Bioinformatics

Posts

Showing posts with the label cython

Needleman-Wunsch global sequence alignment

April 21, 2009

[EDIT 01-01-2010] this is now available at bitbucket . I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as: sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/ or via svn from: svn co http://bpbio.googlecode.com/svn/trunk/nwalign/ it will put an executable 'nwalign' into /usr/bin/ which when run will give this message: Usage: nwalign [options] seq1 seq2 Options: -h, --help show this help message and exit --gap=GAP gap extend penalty (must be integer --gap_init=GAP_INIT gap start penalty (must be integer --match=MATCH match score (must be integer > 0) --mismatch=MISMATCH gap penalty (must be integer --matrix=MATRIX scoring matrix in ncbi/data/ format, if not specificied, match/mismatch are used where the matrix is optional but can be the full path ...

python object initialization speed

April 16, 2009

On the Cython mailing list, I saw this mentioned for avoiding init overhead, so i wrote up some code to try it. Basically, instead of using an __init__, it uses the PY_NEW macro (which I don't pretend to understand fully). I ran a benchmark with 5 cases: PY_NEW macro (still has python overhead for each call to the creator function) regular python init python init using __slots__ cython init (cdef'ed class) batch PY_NEW: calling PY_NEW from inside cython to avoid python call overhead batch init on cython class the timings look like this: PY_NEW on Cython class: 1.160 __init__ on Python class: 30.414 __init__ on Python class with slots: 10.242 __init__ on Cython class 1.185 batch PY_NEW total: 0.855 , interval only: 0.383 batch __init__ on Cython class total 0.998 , interval_only: 0.540 So, the PY_NEW is .383 compared to .540 for using a __init__ on a Cython class, but both are much faster than python. I was surprised that using slots gives a 3x speed improvement over a regular...

binary search over intervals

June 04, 2008

[EDIT: update location of code repo] This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with. Binary search is easy. Just look at the python std library implementation (and the C API version ). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order. Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the explanation here , that means storing not only t...

levenshtein in cython

April 27, 2008

EDIT(2): fix markup for <char *> casts ... fix malloc (see comments. thanks Bao). NOTE: using a kwarg for limit slows things down. setting that to a required arg and using calloc for m2 speed things up to nearly as fast as the pylevenshtein. Well, it seems to be popular to code up the levenshtein . I actually have a use for this and wanted to practice some Cython , so I've written a version. I used bearophile's recipe , wikibooks , this (from k4st on reddit) and this for reference. It follows bearophile's code closely, using only O(m) space instead of O(mn). cdef extern from "stdlib.h": ctypedef unsigned int size_t size_t strlen(char *s) void *malloc(size_t size) void free(void *ptr) int strcmp(char *a, char *b) cdef inline size_t imin(int a, int b, int c): if a if c return c return a if c return c return b cpdef int levenshtein(char *a, char *b, int limit=100): cdef int m = strlen(a),...