seqfind: levenshtein + bktree

I've copied this recipe that I modified before, and added the BK Tree structure in cython. It's in my repo here.
Check it out with:
svn checkout

or easy_install with
sudo easy_install

It's now using the Damerau-Levenshtein distance which is more sensible for bioinformatics where transpositions are frequent.

Bearophile's original implementation used a tuple, which made sense, but in Cython, it's more efficient to use an object where the properties can be typed--as a class is converted to a c-struct--so there is no conversion when appending to a python array -- if i understand the generated c code correctly.

Using an object also allows arbitrary info to be passed along with the word when creating the tree, again, this is important for bio-informatics when the string is something like "actgcc ... acgtc" and it's useful to attach some annotation to it like:

words = [Word("actgc ... acgtc", {'name': 'At2g26540'}), ...]
#and then create a tree in the same fashion as with raw strings:
tree = BKTree(words)
# and search returns a word object:
[(w.word,['name']) for w in tree.find("atc", 2)]

It'll suffice to say (the obvious) it's a lot faster searching with the tree, than comparing every word, on every search.


Popular posts from this blog

my thoughts on golang

filtering paired end reads (high throughput sequencing)

python interval tree