Monday, May 05, 2008

seqfind: levenshtein + bktree

I've copied this recipe that I modified before, and added the BK Tree structure in cython. It's in my repo here.
Check it out with:
svn checkout

or easy_install with
sudo easy_install

It's now using the Damerau-Levenshtein distance which is more sensible for bioinformatics where transpositions are frequent.

Bearophile's original implementation used a tuple, which made sense, but in Cython, it's more efficient to use an object where the properties can be typed--as a class is converted to a c-struct--so there is no conversion when appending to a python array -- if i understand the generated c code correctly.

Using an object also allows arbitrary info to be passed along with the word when creating the tree, again, this is important for bio-informatics when the string is something like "actgcc ... acgtc" and it's useful to attach some annotation to it like:

words = [Word("actgc ... acgtc", {'name': 'At2g26540'}), ...]
#and then create a tree in the same fashion as with raw strings:
tree = BKTree(words)
# and search returns a word object:
[(w.word,['name']) for w in tree.find("atc", 2)]

It'll suffice to say (the obvious) it's a lot faster searching with the tree, than comparing every word, on every search.

No comments: