biohash

This is a quick project I did a while back but, I've seen people interested in similar ideas, so I'll post my implementation here.

Geohash encodes latitude/longtitude locations into a string such that "nearby places will often present similar prefixes" and the longer the string, the greater the precision. Using this python implementation by Schuyler as a reference, I ported the concept to a "biohash" which can encode intervals. It works in a similar fashion, starting with the extremes and halving the space until it finds the smallest space that contains the interval.

The use to allow efficient search of intervals using a BTree index, as in any relational db. It's implemented with only a dumps() and loads() function after the pickle interface. The dumps function takes start and end args and returns a 1/0 encoded string. The loads takes a 1/0 encoded string and returns the tightest interval it can given that string. Both functions take a rng kwarg, which can be as small as the maximum end value. If all the intervals are small, and the rng is very large, the biohash will not help much. The rng used to load must be the same as the one used to dump or the values won't be correct.

I had plans to finish up a set of SQLAlchemy models for this that would save the hash and use it to do range queries behind the scenes, but haven't finished that up yet. The code is in my google SVN. It will even pull in a fast cython version of the encoder.
If anyone wants to use it, improve it or get it going with SQLAlchemy, it available from SVN with:
svn checkout https://bpbio.googlecode.com/svn/trunk/ihash/
it has tests in __init__.py

Comments

Popular posts from this blog

filtering paired end reads (high throughput sequencing)

python interval tree

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign