Thursday, May 08, 2008

openlayers, genomes and image-maps

In response to Titus' post on using imagemaps for genomic visualization:
Why are imagemaps so popular in genomics? As an extreme and unfair comparison, just imagine if was an image map.
Given a CGI script that can accept a url like
and return an appropriate image, you can provide a substantial set of tools using openlayers, which is developed by what must be one of the largest and active developer communities in GIS. (Yes, I am an openlayers fan-boy.)
You can do that with a small addition to openlayers which I updated a couple weeks ago to OL version 2.6. In that update, I removed > 140 lines of code. So, it's now even less of a change to OL. Maybe when 2.7 comes out, I'll figure out how to provide a patch that allows an extra argument to the OpenLayers.Map constructor that limits panning to the horizontal direction -- in which case genome-browser will cease to exist and only the single file containing OpenLayers.Layer.Genomic would be needed.

Maybe I'd need to make a real example of using openlayers for genomics, making more obvious use of layers, the vector stuff, fractional zoom, markers, geocoding, projections, power steering, etc to make it more obvious how badass OL can be. As far as I know, I'm the only one using it for genomics, and 98% of the time I spent developing it was in my spare time--I only have a good excuse to hack on it at $work when something breaks. If more people were using it, I might be more motivated to figure out how to do stacking of images vertically, but still restricting scrolling to the horizontal--to essentially allow the same thing as "tracks" in other genome browsers. Maybe that's the killer feature.


An observation on the difference between the bio and geo programming environments as I interpret them:
There's some tools that (IMHO) are better in the geo world than in the bio. Perhaps that's because of GDAL, the keystone for geo-data formats and projections. Since any other software (in any SWIG-able language) that makes use of gdal can access pretty much any format and I/O to any projection, then geo developers can do things like make nice renderers, or web-based map browsers rather than figuring how to convert that mrSID image in epsg:4326 to a tif in epsg:900913. (I'm also a GDAL fan-boy.)

Contrast this to the bio world where there's a bio for nearly every common language, bio-java, bio-perl, bio-python, bio-ruby, each with it's own blast parser, gen-bank parser, sequence objects, alignment objects. There is no keystone, so there's more duplication of effort, and there's no parser that's used across languagues as is the case for GIS. -- I'm not suggesting that shouldn't be the case, simply making an observation.

Monday, May 05, 2008

seqfind: levenshtein + bktree

I've copied this recipe that I modified before, and added the BK Tree structure in cython. It's in my repo here.
Check it out with:
svn checkout

or easy_install with
sudo easy_install

It's now using the Damerau-Levenshtein distance which is more sensible for bioinformatics where transpositions are frequent.

Bearophile's original implementation used a tuple, which made sense, but in Cython, it's more efficient to use an object where the properties can be typed--as a class is converted to a c-struct--so there is no conversion when appending to a python array -- if i understand the generated c code correctly.

Using an object also allows arbitrary info to be passed along with the word when creating the tree, again, this is important for bio-informatics when the string is something like "actgcc ... acgtc" and it's useful to attach some annotation to it like:

words = [Word("actgc ... acgtc", {'name': 'At2g26540'}), ...]
#and then create a tree in the same fashion as with raw strings:
tree = BKTree(words)
# and search returns a word object:
[(w.word,['name']) for w in tree.find("atc", 2)]

It'll suffice to say (the obvious) it's a lot faster searching with the tree, than comparing every word, on every search.