Monday, April 07, 2008

Genedex: query genomic features and sequence

Normally, I don't write libraries, I figure smarter people than I should do such things, and I should just use them. But, I got tired enough of writing one-off scripts for genomic feature manipulation-- find the upstream, downstream neighbors and get the sequence -- and I saw enough of the pieces coming together that I decided to build it. I'd complained before about how rtree didn't support 1D indicies. Not only is this not a problem, it's beneficial. Genomic features should have strand information, so that's the 2nd dimension. Then rtree does containment queries, so it's simple to find only the features on a given strand. I realized this about the same time that the docstring for numpy's memmap went from 0 lines to about 100, and it was enhanced to take a filehandle, not just a filename. This means you can send in a start position and a shape to the numpy.memmap constuctor and it can create a numpy array of only that chunk. This means that it's possible to slice an unaltered fasta file using the numpy array syntax. That's very good.

So, if you put those 2 simple things together, you have the start of something powerful. That's what I did. Then I gave it a crappy name: Genedex (Gendex was taken) and slapped it into googlecode. Check it out: My only design goal was to keep it as simple as possible. If the amount of features is under-whelming, that's good.


Also, I generally do TDD very half-ass, with asserts and maybe a couple doctests. However, I recently made fairly substantial changes to the SQLite datasource in featureserver, and wrote this set of doctests while doing so. It works! and I've been using it. So, I did what featuresever (presumably crschmidt) devs did and copied the setup for the shapely doctests. It's pretty useful for design, i'd just write out the code for how I wanted the API to look and then implement. The only thing is, for doctests, the way they're used (at least by me) is to copy the output from executing the code into the doctest. So, if your code is wrong to start with, you just copy the wrong answer into the doctest and it's broken but the tests pass. But, at least it's good for regressions, and I just had to remember not to blindly trust the output. That's true for all testing, but especially so for doctests.

So, there's now more tests than code. But, since it's mostly just tie-ing together pieces that do the real work, it's not much code. Doc-tests are also nice because (as the name suggests) they double as documentation. So, here's the genedex documentation:
It's pretty! It gets colored by pygments, using this script. The only major thing I'd like to add to the library is a plotting class using matplotlib. Then other smaller tasks like a method that takes 2 features and returns the sequence between them.
Any fixes, enhancements, ridicule, etc. will be greeted with commit access.

No comments: