Friday, October 03, 2008

genedex.fasta with numpy.memmap

EDIT: added job posting to comments.
I've been working a bit on genedex, I'm still not happy with the way it stores features. Which is a huge pickle of dictionaries where every dictionary is a 'feature' that looks like: {'name':'At2g26540', 'start': 1234, 'stop': 3456, 'strand': 1, 'chr': 2}. So the only way to do a search is by location--and that is _very_ fast, thanks to rtree, but there's no way to search by name or any other attribute--and an entire organism is loaded into memory at once--that part actually works out ok, but it feels dirty. I quickly wrote an SQLAlchemy backed interface to a simple db schema do allow this sort of searching here: http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py. That already supports Feature.upstream(), downstream(), etc. methods, but it will work nicely once python supports sqlite rtree without any extra work--for now, it just uses BTree indexes on the start and stop. I could use rtree to index the sqlite database, but I'd like to move away from the LGPL. Maybe this KDTree that's already in a scipy branch with a more permissive license. Then it could do both spatial, and attribute queries...
That's all tinkering...

I also cleaned up the genedex.fasta module. The useage is nice, if not entirely the implementation. A fasta file can look like:

>chr1
ATGTCGTCGGCCGC
GGGCCAAGA
CAACGGAGA

>chr3
ATGGAGGAGGCTGGCGAGCGG

>chr2
ATGGCGTGC
ACGGCGGCG
CGCATGTT
CGCCT

where a line starting with > is the name of the sequence ('chr1') and everything up until the next > is the sequence. The problem is the newlines, so every time you want to look at chr1 basepairs 10 to 20, you have to find where the sequence starts, and account for newlines. -- Acutally one should never do that, as Biopython, and pretty much any library will take care of that for you. Pygr for example, creates a new file something.fasta.pureseq which removes all newlines and labels and indexes where the starts and stop are. genedex.fasta.Fasta now does something similar, here's example useage on the file above ('123.fasta').

from genedex import Fasta
f = Fasta('123.fasta')
print f.keys()
print f['chr1'][9:20].tostring()
print f['chr1'][9:20]

after, the fasta file looks like this:

>chr1
ATGTCGTCGGCCGCGGGCCAAGACAACGGAGA
>chr3
ATGGAGGAGGCTGGCGAGCGG
>chr2
ATGGCGTGCACGGCGGCGCGCATGTTCGCCT

with spaces removed--so it's still a valid fasta file (you can also send an argument to the constructor and it will create a new file) and there's a new file 123.fasta.gdx that is a python pickle containing:
{'chr3': (45L, 67L), 'chr2': (73L, 105L), 'chr1': (6L, 39L)}
which indicate the start and stop positions of the sequence in the file.
So the file remains a valid fasta file, but now it can be efficiently sliced. For now, it actually uses a numpy mmmap (numpy.memmap), to take advantage of the broadcasting, other than that, it'd be simpler to just use python mmap. So, when it sees f['chr1'][10:20] it acts just like it's indexing a numpy array, but it's accessing the data directly from the disk (well, not actually, but mmap works it's magic and I dont have to think about that). I like that--I can keep my fasta files as valid, add only a small python pickle file, and get simple, fast, pythonic indexing into them. It does take about 12 seconds to index and flatten the entire sorghum genome (629MB, ~660 million basepairs) on first view, after that first time, it's instantaneous.
Anyway, the source is available and easy_install-able:

svn checkout http://genedex.googlecode.com/svn/trunk/ genedex
As always, I'll gladly take any improvements, bugs, enhancements.

Also, our lab at UCB is looking for another (full-time or close to it, on-site) programmer who knows some biology, perl, and hopefully some python. If you're interested, or know anyone, send me an email. I have no real authority in the matter (or any matter) but I will have some say in this. I'd like to work with someone I can learn from. I'll add a link to the job posting in the comments below once it's posted.