Posts

Showing posts with the label genedex

genedex.fasta with numpy.memmap

EDIT: added job posting to comments. I've been working a bit on genedex , I'm still not happy with the way it stores features. Which is a huge pickle of dictionaries where every dictionary is a 'feature' that looks like: {'name':'At2g26540', 'start': 1234, 'stop': 3456, 'strand': 1, 'chr': 2}. So the only way to do a search is by location--and that is _very_ fast, thanks to rtree , but there's no way to search by name or any other attribute--and an entire organism is loaded into memory at once--that part actually works out ok, but it feels dirty. I quickly wrote an SQLAlchemy backed interface to a simple db schema do allow this sort of searching here: http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py . That already supports Feature.upstream(), downstream(), etc. methods, but it will work nicely once python supports sqlite rtree without any extra work--for now, it just uses BTree indexes on th...

Genedex: query genomic features and sequence

Normally, I don't write libraries, I figure smarter people than I should do such things, and I should just use them. But, I got tired enough of writing one-off scripts for genomic feature manipulation-- find the upstream, downstream neighbors and get the sequence -- and I saw enough of the pieces coming together that I decided to build it. I'd complained before about how rtree didn't support 1D indicies. Not only is this not a problem, it's beneficial. Genomic features should have strand information, so that's the 2nd dimension. Then rtree does containment queries, so it's simple to find only the features on a given strand. I realized this about the same time that the docstring for numpy's memmap went from 0 lines to about 100, and it was enhanced to take a filehandle , not just a filename. This means you can send in a start position and a shape to the numpy.memmap constuctor and it can create a numpy array of only that chunk. This means that it's pos...