Showing posts from February 14, 2010

organizing genomic data

In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like gff (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative.

Big ol' Matrixbecause it was already nicely pre-processed, I downloaded some Arabidopsis expression data from here. They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for more d…