fileindex
Disclaimer: This is about a generic indexing method that fits in here and explained below. Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python . And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it. Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence or the fastq record). In the case of flat files, it's a mapping from and id or name to the fseek file-position at ...