Monday, January 04, 2010

updates to pyfasta

At the end of last year, I put in quite a few updates to pyfasta. One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends.
With this option, a file like:

>a
actg
actg
actg
>b
aaaacccc
aaaacccc

will overwritten (flattened in-place) to:

>a
actgactgactg
>b
aaaaccccaaaacccc

simple. When dealing with large files (where this is actually useful), the flattened file does not behave well when opened in an editor because the editor will attempt to read a number of lines into the buffer and a single line may be 200 mega-bases. So, this is a pain if you're planning to sit down with a cup of joe and read through the genome, but otherwise, the fasta file should be un-affected.
This method is not currently the default (though it may be so in future versions). But, it's possible to use the commandline:
pyfasta flatten some.fasta
which will create the flattened fasta (and the index file) and a placeholder some.fasta.flat, containing the text "@flattened@" as a marker to pyfasta that it's ok to use the original (now-flattened) fasta. Once the file is flattened, there is no performance loss compared to having a separate flat file containing no headers.

pyfasta was a fun project for me in 2009. It's a ridiculously simple little module, but when I started it, I didn't think there was a good alternative. (Though discriminating Fasta-ers should look at the sequence module in pygr, and the Bio.Seq module in BioPython which I think has improved quite a lot recently). It has over 100 tests and very close to 100% test coverage for the modules in pyfasta, and much of the code is run once for each of the 4 backends.

the source is on bitbucket.

No comments: