Posts

Showing posts from January 3, 2010

updates to pyfasta

At the end of last year, I put in quite a few updates to pyfasta. One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends.
With this option, a file like:

>a
actg
actg
actg
>b
aaaacccc
aaaacccc
will overwritten (fla…