Sunday, January 10, 2010

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign

I've written previously about nwalign, a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance.

Ack

It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing.

Bugs

Placement of Gaps

Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and "AGEBAM", the alignments:
AGEBAMAM
AGEBAM--
and
AGEBAMAM
AGEB--AM
Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through the scoring matrix.

Gap Extension

nwalign allows use of a scoring matrix to lookup the score/penalty for conversion from one letter to another or a simpler version where the user specifies gap_open, gap_extend, mismatch penalties, and match reward. For the simpler non-matrix path, earlier versions of nwalign did not heed the gap_extension--using the gap_open penalty regardless of the previous entries in the dynamic programming matrix. It's common to use only a gap penalty without a separate gap_extend penalty, nwalign no longer assumes that's what is preferred -- if it is, one can simply specify a `gap_extend` penalty that is equal to `gap_open`.

Scoring

Previous versions of nwalign only returned the alignment and didn't provide access to the score of the alignment. Recent versions allow the user to get the score via:

>>> nw.score_alignment('CEELECANTH', '-PELICAN--', gap_open=-5,
... gap_extend=-2, matrix='PAM250')
11

where `11` is the score of the alignment given, the gap_open and extend_penalty, and subsitution scores in the file 'PAM250' (which is in the NCBI substitution matrix format).

Performance

The earliest version of nwalign was literally about 100 times faster than the perl version from the BLAST book. But, there were a few places where performance has improved even more since that. The most dramatic was in the lookups for the scoring matrix values. Given, again the 2 short strings: "AGEBAMAM" and "AGEBAM", the alignment algorithm has to do the inner loop 48 ( 6 * 8 ) times, in each of those loops, it has to look up the substitution score in the matrix. For example in the first iteration, it has to find the score for "changing" from an "A" to an "A"; in the PAM250 matrix, that score is +2. In the substitution matrix, the row and column "keys" are the amino acids and the values are the scores for changing from the row key to the column key. Previously, I stored the keys in their own array, then did a linear search to find the index. So, for both the row and column, an amino acid letter is translated to an index with a function like:

aa_keys = ['A', 'R', 'N', 'D', ... ] # or 'ARND...'
def findpos(amino_acid, aa_keys):
i = 0
while i < len(aa_keys):
if aa_keys[i] == amino_acid:
return i
i += 1
return - 1

except in C(ython) so it was much faster, the alignment then uses the return values for row and column amino acid to look up the score from the substitution matrix. When the string are long enough-- as they will be with real proteins, this is a huge speed bottleneck. So, it trades memory for speed. Instead of using a 25 * 25 matrix to store the substitution matrix, it now uses an X * X matrix where 'X' is the ord value of the largest amino acid. Usually this will be less than 'Z', so the matrix will be 90 * 90, stored efficiently as a numpy array. From there, the lookup is directly with matrix[ord(x_amino), ord(y_amino)] which is extremely fast in C because the ord is unnecessary as a char can index an array. This gave at least a 3X speed improvement. I could reduce the memory used by the matrix by subtracting ord('A'), but that would be trading speed for memory since it would require 2 extra subtractions per inner loop.

Also, the latest version uses less memory in other areas; in order to do the dynamic programming, the algo has to save N * M arrays of direction "pointers" (in the left/right/diagonal sense (not the c sense), scores and gaps. However, the gaps are not actually needed for the entire trace-back, they are only needed 1 step back to determine if a current gap is a gap_open, or gap_extend. So, now, instead of an N*M matrix for the gap matrix, it's N*1 and M*1 arrays. For large N * M, this is enough to offset the increased memory incurred by how the substitution matrix is stored.

Running a benchmark test script with nwalign-0.1 takes 12.13 seconds and the same script on nwalign-0.3.1 takes 3.41 seconds (and actually runs in 2.99 seconds with unladen-swallow... but that's another story). The script does an alignment on 1200 * 1600 basepair sequences 100 times. As with most things, I figured most of this by banging my head against the wall long enough that the stars aligned, so to speak, anyone who cares to look at the code and offer suggestions would be much appreciated.