Tuesday, April 21, 2009

Needleman-Wunsch global sequence alignment

[EDIT 01-01-2010] this is now available at bitbucket.
I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as:
 sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/

or via svn from:
svn co http://bpbio.googlecode.com/svn/trunk/nwalign/

it will put an executable 'nwalign' into /usr/bin/ which when run will give this message:

nwalign [options] seq1 seq2

-h, --help show this help message and exit
--gap=GAP gap extend penalty (must be integer <= 0)
--gap_init=GAP_INIT gap start penalty (must be integer <= 0)
--match=MATCH match score (must be integer > 0)
--mismatch=MISMATCH gap penalty (must be integer < 0)
--matrix=MATRIX scoring matrix in ncbi/data/ format,
if not specificied, match/mismatch are used

where the matrix is optional but can be the full path specifying the transition score from row to column as here.
If the matrix is specified, match and mismatch are not used. If the matrix is not specified, match and mismatch are used. a typical run looks like:

$ nwalign alphabet alpet

$ nwalign --matrix /usr/share/ncbi/data/BLOSUM62 EEAEE EEEEG
And usage from a python script can be seen in test.py.

I wrote this for a colleague who was using a perl script. This is over 100 times faster than the perl script for long sequences--and the perl script had the BLOSUM62 matrix hard-coded as a hash. There are a couple places that could still be sped up, but I think the improvement will be relatively small. This kind of script is perfect for the numpy-cython mix as I can just access the contents of the 2d array at c-speed without having to do pointer arithmetic. If there's any obvious optimizations I missed, I'd be glad to know them.