Tuesday, April 21, 2009

Needleman-Wunsch global sequence alignment

[EDIT 01-01-2010] this is now available at bitbucket.
I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as:
 sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/

or via svn from:
svn co http://bpbio.googlecode.com/svn/trunk/nwalign/

it will put an executable 'nwalign' into /usr/bin/ which when run will give this message:

nwalign [options] seq1 seq2

-h, --help show this help message and exit
--gap=GAP gap extend penalty (must be integer <= 0)
--gap_init=GAP_INIT gap start penalty (must be integer <= 0)
--match=MATCH match score (must be integer > 0)
--mismatch=MISMATCH gap penalty (must be integer < 0)
--matrix=MATRIX scoring matrix in ncbi/data/ format,
if not specificied, match/mismatch are used

where the matrix is optional but can be the full path specifying the transition score from row to column as here.
If the matrix is specified, match and mismatch are not used. If the matrix is not specified, match and mismatch are used. a typical run looks like:

$ nwalign alphabet alpet

$ nwalign --matrix /usr/share/ncbi/data/BLOSUM62 EEAEE EEEEG
And usage from a python script can be seen in test.py.

I wrote this for a colleague who was using a perl script. This is over 100 times faster than the perl script for long sequences--and the perl script had the BLOSUM62 matrix hard-coded as a hash. There are a couple places that could still be sped up, but I think the improvement will be relatively small. This kind of script is perfect for the numpy-cython mix as I can just access the contents of the 2d array at c-speed without having to do pointer arithmetic. If there's any obvious optimizations I missed, I'd be glad to know them.


diffusing thoughts said...

What is wrong with EMBOSS needle?

brentp said...

@diffusing thoughts: the only thing wrong with it is that i didnt know it existed, and searches in google for various combinations of needleman-wunsch, global sequence alignment ..., etc. didn't find it. thanks for the pointer.
here's the link for anyone else interested.

Michael said...

EMBOSS works well, but you have to a) write the sequences to files and b) do a process call/fork and c) read back the alignment/scores. I haven't tried this app yet, but it's a general problem (and Biopython only provides wrappers to EMBOSS so the problem remains)

brentp said...

@Michael, that's good to know. let me know if you have any recommendations to make it more useful. i've since uploaded it to pypi