Showing posts from May 30, 2010

python tools for bioinformatics I: convolve for moving average

This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics.
Convolvenumpy makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using convolve. It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')".
So the workflow is toget the sequence into a numpy array of dtype 'c'convert the boolean array to int so True is 1, False is 0decide on the window size and make a kernel of that shape …