Sunday, May 30, 2010

python tools for bioinformatics I: convolve for moving average

This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics.


numpy makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using convolve. It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')".
So the workflow is to
  • get the sequence into a numpy array of dtype 'c'
  • convert the boolean array to int so True is 1, False is 0
  • decide on the window size and make a kernel of that shape that sums to 1
  • profit
with a quick addition for plotting:
the first 100 kilobases of gc content for 50 and 250bp windows look like:

where, as expected, the 250 basepair window is more "smoothed".
From there, it's possible to do some analysis, for example grab the regions with the highest gc content.
That's some fairly dense code to pull out the values centered in the highest GC-content windows and then show the mean of those windows. (Arabidopsis thaliana, which I used for that example, has a fairly low genome-wide GC-content, so the 0.61 average is quite high.)

Another Example

There was a recent question on biostar where someone wanted to find regions with an abundance of a couple amino acids above some cutoff.
The main part of that looks like:
    abool = (seq == 'N') | (seq == 'Q') # convert to boolean
    mw = np.convolve(abool, kern, mode='same')
    if mw.max() < cutoff: continue

Where that operates on 'N' and 'Q' amino acids in a protein sequence instead of 'G' and 'C' as in the example above.


Finally, this approach is much more flexible than I've shown, the kernel does not have to be uniform, it can be guassian, or even only taking values to the left of each nucleotide, or weighting the nearest 10 nucleotides at 1 and the rest at 0.2. Even given those changes, once the kernel is chosen, the rest of the "workflow" remains unchanged.


In cases where the input array is numeric--not sequence--there is no need to do the conversion to a boolean, simply run the convolution on the original array with the chosen kernel.

In the examples above, I've shown only np.convolve with mode='same', for most cases dealing with sequence, I think this is a good choice, but it's best to consult the documentation for each specific case.

Finally, in cases where the kernel and the input array are very large, it may be better to use fft convolution from fftconvolve in scipy. I haven't used this much, I think it may require doing your own padding and chopping at the edges...