Posts

Showing posts from May 30, 2010

python tools for bioinformatics I: convolve for moving average

Image
This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics. Convolve numpy makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using convolve . It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')". So the workflow is to get the sequence into a numpy array of dtype 'c' convert the boolean array to int so True is 1, False is 0 decide on the window size and make a kernel of ...