Loopless programming (for calculating methylation types)

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation, determined by the nucleotides that follow the 'C':
  • CG: a 'G' follows the 'C'

  • CHG: anything but a 'G' follows the 'C' and a 'G' follows that

  • CHH: no 'G' in the 2 subsequent nucleotides.

So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs.

Loopless


Given a numpy array `seq`, it's possible to find all the 'C's with:
np.where(seq == 'C')
. From there one can calculate the methylation type without looping or if'ing with:

which potentially takes more memory but is extremely fast, and (IMO) nicely shows how to take advantage of the things numpy is good at using the slicing and indexing.
From there, I put that (slightly modified to differentiate between + and - methylation) into a function named _calc_methylation, and call it from this function:

which does some work to handle the ends of the sequence and reverse-complementing. The result, as shown in the docstring, is a numpy array where the values indicate the type of methylation as:

0: not a C or G
1: CG at a 'C' (+ strand )
2: CHG at a 'C' (+ strand )
3: CHH at a 'C' (+ strand )
4: CG at a 'G' (- strand )
5: CHG at a 'G' (- strand )
6: CHH at a 'G' (- strand )

The full implemenation is here. (even the reverse-complement is done without loops or ifs)

Comments

Unknown said…
I like your approach. In a similar spirit this little function finds the first index of every k-tuple in a numpy array. http://python.pastebin.com/f46f7dae7
brentp said…
@diffusing thoughts . cool. could you show an example usage? i'm not quite sure i follow.
Unknown said…
It allows to find occurrences of words in a sequence e.g. all start codons in a sequence of letters
>>> sequence = np.array('ATGCGCGTAGCTATGAGAGCATCGAT', dtype='c')
>>> find(sequence, 'ATG')
(array([ 0, 12]),)

The first ATG starts at index 0 the second at index 12. The output from "find" can be used to index an array.

>>> sequence[find(sequence, 'ATG')]
array(['A', 'A'],
dtype='|S1')
brentp said…
ah, i see. is there a benefit over the
pure python version
?
i guess yours could more easily handle multiple ktup's with numpy broadcasting.
Unknown said…
The main benefit is that my function generalizes to ndarray i.e. the input could represent an alignment i.e. be a rank-2 numpy array. The other benefit is that it works on numpy arrays as the "find" method is broken in char arrays.

In [1]: "AGCG".find('GC')
Out[1]: 1
In [2]: arr = np.array('AGCG', dtype='c').view(np.char.chararray)
In [3]: arr.find('GC')
Out[3]: array([-1, -1, -1, -1])

char array does not do the right thing because it takes each element from arr e.g. 'A' and calls it's method e.g. 'find' withe the given argument e.g. 'GC', which fails "G".find("GC") returns -1
Anonymous said…
This comment has been removed by a blog administrator.

Popular posts from this blog

filtering paired end reads (high throughput sequencing)

python interval tree

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign