Monday, January 18, 2010

Loopless programming (for calculating methylation types)

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation, determined by the nucleotides that follow the 'C':
  • CG: a 'G' follows the 'C'

  • CHG: anything but a 'G' follows the 'C' and a 'G' follows that

  • CHH: no 'G' in the 2 subsequent nucleotides.

So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs.


Given a numpy array `seq`, it's possible to find all the 'C's with:
np.where(seq == 'C')
. From there one can calculate the methylation type without looping or if'ing with:

which potentially takes more memory but is extremely fast, and (IMO) nicely shows how to take advantage of the things numpy is good at using the slicing and indexing.
From there, I put that (slightly modified to differentiate between + and - methylation) into a function named _calc_methylation, and call it from this function:

which does some work to handle the ends of the sequence and reverse-complementing. The result, as shown in the docstring, is a numpy array where the values indicate the type of methylation as:

0: not a C or G
1: CG at a 'C' (+ strand )
2: CHG at a 'C' (+ strand )
3: CHH at a 'C' (+ strand )
4: CG at a 'G' (- strand )
5: CHG at a 'G' (- strand )
6: CHH at a 'G' (- strand )

The full implemenation is here. (even the reverse-complement is done without loops or ifs)