Loopless programming (for calculating methylation types)
DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation, determined by the nucleotides that follow the 'C':
So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs.
Given a numpy array `seq`, it's possible to find all the 'C's with:
which potentially takes more memory but is extremely fast, and (IMO) nicely shows how to take advantage of the things numpy is good at using the slicing and indexing.
From there, I put that (slightly modified to differentiate between + and - methylation) into a function named _calc_methylation, and call it from this function:
which does some work to handle the ends of the sequence and reverse-complementing. The result, as shown in the docstring, is a numpy array where the values indicate the type of methylation as:
The full implemenation is here. (even the reverse-complement is done without loops or ifs)
- CG: a 'G' follows the 'C'
- CHG: anything but a 'G' follows the 'C' and a 'G' follows that
- CHH: no 'G' in the 2 subsequent nucleotides.
So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs.
Loopless
Given a numpy array `seq`, it's possible to find all the 'C's with:
np.where(seq == 'C'). From there one can calculate the methylation type without looping or if'ing with:
which potentially takes more memory but is extremely fast, and (IMO) nicely shows how to take advantage of the things numpy is good at using the slicing and indexing.
From there, I put that (slightly modified to differentiate between + and - methylation) into a function named _calc_methylation, and call it from this function:
which does some work to handle the ends of the sequence and reverse-complementing. The result, as shown in the docstring, is a numpy array where the values indicate the type of methylation as:
0: not a C or G
1: CG at a 'C' (+ strand )
2: CHG at a 'C' (+ strand )
3: CHH at a 'C' (+ strand )
4: CG at a 'G' (- strand )
5: CHG at a 'G' (- strand )
6: CHH at a 'G' (- strand )
The full implemenation is here. (even the reverse-complement is done without loops or ifs)
Comments
>>> sequence = np.array('ATGCGCGTAGCTATGAGAGCATCGAT', dtype='c')
>>> find(sequence, 'ATG')
(array([ 0, 12]),)
The first ATG starts at index 0 the second at index 12. The output from "find" can be used to index an array.
>>> sequence[find(sequence, 'ATG')]
array(['A', 'A'],
dtype='|S1')
pure python version
?
i guess yours could more easily handle multiple ktup's with numpy broadcasting.
In [1]: "AGCG".find('GC')
Out[1]: 1
In [2]: arr = np.array('AGCG', dtype='c').view(np.char.chararray)
In [3]: arr.find('GC')
Out[3]: array([-1, -1, -1, -1])
char array does not do the right thing because it takes each element from arr e.g. 'A' and calls it's method e.g. 'find' withe the given argument e.g. 'GC', which fails "G".find("GC") returns -1