python tools for bioinformatics II: group-ing

This is the second post describing python tools and idioms I've found useful in bioinformatics.

This post involves grouping items either using a disjoint set or python's
itertools.groupby. I was introduced to both of these
by the post-doc who sits behind me.

Grouper

Grouper is an implementation of a disjoint set, initially from this recipe by Michael Droettboom and included in matplotlib.cbook.
The slightly modified implementation we use is here.
Where the docstring gives a decent idea of useage:
"""
   This class provides a lightweight way to group arbitrary objects
   together into disjoint sets when a full-blown graph data structure
   would be overkill. 

   Objects can be joined using .join(), tested for connectedness
   using .joined(), and all disjoint sets can be retrieved using list(g)
   The objects being joined must be hashable.

   >>> g = Grouper()
   >>> g.join('a', 'b')
   >>> g.join('b', 'c')
   >>> g.join('d', 'e')
   >>> list(g)
   [['a', 'b', 'c'], ['d', 'e']]
   >>> g.joined('a', 'b')
   True
   >>> g.joined('a', 'c')
   True
   >>> 'f' in g
   False
   >>> g.joined('a', 'd')
   False"""
Basically the idea is if there's an explicit connection between `A` and `B` and an explicit connection between `A` and `C` the Grouper class will create a connection between `B` and `C`. In the implementation above, the explicit connections are created with `Grouper.join()` For genomics stuff, we can use this in many places, one simple use-case is finding local duplicates. These are also called tandem duplications and are identified as a group of genes with very similar sequence co-occurring in a local chromosomal region. We want to group these into a single "family" with all members, even if there is not an explicit connection between all genes--due to slight sequence divergence or sequence alignment (BLAST) oddities.
A snippet of code from here (more nice code from Haibao) shows the use of a grouper to find syntenic regions by checking if adjacent members of (an already sorted list) are within a certain distance (window) of each other:
ysorted = sorted(data, key=lambda x: x[1])
g = Grouper()
a, b = itertools.tee(ysorted)
next(b, None)
for ia, ib in itertools.izip(a, b):
pos1, pos2 = ia[1], ib[1]
# join if they're close and on the same chromsome.
if pos2 - pos1 < window and sbed[pos1].seqid == sbed[pos2].seqid:
g.join(ia, ib)
view raw gistfile2.pyw hosted with ❤ by GitHub

after this, the elements that are grouped in the `g` Grouper object are in the same window (though not strictly syntenic).
This is a very nice abstraction for anything where you have groups of related objects. It reduces the accounting invovled because once you have `join`ed all the elements, querying the Grouper object with any element will return all elements in the group to which it is joined.

groupby

Python's itertools has a number of useful features. `groupby` is a means to partition an iterable similar groups of adjacent items, where `similar` is determined by a function you supply to groupby. The example from the docs is:
>>> from itertools import groupby
>>> [k for k, g in groupby('AAAABBBCCDAABBB')]
['A', 'B', 'C', 'D', 'A', 'B']
>>> [list(g) for k, g in groupby('AAAABBBCCDAABBB')]
[['A', 'A', 'A', 'A'], ['B', 'B', 'B'], ['C', 'C'], ['D'], ['A', 'A'], ['B', 'B', 'B']]
>>> from itertools import groupby
>>> [k for k, g in groupby('AAAABBBCCDAABBB')]
['A', 'B', 'C', 'D', 'A', 'B']
>>> [list(g) for k, g in groupby('AAAABBBCCDAABBB')]
[['A', 'A', 'A', 'A'], ['B', 'B', 'B'], ['C', 'C'], ['D'], ['A', 'A'], ['B', 'B', 'B']]
view raw gistfile3.pyw hosted with ❤ by GitHub

but it's also possible to specify an arbitrary grouping function, so grouping capital letters together:
>>> [(k, list(g)) for k, g in groupby('AAaaABZcdefcBB', lambda l: l.isupper())]
[(True, ['A', 'A']), (False, ['a', 'a']), (True, ['A', 'B', 'Z']), (False, ['c', 'd', 'e', 'f', 'c']), (True, ['B', 'B'])]
view raw gistfile4.pyw hosted with ❤ by GitHub

So the `k`ey tells what the grouping function returned and the values or groups are grouped with other adjacent items from the list that also pass the test. So in some cases, we often want to sort the input before using groupby and then we can get groups and avoid an extra nested for-loop that we would otherwise need for filtering.
An example use is grouping BLAST hits by query and subject. Here, BlastLine is a simple object that takes a line from a tab-delimited blast and converts the e-value and bitscore to float and the starts and stops to ints, etc.
from itertools import groupby
blasts = [BlastLine(line) for line in open(blast_file)]
blasts.sort(key=operator.attrgetter('query', 'subject'))
for (query, subject), blast_iter in groupby(blasts, lambda b: (b.query, b.subject)):
print query, subject, blast_iter
# ... do something useful with blast_iter
view raw gistfile5.pyw hosted with ❤ by GitHub

and since the blasts are sorted beforehand, this gives a useful set of blasts in blast_iter--containing only blasts with a matching query and subject. To do this without groupby requires quite a bit more code and reduces readability.

I recently saw a post about using `groupby` to parse fasta files. It's a good idea as a fasta file consists of pairs of a single header line, followed by N lines of sequence. A header line always starts with ">" so the test function for groupby is clear, it's simply getting them out in pairwise fashion, here's one way building from the post above:
from itertools import groupby
def fasta_iter(fasta_name):
"""
given a fasta file. yield tuples of header, sequence
"""
fh = open(fasta_name)
# ditch the boolean (x[0]) and just keep the header or sequence since
# we know they alternate.
faiter = (x[1] for x in groupby(fh, lambda line: line[0] == ">"))
for header in faiter:
# drop the ">"
header = header.next()[1:].strip()
# join all sequence lines to one.
seq = "".join(s.strip() for s in faiter.next())
yield header, seq
view raw gistfile1.pyw hosted with ❤ by GitHub


groupby-grouper

Here's a final example, both of these used together to find tandem duplications given a genomic Bed file (contains chromsome, start, stop) and a list of blasts. Again this is code from Haibao from here.
def tandem_grouper(bed, blast_list, tandem_Nmax=10):
simple_blast = [(b.subject, (b.qseqid, b.qi)) for b in blast_list if b.evalue < 1e-10]
simple_blast.sort()
standems = Grouper()
for name, hits in itertools.groupby(simple_blast, key=lambda x:x[0]):
# these are already sorted.
hits = [x[1] for x in hits]
for ia, a in enumerate(hits[:-1]):
b = hits[ia + 1]
# on the same chromosome and rank difference no larger than tandem_Nmax
if b[1] - a[1] <= tandem_Nmax and b[0] == a[0]:
standems.join(a[1], b[1])
return standems
view raw gistfile6.pyw hosted with ❤ by GitHub

So, this creates a `simple_blast` that's easier to sort and group, it iterates for the groups based on the subject
and then groups hits if they're on the same chromosome an within a given distance (where distance is measured in genes).
I like this example because I'd previously written code to do the same thing without Grouper or groupby and it was longer, slower and less readable.

Comments

Popular posts from this blog

filtering paired end reads (high throughput sequencing)

levenshtein in cython

python interval tree