Monday, June 07, 2010

python tools for bioinformatics II: group-ing

This is the second post describing python tools and idioms I've found useful in bioinformatics.

This post involves grouping items either using a disjoint set or python's
itertools.groupby. I was introduced to both of these
by the post-doc who sits behind me.

Grouper

Grouper is an implementation of a disjoint set, initially from this recipe by Michael Droettboom and included in matplotlib.cbook.
The slightly modified implementation we use is here.
Where the docstring gives a decent idea of useage:
"""
   This class provides a lightweight way to group arbitrary objects
   together into disjoint sets when a full-blown graph data structure
   would be overkill. 

   Objects can be joined using .join(), tested for connectedness
   using .joined(), and all disjoint sets can be retrieved using list(g)
   The objects being joined must be hashable.

   >>> g = Grouper()
   >>> g.join('a', 'b')
   >>> g.join('b', 'c')
   >>> g.join('d', 'e')
   >>> list(g)
   [['a', 'b', 'c'], ['d', 'e']]
   >>> g.joined('a', 'b')
   True
   >>> g.joined('a', 'c')
   True
   >>> 'f' in g
   False
   >>> g.joined('a', 'd')
   False"""
Basically the idea is if there's an explicit connection between `A` and `B` and an explicit connection between `A` and `C` the Grouper class will create a connection between `B` and `C`. In the implementation above, the explicit connections are created with `Grouper.join()` For genomics stuff, we can use this in many places, one simple use-case is finding local duplicates. These are also called tandem duplications and are identified as a group of genes with very similar sequence co-occurring in a local chromosomal region. We want to group these into a single "family" with all members, even if there is not an explicit connection between all genes--due to slight sequence divergence or sequence alignment (BLAST) oddities.
A snippet of code from here (more nice code from Haibao) shows the use of a grouper to find syntenic regions by checking if adjacent members of (an already sorted list) are within a certain distance (window) of each other:

after this, the elements that are grouped in the `g` Grouper object are in the same window (though not strictly syntenic).
This is a very nice abstraction for anything where you have groups of related objects. It reduces the accounting invovled because once you have `join`ed all the elements, querying the Grouper object with any element will return all elements in the group to which it is joined.

groupby

Python's itertools has a number of useful features. `groupby` is a means to partition an iterable similar groups of adjacent items, where `similar` is determined by a function you supply to groupby. The example from the docs is:

but it's also possible to specify an arbitrary grouping function, so grouping capital letters together:

So the `k`ey tells what the grouping function returned and the values or groups are grouped with other adjacent items from the list that also pass the test. So in some cases, we often want to sort the input before using groupby and then we can get groups and avoid an extra nested for-loop that we would otherwise need for filtering.
An example use is grouping BLAST hits by query and subject. Here, BlastLine is a simple object that takes a line from a tab-delimited blast and converts the e-value and bitscore to float and the starts and stops to ints, etc.

and since the blasts are sorted beforehand, this gives a useful set of blasts in blast_iter--containing only blasts with a matching query and subject. To do this without groupby requires quite a bit more code and reduces readability.

I recently saw a post about using `groupby` to parse fasta files. It's a good idea as a fasta file consists of pairs of a single header line, followed by N lines of sequence. A header line always starts with ">" so the test function for groupby is clear, it's simply getting them out in pairwise fashion, here's one way building from the post above:


groupby-grouper

Here's a final example, both of these used together to find tandem duplications given a genomic Bed file (contains chromsome, start, stop) and a list of blasts. Again this is code from Haibao from here.

So, this creates a `simple_blast` that's easier to sort and group, it iterates for the groups based on the subject
and then groups hits if they're on the same chromosome an within a given distance (where distance is measured in genes).
I like this example because I'd previously written code to do the same thing without Grouper or groupby and it was longer, slower and less readable.

No comments: