synteny mapping

Living in Synteny
I've been working on automating synteny mapping between any pairs of genomes. Synteny is where there's a stretch of DNA or genes in some order on chromosomeA of organismX and due to a shared evolutionary history, you can find a similar stretch of genes in order on chromosomeB of organismY. Often there are small losses and inversions, but between closely related organisms like man and mouse, there's still a lot of synteny.
Plants can undergo polyploidy, following which, a species can have 2 entire copies of its genome. Over time, much of the duplicated cruft is lost, and the homologous chromosomes diverge, but if the divergence is not too great, it's still possible (actually common) to find synteny within the genome of a single organism--as well as between organisms.
I've written my own algorithm to find synteny which uses python sets, and numpy array slicing to do the heavy lifting. It is quite clever [wink]. And it _almost_ works but it's ... sorta non-deterministic.
The output looks like this for Arabidopsis against itself:

That's compressed from a 3000px*3000px image, but it's 25 boxes for the 5 vs. 5 chromsomes so above and below the diagonal should be mirror images. and with:
+ the diagonal dark line in the center being where each gene matches itself
+ the dark patches in the center of each box being all the crap in the centromere of each chromosome.
+ the speckled dots all over being because genes have many relatives. and there's lots of frequently appearing transposons.
+ the red lines being what my algorithm finds as diagonals.
The problem is it either extends too far into the sea of dots, or it doesn't seed on what should be a diagonal, depending on the parameters. The human eye is very good at this, but it's difficult to make a computer do it. Anyway, I'd previously tried published synteny finding programs and found them no better than mine, but just found dagchainer, where DAG is directed-acyclic graph. It's a but fussy in that given too many points, it will find spurious diagonals, but it's predictable, and fast, and it's a simple matter to cull the points before sending them in. DAGChainer is published and public domain, and the code is readable (made me think maybe C++ ain't so bad). Also, it doesnt rely on protein sequence as many synteny programs do. This is important when dealing with poorly annotated genomes. I had some trouble with it originally, and emailed the author, Brian Haas. He asked me to send my data set, so I sent the worst parts. He did some preprocessing, found some good parameters , ran it himself and sent me the results in under a day, including a couple of explanatory emails back and forth. So, now, I'm building my processing scripts around dagchainer, and it's working out great. Brian is extremely enthusiastic and helpful. Must be another one of those crazy folks who enjoy what they do.


Bao said…
Marker-based colinearity mining software are mostly based on dynamic programming and tend to be very fast. It turns out that the key issues are that you need to tuned the parameters (largely affected by gene density and tandem matches) and work on statistics (see Wang et al. BMC Bioinformatics 2006) or permutation test to filter out spurious matches. However, over long evolutionary distance, colinearity reduces to synteny which is then harder to search, for example, Arabidopsis-rice, moss-angiosperm, fugu-human, etc.
brent said…
bao, thanks for the reference, i'm downloading colinearscan software now. i generally dont like the dependence on protein blast as we have a lot of organisms that are poorly annotated, but i imagine it will work with blastn as well. i think the problem is that there is no real, clear definition of what is a spurious match -- especially as you say, over greater evolutionary distance.
thus far, dagchainer has given enough freedom to to what i want--given the proper input.
Bao said…
Brent, if don't prefer protein anchors, rather matches at nucleotide-level, then enredo ( appears to be a better choice. The software was developed as a component for ensembl pipeline and reviewed in several publications ( but the method itself has not been published(?!). I just glanced at the source code. Rather than trying to maximize the score of the local segments, it tries to simplify the matching graph. IMHO, there are not much trick to the algorithms, most of the manipulations appear to be empirical. But it represents a slightly different approach than both dagchainer/colinearscan. Tell me about your thoughts on this if you get a chance to try.
brent said…
bao, thanks again. that nature paper is a good reference. i haven't yet tried out the actual software, but i'll keep it in mind. IMHO, the trick to the algorithms _is_ the empirical tweakings. though i dont know too much about it--which is why i'm content to use what's availalable.
it sounds like you know much more about what's available--thanks for keeping me informed. does the paterson lab use enredo?

Popular posts from this blog

python interval tree

filtering paired end reads (high throughput sequencing)

my thoughts on golang