synteny mapping
Living in Synteny
I've been working on automating synteny mapping between any pairs of genomes. Synteny is where there's a stretch of DNA or genes in some order on chromosomeA of organismX and due to a shared evolutionary history, you can find a similar stretch of genes in order on chromosomeB of organismY. Often there are small losses and inversions, but between closely related organisms like man and mouse, there's still a lot of synteny.
Plants can undergo polyploidy, following which, a species can have 2 entire copies of its genome. Over time, much of the duplicated cruft is lost, and the homologous chromosomes diverge, but if the divergence is not too great, it's still possible (actually common) to find synteny within the genome of a single organism--as well as between organisms.
I've written my own algorithm to find synteny which uses python sets, and numpy array slicing to do the heavy lifting. It is quite clever [wink]. And it _almost_ works but it's ... sorta non-deterministic.
The output looks like this for Arabidopsis against itself:
That's compressed from a 3000px*3000px image, but it's 25 boxes for the 5 vs. 5 chromsomes so above and below the diagonal should be mirror images. and with:
+ the diagonal dark line in the center being where each gene matches itself
+ the dark patches in the center of each box being all the crap in the centromere of each chromosome.
+ the speckled dots all over being because genes have many relatives. and there's lots of frequently appearing transposons.
+ the red lines being what my algorithm finds as diagonals.
The problem is it either extends too far into the sea of dots, or it doesn't seed on what should be a diagonal, depending on the parameters. The human eye is very good at this, but it's difficult to make a computer do it. Anyway, I'd previously tried published synteny finding programs and found them no better than mine, but just found dagchainer, where DAG is directed-acyclic graph. It's a but fussy in that given too many points, it will find spurious diagonals, but it's predictable, and fast, and it's a simple matter to cull the points before sending them in. DAGChainer is published and public domain, and the code is readable (made me think maybe C++ ain't so bad). Also, it doesnt rely on protein sequence as many synteny programs do. This is important when dealing with poorly annotated genomes. I had some trouble with it originally, and emailed the author, Brian Haas. He asked me to send my data set, so I sent the worst parts. He did some preprocessing, found some good parameters , ran it himself and sent me the results in under a day, including a couple of explanatory emails back and forth. So, now, I'm building my processing scripts around dagchainer, and it's working out great. Brian is extremely enthusiastic and helpful. Must be another one of those crazy folks who enjoy what they do.
I've been working on automating synteny mapping between any pairs of genomes. Synteny is where there's a stretch of DNA or genes in some order on chromosomeA of organismX and due to a shared evolutionary history, you can find a similar stretch of genes in order on chromosomeB of organismY. Often there are small losses and inversions, but between closely related organisms like man and mouse, there's still a lot of synteny.
Plants can undergo polyploidy, following which, a species can have 2 entire copies of its genome. Over time, much of the duplicated cruft is lost, and the homologous chromosomes diverge, but if the divergence is not too great, it's still possible (actually common) to find synteny within the genome of a single organism--as well as between organisms.
I've written my own algorithm to find synteny which uses python sets, and numpy array slicing to do the heavy lifting. It is quite clever [wink]. And it _almost_ works but it's ... sorta non-deterministic.
The output looks like this for Arabidopsis against itself:
That's compressed from a 3000px*3000px image, but it's 25 boxes for the 5 vs. 5 chromsomes so above and below the diagonal should be mirror images. and with:
+ the diagonal dark line in the center being where each gene matches itself
+ the dark patches in the center of each box being all the crap in the centromere of each chromosome.
+ the speckled dots all over being because genes have many relatives. and there's lots of frequently appearing transposons.
+ the red lines being what my algorithm finds as diagonals.
The problem is it either extends too far into the sea of dots, or it doesn't seed on what should be a diagonal, depending on the parameters. The human eye is very good at this, but it's difficult to make a computer do it. Anyway, I'd previously tried published synteny finding programs and found them no better than mine, but just found dagchainer, where DAG is directed-acyclic graph. It's a but fussy in that given too many points, it will find spurious diagonals, but it's predictable, and fast, and it's a simple matter to cull the points before sending them in. DAGChainer is published and public domain, and the code is readable (made me think maybe C++ ain't so bad). Also, it doesnt rely on protein sequence as many synteny programs do. This is important when dealing with poorly annotated genomes. I had some trouble with it originally, and emailed the author, Brian Haas. He asked me to send my data set, so I sent the worst parts. He did some preprocessing, found some good parameters , ran it himself and sent me the results in under a day, including a couple of explanatory emails back and forth. So, now, I'm building my processing scripts around dagchainer, and it's working out great. Brian is extremely enthusiastic and helpful. Must be another one of those crazy folks who enjoy what they do.
Comments
thus far, dagchainer has given enough freedom to to what i want--given the proper input.
it sounds like you know much more about what's available--thanks for keeping me informed. does the paterson lab use enredo?