Thursday, February 28, 2008

flash, vi, fcsh

All my flash tinkering has been in VIM-- no IDE, no XML, just actionscript. It's a little tough to deal with the adobe compiler as it takes about 11 seconds to compile a large project like modestmaps on my machine. That's not good for a guess-and-check programmer. The typing does catch some errors.

The worldkit project compiles instantaneously with mtasc (the predecessor to haxe)--likewise for the as2 branch of modestmaps. The flash compiler shell drops the compile time for as3 modestmaps to under 3 seconds, so I've added this to my .bash_aliases:

alias fcsh="/usr/bin/rlwrap /opt/src/flex2/bin/fcsh"

the rlwrap is to use readline in the flash shell--meaning I can just press up-arrow to get the previous compile command. By default, one has to paste or type the entire command again.
With that, it's close to a reasonable workflow.

Wednesday, February 27, 2008

synteny mapping

Living in Synteny
I've been working on automating synteny mapping between any pairs of genomes. Synteny is where there's a stretch of DNA or genes in some order on chromosomeA of organismX and due to a shared evolutionary history, you can find a similar stretch of genes in order on chromosomeB of organismY. Often there are small losses and inversions, but between closely related organisms like man and mouse, there's still a lot of synteny.
Plants can undergo polyploidy, following which, a species can have 2 entire copies of its genome. Over time, much of the duplicated cruft is lost, and the homologous chromosomes diverge, but if the divergence is not too great, it's still possible (actually common) to find synteny within the genome of a single organism--as well as between organisms.
I've written my own algorithm to find synteny which uses python sets, and numpy array slicing to do the heavy lifting. It is quite clever [wink]. And it _almost_ works but it's ... sorta non-deterministic.
The output looks like this for Arabidopsis against itself:


That's compressed from a 3000px*3000px image, but it's 25 boxes for the 5 vs. 5 chromsomes so above and below the diagonal should be mirror images. and with:
+ the diagonal dark line in the center being where each gene matches itself
+ the dark patches in the center of each box being all the crap in the centromere of each chromosome.
+ the speckled dots all over being because genes have many relatives. and there's lots of frequently appearing transposons.
+ the red lines being what my algorithm finds as diagonals.
The problem is it either extends too far into the sea of dots, or it doesn't seed on what should be a diagonal, depending on the parameters. The human eye is very good at this, but it's difficult to make a computer do it. Anyway, I'd previously tried published synteny finding programs and found them no better than mine, but just found dagchainer, where DAG is directed-acyclic graph. It's a but fussy in that given too many points, it will find spurious diagonals, but it's predictable, and fast, and it's a simple matter to cull the points before sending them in. DAGChainer is published and public domain, and the code is readable (made me think maybe C++ ain't so bad). Also, it doesnt rely on protein sequence as many synteny programs do. This is important when dealing with poorly annotated genomes. I had some trouble with it originally, and emailed the author, Brian Haas. He asked me to send my data set, so I sent the worst parts. He did some preprocessing, found some good parameters , ran it himself and sent me the results in under a day, including a couple of explanatory emails back and forth. So, now, I'm building my processing scripts around dagchainer, and it's working out great. Brian is extremely enthusiastic and helpful. Must be another one of those crazy folks who enjoy what they do.

Tuesday, February 26, 2008

open source gis and flash maps part two

Mash Flap
I started looking into flash mapping stuff lately. For the patch I submitted to worldkit, Mikel gave me commit access! So, now the svn version of worldkit can be compiled with mtasc by typing "make". I feel unreasonably proud of that, given that much of what I did was some global search and replace stuff in VI, and then read and fixed mtasc compiler errors until they went away. It was good fun.
Michal Migurski saw in that same post that I mentioned modestmaps and gave me some good ideas on getting WMS going. I just figured out how to get that working and posted a message to their overly web2.0 forums. Hopefully someone with some real actionscript skillz will clean it up. A mapping library without a good WMS interface is much less useful for most of the stuff I do.

I haven't decided whether to use modestmaps or worldkit, or both. The time stuff the Mikel has done in worldkit is very cool and I haven't really looked at that yet. But I have a time based project starting soon. But, most of my GIS project are reliant on good, hi-resolution imagery and quality roads data, and it's nice if that's just "magically" included via one of the commercial map providers. That's a plus for modestmaps. Though I _still_ do not understand if the modestmaps usage of google, yahoo, microsoft images falls within the terms of use... Especially since the google starts sending "X" tiles instead of imagery after browsing with modesmaps for a while. Meanwhile, I keep using OpenLayers, because that's what sane people do.

Sunday, February 24, 2008

fast python with shedskin

There's a new release of the shedskin compiler. It is able to generate fast shared libraries that can be run from CPython. It can also create binaries so I thought I'd see how it did on some code from this BMC bioinformatcs article compared to psyco and CPython.
I took this iterative, brute-force (Needleman-Wunsch?) alignment code and modified it slightly. That's pasted here. (Notice the first line! that's how it appears in the original code). The modifications allow shedskin to infer the function and variable types. Plus, there's a couple changes I made that improve the run-time for all cases. The max() function is also in the original, but unnecessary because of python's builtin max(), however, pysco does run much faster using their hand-coded max(). For the shedskin run, I removed that extra code and used shedskin's builtin 'cause it made me feel better.

The python code was run as
$ time python -c "import alignment; alignment.imain()"

To run with psyco. The only change from the pasted script is to add
'import psyco; psyco.full()' at the top.

To run with shedskin built executable (after removing the max()):
$ shedskin alignment.py ### generates Makefile
$ make
$ time ./alignment

To run with shedskin as a shared (.so) library
$ shedskin -e -n alignment.py # generates Makefile
$ make
$ time python -c "import alignment; print alignment.imain()"
# python finds and imports the shared library (.so) before the .py file.

Timing:
time reported is real.
Python 2.5.1: 19.030s
Psyco: 1.1336s
Shedskin shared: 0.921
Shedskin binary: 0.818

CPython is much slower at 19seconds compared to ~1 for pysco and shedskin. The output format for the shedskin binary is slightly different because it calls the stuff in __main__, but they all generate the same alignments. It's interesting to look at the alignment.ss.py that shedskin generates, as you can see all the inferred types. The alignment.cpp contains the generated cpp code, which is also quite readable. It's also nice to be able to get a binary executable without jumping through any extra hoops.

Shedskin was very easy to use and faster than psyco for this case, I just pulled from SVN and it worked out of the box. It now has support for sets and regular expressions, and seems quite active. I can see using shedskin for purely brute force stuff where numpy is no help and I might otherwise have to resort to cython. I kinda like cython, but it's nice just to be able to get fast code with little to no modifications.


EDIT:
jython 2.3a0 runs this unaltered in 42 seconds.