Wednesday, December 23, 2009

genome scrubber : mask repetitive sequence

This is to describe a simple tool I've made available (svn repo) for masking repetitive sequence.

rice (Oryza Sativa) version 5 sequence looks something like below when run through pyfasta info.

>1 length:43596771
>3 length:36345490
>2 length:35925388
>4 length:35244269
>6 length:31246789
>5 length:29874162
>7 length:29688601
>11 length:28462103
>8 length:28309179
>12 length:27497214
>9 length:23011239
>10 length:22876596

372.078M basepairs in 12 sequences

So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command:
/usr/bin/blastall -K 100 -a 8 -d rice.fasta -e 0.001 -i rice.fasta -m 8 -o rice_rice.blast -p blastn

nearly causes my machine to run out of memory (16G), takes a couple of days to run, and results in a blast output of 5.1G and 84 million rows--that's 84 million blast hits with an e-value below 0.001! By definition, that output is dominated by the repetitive elements. Repetitive elements are interesting, but in the case were we want to look at synteny, we have to wade through that 5.1G of stuff to find the very small chunk of data we need. This adds time to run the sequence comparison, time to parse, time to plot, time to analyze, and data to store, etc...
The solution we use in our lab is to create a "masked" sequence where we 'X' out any DNA sequence occurring more than a given number of times in the original blast. So, from the output of the blast above, any basepair that is covered more than 50 times is hard-masked with 'X'. The single script to run this is here. Available from svn via:
svn checkout

when run as a script with no options, it will print some help text. To mask rice at 50X, it is run as:
python -b rice_rice.blast -f rice.fasta -o rice -c 50 -m X

This hard-masks with "X". To soft-mask (changing basepairs to upper-case except those occurring more than 50 times which are lower-cased.) one can use the commandline option -m SOFT.
The resulting file: "rice.masked.50.fasta" has the same number of chromosomes, each with the same number of basepairs, but with repetitive regions masked. To show the difference, I re-ran the full-genome blast, except this time on the masked genome:
/usr/bin/blastall -K 100 -a 8 -d rice.masked.50.fasta -e 0.001 -i rice.masked.50.fasta -m 8 -o ricemasked.blast -p blastn
. The resulting blast file is only 128MB and 2 million rows (contrast to 5.1G and 84 million rows) or a blast file that's 2.4% of the original size. This makes things simpler, and makes doing genomic analyses more enjoyable. This is a different tool than the dust filter that comes with BLAST or many others, because it masks based on the global coverage at each location (there are similar tools).



In order to do the full-genome mask, stores an array for each chromosome, with a length equal to that of the chromosome in a pytables/hdf5 structure which acts like a numpy array. So to increment the counts for a blast hit, the code looks like:

cache[qchr][qstart - 1: qstop] += 1
cache[schr][sstart - 1: sstop] += 1

where cache is the hd5 structure (except for some cases it's read into memory to improve speed), 'qchr' and 'schr' are the query and subject chromosome keys with the values being the count arrays, and 's/qstart', 's/qstop' are the bounds of the subject and query for a particular blast line. This sends much of the work to numpy, instead of iterating over the range of qstart, qstop in python. That is repeated for every row in the blast file.


Once all of the counts are created, the script uses that data to mask out the original sequence for any element in count that is greater than the cutoff. That is achieved in a single line iterated per chromosome:
masked_seq = np.where(numexpr.evaluate("hit_counts > %i" % cutoff)
, mask_value, seq).tostring()

where `np` is from "import numpy as np", `seq` is the original fasta sequence in upper case, `mask_value is either a constant, like 'X', or in the case of soft-masking, a lower-case copy of `seq`, and `hit_counts` is the array of counts. The only tricky bit there is the use of numexpr which makes things a bit faster. The `masked_seq` is written to file after it's corresponding header, and the masking is performed for the next chromosome. The entire masking takes under 10 minutes on my machine.


Generally, we used the masked fasta file as the final output, but there's useful information in the hd5f file--a sort of measure of repetitiveness at each basepair in the genome. also in the svn repo is a short matplotlib web script (for mod_wsgi/mod_python but can be modified to run as cgi or command-line) to generate an image given a request like:

where 'org' is the name of the organism sent to the script, `xmin`, `xmax` specify the extents in basepairs, and `seqid` the chromosome.
Here's an example from our genome viewer (built on openlayers) where I've added copy count as a layer (with admittedly poor cartography). The nice gene renderings and basepair location ruler are courtesy of Eric Lyons.
The height of the blue line indicates the copy count. The flat red line is at a copy count of 50. So, the genes in this region are repetitive, but it's possible to see that they are surrounded by long terminal repeats, and that there is some repetitive sequence in the introns (the gray between the big green exons). Scrolling along, openlayers-style, it's possible to see patterns, and spot transposons.


I've made a few genomes available and will be adding more. There's also a (currently undocumented) service to plot the genome. So if the organism is brachy and the chromosome of interest is Bd1 then:

will result in:

where the image is described above.

If you're doing full-genome analyses, give it a try and let me know any suggestions. There's a full example starting from only a genomic fasta, including the BLAST command in the README.rst included in svn. If you get a genome that's too big to run through a full-genome BLAST (maize, ahem). You can split with "pyfasta -split -n 4", run 4 blasts, concat the results, and send through

Friday, October 23, 2009

rst2s5 with syntax highlighting

Restructured Text to S5 Presentation

(lots of caffeine today so 2 posts in one day)
The stub example presentation I'll be talking about is viewable as a presentation here (click on that page to advance the demo slides).

There's a nice browser-based tool for presentations called S5. In recent python docutils, there's a tool called which converts restructured-text to an s5 presentation. However, it's not obvious how to get syntax highlighting for code blocks to work.
So pygments, a python library that will highlight syntax for many programming languages comes with this file which they recommend you use as a starting point. That's what I did, and I've created a stub example project accessible via subversion:

$ svn export

with a build script and a couple of example slides (and a nice theme). It's possible to change the theme by editing (included in the source) and changing the STYLE global to a theme that pygments knows about (being comfortable with my machismo, I chose "fruity"). One way to find other themes is to check the drop down box on the paste bin at the pygments site.

To use the template, just edit the index.rst, then run and view the resulting index.html in a browser. The S5 "theme" it's using is specified in the script and contained in the ui/ sub directory, you can find more themes on the s5 site and others that come with the s5 install.

Check out the pretty example project here (with python code 3 slides in). As with any S5 slideshow, you can click to advance slides or use the controls that appear in the bottom right when the mouse is in that area.

As a bonus, if the index.rst file contains python shell sessions (doctests) like the example, you can check them with nose using:

$ nosetests --with-doctest --doctest-extension=.rst index.rst

Thursday, October 22, 2009

some python ctypes stuff in Rtree

I've been working with and on the Rtree python module. It's some cool work done by Howard Butler (hobu) (originally with Sean Gillies) to make python bindings for the Spatial Index C++ library by Marios Hadjieleftheriou which provides various tree structures for efficient spatial searches in n-dimensions. Hobu has written a C API for that along with a new ctypes wrapper to that API which appears in Rtree 0.5 and greater. There is some cool ctypes stuff in there which I'm starting to understand.
From the website:

ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python.

as a simple example of how ctypes works we can pretend there's no atan() in python's math module and access the one from libm in c like this:
import ctypes
libm = ctypes.CDLL('')

# the following 2 lines correspond to the c signature: double atan(double)
libm.atan.argtypes = [ctypes.c_double]
libm.atan.restype = ctypes.c_double

print libm.atan(0.22)
print libm.atan(ctypes.c_double(0.22))

where line 2 tells ctypes how to find the library (have a look at Rtree or shapely source code to see the cross-platform way to do that). lines 4, 5 tell it the input types (argtypes) and return type (restype) respectively, and lines 7, 8 call the c function by way of the ctypes wrapper. Here, it's calling the version of atan with a double precision number. With simple types, you can let ctypes wrap a python value in the type or you can do so explicitly as in the last line.

Things get more interesting with more complicated return types. For c function with a char * return type, e.g. this contrived example:
// does not need to be freed.
char* fn_char(){
char *s = "asdf";
return s;

the ctypes invocation -- with 'ccode' being this contrived library as loaded by ctypes.CDLL -- looks like:
ccode.fn_char.restype = ctypes.c_char_p
print ccode.fn_char()

which returns "asdf" as expected and does not leak. (Note that ctypes.c_char_p is "c char pointer" or "char *".) If you get a copy of a char * and are responsible for freeing it's memory, e.g. from the c function:
// needs to be freed.
const char * fn_const_char(){
return (const char * )strdup("asdf");

the ctypes for that looks like:

def get_and_free(achar_p):
s = ctypes.string_at(achar_p)
return s

ccode.fn_const_char.restype = get_and_free
print ccode.fn_const_char()

where libc is the standard c library defining the function to free() memory. In this case, it takes advantage of the feature that .restype can be a callable which takes the pointer return from the c code. In get_and_free(), ctypes.string_at() turns that pointer address into a python string. Then the char * pointer is free'd, and the python string is returned, and "asdf" is printed as expected.
It's also possible to do more rigorous error checking with errcheck in which case the ctypes looks like:

def err_check(char_p, fn, args, fn, args):
s = ctypes.string_at(char_p)
return s

ccode.fn_const_char.restype = ctypes.POINTER(ctypes.c_char)
ccode.fn_const_char.errcheck = err_check

where err_check gets 3 arguments, the result of the function, a reference to the function, and the args sent to it (in this case args is empty). Note that in this case, we have to specify the restype ctypes.POINTER(ctypes.c_char) so that we still have the pointer address--which we then free. When the restype is specified as ctypes.c_char_p (char *), then ctypes automatically gives us the python string and we can't (as far as I know) free the memory and a leak occurs. Also, in the case above, I haven't actually added any extra error checking, in Rtree hobu has a few functions to check the return values, see that code here.

This post has been pretty basic, there's also good ctypes code in shapely, geodjango, and libLAS. Next post I'll talk about callback functions -- calling a C function that expects a pointer-to-a-function with a python function as an argument.

Wednesday, October 14, 2009


I've been trying in 2009 to write less throw-away code. I'm not sure how successful I've been at that, but at least I'm writing more code that I keep around.
Previously, I stuck anything of at least marginal quality and re-usability into my google code project bpbio. As of yesterday, I've moved a lot of stuff from there to bitbucket. "Biostuff" is where I'll put modules that are well documented and tested in hopes that using a distributed VCS and a project that doesn't contain my initials will foster any contribution. Currently, all the modules on bitbucket are also on pypi.

pyfasta provides pythonic access fasta sequence files. Previously, it had been a part of genedex (which I've stopped supporting since @hobu has done so much good work on Rtree that genedex is now pretty much obsolete) but it's been pulled out and simplified and improved. Check out the docs on pypi.

nwalign is a command-line or python interface to the Needleman-Wunsch global sequence alignment which I've blogged about previously. Whenever I need to do stuff with cython and numpy I use nwalign.pyx for reference (though there's probably better material out there).

simpletable, as the name suggest is a wrapper around pytables to remove some of the boiler-plate in creating a table and dataset and Description. That's it.

skidmarks is a small module to check for runs in data (get it skidmarks, runs?). It implements Wald-Wolfowitz, autocorrelation, serial, and gap tests. Each function (implementing one of those tests) returns a p-value which indicates the level of support to reject the null hypotheses that the sequence is random and the chi-square or z-score value as appropriate. I've been using this and monte-carlo simulations to see if runs in genomic data could be explained by random events.

Any contributions, suggestions, or bug reports are welcomed--the interface at bitbucket should make this easier to do, just fork and fix and pull-request.
Meanwhile, my less documented/ more crappy code will continue to live on google code--at least until it matures. I've got a couple modules in the pipeline that will be added once they're cleaned up and documented.

Sunday, July 26, 2009

starting haxe. (stuff i want to remember)

I've been tinkering with a flash project recently. Actually haxe, so it's only linux tools -- VI, and the command line -- not the GUI interface people normally associate with flash. This post is a summary of how to get started with haxe using only the command line, and a project containing flash stuff I want to remember.
To start, here's a gist of shell commands that will set up haxe on an ubuntu machine. The installers from the haxe website work fine for windows and mac (and I think 32 bit linux).

Haxe has a slightly different syntax from actionscript 3, but for most things it is identical. These docs are very good, and better than the adobe site, I have that page open always when working with haxe. I also grabbed an "actionscript.vim" from the internet somewhere and put it in ~/.vim/syntax/ for syntax highlighting and added this line to my .vimrc:
autocmd BufRead *.hx set filetype=actionscript

Then compilation and code is simply a matter of following this.

If you only look at haxe/flash occassionally, some stuff can be less than obvious. I've started a github project as a testing ground where I can record the stuff I figure out.
Currently it has examples of:

+ call javascript from flash (
+ call flash from javascript (ExternalInterface.addCallBack())
+ add an image from a (local) url
+ keyboard events
+ interact with bitmap data of image.
+ style a text field.

all in a single .hx file. Most of those are fairly simple, they just require the correct incantation. There's a version of this running on my work machine here to demo this stuff. The possible interactions are mostly listed as instructions in either the HTML or the flash movie in that page.
To build the flash(9) movie, just type 'haxe build.hxml' in the learnflash/ directory. Commenting out the '--no-traces' line in build.hxml will cause any trace() call to be sent to firebug--this is extremely useful for debugging.

I'll probably add more as stumble upon it.

Saturday, May 23, 2009

displaying and serving big(ish) data with numpy and memmap

In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script.

A web request containing a width=XXX in the url will show the user wants an image. so a url like:
will give an image:

(geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into
an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression.
Without the width=XXX param, the service returns JSON of values for the + and minus strand for each basepair and tissue in the original file--which is less useful than having a summary of mean value for each tissue over the range:

{"+": [0.008325556293129921, 0.0072462377138435841, 0.0074741491116583347, 0.0017609233036637306, 0.010895536281168461], "-": [0.0088664013892412186, 0.017081102356314659, 0.009236418642103672, 0.0043859495781362057, 0.021286465227603912]}

which are the mean values from basepair 491520 to 499712 for the + and - strands and each of the 5 tissues. A single tissue can be requested by number as:
which gives:

{"+": 0.0074741492195734898, "-": 0.0092364182547917447}

Likewise, one can request and image with only a single tissue:

Transcriptome data often has enough rows of data that it's possible, but not really convenient to stick it into an RDBMS. the data I'm using is available here and looks like:

26 C 125.05 106.18 119.80 159.78 101.40
26 W 141.90 139.07 137.50 151.91 171.18
51 C 131.85 108.57 71.00 156.41 133.79
51 W 123.83 129.97 122.50 100.23 139.09
76 C 136.84 104.98 120.30 88.88 151.64
76 W 119.35 112.79 160.00 119.51 130.29
126 C 138.93 100.92 111.00 84.82 103.48
126 W 87.82 118.30 110.30 157.17 126.42
151 C 184.50 71.24 157.80 396.25 86.42
151 W 136.76 119.95 107.00 98.10 108.60
176 C 135.75 86.98 115.80 188.41 93.53
176 W 147.57 158.19 131.00 117.45 164.77
[... millions more ...]

where the first column is basepair position, the second is either 'W' or 'C' for the Watson or Crick strand of the (double-stranded) DNA sequence. Columns 2 and on are measurements of transcription (see the wikipedia article) for various tissues in the plant Arabidopsis Thaliana which we use because it has a small genome, it's well annotated and doesn't have too much repetitive DNA or too many transposons.

The data is somewhat irregular in that not every 'C' row has a 'W' counterpart and sometimes vice-versa, for those cases, I create a row to fill the missing row with the same values as the existing row. The code below runs through and creates numpy .npy files of the data:

It saves the positions and the actual transcriptome data into separate files because for the web service, the transcriptome files will be memory-mapped while the smaller position files will be read into memory. This code is run at startup (and so not done for every request):

from numpy.lib import format
for ichr in range(1, 6): # for 5 at chrs
schr = str(ichr)
# memmap these
seqids[schr] = format.open_memmap(os.path.join(path, \
'' % (schr,)), mode='r')
# read these into memory.
posns[schr] = np.load(os.path.join(path, 'at.tome.posn.%s.npy' % schr))

So that the seqids dict has values that are the memmaped contents of the expression data in the .npy files. For each web request, it then uses numpy's searchsorted to do a binary search so that when a user requests &xmin=1234&xmax=4567 searchsorted() finds the position in the array where 1234 would fall (exactly like python's bisect module). and like wise for 4567. The pair of array indexes from the searchsorted() calls with xmin and xmax give the indexes into the tome array for which to grab the data:

minidx = posns[seqid].searchsorted(xmin)
maxidx = posns[seqid].searchsorted(xmax)

data = seqids[seqid][minidx: maxidx]
data_idx = posns[seqid][minidx: maxidx]

where data contains the values of the transcriptome data and data_idx contains the associated base_pair positions. I could have saved the basepair positions in the seqids/data numpy array, but since searchsorted() will likely traverse many chunks of the array, I think it's best to have that part in memory rather than memory mapped.
The rest of the code is a bunch of if statements (which I should really spread across multiple functions...) deciding whether to show an image with matplotlib or grab a summary of the data and return it via simplejson. The gist of the entire wsgi script is at the end of this post.
An example of what this looks like in an openlayers map with gene annotations is in the image below.

where one can see some correlation between where the (green) genes are and the higher transcriptome levels. In summary, it's nice to be able to take data in this format and write a 120 line script which provides full, fast access both to the raw data and to images which we can use to find patterns to explore.

Tuesday, April 21, 2009

Needleman-Wunsch global sequence alignment

[EDIT 01-01-2010] this is now available at bitbucket.
I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as:
 sudo easy_install -UZ

or via svn from:
svn co

it will put an executable 'nwalign' into /usr/bin/ which when run will give this message:

nwalign [options] seq1 seq2

-h, --help show this help message and exit
--gap=GAP gap extend penalty (must be integer <= 0)
--gap_init=GAP_INIT gap start penalty (must be integer <= 0)
--match=MATCH match score (must be integer > 0)
--mismatch=MISMATCH gap penalty (must be integer < 0)
--matrix=MATRIX scoring matrix in ncbi/data/ format,
if not specificied, match/mismatch are used

where the matrix is optional but can be the full path specifying the transition score from row to column as here.
If the matrix is specified, match and mismatch are not used. If the matrix is not specified, match and mismatch are used. a typical run looks like:

$ nwalign alphabet alpet

$ nwalign --matrix /usr/share/ncbi/data/BLOSUM62 EEAEE EEEEG
And usage from a python script can be seen in

I wrote this for a colleague who was using a perl script. This is over 100 times faster than the perl script for long sequences--and the perl script had the BLOSUM62 matrix hard-coded as a hash. There are a couple places that could still be sped up, but I think the improvement will be relatively small. This kind of script is perfect for the numpy-cython mix as I can just access the contents of the 2d array at c-speed without having to do pointer arithmetic. If there's any obvious optimizations I missed, I'd be glad to know them.

Thursday, April 16, 2009

python object initialization speed

On the Cython mailing list, I saw this mentioned for avoiding init overhead, so i wrote up some code to try it. Basically, instead of using an __init__, it uses the PY_NEW macro (which I don't pretend to understand fully).
I ran a benchmark with 5 cases:

  1. PY_NEW macro (still has python overhead for each call to the creator function)

  2. regular python init

  3. python init using __slots__

  4. cython init (cdef'ed class)

  5. batch PY_NEW: calling PY_NEW from inside cython to avoid python call overhead

  6. batch init on cython class

the timings look like this:

PY_NEW on Cython class: 1.160
__init__ on Python class: 30.414
__init__ on Python class with slots: 10.242
__init__ on Cython class 1.185
batch PY_NEW total: 0.855 , interval only: 0.383
batch __init__ on Cython class total 0.998 , interval_only: 0.540

So, the PY_NEW is .383 compared to .540 for using a __init__ on a Cython class, but both are much faster than python. I was surprised that using slots gives a 3x speed improvement over a regular python class. That Cython is faster is no surprise.
Stefan Behnel explains better than I could.

All the code is smashed uncomfortably into this gist.

for kicks, i tried with unladen-swallow. It comes out almost 2x faster on the python times both with and without slots. I didn't use the optimization stuff. Cython even works with unladen-swallow--just have to rebuild the .so--and the timings are the same as Cython-with-CPython.

Sunday, April 12, 2009

apache mpm-worker with php on low memory servers

I'm partly writing this because I think such valuable info is too hard to find, I'd read a lot that if you want php, you have to use mpm-prefork but it's not true!

I've been using slicehost for dev server for almost a year now. Since my 1 year deal is up, I decided to switch to their affiliate mosso. I really like slicehost, but Mosso "cloud servers" seem a good fit for a server that goes through spurts of development and use followed by weeks of non-use. So now, I can keep it as a 256MB instance at about $10/month and update to a larger instance when doing real dev.
I built it today as a 1024MB instance -- installed all my usual stuff, and updated my build script for ubuntu. That's here.
The machine I'm on is extremely fast, normally I set GDAL building and leave, but it finished before I had a chance. After all was built, I resized it to a 256MB server -- that took 12minutes, but my instance was accessible for at least 10 of those.
After that, I log right back in and all works fine. On this dev server, I have to run PHP (at least I dont have to code in it). I've been getting tired of having to use apache's mpm-prefork because ubuntu won't let me have php and mpm-worker. (For more from someone who actually knows what they are talking about, Graham Dumpleton, the author of mod_wsgi, has a good write up here.) So, every apache process takes up a good chunk of the available memory and things are sloooow. Even just running trac, it starts swapping.
I found a good post for using fcgi for php and followed it blindly and all works!
Using worker_mpm, Trac (via mod_wsgi) is so fast I doubled checked to make sure my instance was still at 256MB. After setting ThreadStackSize to 1.5MB trac takes up only 16% of the available memory. The relevant bits of my apache config look like:

<IfModule mpm_worker_module>
StartServers 2
MaxClients 40
MinSpareThreads 5
MaxSpareThreads 20
ThreadsPerChild 20
MaxRequestsPerChild 5000

MaxKeepAliveRequests 1000
Timeout 30

# i keep this fairly high for serving image tiles
# from mapserver
KeepAliveTimeout 40

# default is 8 per child process, only use 1.5MB
ThreadStackSize 1500000

AddHandler fcgid-script .php
FCGIWrapper /usr/lib/cgi-bin/php5 .php

and then in 000-default, i added +ExecCGI to the DocumentRoot Directory Options.
If any of that config is not sane, let me know.

I should say that I have no idea how PHP will do under fast-cgi, if it uses more memory, or not, or if it will have any problems, on quick look, everything seems ok.

Apparently, you can also use mpm-worker with php if you build php from source. Since using fcgi worked so easily, I didn't try that.

Friday, February 20, 2009


This is a quick project I did a while back but, I've seen people interested in similar ideas, so I'll post my implementation here.

Geohash encodes latitude/longtitude locations into a string such that "nearby places will often present similar prefixes" and the longer the string, the greater the precision. Using this python implementation by Schuyler as a reference, I ported the concept to a "biohash" which can encode intervals. It works in a similar fashion, starting with the extremes and halving the space until it finds the smallest space that contains the interval.

The use to allow efficient search of intervals using a BTree index, as in any relational db. It's implemented with only a dumps() and loads() function after the pickle interface. The dumps function takes start and end args and returns a 1/0 encoded string. The loads takes a 1/0 encoded string and returns the tightest interval it can given that string. Both functions take a rng kwarg, which can be as small as the maximum end value. If all the intervals are small, and the rng is very large, the biohash will not help much. The rng used to load must be the same as the one used to dump or the values won't be correct.

I had plans to finish up a set of SQLAlchemy models for this that would save the hash and use it to do range queries behind the scenes, but haven't finished that up yet. The code is in my google SVN. It will even pull in a fast cython version of the encoder.
If anyone wants to use it, improve it or get it going with SQLAlchemy, it available from SVN with:
svn checkout
it has tests in

Friday, January 09, 2009

Stupid things I did as a Bioinformatics Programmer in 2008

In 2008, I was good enough at programming to get my ass kicked by hard problems. I think that's the most positive way to say it.
My main bioinformatics project was a long annotation pipeline. It takes days to run, often using 8 CPUs. It's driven by a big ol' Makefile. I made the mistake of passing data between steps in un-structured text files or python pickles.

I'd create one at the beginning of the pipeline and not notice it was messed up in a way that affected other parts until the entire pipeline was done, days later. Toward the end of the pipeline, I'd need something simple, like the strand of a BLAST hit, but I'd have to parse through an entire GFF file, or load some huge pickle into memory just to get to that. Then I'd need some annotation, and I'd have to add a slow step of doing a lookup in a script that'd otherwise run very quickly.

I was passing around data in arrays and tuples, so then when I changed the order or added another element in the script that was creating the tuple, a downstream script that was using the tuple would be using the wrong index to access data. If I was lucky, my code would fail, if not, it'd be using the strand when what it should have had was the start location.

I hit problems where I'd run out of memory. At one point, I ran out of disk space (it's a big series of datasets), hit bugs of software I was using.

When I should have run just 1 chromosome to test the pipleline in 1/100th (comparing 10 chromsomes to 10 chromosomes) the time, I ran it over an entire

When I should have taken the time to really fix small mistakes as I found them, I instead worked around them, making the code unnecessarily complex as a result. If I had fixed instead of fudged in those cases, I would have been more productive.

I did write tests, but not enough, and I didn't set up the project in a way that it was really testable. I'm still learning how to do that. All the other stuff may just be discipline, but the testing is very difficult for me in bioinformatics. I've extracted what I could into tested libraries and added checks for the intermediate data, and every time I found a dumb error, I'd add an assert. (there's a discussion of pretty much exactly the problems I'm describing here: )

I'd assume that it was ok to use a tuple, because I was only going to store start and stop, but then later I'd need to add chromosome, then strand, then score, and pretty soon i'd have code elsewhere like:
if cns[2][4] < cns[2][5]: 

where i have no idea what i'm comparing there. That one's simple to fix, just use objects, or dicts at the very least, but a 2-tuple is so tempting, and
a 3-tuple not so bad, and ...

On top of all this, the project was changing as I was working on it, so I was changing what went in to the start of the pipeline, what we were getting out, and the steps I was doing. And because of all the extra little hacks in there, I would be stuck in some function far away from the data that I needed (*).

It sucked. I can make those the sort of mistakes on a project that runs in 5 seconds and has a very simple, and relatively known output. But, not for complex pipelines.

It's been painful watching myself do such stupid stuff, and then reading the code afterward, but I think I've learned a lot. Much of that code still sucks, but I've moved to more rigid data-structures. Every time I make a change now, I do it for real, it's not just hacked in. I have to deal with someone pacing around, waiting for the results and asking me questions like "why is it so hard to just add in the RNA?" or whatever. Also, statistically speaking, I'm
probably running out of mistakes I can make...

And actually, the most difficult things have been inter-personal relationships, but that's for another post--as are the more positive, awesome things I did with projects I set up correctly, and had good test coverage...
And finally, I did make a lot of mistakes, but this has been genuinely a difficult problem.

* see this post, especially the last 4 paragraphs. It's good to hear that someone with 43 years of programming language has the same problems as I do.