Bioinformatics

Posts

reading code

March 25, 2008

A fasta file of rice genomic sequence is 355MB. It's not easy to understand how large that is. This is an attempt to come up with a quick metric. So, I downloaded Ulysses . wc shows it to have 267235 words. Some googling says the average person can read 250 words per - minute. So that's 267,235 / 250 / 60 = 17.8 hours. Well, it's hard to believe anyone can really read Ulysses in 18 hours but... good enough. So on the rice fasta file i ran: grep -v ">" rice.fasta | wc -c to get rid of the 12 header lines (1 per chromosome) and only count sequence (should be within 12 characters counting the extra new-lines). That gives 372,077,765 characters. The average word-size in ulysses is 5. I rounded up to 6. So, the rice sequence has the equivalent of 372,077,765 / 6 = 62,012,960 words So, at 250 words per minute, it'd take: 62012960 / 250 / 60 = 4,134 hours to read the rice genome . That's 172 days. Also, from what I know, the plot is hard to follow. Genome siz...

point partitioning

March 24, 2008

After you spend all day banging your head against your own problems, sometimes it's just nice to bang it on something else for a bit. This question came through on the postgis mailing list and it seemed like a good diversion. I think it's a very clear description of the problem. To quote: I have around 300,000 points, each with a lat, lon and altitude (also converted to geometry). I need to get a subset of those points, where none of them are within 5m (or some other arbitrary distance) of each other. It doesnt matter which points get picked over another, as long as whatever the data set it creates, that none of the points are within that 5m radius and that relatively most of the points are used So, I hacked up a quick solution. It's probably inefficient -- deleting keys from a dict, and removing entries from an rtree index. But, it's easy to understand, and (without the plotting) it runs in about 2 minutes for the requested 300000 points. When plotting, the image lo...

OGR python projection

March 18, 2008

OGR Projection If you're using shapely and you need to do projections, you'll either have a lot of boilerplate or a function like this one. Actually, even in OGR, there's a lot of bioler plate involved in transforming.... from osgeo import ogr from shapely.wkb import loads def project(geom, to_epsg=900913, from_epsg=4326): """utility function to do quick projection with ogr, to and from shapely objects >>> from shapely.geometry import LineString >>> l = LineString([[-121, 43], [-122, 42]]) >>> lp = project(l, from_epsg=4326, to_epsg=26910) >>> lp.wkt 'LINESTRING (663019.0700828594854102 4762755.6415722491219640, 582818.0692490270594135 4650259.8474613213911653)' """ to_srs = ogr.osr.SpatialReference() to_srs.ImportFromEPSG(to_epsg) from_srs = ogr.osr.SpatialReference() from_srs.ImportFromEPSG(from_epsg) ogr_geom = ogr.CreateGeometryFromWkb(geom.wk...

spatially explicit metapopulation models in scipy.

March 16, 2008

Making Pretty Pictures I started to learn to program about 5 years ago running population ecology models in mathematica . Yesterday, I found an old mma notebook with a model modified to include differential parasitoid dispersal to adjacent host cells depending on the host density in those cells. It's bascially a discrete-time Nicholson-Bailey model . But in a grid of cells, where each cell contains a population of hosts (H) and parasitoids (P) that give birth, die, eat, and get eaten according to the NB model. Each generation, following birth/reproduction/predation, the hosts and parasitoids disperse. The hosts disperse equally to the 8 surrounding cells in their neigbhorhood. The parasitoids can move irrespective of host densities when the aggregation parameter (eta) is 0. When aggregation is 1 (eta == 1), the parasitoids move to adjacent cells in exact proportion to host densities in each of the surrounding cells. muH and muP (i'm too lazy to figure out how to write the symb...

rtree: know your nearest neigbhors

March 11, 2008

My computer spends a lot of time looking for neighbors of a given location-- even more so for bio, than for geo. This is what I've learned about the options for doing smarter search so far. SELECT * from ( (SELECT * FROM ((SELECT * FROM feature WHERE start = ? ORDER BY start LIMIT 1)) as u) UNION (SELECT * FROM ((SELECT * FROM feature where stop = ? ORDER BY stop LIMIT 1)) as v) ) as w ORDER BY ABS((start + stop)/2 - ?) LIMIT 1 if you fill in ? with an integer location, that query will return the closest feature most of the time. It's verbose if not ugly, and that's only for 1 dimension. It can return the wrong feature in certain cases.... You have to write it like that in MySQL, because it doesnt support functional indexes , so as soon as you do something like: ORDER BY ABS((start + stop)/2 - ?) it's no longer an indexed search It's a hard problem, even if you're using postgis . And even if you're a postGIS badass . Other than postGIS, there postgres...

What a Shapely genome you have!

March 04, 2008

This might be a case of if you have a really cool hammer , everything looks like a nail, but it was fun mixing tools from different disciplines. After finding synteny , there's a bunch of paired genes whose neighbors are also pairs. Paired ( homologous ) genes have similar sequence because they have some function and can't change without loss of function. Non-gene sequence between the paired genes is mostly randomized via mutation, deletion, etc. But, there is non-gene sequence that is conserved between the genes. These CNS's-- conserved non-coding sequences--are usually sites that bind stuff that regulates the expression of a gene. That looks like this. With one gene on the top, and its pair below, both yellow. Pink lines in the foreground connect putative CNSs (similar sequences) between these genes. That the lines cross is bad. CNSs occur right at the level of noise. So even though a similar sequence occurs near both genes, it could be by chance. It is possible to red...

flash, vi, fcsh

February 28, 2008

All my flash tinkering has been in VIM-- no IDE, no XML, just actionscript. It's a little tough to deal with the adobe compiler as it takes about 11 seconds to compile a large project like modestmaps on my machine. That's not good for a guess-and-check programmer. The typing does catch some errors. The worldkit project compiles instantaneously with mtasc (the predecessor to haxe )--likewise for the as2 branch of modestmaps. The flash compiler shell drops the compile time for as3 modestmaps to under 3 seconds, so I've added this to my .bash_aliases: alias fcsh="/usr/bin/rlwrap /opt/src/flex2/bin/fcsh" the rlwrap is to use readline in the flash shell--meaning I can just press up-arrow to get the previous compile command. By default, one has to paste or type the entire command again. With that, it's close to a reasonable workflow.

synteny mapping

February 27, 2008

Living in Synteny I've been working on automating synteny mapping between any pairs of genomes. Synteny is where there's a stretch of DNA or genes in some order on chromosomeA of organismX and due to a shared evolutionary history, you can find a similar stretch of genes in order on chromosomeB of organismY. Often there are small losses and inversions, but between closely related organisms like man and mouse , there's still a lot of synteny. Plants can undergo polyploidy , following which, a species can have 2 entire copies of its genome. Over time, much of the duplicated cruft is lost, and the homologous chromosomes diverge, but if the divergence is not too great, it's still possible (actually common) to find synteny within the genome of a single organism--as well as between organisms. I've written my own algorithm to find synteny which uses python sets, and numpy array slicing to do the heavy lifting. It is quite clever [wink]. And it _almost_ works but it'...

open source gis and flash maps part two

February 26, 2008

Mash Flap I started looking into flash mapping stuff lately. For the patch I submitted to worldkit, Mikel gave me commit access! So, now the svn version of worldkit can be compiled with mtasc by typing "make". I feel unreasonably proud of that, given that much of what I did was some global search and replace stuff in VI, and then read and fixed mtasc compiler errors until they went away. It was good fun. Michal Migurski saw in that same post that I mentioned modestmaps and gave me some good ideas on getting WMS going. I just figured out how to get that working and posted a message to their overly web2.0 forums. Hopefully someone with some real actionscript skillz will clean it up. A mapping library without a good WMS interface is much less useful for most of the stuff I do. I haven't decided whether to use modestmaps or worldkit, or both. The time stuff the Mikel has done in worldkit is very cool and I haven't really looked at that yet. But I have a tim...

fast python with shedskin

February 24, 2008

There's a new release of the shedskin compiler. It is able to generate fast shared libraries that can be run from CPython. It can also create binaries so I thought I'd see how it did on some code from this BMC bioinformatcs article compared to psyco and CPython. I took this iterative, brute-force ( Needleman-Wunsch ?) alignment code and modified it slightly. That's pasted here . (Notice the first line! that's how it appears in the original code). The modifications allow shedskin to infer the function and variable types. Plus, there's a couple changes I made that improve the run-time for all cases. The max() function is also in the original, but unnecessary because of python's builtin max(), however, pysco does run much faster using their hand-coded max(). For the shedskin run, I removed that extra code and used shedskin's builtin 'cause it made me feel better. The python code was run as $ time python -c "import alignment; alignment.imain()"...

Python Mapscript Tricks

February 23, 2008

I mentioned previously a site which uses google maps with a WMS. The problem with using points (or labels) in a tiled application such as google maps, or OpenLayers is that each tile can only draw its own contents. So if you draw a point with a radius of 10 pixels whose center is 2 px away from the edge of the tile, then 8px of the entire 20px will be chopped. By default, those lost 8 pixels will not be drawn in the adjacent tile because the center of the point does not fall in that tile. so that gives something that looks like this for 2 adjacent tiles: It's hard to even tell that those tiles belong together! Using some python mapscript and PIL , it's pretty simple to make those look like this( except I dont know how to tell blogger not to add the spacin...) : That's actually the same WMS request(s), just changing the call from a simple WMS CGI script that calls mapserver to a WSGI script that uses python mapscript and PIL. The script: 1. takes the current bounding bo...

Open Source GIS

February 21, 2008

Just saw this linked from Sean's post . The quotes in there are absurd. But reminded me there's an important point in favor of OSS that I haven't seen. You can still get help directly via the email list or IRC from F Warmerdam , it's primary author. Likewise for the developers of Mapserver and OpenLayers and PostGIS . I wonder if the lead developers of ESRI products spend their off-time perusing the forum's or mailing lists and answering questions? ( I don't know, they may. But I suspect not. ) There's something to be said for enjoying what you do. And I think that's very true in the case of those in the open-source community. Happy coders make better software. Any programmer that denies that will leave me flabbergasted. I just don't understand how I could be effective only clicking the menus that were provided to me if I chose a black-box solution. But, call me crazy, I like linux. Also, in these parts, if you lock your single-speed bike to a woo...

Flash-y Map-plication vs. 500 marker limit

February 12, 2008

I've been pushing capabilities of javascript mapping frameworks. There's a real limit on the number of markers a browser can display without getting bogged down. 500 is a good limit, and that's probably too high for internet explorer. You can put as many markers as you like via WMS tiles. You can even take map clicks, send them back as AJAX queries and open an info window. I recently helped my friend do this, and we have a pretty snappy map displaying 5,000+ clickable "markers" no problem. It's actually not markers, just tiles, but it works quite well: That's a snapshot of the google maps application, with an info window that appears when a marker is clicked. It sends a GetFeatureInfo request back to a mapserver WMS. Vector drawing in the browser is limited. The amazing efforts of the OpenLayers , featureserver projects make this difficult to assert, as they abstract away all of the browser incompatibilities and give a nice platform to do real vector edit...

python bioinformatics

February 10, 2008

There's a new article out in BMC Bioinformatics with a comparison of the speed and length of programs from various languages. This article was sent to the biology in python ( BIP ) mailing list. Looking at the code, it's not that bad, but it is clear that the authors are not pythonistas, and that the reviewers have done a great job on the actual paper, but likely there was no thorough review of the code. The authors define their own max() function that needlessly overrides python's built-in, and they use the code: line.rstrip('/n') that indicates there was not a thorough understanding of python, or a complete code review. Even a non-python programmer should have seen the intent was to strip a newline '\n', but the operation is not inplace, so the desired behavior could be achieved by: line = line.rstrip('\n') Syntactical mistakes aside, python was given a poor review on speed. Andrew Dalke, of wide-finder (and general python-bio) fame ran the alig...

parallel blasts using python's pp module

February 02, 2008

BLAST handles utilizes multiple cores for some scenarios using the -a flag. however, i often do full genome blasts -- blasting all chromosomes of one organism against all others. For the case of rice (O.s.) against itself, this is 124 jobs. there are simple tools in python to run a queued blast in which on my 8 core machine, each core will run 1 of those blasts, as a job finishes, the pp or parallel python module starts the next job, based on the number of cpus it has detected for your machine. the syntax for the script is: python pblast.py rice_rice_10kmers where rice_rice_10kmers is the section in a config/.ini file to get the parameters. My fasta directory looks like: $ ls /tmp/rice/fasta/ ricetenkmers_chr01.fasta ricetenkmers_chr04.fasta ricetenkmers_chr07.fasta ricetenkmers_chr10.fasta ricetenkmers.order ricetenkmers_chr02.fasta ricetenkmers_chr05.fasta ricetenkmers_chr08.fasta ricetenkmers_chr11.fasta ricetenkmers_chr03.fasta ricetenkmers_chr06.fasta ricetenkmers_chr...

tinycc

November 26, 2007

i've been _trying_ to learn C. tinycc, beside being tiny, it compiles very quickly, allowing you to do cool things like script in C #!/usr/bin/tcc -run #include int main(int argc, char *argv[]) { printf("Hello World %s, %s", argv[0], argv[1]); return 0; } and then run as ./file.c arg_1 which makes it easier for those c-fu to guess and check. it also allows such nice things as c in python which is like pyinline, but uses ctypes and doesn't need write access.

Sorting by proximity to a date in PostgreSQL

October 11, 2007

postgreSQL has great support for dates, => SELECT '2007-08-23'::date - '2006-09-14'::date as days; days ------ 343 given a date column and a date, to find the nearest date, you can "extract the epoch", here, i used ABS as i just want the nearest date, before or after. SELECT *, ABS(EXTRACT(EPOCH FROM(date - '2006-08-23'))::BIGINT) as date_order FROM record WHERE well_id = 1234 ORDER BY date_order limit 1 i suppose this could make a nice PL/PGSQL function...

k-means clustering in scipy

September 29, 2007

it's fairly simple to do clustering of points with similar z-values in scipy: import numpy import matplotlib matplotlib.use('Agg') from scipy.cluster.vq import * import pylab pylab.close() # generate some random xy points and # give them some striation so there will be "real" groups. xy = numpy.random.rand(30,2) xy[3:8,1] -= .9 xy[22:28,1] += .9 # make some z vlues z = numpy.sin(xy[:,1]-0.2*xy[:,1]) # whiten them z = whiten(z) # let scipy do its magic (k==3 groups) res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3) # convert groups to rbg 3-tuples. colors = ([([0,0,0],[1,0,0],[0,0,1])[i] for i in idx]) # show sizes and colors. each color belongs in diff cluster. pylab.scatter(xy[:,0],xy[:,1],s=20*z+9, c=colors) pylab.savefig('/var/www/tmp/clust.png')

using python mapscript to create a shapefile and dbf

July 17, 2007

i always have trouble remembering how to use mapscript. it's pretty simple, but the docs are hard to find and the test cases (though excellent!) have a lot of abstraction. heres some code that creates a shapefile and dbf (using another module). and does a quick projection at the start. import mapscript as M import random from dbfpy import dbf ######################################### # do some projection ######################################### p = 'POINT(466666 466000)' shape = M.shapeObj.fromWKT(p) projInObj = M.projectionObj("init=epsg:32619") projOutObj = M.projectionObj("init=epsg:4326") shape.project(projInObj, projOutObj) print shape.toWKT() ######################################### # create a shapefile from scractch ######################################### ms_dbf = dbf.Dbf("/tmp/t.dbf", new=True) ms_dbf.addField(('some_field', "C", 10)) ms_shapefile = M.shapefileObj('/tmp/t.shp', M.MS_SHAPEFILE_POLYGON) for ...

Note to self: using python logging module

June 27, 2007

import logging logging.basicConfig(level=logging.DEBUG ,format='%(asctime)s [[%(levelname)s]] %(message)s' ,datefmt='%d %b %y %H:%M' ,filename='/tmp/app.log' ,filemode='a') logging.debug('A debug message') logging.info('Some information') logging.warning('A shot across the bows')