Bioinformatics

Posts

Showing posts with the label numpy

Loopless programming (for calculating methylation types)

January 18, 2010

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation , determined by the nucleotides that follow the 'C': CG: a 'G' follows the 'C' CHG: anything but a 'G' follows the 'C' and a 'G' follows that CHH: no 'G' in the 2 subsequent nucleotides. So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs. Loopless Given a nump...

displaying and serving big(ish) data with numpy and memmap

May 23, 2009

In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script. A web request containing a width=XXX in the url will show the user wants an image. so a url like: http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&&width=512 will give an image: (geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression. Without the width=XXX param, the service returns JSON of values...

landsummary.com

December 17, 2008

One of the things I like the least about my real job, and much of the contract work that I do is that i'm usually the only programmer working on each task. So, it's been very fun to work on a project with Josh Livni ( His writeup ). We got together one afternoon, and by the time we left, we had a reasonable start of what we call landsummary , we've since put in a fair bit of work sprinkled here and there. Josh set up an AWS server--it's nice for me to have fewer sys-admin duties too. What's actually on display is fairly modest. What it does is takes a user-drawn square, circle, or arbitrary polygon, and uses that to summarize the NLCD dataset along with some census data. The things that make it more than lame are that it's very fast, it can easily be extended to summarize any raster dataset, and we have a sorta cool API (not documented) which allows us or anyone to query the data with a WKT Polygon and request a particular service -- currently nlcd, population,...

numpy to tiff via gdal

April 23, 2008

EDIT: 7 months later I came back to this and found an error. update in the code below with old line commented out. Rather than venting about a project I've recently decoupled myself from, I'll try to do something constructive... I also posted this to the gispython mailing list, but I've had to figure it out a couple times, so I'll put it here for the record. Given an N * N numpy array, and a bounding box, it's actually fairly simple to make a georeferenced tiff: from osgeo import gdal, gdal_array import numpy from osgeo.gdalconst import GDT_Float64 xsize, ysize = 10, 10 a = numpy.random.random((xsize, ysize)).astype(numpy.float64) xmin, xmax = -121., -119. ymin, ymax = 41., 43. driver = gdal.GetDriverByName('GTiff') # bad: out = driver.Create('a.tiff', a.shape[0], a.shape[1], 1, GDT_Float64) # the args to Create are 'name', xsize, ysize. and .shape[0] is rows, which is y. driver.Create('a.tiff', a.shape[1], a.shape[0], 1, GDT_Float64...