Saturday, May 23, 2009

displaying and serving big(ish) data with numpy and memmap

In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script.

A web request containing a width=XXX in the url will show the user wants an image. so a url like:
will give an image:

(geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into
an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression.
Without the width=XXX param, the service returns JSON of values for the + and minus strand for each basepair and tissue in the original file--which is less useful than having a summary of mean value for each tissue over the range:

{"+": [0.008325556293129921, 0.0072462377138435841, 0.0074741491116583347, 0.0017609233036637306, 0.010895536281168461], "-": [0.0088664013892412186, 0.017081102356314659, 0.009236418642103672, 0.0043859495781362057, 0.021286465227603912]}

which are the mean values from basepair 491520 to 499712 for the + and - strands and each of the 5 tissues. A single tissue can be requested by number as:
which gives:

{"+": 0.0074741492195734898, "-": 0.0092364182547917447}

Likewise, one can request and image with only a single tissue:

Transcriptome data often has enough rows of data that it's possible, but not really convenient to stick it into an RDBMS. the data I'm using is available here and looks like:

26 C 125.05 106.18 119.80 159.78 101.40
26 W 141.90 139.07 137.50 151.91 171.18
51 C 131.85 108.57 71.00 156.41 133.79
51 W 123.83 129.97 122.50 100.23 139.09
76 C 136.84 104.98 120.30 88.88 151.64
76 W 119.35 112.79 160.00 119.51 130.29
126 C 138.93 100.92 111.00 84.82 103.48
126 W 87.82 118.30 110.30 157.17 126.42
151 C 184.50 71.24 157.80 396.25 86.42
151 W 136.76 119.95 107.00 98.10 108.60
176 C 135.75 86.98 115.80 188.41 93.53
176 W 147.57 158.19 131.00 117.45 164.77
[... millions more ...]

where the first column is basepair position, the second is either 'W' or 'C' for the Watson or Crick strand of the (double-stranded) DNA sequence. Columns 2 and on are measurements of transcription (see the wikipedia article) for various tissues in the plant Arabidopsis Thaliana which we use because it has a small genome, it's well annotated and doesn't have too much repetitive DNA or too many transposons.

The data is somewhat irregular in that not every 'C' row has a 'W' counterpart and sometimes vice-versa, for those cases, I create a row to fill the missing row with the same values as the existing row. The code below runs through and creates numpy .npy files of the data:

It saves the positions and the actual transcriptome data into separate files because for the web service, the transcriptome files will be memory-mapped while the smaller position files will be read into memory. This code is run at startup (and so not done for every request):

from numpy.lib import format
for ichr in range(1, 6): # for 5 at chrs
schr = str(ichr)
# memmap these
seqids[schr] = format.open_memmap(os.path.join(path, \
'' % (schr,)), mode='r')
# read these into memory.
posns[schr] = np.load(os.path.join(path, 'at.tome.posn.%s.npy' % schr))

So that the seqids dict has values that are the memmaped contents of the expression data in the .npy files. For each web request, it then uses numpy's searchsorted to do a binary search so that when a user requests &xmin=1234&xmax=4567 searchsorted() finds the position in the array where 1234 would fall (exactly like python's bisect module). and like wise for 4567. The pair of array indexes from the searchsorted() calls with xmin and xmax give the indexes into the tome array for which to grab the data:

minidx = posns[seqid].searchsorted(xmin)
maxidx = posns[seqid].searchsorted(xmax)

data = seqids[seqid][minidx: maxidx]
data_idx = posns[seqid][minidx: maxidx]

where data contains the values of the transcriptome data and data_idx contains the associated base_pair positions. I could have saved the basepair positions in the seqids/data numpy array, but since searchsorted() will likely traverse many chunks of the array, I think it's best to have that part in memory rather than memory mapped.
The rest of the code is a bunch of if statements (which I should really spread across multiple functions...) deciding whether to show an image with matplotlib or grab a summary of the data and return it via simplejson. The gist of the entire wsgi script is at the end of this post.
An example of what this looks like in an openlayers map with gene annotations is in the image below.

where one can see some correlation between where the (green) genes are and the higher transcriptome levels. In summary, it's nice to be able to take data in this format and write a 120 line script which provides full, fast access both to the raw data and to images which we can use to find patterns to explore.