Bioinformatics

Posts

numpy + GDAL = agoodle

May 19, 2010

It's been a while since I've posted anything geo-related. Just want to let folks know about agoodle a project that makes it easy to access raster geo files as numpy arrays (thanks to GDAL), and to query them with polygons. The simplest way to see it in action is to go to landsummary , a project Josh Livni and I worked on a couple years ago, and query the map with a polygon (click "Draw Polygon" on the right side of the map first). Then you'll see a nice summary and a pie-chart (due to some google-chart-api usage by Josh) of the land-use types in the polygon you queried. The backend of that is agoodle. The map is, of course, openlayers . You can see the usage on the github page but basically, it'll be something like: which gives you a dictionary of raster grid code keys to the area values-- meaning, for each landclass, it tells you the sum area of cells of that class that fall inside the polygon. Note that summarize_wkt can also take keyword arguments for ra...

fileindex

April 10, 2010

Disclaimer: This is about a generic indexing method that fits in here and explained below. Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python . And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it. Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence or the fastq record). In the case of flat files, it's a mapping from and id or name to the fseek file-position at ...

writing and building a lua c extension module

April 04, 2010

[Update: 2009-04-09] This package is now on luarocks under it's new name: "stringy". It now includes startswith and endswith methods (more coming). Under advice of lua gurus, I no longer add to the string base methods. And finally, it's available in my lua github repo . [/Update] I've been messing with Lua programming language lately, mostly because I wanted to try to use love2d as a visualization tool. I got side-tracked into building a c extension for lua. The C-API is much different from python . In lua, all the c functions have a signature like: int c_function(lua_State *L) and you use the lua c API to pull values off the lua_state stack thingy -- L . And then to return values, you just push them back onto the stack. I don't grok this fully yet, but it seems to handle all the memory allocation for you. Anyway, it's hard to find a full example of creating a C extension for lua 5.1. It actually seems more common just to provide patches for the lua dis...

organizing genomic data

February 16, 2010

In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like gff (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative. Big ol' Matrix because it was already nicely pre-processed, I downloaded some Arabidopsis expression data from here . They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for mo...

Loopless programming (for calculating methylation types)

January 18, 2010

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation , determined by the nucleotides that follow the 'C': CG: a 'G' follows the 'C' CHG: anything but a 'G' follows the 'C' and a 'G' follows that CHH: no 'G' in the 2 subsequent nucleotides. So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs. Loopless Given a nump...

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign

January 10, 2010

I've written previously about nwalign , a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance. Ack It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing. Bugs Placement of Gaps Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and "AGEBAM", the alignments: AGEBAMAM AGEBAM-- and AGEBAMAM AGEB--AM Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through t...

updates to pyfasta

January 04, 2010

At the end of last year, I put in quite a few updates to pyfasta . One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends. With this option, a file like: >a actg actg actg >b aaaacccc aaaacccc will overwri...

genome scrubber : mask repetitive sequence

December 23, 2009

This is to describe a simple tool I've made available ( svn repo ) for masking repetitive sequence. rice (Oryza Sativa) version 5 sequence looks something like below when run through pyfasta info. rice.fasta ========== >1 length:43596771 >3 length:36345490 >2 length:35925388 >4 length:35244269 >6 length:31246789 >5 length:29874162 >7 length:29688601 >11 length:28462103 >8 length:28309179 >12 length:27497214 >9 length:23011239 >10 length:22876596 372.078M basepairs in 12 sequences So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command: /usr/bin/bl...

rst2s5 with syntax highlighting

October 23, 2009

Restructured Text to S5 Presentation (lots of caffeine today so 2 posts in one day) The stub example presentation I'll be talking about is viewable as a presentation here (click on that page to advance the demo slides). There's a nice browser-based tool for presentations called S5 . In recent python docutils , there's a tool called rst2s5.py which converts restructured-text to an s5 presentation. However, it's not obvious how to get syntax highlighting for code blocks to work. So pygments , a python library that will highlight syntax for many programming languages comes with this file which they recommend you use as a starting point. That's what I did, and I've created a stub example project accessible via subversion: $ svn export http://bpgeo.googlecode.com/svn/trunk/rst2s5_template/ with a build script and a couple of example slides (and a nice theme). It's possible to change the theme by editing rst-directive.py (included in the source) and changing th...

some python ctypes stuff in Rtree

October 22, 2009

I've been working with and on the Rtree python module. It's some cool work done by Howard Butler (hobu) (originally with Sean Gillies ) to make python bindings for the Spatial Index C++ library by Marios Hadjieleftheriou which provides various tree structures for efficient spatial searches in n-dimensions. Hobu has written a C API for that along with a new ctypes wrapper to that API which appears in Rtree 0.5 and greater. There is some cool ctypes stuff in there which I'm starting to understand. From the website: ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python. as a simple example of how ctypes works we can pretend there's no atan() in python's math module and access the one from libm in c like this: import ctypes libm = ctypes.CDLL('libm.so.6') # the following 2 lines correspond to the c signature: double...

biostuff

October 14, 2009

I've been trying in 2009 to write less throw-away code. I'm not sure how successful I've been at that, but at least I'm writing more code that I keep around. Previously, I stuck anything of at least marginal quality and re-usability into my google code project bpbio . As of yesterday, I've moved a lot of stuff from there to bitbucket . "Biostuff" is where I'll put modules that are well documented and tested in hopes that using a distributed VCS and a project that doesn't contain my initials will foster any contribution. Currently, all the modules on bitbucket are also on pypi. pyfasta provides pythonic access fasta sequence files. Previously, it had been a part of genedex (which I've stopped supporting since @hobu has done so much good work on Rtree that genedex is now pretty much obsolete) but it's been pulled out and simplified and improved. Check out the docs on pypi . nwalign is a command-line or python interface to the Needleman-W...

starting haxe. (stuff i want to remember)

July 26, 2009

I've been tinkering with a flash project recently. Actually haxe , so it's only linux tools -- VI, and the command line -- not the GUI interface people normally associate with flash. This post is a summary of how to get started with haxe using only the command line, and a project containing flash stuff I want to remember. To start, here's a gist of shell commands that will set up haxe on an ubuntu machine. The installers from the haxe website work fine for windows and mac (and I think 32 bit linux). Haxe has a slightly different syntax from actionscript 3, but for most things it is identical. These docs are very good, and better than the adobe site, I have that page open always when working with haxe. I also grabbed an "actionscript.vim" from the internet somewhere and put it in ~/.vim/syntax/ for syntax highlighting and added this line to my .vimrc: autocmd BufRead *.hx set filetype=actionscript Then compilation and code is simply a matter of following this . ...

displaying and serving big(ish) data with numpy and memmap

May 23, 2009

In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script. A web request containing a width=XXX in the url will show the user wants an image. so a url like: http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&&width=512 will give an image: (geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression. Without the width=XXX param, the service returns JSON of values...

Needleman-Wunsch global sequence alignment

April 21, 2009

[EDIT 01-01-2010] this is now available at bitbucket . I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as: sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/ or via svn from: svn co http://bpbio.googlecode.com/svn/trunk/nwalign/ it will put an executable 'nwalign' into /usr/bin/ which when run will give this message: Usage: nwalign [options] seq1 seq2 Options: -h, --help show this help message and exit --gap=GAP gap extend penalty (must be integer --gap_init=GAP_INIT gap start penalty (must be integer --match=MATCH match score (must be integer > 0) --mismatch=MISMATCH gap penalty (must be integer --matrix=MATRIX scoring matrix in ncbi/data/ format, if not specificied, match/mismatch are used where the matrix is optional but can be the full path ...

python object initialization speed

April 16, 2009

On the Cython mailing list, I saw this mentioned for avoiding init overhead, so i wrote up some code to try it. Basically, instead of using an __init__, it uses the PY_NEW macro (which I don't pretend to understand fully). I ran a benchmark with 5 cases: PY_NEW macro (still has python overhead for each call to the creator function) regular python init python init using __slots__ cython init (cdef'ed class) batch PY_NEW: calling PY_NEW from inside cython to avoid python call overhead batch init on cython class the timings look like this: PY_NEW on Cython class: 1.160 __init__ on Python class: 30.414 __init__ on Python class with slots: 10.242 __init__ on Cython class 1.185 batch PY_NEW total: 0.855 , interval only: 0.383 batch __init__ on Cython class total 0.998 , interval_only: 0.540 So, the PY_NEW is .383 compared to .540 for using a __init__ on a Cython class, but both are much faster than python. I was surprised that using slots gives a 3x speed improvement over a regular...

apache mpm-worker with php on low memory servers

April 12, 2009

I'm partly writing this because I think such valuable info is too hard to find, I'd read a lot that if you want php, you have to use mpm-prefork but it's not true! I've been using slicehost for dev server for almost a year now. Since my 1 year deal is up, I decided to switch to their affiliate mosso . I really like slicehost, but Mosso "cloud servers" seem a good fit for a server that goes through spurts of development and use followed by weeks of non-use. So now, I can keep it as a 256MB instance at about $10/month and update to a larger instance when doing real dev. I built it today as a 1024MB instance -- installed all my usual stuff, and updated my build script for ubuntu. That's here . The machine I'm on is extremely fast, normally I set GDAL building and leave, but it finished before I had a chance. After all was built, I resized it to a 256MB server -- that took 12minutes, but my instance was accessible for at least 10 of those. After that, I ...

biohash

February 20, 2009

This is a quick project I did a while back but, I've seen people interested in similar ideas, so I'll post my implementation here. Geohash encodes latitude/longtitude locations into a string such that "nearby places will often present similar prefixes" and the longer the string, the greater the precision. Using this python implementation by Schuyler as a reference, I ported the concept to a "biohash" which can encode intervals. It works in a similar fashion, starting with the extremes and halving the space until it finds the smallest space that contains the interval. The use to allow efficient search of intervals using a BTree index, as in any relational db. It's implemented with only a dumps() and loads() function after the pickle interface. The dumps function takes start and end args and returns a 1/0 encoded string. The loads takes a 1/0 encoded string and returns the tightest interval it can given that string. Both functions take a rng kwarg, whic...

Stupid things I did as a Bioinformatics Programmer in 2008

January 09, 2009

In 2008, I was good enough at programming to get my ass kicked by hard problems. I think that's the most positive way to say it. My main bioinformatics project was a long annotation pipeline. It takes days to run, often using 8 CPUs. It's driven by a big ol' Makefile. I made the mistake of passing data between steps in un-structured text files or python pickles. I'd create one at the beginning of the pipeline and not notice it was messed up in a way that affected other parts until the entire pipeline was done, days later. Toward the end of the pipeline, I'd need something simple, like the strand of a BLAST hit, but I'd have to parse through an entire GFF file, or load some huge pickle into memory just to get to that. Then I'd need some annotation, and I'd have to add a slow step of doing a lookup in a script that'd otherwise run very quickly. I was passing around data in arrays and tuples, so then when I changed the order or added another element in...

landsummary.com

December 17, 2008

One of the things I like the least about my real job, and much of the contract work that I do is that i'm usually the only programmer working on each task. So, it's been very fun to work on a project with Josh Livni ( His writeup ). We got together one afternoon, and by the time we left, we had a reasonable start of what we call landsummary , we've since put in a fair bit of work sprinkled here and there. Josh set up an AWS server--it's nice for me to have fewer sys-admin duties too. What's actually on display is fairly modest. What it does is takes a user-drawn square, circle, or arbitrary polygon, and uses that to summarize the NLCD dataset along with some census data. The things that make it more than lame are that it's very fast, it can easily be extended to summarize any raster dataset, and we have a sorta cool API (not documented) which allows us or anyone to query the data with a WKT Polygon and request a particular service -- currently nlcd, population,...

python interval tree

November 03, 2008

EDIT: added a couple points inline. I'm obsessed with trees lately -- of the CS variety, not the plant variety. Although we are studying poplar, so I'll be using trees to study trees . I'd tried a couple times to implement an interval tree from scratch following the Wikipedia entry . Today I more or less did that in python. It's the simplest possible form. There's no insertion (though that's trivial to add), it just takes a list of 'things' with start and stop attributes and creates a tree with a .find() method. The wikipedia entry that baffled me was about storing 2 copies of each node's intervals --one sorted by start and the other by stop. I didn't do that as I think in most cases it won't improve lookup time. I figure if you have 1 million elements and a tree of depth 16, then you have on average 15 intervals per node (actually fewer since there are the non-leaf nodes). So I just brute force each of those nodes and move to the next. I th...