Posts

GSNAP

Aligners Since starting the methylcoder project, I've been using the bowtie short read aligner . It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough options to keep you busy (including my favorite: --try-hard). There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle. As with many aligners, bowtie creates an index of the reference genome. It uses a suffix array compressed with the Burrows-Wheeler Transform (BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the r...

python tools for bioinformatics II: group-ing

This is the second post describing python tools and idioms I've found useful in bioinformatics. This post involves grouping items either using a disjoint set or python's itertools.groupby . I was introduced to both of these by the post-doc who sits behind me. Grouper Grouper is an implementation of a disjoint set, initially from this recipe by Michael Droettboom and included in matplotlib.cbook. The slightly modified implementation we use is here . Where the docstring gives a decent idea of useage: """ This class provides a lightweight way to group arbitrary objects together into disjoint sets when a full-blown graph data structure would be overkill. Objects can be joined using .join(), tested for connectedness using .joined(), and all disjoint sets can be retrieved using list(g) The objects being joined must be hashable. >>> g = Grouper() >>> g.join('a', 'b') >>> g.join('b...

python tools for bioinformatics I: convolve for moving average

Image
This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics. Convolve numpy makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using convolve . It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')". So the workflow is to get the sequence into a numpy array of dtype 'c' convert the boolean array to int so True is 1, False is 0 decide on the window size and make a kernel of ...

numpy + GDAL = agoodle

It's been a while since I've posted anything geo-related. Just want to let folks know about agoodle a project that makes it easy to access raster geo files as numpy arrays (thanks to GDAL), and to query them with polygons. The simplest way to see it in action is to go to landsummary , a project Josh Livni and I worked on a couple years ago, and query the map with a polygon (click "Draw Polygon" on the right side of the map first). Then you'll see a nice summary and a pie-chart (due to some google-chart-api usage by Josh) of the land-use types in the polygon you queried. The backend of that is agoodle. The map is, of course, openlayers . You can see the usage on the github page but basically, it'll be something like: which gives you a dictionary of raster grid code keys to the area values-- meaning, for each landclass, it tells you the sum area of cells of that class that fall inside the polygon. Note that summarize_wkt can also take keyword arguments for ra...

fileindex

Disclaimer: This is about a generic indexing method that fits in here and explained below. Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python . And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it. Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence or the fastq record). In the case of flat files, it's a mapping from and id or name to the fseek file-position at ...

writing and building a lua c extension module

[Update: 2009-04-09] This package is now on luarocks under it's new name: "stringy". It now includes startswith and endswith methods (more coming). Under advice of lua gurus, I no longer add to the string base methods. And finally, it's available in my lua github repo . [/Update] I've been messing with Lua programming language lately, mostly because I wanted to try to use love2d as a visualization tool. I got side-tracked into building a c extension for lua. The C-API is much different from python . In lua, all the c functions have a signature like: int c_function(lua_State *L) and you use the lua c API to pull values off the lua_state stack thingy -- L . And then to return values, you just push them back onto the stack. I don't grok this fully yet, but it seems to handle all the memory allocation for you. Anyway, it's hard to find a full example of creating a C extension for lua 5.1. It actually seems more common just to provide patches for the lua dis...

organizing genomic data

In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like gff (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative. Big ol' Matrix because it was already nicely pre-processed, I downloaded some Arabidopsis expression data from here . They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for mo...

Loopless programming (for calculating methylation types)

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation , determined by the nucleotides that follow the 'C': CG: a 'G' follows the 'C' CHG: anything but a 'G' follows the 'C' and a 'G' follows that CHH: no 'G' in the 2 subsequent nucleotides. So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs. Loopless Given a nump...

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign

I've written previously about nwalign , a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance. Ack It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing. Bugs Placement of Gaps Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and "AGEBAM", the alignments: AGEBAMAM AGEBAM-- and AGEBAMAM AGEB--AM Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through t...

updates to pyfasta

At the end of last year, I put in quite a few updates to pyfasta . One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends. With this option, a file like: >a actg actg actg >b aaaacccc aaaacccc will overwri...

genome scrubber : mask repetitive sequence

Image
This is to describe a simple tool I've made available ( svn repo ) for masking repetitive sequence. rice (Oryza Sativa) version 5 sequence looks something like below when run through pyfasta info. rice.fasta ========== >1 length:43596771 >3 length:36345490 >2 length:35925388 >4 length:35244269 >6 length:31246789 >5 length:29874162 >7 length:29688601 >11 length:28462103 >8 length:28309179 >12 length:27497214 >9 length:23011239 >10 length:22876596 372.078M basepairs in 12 sequences So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command: /usr/bin/bl...

rst2s5 with syntax highlighting

Restructured Text to S5 Presentation (lots of caffeine today so 2 posts in one day) The stub example presentation I'll be talking about is viewable as a presentation here (click on that page to advance the demo slides). There's a nice browser-based tool for presentations called S5 . In recent python docutils , there's a tool called rst2s5.py which converts restructured-text to an s5 presentation. However, it's not obvious how to get syntax highlighting for code blocks to work. So pygments , a python library that will highlight syntax for many programming languages comes with this file which they recommend you use as a starting point. That's what I did, and I've created a stub example project accessible via subversion: $ svn export http://bpgeo.googlecode.com/svn/trunk/rst2s5_template/ with a build script and a couple of example slides (and a nice theme). It's possible to change the theme by editing rst-directive.py (included in the source) and changing th...

some python ctypes stuff in Rtree

I've been working with and on the Rtree python module. It's some cool work done by Howard Butler (hobu) (originally with Sean Gillies ) to make python bindings for the Spatial Index C++ library by Marios Hadjieleftheriou which provides various tree structures for efficient spatial searches in n-dimensions. Hobu has written a C API for that along with a new ctypes wrapper to that API which appears in Rtree 0.5 and greater. There is some cool ctypes stuff in there which I'm starting to understand. From the website: ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python. as a simple example of how ctypes works we can pretend there's no atan() in python's math module and access the one from libm in c like this: import ctypes libm = ctypes.CDLL('libm.so.6') # the following 2 lines correspond to the c signature: double...

biostuff

I've been trying in 2009 to write less throw-away code. I'm not sure how successful I've been at that, but at least I'm writing more code that I keep around. Previously, I stuck anything of at least marginal quality and re-usability into my google code project bpbio . As of yesterday, I've moved a lot of stuff from there to bitbucket . "Biostuff" is where I'll put modules that are well documented and tested in hopes that using a distributed VCS and a project that doesn't contain my initials will foster any contribution. Currently, all the modules on bitbucket are also on pypi. pyfasta provides pythonic access fasta sequence files. Previously, it had been a part of genedex (which I've stopped supporting since @hobu has done so much good work on Rtree that genedex is now pretty much obsolete) but it's been pulled out and simplified and improved. Check out the docs on pypi . nwalign is a command-line or python interface to the Needleman-W...

starting haxe. (stuff i want to remember)

I've been tinkering with a flash project recently. Actually haxe , so it's only linux tools -- VI, and the command line -- not the GUI interface people normally associate with flash. This post is a summary of how to get started with haxe using only the command line, and a project containing flash stuff I want to remember. To start, here's a gist of shell commands that will set up haxe on an ubuntu machine. The installers from the haxe website work fine for windows and mac (and I think 32 bit linux). Haxe has a slightly different syntax from actionscript 3, but for most things it is identical. These docs are very good, and better than the adobe site, I have that page open always when working with haxe. I also grabbed an "actionscript.vim" from the internet somewhere and put it in ~/.vim/syntax/ for syntax highlighting and added this line to my .vimrc: autocmd BufRead *.hx set filetype=actionscript Then compilation and code is simply a matter of following this . ...

displaying and serving big(ish) data with numpy and memmap

Image
In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script. A web request containing a width=XXX in the url will show the user wants an image. so a url like: http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&&width=512 will give an image: (geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression. Without the width=XXX param, the service returns JSON of values...

Needleman-Wunsch global sequence alignment

[EDIT 01-01-2010] this is now available at bitbucket . I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as: sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/ or via svn from: svn co http://bpbio.googlecode.com/svn/trunk/nwalign/ it will put an executable 'nwalign' into /usr/bin/ which when run will give this message: Usage: nwalign [options] seq1 seq2 Options: -h, --help show this help message and exit --gap=GAP gap extend penalty (must be integer --gap_init=GAP_INIT gap start penalty (must be integer --match=MATCH match score (must be integer > 0) --mismatch=MISMATCH gap penalty (must be integer --matrix=MATRIX scoring matrix in ncbi/data/ format, if not specificied, match/mismatch are used where the matrix is optional but can be the full path ...

python object initialization speed

On the Cython mailing list, I saw this mentioned for avoiding init overhead, so i wrote up some code to try it. Basically, instead of using an __init__, it uses the PY_NEW macro (which I don't pretend to understand fully). I ran a benchmark with 5 cases: PY_NEW macro (still has python overhead for each call to the creator function) regular python init python init using __slots__ cython init (cdef'ed class) batch PY_NEW: calling PY_NEW from inside cython to avoid python call overhead batch init on cython class the timings look like this: PY_NEW on Cython class: 1.160 __init__ on Python class: 30.414 __init__ on Python class with slots: 10.242 __init__ on Cython class 1.185 batch PY_NEW total: 0.855 , interval only: 0.383 batch __init__ on Cython class total 0.998 , interval_only: 0.540 So, the PY_NEW is .383 compared to .540 for using a __init__ on a Cython class, but both are much faster than python. I was surprised that using slots gives a 3x speed improvement over a regular...

apache mpm-worker with php on low memory servers

I'm partly writing this because I think such valuable info is too hard to find, I'd read a lot that if you want php, you have to use mpm-prefork but it's not true! I've been using slicehost for dev server for almost a year now. Since my 1 year deal is up, I decided to switch to their affiliate mosso . I really like slicehost, but Mosso "cloud servers" seem a good fit for a server that goes through spurts of development and use followed by weeks of non-use. So now, I can keep it as a 256MB instance at about $10/month and update to a larger instance when doing real dev. I built it today as a 1024MB instance -- installed all my usual stuff, and updated my build script for ubuntu. That's here . The machine I'm on is extremely fast, normally I set GDAL building and leave, but it finished before I had a chance. After all was built, I resized it to a 256MB server -- that took 12minutes, but my instance was accessible for at least 10 of those. After that, I ...

biohash

This is a quick project I did a while back but, I've seen people interested in similar ideas, so I'll post my implementation here. Geohash encodes latitude/longtitude locations into a string such that "nearby places will often present similar prefixes" and the longer the string, the greater the precision. Using this python implementation by Schuyler as a reference, I ported the concept to a "biohash" which can encode intervals. It works in a similar fashion, starting with the extremes and halving the space until it finds the smallest space that contains the interval. The use to allow efficient search of intervals using a BTree index, as in any relational db. It's implemented with only a dumps() and loads() function after the pickle interface. The dumps function takes start and end args and returns a 1/0 encoded string. The loads takes a 1/0 encoded string and returns the tightest interval it can given that string. Both functions take a rng kwarg, whic...