Bioinformatics

Posts

genome scrubber : mask repetitive sequence

December 23, 2009

This is to describe a simple tool I've made available ( svn repo ) for masking repetitive sequence. rice (Oryza Sativa) version 5 sequence looks something like below when run through pyfasta info. rice.fasta ========== >1 length:43596771 >3 length:36345490 >2 length:35925388 >4 length:35244269 >6 length:31246789 >5 length:29874162 >7 length:29688601 >11 length:28462103 >8 length:28309179 >12 length:27497214 >9 length:23011239 >10 length:22876596 372.078M basepairs in 12 sequences So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command: /usr/bin/bl...

rst2s5 with syntax highlighting

October 23, 2009

Restructured Text to S5 Presentation (lots of caffeine today so 2 posts in one day) The stub example presentation I'll be talking about is viewable as a presentation here (click on that page to advance the demo slides). There's a nice browser-based tool for presentations called S5 . In recent python docutils , there's a tool called rst2s5.py which converts restructured-text to an s5 presentation. However, it's not obvious how to get syntax highlighting for code blocks to work. So pygments , a python library that will highlight syntax for many programming languages comes with this file which they recommend you use as a starting point. That's what I did, and I've created a stub example project accessible via subversion: $ svn export http://bpgeo.googlecode.com/svn/trunk/rst2s5_template/ with a build script and a couple of example slides (and a nice theme). It's possible to change the theme by editing rst-directive.py (included in the source) and changing th...

some python ctypes stuff in Rtree

October 22, 2009

I've been working with and on the Rtree python module. It's some cool work done by Howard Butler (hobu) (originally with Sean Gillies ) to make python bindings for the Spatial Index C++ library by Marios Hadjieleftheriou which provides various tree structures for efficient spatial searches in n-dimensions. Hobu has written a C API for that along with a new ctypes wrapper to that API which appears in Rtree 0.5 and greater. There is some cool ctypes stuff in there which I'm starting to understand. From the website: ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python. as a simple example of how ctypes works we can pretend there's no atan() in python's math module and access the one from libm in c like this: import ctypes libm = ctypes.CDLL('libm.so.6') # the following 2 lines correspond to the c signature: double...

biostuff

October 14, 2009

I've been trying in 2009 to write less throw-away code. I'm not sure how successful I've been at that, but at least I'm writing more code that I keep around. Previously, I stuck anything of at least marginal quality and re-usability into my google code project bpbio . As of yesterday, I've moved a lot of stuff from there to bitbucket . "Biostuff" is where I'll put modules that are well documented and tested in hopes that using a distributed VCS and a project that doesn't contain my initials will foster any contribution. Currently, all the modules on bitbucket are also on pypi. pyfasta provides pythonic access fasta sequence files. Previously, it had been a part of genedex (which I've stopped supporting since @hobu has done so much good work on Rtree that genedex is now pretty much obsolete) but it's been pulled out and simplified and improved. Check out the docs on pypi . nwalign is a command-line or python interface to the Needleman-W...

starting haxe. (stuff i want to remember)

July 26, 2009

I've been tinkering with a flash project recently. Actually haxe , so it's only linux tools -- VI, and the command line -- not the GUI interface people normally associate with flash. This post is a summary of how to get started with haxe using only the command line, and a project containing flash stuff I want to remember. To start, here's a gist of shell commands that will set up haxe on an ubuntu machine. The installers from the haxe website work fine for windows and mac (and I think 32 bit linux). Haxe has a slightly different syntax from actionscript 3, but for most things it is identical. These docs are very good, and better than the adobe site, I have that page open always when working with haxe. I also grabbed an "actionscript.vim" from the internet somewhere and put it in ~/.vim/syntax/ for syntax highlighting and added this line to my .vimrc: autocmd BufRead *.hx set filetype=actionscript Then compilation and code is simply a matter of following this . ...

displaying and serving big(ish) data with numpy and memmap

May 23, 2009

In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script. A web request containing a width=XXX in the url will show the user wants an image. so a url like: http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&&width=512 will give an image: (geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression. Without the width=XXX param, the service returns JSON of values...

Needleman-Wunsch global sequence alignment

April 21, 2009

[EDIT 01-01-2010] this is now available at bitbucket . I've written a simple, fast, python version of Needleman-Wunsch as I couldn't find one to use. It uses Cython and specifically cython-numpy goodness. It's easy-installable as: sudo easy_install -UZ http://bpbio.googlecode.com/svn/trunk/nwalign/ or via svn from: svn co http://bpbio.googlecode.com/svn/trunk/nwalign/ it will put an executable 'nwalign' into /usr/bin/ which when run will give this message: Usage: nwalign [options] seq1 seq2 Options: -h, --help show this help message and exit --gap=GAP gap extend penalty (must be integer --gap_init=GAP_INIT gap start penalty (must be integer --match=MATCH match score (must be integer > 0) --mismatch=MISMATCH gap penalty (must be integer --matrix=MATRIX scoring matrix in ncbi/data/ format, if not specificied, match/mismatch are used where the matrix is optional but can be the full path ...

python object initialization speed

April 16, 2009

On the Cython mailing list, I saw this mentioned for avoiding init overhead, so i wrote up some code to try it. Basically, instead of using an __init__, it uses the PY_NEW macro (which I don't pretend to understand fully). I ran a benchmark with 5 cases: PY_NEW macro (still has python overhead for each call to the creator function) regular python init python init using __slots__ cython init (cdef'ed class) batch PY_NEW: calling PY_NEW from inside cython to avoid python call overhead batch init on cython class the timings look like this: PY_NEW on Cython class: 1.160 __init__ on Python class: 30.414 __init__ on Python class with slots: 10.242 __init__ on Cython class 1.185 batch PY_NEW total: 0.855 , interval only: 0.383 batch __init__ on Cython class total 0.998 , interval_only: 0.540 So, the PY_NEW is .383 compared to .540 for using a __init__ on a Cython class, but both are much faster than python. I was surprised that using slots gives a 3x speed improvement over a regular...

apache mpm-worker with php on low memory servers

April 12, 2009

I'm partly writing this because I think such valuable info is too hard to find, I'd read a lot that if you want php, you have to use mpm-prefork but it's not true! I've been using slicehost for dev server for almost a year now. Since my 1 year deal is up, I decided to switch to their affiliate mosso . I really like slicehost, but Mosso "cloud servers" seem a good fit for a server that goes through spurts of development and use followed by weeks of non-use. So now, I can keep it as a 256MB instance at about $10/month and update to a larger instance when doing real dev. I built it today as a 1024MB instance -- installed all my usual stuff, and updated my build script for ubuntu. That's here . The machine I'm on is extremely fast, normally I set GDAL building and leave, but it finished before I had a chance. After all was built, I resized it to a 256MB server -- that took 12minutes, but my instance was accessible for at least 10 of those. After that, I ...

biohash

February 20, 2009

This is a quick project I did a while back but, I've seen people interested in similar ideas, so I'll post my implementation here. Geohash encodes latitude/longtitude locations into a string such that "nearby places will often present similar prefixes" and the longer the string, the greater the precision. Using this python implementation by Schuyler as a reference, I ported the concept to a "biohash" which can encode intervals. It works in a similar fashion, starting with the extremes and halving the space until it finds the smallest space that contains the interval. The use to allow efficient search of intervals using a BTree index, as in any relational db. It's implemented with only a dumps() and loads() function after the pickle interface. The dumps function takes start and end args and returns a 1/0 encoded string. The loads takes a 1/0 encoded string and returns the tightest interval it can given that string. Both functions take a rng kwarg, whic...

Stupid things I did as a Bioinformatics Programmer in 2008

January 09, 2009

In 2008, I was good enough at programming to get my ass kicked by hard problems. I think that's the most positive way to say it. My main bioinformatics project was a long annotation pipeline. It takes days to run, often using 8 CPUs. It's driven by a big ol' Makefile. I made the mistake of passing data between steps in un-structured text files or python pickles. I'd create one at the beginning of the pipeline and not notice it was messed up in a way that affected other parts until the entire pipeline was done, days later. Toward the end of the pipeline, I'd need something simple, like the strand of a BLAST hit, but I'd have to parse through an entire GFF file, or load some huge pickle into memory just to get to that. Then I'd need some annotation, and I'd have to add a slow step of doing a lookup in a script that'd otherwise run very quickly. I was passing around data in arrays and tuples, so then when I changed the order or added another element in...

landsummary.com

December 17, 2008

One of the things I like the least about my real job, and much of the contract work that I do is that i'm usually the only programmer working on each task. So, it's been very fun to work on a project with Josh Livni ( His writeup ). We got together one afternoon, and by the time we left, we had a reasonable start of what we call landsummary , we've since put in a fair bit of work sprinkled here and there. Josh set up an AWS server--it's nice for me to have fewer sys-admin duties too. What's actually on display is fairly modest. What it does is takes a user-drawn square, circle, or arbitrary polygon, and uses that to summarize the NLCD dataset along with some census data. The things that make it more than lame are that it's very fast, it can easily be extended to summarize any raster dataset, and we have a sorta cool API (not documented) which allows us or anyone to query the data with a WKT Polygon and request a particular service -- currently nlcd, population,...

python interval tree

November 03, 2008

EDIT: added a couple points inline. I'm obsessed with trees lately -- of the CS variety, not the plant variety. Although we are studying poplar, so I'll be using trees to study trees . I'd tried a couple times to implement an interval tree from scratch following the Wikipedia entry . Today I more or less did that in python. It's the simplest possible form. There's no insertion (though that's trivial to add), it just takes a list of 'things' with start and stop attributes and creates a tree with a .find() method. The wikipedia entry that baffled me was about storing 2 copies of each node's intervals --one sorted by start and the other by stop. I didn't do that as I think in most cases it won't improve lookup time. I figure if you have 1 million elements and a tree of depth 16, then you have on average 15 intervals per node (actually fewer since there are the non-leaf nodes). So I just brute force each of those nodes and move to the next. I th...

twill with XHTML (not viewing HTML)

October 25, 2008

Since I couldn't find this anywhere, I'll add it here for those who have the same problem: I was trying to test a website with twill and got this at the end of my traceback: raise BrowserStateError("not viewing HTML") BrowserStateError: not viewing HTML After spending a bunch of time making sure that, yes, it was spitting out HTML, I figured out that it specifically means that twill (actually mechanize) doesnt like X HTML. You can likely fix it by adding this at the top of the script: b = twill.get_browser() b._browser._factory.is_html = True twill.browser = b Presumably, there's a real reason that check is in place, but works-4-me...

appengine memcache memoize decorator

October 24, 2008

[NOTE: see the 2nd comment below about using a tuple as a key. better to just use pickle.dumps] I've been playing with google appengine lately. I'm working on a fun, pointless side project. Here's what I came up with for a cache decorator that pulls from memcache based on the args, kwargs and function name if no explicit key is given. The code for creating a key from those is from the recipe linked in the docstring. """ a decorator to use memcache on google appengine. optional arguments: `key`: the key to use for the memcache store `time`: the time to expiry sent to memcache if no key is given, the function name, args, and kwargs are used to create a unique key so that the same function can return different results when called with different arguments (as expected). usage: NOTE: actual usage is simpler as: @gaecache() def some_function(): ... but doctest doesnt seem to like that. >>> import time >>> def slow_fn(): ... time...

genedex.fasta with numpy.memmap

October 03, 2008

EDIT: added job posting to comments. I've been working a bit on genedex , I'm still not happy with the way it stores features. Which is a huge pickle of dictionaries where every dictionary is a 'feature' that looks like: {'name':'At2g26540', 'start': 1234, 'stop': 3456, 'strand': 1, 'chr': 2}. So the only way to do a search is by location--and that is _very_ fast, thanks to rtree , but there's no way to search by name or any other attribute--and an entire organism is loaded into memory at once--that part actually works out ok, but it feels dirty. I quickly wrote an SQLAlchemy backed interface to a simple db schema do allow this sort of searching here: http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py . That already supports Feature.upstream(), downstream(), etc. methods, but it will work nicely once python supports sqlite rtree without any extra work--for now, it just uses BTree indexes on th...

choosing django

August 06, 2008

I prefer sqlalchemy and genshi (or mako) and was therefore looking at using turbogears, but I saw a demo of the django admin, and that sold me. Certainly the templating language did not. Before this, I'd only used web.py in my projects. These are the things I've liked/noted: The first and most important: community . (Oddly enough, as I write this there are 666 projects tagged as django. 'turbogears', 'tg2', 'tg' give less than 50 projects combined. Think someone might have already written what you need? yep. Also, a great site: http://www.djangosnippets.org/ , where I've learned a lot just by reading, and saved myself a lot of time, by extending ideas there. And the development is active . Second, django.contrib.* User authentication is simple, and check google-code for various alterations on the theme. admin. This was what first made me decide on django. And now, new-forms admin is in trunk. This gives you a pretty nice CRUD interface for models ...

pylab matplotlib imagemap

June 20, 2008

UPDATE 7-10-08: + add example for scatter plot + link to ken-ichi === Figuring how to make a client side image map from a matplotlib image has stumped me more than once. Andrew Dalke does have a working example. Below, I have the minimal example. It's simple once you get the steps right: just use mpl's transform() to convert the data into the image's coordinate system. Then flip the y-axis as required by the imagemap, then do the normal imagemap stuff and save the html. The only real gotcha, is to make sure to put the dpi in the call to savefig(). import pylab import sys import random name = 'imap' # make some fake data xs = range(15) ys = [random.choice(xs) for i in range(len(xs))] xys = zip(xs, ys) # can also use : f = pylab.subplot(121) f, = pylab.plot(xs, ys, 'ro') dpi = f.figure.get_dpi() height = f.figure.get_figheight() * dpi # convert the x,y coords into image coords. transform = f.get_transform() icoords = transform.transform(xys) # the minimal ...

binary search over intervals

June 04, 2008

[EDIT: update location of code repo] This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with. Binary search is easy. Just look at the python std library implementation (and the C API version ). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order. Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the explanation here , that means storing not only t...

wherecamp

May 21, 2008

I agree with every report I've seen. Wherecamp was awesome. I've been telling people, and I'm still sure it's true, that I met exactly zero people who I'd physically seen before. Ordinarily, I avoid meetings, but this is a good format and seems to attract good people. It's fun to meet and work with people who are really into what they do. The talks are less "talky" and more like chat sessions--which is possible when the groups are small. There was also plenty of time to hack, which was the original reason I went. During and after, I learned some simple things which I'm trying to incorporate into my usual workflow: In the shell, background a job with "ctrl + z" then get back to it with %i where i is the number shown in the output from "jobs". That's a trick from jlivni. From crschmidt , I added: alias doctest="nosetests --with-doctest --doctest-extension=.txt" to my .bash_aliases. Which let's me do: doctest tes...