Wednesday, December 17, 2008

landsummary.com

One of the things I like the least about my real job, and much of the contract work that I do is that i'm usually the only programmer working on each task. So, it's been very fun to work on a project with Josh Livni (His writeup). We got together one afternoon, and by the time we left, we had a reasonable start of what we call landsummary, we've since put in a fair bit of work sprinkled here and there. Josh set up an AWS server--it's nice for me to have fewer sys-admin duties too.
What's actually on display is fairly modest. What it does is takes a user-drawn square, circle, or arbitrary polygon, and uses that to summarize the NLCD dataset along with some census data. The things that make it more than lame are that it's very fast, it can easily be extended to summarize any raster dataset, and we have a sorta cool API (not documented) which allows us or anyone to query the data with a WKT Polygon and request a particular service -- currently nlcd, population, weather.precip, and a couple of services with environmental engineering application. All of them use the same libraries--postGIS for doing the census related stuff, and GDAL, gdal_array for doing the raster (currently just NLCD) queries. Josh handled all the census data, I know very little about that, except that whenever I've tried previously, it's been a pain to work with, and now, Josh has a nice set up for it. I took the lead on the raster summaries. For that, I wrote a little library that wraps gdal_array, so you can take a GDAL datasource:

>>> g = AGoodle("something.tif")
>>> a = g.read_array_bbox([xmin, ymin, xmax, ymax])

And then 'a' is a numpy array with all the niceties that entails. So, if we want to get just the food cells, which have values of 81 and 82 in the NLCD dataset, it's just:

>>> a[(a == 81) | (a == 82)]

For arbitrary polygons, we use a surprisingly fast function from matplotlib to mask anything that's not inside a list of verticies (polygon). Then do any summary stats on the masked array.

Thanks to Josh, we have a fairly nice django project structure, with separate apps for each little analysis we've added. In my previous django projects, I've dumped everything into a single app and hacked away, the structure we have now makes it easier to keep what's needed in my brain. Also, when hacking with someone, I'm less likely to put in total crap code. Josh has already had a good laugh at some code where I found 25 closest weather stations using:

SELECT * from stations ORDER BY ABS(lat - ?) + ABS(lon - ?) LIMIT 25

then sorted those 25 using geopy.distance to make sure it was the real distance. In my defence, I really wanted to use vanilla sqlite and so didn't have postGIS at my disposal--also, it was quite fast for only 6,000 stations. We've since dumped it all into postGIS. There's probably a couple of other gems in there.

So, back to the modest functionality part... Actually, it turns out, this is a fairly difficult thing to do in McClick software--time consuming in user and processor time. So, having a way to click a point and see land-use stats and population data appear in about a second is pretty cool--and it's on the web. We've already found a couple folks with interesting applications, and we're interested in finding more--the original motivation was 'foodmiles'-- from this post. And there's a couple things we'll probably add in from that, people I happened to hear talking in a cafe today were talking about foodmiles and seemed interested in incorporating the carbon foot-print of exporting / importing food vs. growing locally. My friend, Megan also has lots other ideas for things that firms commonly do McClick style with the NLCD data.
There's more info on the about page, but suffice to say we make full use of all the usual open-source GIS, science tools.

Monday, November 03, 2008

python interval tree

EDIT: added a couple points inline.

I'm obsessed with trees lately -- of the CS variety, not the plant variety. Although we are studying poplar, so I'll be using trees to study trees.
I'd tried a couple times to implement an interval tree from scratch following the Wikipedia entry.
Today I more or less did that in python. It's the simplest possible form. There's no insertion (though that's trivial to add), it just takes a list of 'things' with start and stop attributes and creates a tree with a .find() method.
The wikipedia entry that baffled me was about storing 2 copies of each node's intervals--one sorted by start and the other by stop. I didn't do that as I think in most cases it won't improve lookup time. I figure if you have 1 million elements and a tree of depth 16, then you have on average 15 intervals per node (actually fewer since there are the non-leaf nodes). So I just brute force each of those nodes and move to the next. I think that increases the worst-case, but makes no difference in actual search time--with the benefit of halving storage.

EDIT: now the version in my repo keeps the intervals sorted by start, so it can avoid doing the brute for search at each node during a search when search.stop < node.intervals[0].start. This did improve performance.

The tree class takes a list of intervals and calculates a center point. From there it partitions them into left, overlapping, and right in terms of their relation to the center point. Overlapping are assigned to the current node, and left and right are recursively partitioned in that fashion until there are only `minbucket` intervals per node, or the specified `depth` has been reached AND there are fewer intervals than `maxbucket`. So a tree can have a greater `depth` than requested if it would otherwise have more than `maxbucket` intervals in a single node. The Wikipedia version doesn't have maxbucket or minbucket...

EDIT: the maxbucket actually only works on leaf-nodes, and has no effect otherwise.

I'm sure that's painfully obvious for anyone who's ever taken a CS course, but it was foggy at best for me until I implemented. Below is the entire implementation:

class IntervalTree(object):
__slots__ = ('intervals', 'left', 'right', 'center')

def __init__(self, intervals, depth=16, minbucket=96, _extent=None, maxbucket=4096):

depth -= 1
if (depth == 0 or len(intervals) < minbucket) and len(intervals) > maxbucket:
self.intervals = intervals
self.left = self.right = None
return

left, right = _extent or \
(min(i.start for i in intervals), max(i.stop for i in intervals))
center = (left + right) / 2.0


self.intervals = []
lefts, rights = [], []


for interval in intervals:
if interval.stop < center:
lefts.append(interval)
elif interval.start > center:
rights.append(interval)
else: # overlapping.
self.intervals.append(interval)

self.left = lefts and IntervalTree(lefts, depth, minbucket, (left, center)) or None
self.right = rights and IntervalTree(rights, depth, minbucket, (center, right)) or None
self.center = center


def find(self, start, stop):
"""find all elements between (or overlapping) start and stop"""
overlapping = [i for i in self.intervals if i.stop >= start
and i.start <= stop]

if self.left and start <= self.center:
overlapping += self.left.find(start, stop)

if self.right and stop >= self.center:
overlapping += self.right.find(start, stop)

return overlapping

Only 45 lines of code. I had added a couple extra attributes so that searching could do fewer checks, but it only improved performance by ~15% and I liked the simplicity. One way to improve the search speed, and the distribution on skewed data would be to sort the intervals at the top node, so they'd then be sorted for all other nodes. Then instead of using center = (left + right)/2, It'd could use the center point of the center interval at each node. That would also allow short-circuiting the brute-force search at the top of the find method with something like:

if not (start > self.intervals[-1].stop and stop < self.intervals[0].start):
overlapping = [ .. list comprehension ]

But all told, that adds 5 or so lines of code. Oh, and depending on how it's used, it's between 15 and 25 times faster than brute-force search.

EDIT: I added the above check, but it can only do the 2nd comparison "stop < self.intervals.start as the first is invalid given a very long interval. Regarding speed, the smaller the search window, the better the performance improvement. The code is now > 20 times as fast brute force for a very (speaking in terms of looking for genomic features) large swath of 100K. with a search space of 50K, it's 50+ times as fast as linear search.

The full code (including a docstring with homer simpson quote) is in my google code repo. If I've made obvious mistakes or you have improvements, I'd be glad to know them.

Saturday, October 25, 2008

twill with XHTML (not viewing HTML)

Since I couldn't find this anywhere, I'll add it here for those who have the same problem:

I was trying to test a website with twill and got this at the end of my traceback:

raise BrowserStateError("not viewing HTML")
BrowserStateError: not viewing HTML

After spending a bunch of time making sure that, yes, it was spitting out HTML, I figured out that it specifically means that twill (actually mechanize) doesnt like XHTML.

You can likely fix it by adding this at the top of the script:

b = twill.get_browser()
b._browser._factory.is_html = True
twill.browser = b

Presumably, there's a real reason that check is in place, but works-4-me...

Friday, October 24, 2008

appengine memcache memoize decorator

[NOTE: see the 2nd comment below about using a tuple as a key. better to just use pickle.dumps]
I've been playing with google appengine lately. I'm working on a fun, pointless side project. Here's what I came up with for a cache decorator that pulls from memcache based on the args, kwargs and function name if no explicit key is given. The code for creating a key from those is from the recipe linked in the docstring.
"""
a decorator to use memcache on google appengine.
optional arguments:
`key`: the key to use for the memcache store
`time`: the time to expiry sent to memcache

if no key is given, the function name, args, and kwargs are
used to create a unique key so that the same function can return
different results when called with different arguments (as
expected).

usage:
NOTE: actual usage is simpler as:
@gaecache()
def some_function():
...

but doctest doesnt seem to like that.

>>> import time

>>> def slow_fn():
... time.sleep(1.1)
... return 2 * 2
>>> slow_fn = gaecache()(slow_fn)

this run take over a second.
>>> t = time.time()
>>> slow_fn(), time.time() - t > 1
(4, True)

this grab from cache in under .01 seconds
>>> t = time.time()
>>> slow_fn(), time.time() - t < .01
(4, True)

modified from
http://code.activestate.com/recipes/466320/
and
http://code.activestate.com/recipes/325905/
"""

from google.appengine.api import memcache
import logging
import pickle

class gaecache(object):
"""
memoize decorator to use memcache with a timeout and an optional key.
if no key is given, the func_name, args, kwargs are used to create a key.
"""
def __init__(self, time=3600, key=None):
self.time = time
self.key = key

def __call__(self, f):
def func(*args, **kwargs):
if self.key is None:
t = (f.func_name, args, kwargs.items())
try:
hash(t)
key = t
except TypeError:
try:
key = pickle.dumps(t)
except pickle.PicklingError:
logging.warn("cache FAIL:%s, %s", args, kwargs)
return f(*args, **kwargs)
else:
key = self.key

data = memcache.get(key)
if data is not None:
logging.info("cache HIT: key:%s, args:%s, kwargs:%s", key, args, kwargs)
return data

logging.warn("cache MISS: key:%s, args:%s, kwargs:%s", key, args, kwargs)
data = f(*args, **kwargs)
memcache.set(key, data, self.time)
return data

func.func_name = f.func_name
return func

Friday, October 03, 2008

genedex.fasta with numpy.memmap

EDIT: added job posting to comments.
I've been working a bit on genedex, I'm still not happy with the way it stores features. Which is a huge pickle of dictionaries where every dictionary is a 'feature' that looks like: {'name':'At2g26540', 'start': 1234, 'stop': 3456, 'strand': 1, 'chr': 2}. So the only way to do a search is by location--and that is _very_ fast, thanks to rtree, but there's no way to search by name or any other attribute--and an entire organism is loaded into memory at once--that part actually works out ok, but it feels dirty. I quickly wrote an SQLAlchemy backed interface to a simple db schema do allow this sort of searching here: http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py. That already supports Feature.upstream(), downstream(), etc. methods, but it will work nicely once python supports sqlite rtree without any extra work--for now, it just uses BTree indexes on the start and stop. I could use rtree to index the sqlite database, but I'd like to move away from the LGPL. Maybe this KDTree that's already in a scipy branch with a more permissive license. Then it could do both spatial, and attribute queries...
That's all tinkering...

I also cleaned up the genedex.fasta module. The useage is nice, if not entirely the implementation. A fasta file can look like:

>chr1
ATGTCGTCGGCCGC
GGGCCAAGA
CAACGGAGA

>chr3
ATGGAGGAGGCTGGCGAGCGG

>chr2
ATGGCGTGC
ACGGCGGCG
CGCATGTT
CGCCT

where a line starting with > is the name of the sequence ('chr1') and everything up until the next > is the sequence. The problem is the newlines, so every time you want to look at chr1 basepairs 10 to 20, you have to find where the sequence starts, and account for newlines. -- Acutally one should never do that, as Biopython, and pretty much any library will take care of that for you. Pygr for example, creates a new file something.fasta.pureseq which removes all newlines and labels and indexes where the starts and stop are. genedex.fasta.Fasta now does something similar, here's example useage on the file above ('123.fasta').

from genedex import Fasta
f = Fasta('123.fasta')
print f.keys()
print f['chr1'][9:20].tostring()
print f['chr1'][9:20]

after, the fasta file looks like this:

>chr1
ATGTCGTCGGCCGCGGGCCAAGACAACGGAGA
>chr3
ATGGAGGAGGCTGGCGAGCGG
>chr2
ATGGCGTGCACGGCGGCGCGCATGTTCGCCT

with spaces removed--so it's still a valid fasta file (you can also send an argument to the constructor and it will create a new file) and there's a new file 123.fasta.gdx that is a python pickle containing:
{'chr3': (45L, 67L), 'chr2': (73L, 105L), 'chr1': (6L, 39L)}
which indicate the start and stop positions of the sequence in the file.
So the file remains a valid fasta file, but now it can be efficiently sliced. For now, it actually uses a numpy mmmap (numpy.memmap), to take advantage of the broadcasting, other than that, it'd be simpler to just use python mmap. So, when it sees f['chr1'][10:20] it acts just like it's indexing a numpy array, but it's accessing the data directly from the disk (well, not actually, but mmap works it's magic and I dont have to think about that). I like that--I can keep my fasta files as valid, add only a small python pickle file, and get simple, fast, pythonic indexing into them. It does take about 12 seconds to index and flatten the entire sorghum genome (629MB, ~660 million basepairs) on first view, after that first time, it's instantaneous.
Anyway, the source is available and easy_install-able:

svn checkout http://genedex.googlecode.com/svn/trunk/ genedex
As always, I'll gladly take any improvements, bugs, enhancements.

Also, our lab at UCB is looking for another (full-time or close to it, on-site) programmer who knows some biology, perl, and hopefully some python. If you're interested, or know anyone, send me an email. I have no real authority in the matter (or any matter) but I will have some say in this. I'd like to work with someone I can learn from. I'll add a link to the job posting in the comments below once it's posted.

Wednesday, August 06, 2008

choosing django

I prefer sqlalchemy and genshi (or mako) and was therefore looking at using turbogears, but I saw a demo of the django admin, and that sold me. Certainly the templating language did not. Before this, I'd only used web.py in my projects. These are the things I've liked/noted:


The first and most important: community. (Oddly enough, as I write this there are 666 projects tagged as django. 'turbogears', 'tg2', 'tg' give less than 50 projects combined. Think someone might have already written what you need? yep.
Also, a great site: http://www.djangosnippets.org/, where I've learned a lot just by reading, and saved myself a lot of time, by extending ideas there.
And the development is active.

Second, django.contrib.*
  • User authentication is simple, and check google-code for various alterations on the theme.

  • admin. This was what first made me decide on django. And now, new-forms admin is in trunk. This gives you a pretty nice CRUD interface for models in your app. In the app I've been working on, we have row _and_ field level permissions. We also need to let users edit the fields of other users stored in the database--but only certain fields, with which fields depending on the user viewing and the user being edited. This bit is more hacky than it should be, but quite simple.
    My biggest gripe about the admin is that it's too complicated to use custom widgets or validation. Hopefully, this will change.
    Oh, and it's too hard to have read_only privileges. (Yeah, I know the admin mantra).

  • GIS: nuf said.


i18n, t9n. Anywhere there's some text to be displayed in the app, I wrap it in ugettext_lazy (aliased to _), and later, dump it and send it to someone who knows Chinese. When it returns, I make messages, and the app will show in English or Chinese depending on browser preferences. Simple.


docs/help: yeah, the docs may have trouble keeping up, especially with recent rate of change, but it's easy enough to find what you need, and the django official docs are pretty nice. And if you can handle the fact that most responses you'll get on #django will begin with "of course, ...", then it's a great place to get help.


ModelForms: I've just started using these, but, they've already saved me a lot of code.




There's a lot of other nice things, and a lot of django that I don't even know. I'd still consider myself a newb, but it's still possible to get sh*t done.

Friday, June 20, 2008

pylab matplotlib imagemap

UPDATE 7-10-08:
+ add example for scatter plot
+ link to ken-ichi
===
Figuring how to make a client side image map from a matplotlib image has stumped me more than once. Andrew Dalke does have a working example. Below, I have the minimal example.

It's simple once you get the steps right: just use mpl's transform() to convert the data into the image's coordinate system. Then flip the y-axis as required by the imagemap, then do the normal imagemap stuff and save the html. The only real gotcha, is to make sure to put the dpi in the call to savefig().

import pylab
import sys
import random

name = 'imap'

# make some fake data
xs = range(15)
ys = [random.choice(xs) for i in range(len(xs))]

xys = zip(xs, ys)

# can also use : f = pylab.subplot(121)
f, = pylab.plot(xs, ys, 'ro')
dpi = f.figure.get_dpi()
height = f.figure.get_figheight() * dpi

# convert the x,y coords into image coords.
transform = f.get_transform()
icoords = transform.transform(xys)


# the minimal 'template' to generate an image map.
tmpl = """
<html><head></head><body>
<img src="%s.png" usemap="#points" border="0">
<map name="points">%s</map>
</body></html>"""

# change this as needed, e.g. if not plotting points.
fmt = "<area shape='circle' coords='%i,%i,2' href='http://example.com/%i/%i' >"

# need to do height - y for the image-map
fmts = [fmt % (ix, height - iy, x, y) for (ix, iy), (x, y) in zip(icoords, xys) ]

# NOTE, this dpi is needed!
pylab.savefig('imap' + '.png', dpi=dpi)
print >> open(name + ".html", 'w'), tmpl % (name, "\n".join(fmts))

UPDATE:
When trying to figure how to do this for a pylab.scatter plot, I found Ken-ichi had also done this for a scatter plot.
As of a matplotlib trunk revision 5711, the transform does not get set when the scatter plot is drawn. To set it, I had to set the transform explicitly:

s = pylab.scatter(xs, ys)
s.set_transform(s.axes.transData)
transformed_xys = s.get_transform().transform(zip(xs,ys))

Wednesday, June 04, 2008

binary search over intervals

[EDIT: update location of code repo]

This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with.

Binary search is easy. Just look at the python std library implementation (and the C API version). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order.
Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the explanation here, that means storing not only the start, but the stop and often the center of every interval--and using them to do the search. That makes things considerably more complicated than binary search. I started with an implementation of interval search here, but couldn't figure out how to customize.

Binary search is kind of a special case of binary search where intervals are of exactly length 0, e.g. start == stop. So, if all intervals are of exactly length 2? Well, then you can sort by start, find the left-most index by looking for start - 2 and find the right most index by searching for the (highest) index of start. That will give you the indicies in the sorted array of all intervals that overlap the query. The highest and lowest correspond to python's bisect.bisect_left, and bisect.bisect_right respectively.
This carries to any length. But, if all the intervals are different lengths. Well, then you can save the longest length, and then for any search, it's:
p_overlaps = intervals[bisect_left(start - max_len):bisect_right(stop)]
but that only gives the putative overlapping. Since we extended the left by max_len, and we may have found an interval whose length was < max_len (meaning its stop is before the start of the query) we have to explicitly test for distance:
real_overlaps = [iv for iv in p_overlaps if distance(query_interval, iv) == 0]
which gives only the intervals that overlap. So, that part is the price to pay for the simplified search. Another way, as suggested in the wikipedia article is to store the length of each interval as part of the interval.

In this setup, the worst case scenario is when a single looooong interval covers the entire range of the list of intervals. Then every search is linear, brute-force search. However, my use for this is genomic data. There, I'll have an entire range of say... 50 million, and the intervals (genes) are rarely longer than 4000 in length (basepairs). So, it's useful to optimize simplicity.

My cython version of this is in my googlecode repo here. It has all the stuff I use, methods for left(), right(), upstream(), downstream(), nearest_neighbors(). Most of the searching work is in the binsearch_* functions -- I couldn't use the python ones. There are a couple of hacks in there:
1) because pyrex/cython don't support closures
2) the left() method is confusing because the intervals are sorted by start, and the left() has to find the nearest intervals by stop(). That's where complexity is increased because of this setup.

On my machine, it creates a tree with 6815 intervals in .016 seconds and does 50000 searches on that tree in .14 seconds. It seems to scale well with the number of features as 50K searches on 68150 features takes .50 seconds. Adding an interval that covers the entire range of all other features (results in worst-case linear search) makes the 50K searches (on 6185 intervals) take 1.54 seconds--which is only so good because the brute-force in method is pretty close to c-speed. This could be optimized by saving the stops in a separate array, or by saving long intervals in a separate array, but it's rare enough, and the code is simple enough as-is, that I'll probably leave it for now.

The "proper" way to do this is with an interval/segment tree, of which there's a very readable version in bx-python. If I'd found that earlier, I probably wouldn't have coded this... The tree is faster for larger number of intervals, but that's rarely going to be an issue, it does take much less memory...

Wednesday, May 21, 2008

wherecamp

I agree with every report I've seen. Wherecamp was awesome. I've been telling people, and I'm still sure it's true, that I met exactly zero people who I'd physically seen before. Ordinarily, I avoid meetings, but this is a good format and seems to attract good people. It's fun to meet and work with people who are really into what they do. The talks are less "talky" and more like chat sessions--which is possible when the groups are small.

There was also plenty of time to hack, which was the original reason I went. During and after, I learned some simple things which I'm trying to incorporate into my usual workflow:
In the shell, background a job with "ctrl + z" then get back to it with %i where i is the number shown in the output from "jobs". That's a trick from jlivni.

From crschmidt, I added:
alias doctest="nosetests --with-doctest --doctest-extension=.txt"
to my .bash_aliases. Which let's me do:
doctest tests/
or
doctest tests/test_somefile.txt
to run my doctests instead of python -c "import doctest;doctest.testfile('...')"

And springmeyer showed me a ton of django and geodjango. The admin stuff is just ... nice -- it's how making a db front end should be. I still don't know how to learn that stuff on my own, it seems a lot of it, you just have to know which modules to import and the django book doesn't cover newforms or the new admin stuff as far as I can tell.

Quotes


On friday night, we met up in SF to do some hacking, the never went down, as I couldnt get wireless and it turned into more of a real bar trip. We, were however, talking about python. At one point, it was sorta quiet and out of the silence, comes:
"Python sucks"
from a true lisp hacker in the next booth--complete with curly grey beard and spectacles. He actually turned out to be a cool guy, I think maybe he even admitted that if he couldn't use lisp, python was a reasonable choice--I think that's about as much as you can expect from a lisper.

From crschmidt:
"I don't really know python that well"
Then who was it that basically rewrote featureserver between the hours of 2AM and 9AM when everyone else was sleeping?

Wednesday, May 14, 2008

slicehost, trac, wherecamp

I have a development "server" here beside me. It's actually a budget laptop that sold for $799 2 years ago. It's a xubuntu machine, hosting a trac instance, a development server for mapping stuff, postgresql/postgis, mapserver, mysql, and couple svn repos, anything I do for contracting, etc. Oh, and it's also hosting a couple of sites for the multi-national company that my gf works for! all of their servers are windows machines (long rant suppressed).
It used to get warm, so propped it up on 4 tuna cans, 1 for each corner, now it stays cooler. Yep, it's a sweet setup.
Anyway, I pay AT&T or SBC --or whatever they are now called-- for static IP's and a supposedly faster internet connection. My 1 year contract for that is nearly up, so I'm switching to slicehost.

I'm not a sys-admin, I sorta do that for 4 gentoo (not my choice) machines at $work, and my strategy is to set up, rsnapshot and never, ever emerge -u world. ever. So far, it's mostly working. When I have the choice, I use (x/k)ubuntu, I don't care if they do magical stuff (or even if I had to redo my ssh keys today), it just works.

Anyway, I want something easy and idiot proof. I'd heard the hype about slicehost when it came out, and figured it was just that, hype. It's not. I've never used shared hosting before, but this was pretty simple. From entering my payment to ssh'ing as root into my slice took ~ 2 minutes. /proc/cpuinfo shows it's a machine with 4 opteron dual-cores. I started with a 256 slice with back-up. You can start new slices and restore from the backups. That's cool, and less $$ than I pay AT&T for static IPs and faster uploads.

I ran a script to apt-get all the packages I use, and had a base, working system in under 1 hour. They have a lot of articles about how to set stuff up, mostly basic (even for me), but I followed their info on setting up iptables. Predictably, I forgot to leave open port 22 and locked myself out of ssh, but they have a web-based console, so, not a problem. It seems to be idiot proof...

Trac

Also, in the theme of things that just work as they should, upgrading to Trac 0.11 (still in rc).

sudo easy_install -UZ Trac==0.11rc1
cd /path/to/trac/project/
sudo trac-admin . upgrade
sudo trac-admin . upgrade wiki
sudo apache2ctl restart

WhereCamp


I'll likely be at wherecamp, it'll be good to learn some stuff, and meet people I only know from IRC. If anyone needs a ride from the east-bay, let me know.

Thursday, May 08, 2008

openlayers, genomes and image-maps

In response to Titus' post on using imagemaps for genomic visualization:
Why are imagemaps so popular in genomics? As an extreme and unfair comparison, just imagine if http://maps.google.com was an image map.
Given a CGI script that can accept a url like
&start=1024&stop=2048&chr=3
and return an appropriate image, you can provide a substantial set of tools using openlayers, which is developed by what must be one of the largest and active developer communities in GIS. (Yes, I am an openlayers fan-boy.)
You can do that with a small addition to openlayers which I updated a couple weeks ago to OL version 2.6. In that update, I removed > 140 lines of code. So, it's now even less of a change to OL. Maybe when 2.7 comes out, I'll figure out how to provide a patch that allows an extra argument to the OpenLayers.Map constructor that limits panning to the horizontal direction -- in which case genome-browser will cease to exist and only the single file containing OpenLayers.Layer.Genomic would be needed.

Maybe I'd need to make a real example of using openlayers for genomics, making more obvious use of layers, the vector stuff, fractional zoom, markers, geocoding, projections, power steering, etc to make it more obvious how badass OL can be. As far as I know, I'm the only one using it for genomics, and 98% of the time I spent developing it was in my spare time--I only have a good excuse to hack on it at $work when something breaks. If more people were using it, I might be more motivated to figure out how to do stacking of images vertically, but still restricting scrolling to the horizontal--to essentially allow the same thing as "tracks" in other genome browsers. Maybe that's the killer feature.

And


An observation on the difference between the bio and geo programming environments as I interpret them:
There's some tools that (IMHO) are better in the geo world than in the bio. Perhaps that's because of GDAL, the keystone for geo-data formats and projections. Since any other software (in any SWIG-able language) that makes use of gdal can access pretty much any format and I/O to any projection, then geo developers can do things like make nice renderers, or web-based map browsers rather than figuring how to convert that mrSID image in epsg:4326 to a tif in epsg:900913. (I'm also a GDAL fan-boy.)

Contrast this to the bio world where there's a bio for nearly every common language, bio-java, bio-perl, bio-python, bio-ruby, each with it's own blast parser, gen-bank parser, sequence objects, alignment objects. There is no keystone, so there's more duplication of effort, and there's no parser that's used across languagues as is the case for GIS. -- I'm not suggesting that shouldn't be the case, simply making an observation.

Monday, May 05, 2008

seqfind: levenshtein + bktree

I've copied this recipe that I modified before, and added the BK Tree structure in cython. It's in my repo here.
Check it out with:
svn checkout http://bpbio.googlecode.com/svn/trunk/seqfind

or easy_install with
sudo easy_install http://bpbio.googlecode.com/svn/trunk/seqfind


It's now using the Damerau-Levenshtein distance which is more sensible for bioinformatics where transpositions are frequent.

Bearophile's original implementation used a tuple, which made sense, but in Cython, it's more efficient to use an object where the properties can be typed--as a class is converted to a c-struct--so there is no conversion when appending to a python array -- if i understand the generated c code correctly.

Using an object also allows arbitrary info to be passed along with the word when creating the tree, again, this is important for bio-informatics when the string is something like "actgcc ... acgtc" and it's useful to attach some annotation to it like:

words = [Word("actgc ... acgtc", {'name': 'At2g26540'}), ...]
#and then create a tree in the same fashion as with raw strings:
tree = BKTree(words)
# and search returns a word object:
[(w.word, w.info['name']) for w in tree.find("atc", 2)]


It'll suffice to say (the obvious) it's a lot faster searching with the tree, than comparing every word, on every search.

Wednesday, April 30, 2008

flash (n)back

The PNAS article linked here found subjects could improve IQ with training. I've written a simple flash version of their protocol over a couple long evenings in haxe/flash.

The article methods list 2 stimuli, a moving box, and spoken letters. The test subject is to respond (click in my case) when the box position or the spoken letter is the same as it was 2 time steps ago. Where 2 is increased as the subject gets better. I didn't do sound, I just show a big letter. Clearly, the logical thing to do is use it for a couple weeks and then implement the sound when I'm smarter. (I've never really used swfmill, but I think that'd be useful here...)

The article is ambiguous about when the letter is to sound, I've made both the letter and the box appear at the same time. The default, as in the article is to have 3 seconds between events, and to show the box for 0.5 seconds. I also add some indication of whether the answer was correct (green +) or not (red -). That actually makes things more difficult as its distracting to see something new flash on the screen. It also keeps a running total of misses (didn't click when should have), correct, and incorrect (clicked when shouldn't have). The grid size is set at 3 * 3 as that's more than difficult enough for me, and it appears to be what they used in the article. It actually only has 8 positions as I use the center to display the letter. Another ambiguity from the article is the number of letters. I use 3. That's easily changeable in the code.

The length of the time step (time_step, 3000ms default), the amount of time to show the box and text (show_time, 500ms default) and the number of steps back (nback, 2 default) are settable via the url so the default equates to:
?time_step=3000&nback=2&show_time=500

Observations

1. It's freakin hard. I'm suck at it.
2. Haxe is nice, but jeez, I write ugly actionscript.
3. I like quick, pointless evening projects like this where I have a clear idea of the outcome. It's good for learning.
4. Whatever points there are against flash, it's easy to uh, "deploy".


Here's a live version of the app (untested in IE, but swfobject should do it's thing). Just click in the flash movie when there's something that's the same as 2 time-steps ago.
You can make it arbitrarily hard by sending in parameters on the url, see the links in the page for examples.

And here's the code. Get it from svn via:
svn co http://bpgeo.googlecode.com/svn/trunk/nback

Sunday, April 27, 2008

levenshtein in cython

EDIT(2): fix markup for <char *> casts ... fix malloc (see comments. thanks Bao).
NOTE: using a kwarg for limit slows things down. setting that to a required arg and using calloc for m2 speed things up to nearly as fast as the pylevenshtein.

Well, it seems to be popular to code up the levenshtein. I actually have a use for this and wanted to practice some Cython, so I've written a version. I used bearophile's recipe, wikibooks, this (from k4st on reddit) and this for reference. It follows bearophile's code closely, using only O(m) space instead of O(mn).


cdef extern from "stdlib.h":
ctypedef unsigned int size_t
size_t strlen(char *s)
void *malloc(size_t size)
void free(void *ptr)
int strcmp(char *a, char *b)


cdef inline size_t imin(int a, int b, int c):
if a < b:
if c < a:
return c
return a
if c < b:
return c
return b


cpdef int levenshtein(char *a, char *b, int limit=100):
cdef int m = strlen(a), n = strlen(b)
cdef char *ctmp
cdef int i = 0, j = 0, retval
cdef int achr, bchr

if strcmp(a, b) == 0:
return 0

if m > n:
ctmp = a;
a = b;
b = ctmp;
#a, b = b, a
m, n = n, m

# short circuit.
if n - m >= limit:
return n - m

cdef char *m1 = <char *>malloc((n + 2) * sizeof(char))
cdef char *m2 = <char *>malloc((n + 2) * sizeof(char))

for i from 0 <= i <= n:
m1[i] = i
m2[i] = 0

for i from 0 <= i <= m:
m2[0] = i + 1
achr = a[i]
for j from 0 <= j <= n:
bchr = b[j]
if achr == bchr:
m2[j + 1] = m1[j]
else:
m2[j + 1] = 1 + imin(m2[j], m1[j], m1[j + 1])

m1, m2 = m2, m1

retval = m1[n + 1]
free(m2)
free(m1)
return retval

I believe that's correct (if not, let me know!), it matches output from other versions (which i used fort testing) and it doesn't leak memory, so I must have done the malloc/free correctly despite my lack of C-fu.
Again, I'm not sure, but I think that's a pretty good example of how to mix python and C with cython as it's not much longer than the other python versions and pretty readable. And, it's very fast, it does 500K iterations of:
levenshtein('i ehm a gude spehlar', 'i am a good speller')
in 2.5 seconds.
bearophile's pysco-ed editDistanceFast took 3 minutes, 31 seconds to do the same (which is why I started this project...).

Literally, just before I was to post this, I found pylevenshtein which is even faster, does the 500K in 1.5 seconds. ho hum. It's doing more, handling unicode, and apparently doing some optimizations.... So, use that instead! -- and contribute a test-suite.
Next, I think I'll try a variant that allows transpositions (Damerau) and/or implement the BK-Tree, again cribbing off bearophile's recipe.


for reference, here are the contents of the setup.py to build the module.

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup( name = 'levenshtein',
ext_modules=[ Extension("levenshtein", sources=["levenshtein.pyx"] ), ],
cmdclass = {'build_ext': build_ext})

Wednesday, April 23, 2008

numpy to tiff via gdal

EDIT: 7 months later I came back to this and found an error. update in the code below with old line commented out.

Rather than venting about a project I've recently decoupled myself from, I'll try to do something constructive... I also posted this to the gispython mailing list, but I've had to figure it out a couple times, so I'll put it here for the record. Given an N * N numpy array, and a bounding box, it's actually fairly simple to make a georeferenced tiff:

from osgeo import gdal, gdal_array
import numpy
from osgeo.gdalconst import GDT_Float64

xsize, ysize = 10, 10
a = numpy.random.random((xsize, ysize)).astype(numpy.float64)

xmin, xmax = -121., -119.
ymin, ymax = 41., 43.


driver = gdal.GetDriverByName('GTiff')
# bad: out = driver.Create('a.tiff', a.shape[0], a.shape[1], 1, GDT_Float64)
# the args to Create are 'name', xsize, ysize. and .shape[0] is rows, which is y.
driver.Create('a.tiff', a.shape[1], a.shape[0], 1, GDT_Float64)

out.SetGeoTransform([xmin
, (xmax - xmin)/a.shape[0]
, 0
, ymin
, 0
, (ymax - ymin)/a.shape[1]])

gdal_array.BandWriteArray(out.GetRasterBand(1), a)

where the SetGeoTransform bit is (I believe) the same stuff you'd stick in a world file.

and plottable in pylab:

import pylab
tif = gdal.Open('a.tiff')
a = tif.ReadAsArray()
pylab.imshow(a)
pylab.show()

Anyone checked out matplotlib's basemap recently? I just wish they didn't rely on geos < 3...

Friday, April 11, 2008

python script as wsgi, cgi, or standalone

EDIT:
See below for original, I realized this could be done cleanly with a decorator.
The decorator wrapplication takes the number of the port to use when called as a standalone server. The EMiddle class is unnecessary, it's just used as middleware to update the environ to show it came via wsgi. If there's a cleaner way, let me know.

#!/usr/bin/python
import os

class EMiddle(object):
def __init__(self, app):
self.app = app
def __call__(self, env, start_response):
env['hello'] = 'wsgi'
return self.app(env, start_response)

def wrapplication(port):
def wrapper(wsgi_app):
if 'TERM' in os.environ:
print "serving on port: %i" % port
os.environ['hello'] = 'standalone'
from wsgiref.simple_server import make_server
make_server('', port, wsgi_app).serve_forever()

elif 'CGI' in os.environ.get('GATEWAY_INTERFACE',''):
os.environ['hello'] = 'cgi'
import wsgiref.handlers
wsgiref.handlers.CGIHandler().run(wsgi_app)
else:
return EMiddle(wsgi_app)
return wrapper

@wrapplication(3000)
def application(environ, start_response):
start_response("200 OK", [('Content-Type', 'text/plain')])
yield "How do you like the teaches of peaches?\n"
yield "from " + environ['hello']




ORIGINAL VERSION:

If you write a script with the application entry point that fits the wsgi spec, it's simple to make it run via mod_wsgi, cgi, or via standalone server depending on the context. I believe this is common knowledge, but for my own reference here's an example with the extra setup (which will work for any script) to do this:


#!/usr/bin/python

def application(environ, start_response):
start_response("200 OK", [('Content-Type', 'text/plain')])
yield environ['QUERY_STRING']

if __name__ == "__main__":
try:
from wsgiref.simple_server import make_server
import sys
port = int(sys.argv[1])
print "server on port: %i" % port
make_server('', port, application).serve_forever()
except Exception, e:

import wsgiref.handlers
wsgiref.handlers.CGIHandler().run(application)

To change between wsgi and cgi toggle between
AddHandler cgi-script .py
AddHandler wsgi-script .py

For the stand alone server, just run it with an argument indicating the port:
python app.py 3000

to use the cherrypy server instead of the wsgiref, replace the make_server() line with:

from cherrypy import wsgiserver
server = wsgiserver.CherryPyWSGIServer(('0.0.0.0', port), [('/', application)], server_name='')
try:
server.start()
except KeyboardInterrupt:
server.stop()

That also handles the server shutdown more politely.

Monday, April 07, 2008

Genedex: query genomic features and sequence

Normally, I don't write libraries, I figure smarter people than I should do such things, and I should just use them. But, I got tired enough of writing one-off scripts for genomic feature manipulation-- find the upstream, downstream neighbors and get the sequence -- and I saw enough of the pieces coming together that I decided to build it. I'd complained before about how rtree didn't support 1D indicies. Not only is this not a problem, it's beneficial. Genomic features should have strand information, so that's the 2nd dimension. Then rtree does containment queries, so it's simple to find only the features on a given strand. I realized this about the same time that the docstring for numpy's memmap went from 0 lines to about 100, and it was enhanced to take a filehandle, not just a filename. This means you can send in a start position and a shape to the numpy.memmap constuctor and it can create a numpy array of only that chunk. This means that it's possible to slice an unaltered fasta file using the numpy array syntax. That's very good.

So, if you put those 2 simple things together, you have the start of something powerful. That's what I did. Then I gave it a crappy name: Genedex (Gendex was taken) and slapped it into googlecode. Check it out: http://code.google.com/p/genedex/. My only design goal was to keep it as simple as possible. If the amount of features is under-whelming, that's good.

TDD


Also, I generally do TDD very half-ass, with asserts and maybe a couple doctests. However, I recently made fairly substantial changes to the SQLite datasource in featureserver, and wrote this set of doctests while doing so. It works! and I've been using it. So, I did what featuresever (presumably crschmidt) devs did and copied the setup for the shapely doctests. It's pretty useful for design, i'd just write out the code for how I wanted the API to look and then implement. The only thing is, for doctests, the way they're used (at least by me) is to copy the output from executing the code into the doctest. So, if your code is wrong to start with, you just copy the wrong answer into the doctest and it's broken but the tests pass. But, at least it's good for regressions, and I just had to remember not to blindly trust the output. That's true for all testing, but especially so for doctests.

So, there's now more tests than code. But, since it's mostly just tie-ing together pieces that do the real work, it's not much code. Doc-tests are also nice because (as the name suggests) they double as documentation. So, here's the genedex documentation:
http://genedex.googlecode.com/svn/trunk/doc/readme.html
It's pretty! It gets colored by pygments, using this script. The only major thing I'd like to add to the library is a plotting class using matplotlib. Then other smaller tasks like a method that takes 2 features and returns the sequence between them.
Any fixes, enhancements, ridicule, etc. will be greeted with commit access.

Sunday, April 06, 2008

comparative genomics with openlayers

Traditional genome browsers, look like this. In fact, I think that's the most popular genome-browser used--gbrowse. They display information in tracks, so any layer of annotation you just add on to the bottom of the image (after making the image taller). This doesnt work for genome-browser, the hack of openlayers to support only horizontal scrolling, because you if you have 2 adjacent tiles, if one has more features than the next, there's not guarantee that they'll be the same height, and no guarantee that a feature that's on both images will align correctly.

I was just hacking around, trying to test some work I'd done and realized that you can have annotation layers with OpenLayers, just add another map, and tie them together!


So that's 2 OpenLayers.Map() instances. What makes this easy is the new Map.panTo() methods in OpenLayers 2.6 (which is in release candidate 1). So, the top map registers for 'move' and 'zoomend' events with callbacks that update the bottom map with the position/zoom of the top map.
That's it! And layers of annotation are available, along with the slippy map. OpenLayers continues to amaze.
That site with the linked maps is here.

Sunday, March 30, 2008

featureserver authentication

As the name implies, featureserver serves vector features to various formats from a number of datasources, including OGR -- which means pretty much any vector format. That's extremely powerful. Really. That means, for instance, that when you're working on a really cool project and all anyone wants to know is if they can see it in KML/Google Earth, it's no extra work. Just point them to the REST-ful url like "http://example.com/featureserver/all.kml", and continue working on the cool project. Likewise for all.gml, .atom, etc. And, if you have a project with spatial data, if you put it in a format that featureserver understands, it's displayable, and editable in openlayers.

The next thing people want in a web application is some sort of user restrictions. In featureserver, by default, anyone can do any of the CRUD operations on any feature. I've been playing with a soon-to-be-open-sourced PPGIS (apparently the trendy acronym for that is now VGI) project where people can go to report sudden oak death. I want anyone to be able to report and view cases and add notes about existing cases, but only admins to be able to edit and delete existing reported sudden oak death cases. The simplest way is to use basic authentication in apache, but then anyone who goes to the site has to enter a user/password, and I think that really limits the public participation bit. If there's a way to do only authenticate sometimes with apache authentication, let me know.

unrested development

Since it's possible to use featureserver as an API, you can make your own server and add authentication in python. You can do this with any framework that supports sessions, I've done it with the development version (0.3) of web.py, using the nice sessions support. That it supports intuitive syntax for GET, PUT, DELETE, POST makes it a good fit as well. In the code below, the authentication related urls are /login and /logout, but those are never needed unless updating or deleting an existing point. Anyone can create new features. All featureserver related requests are made as with the original. Here's the wsgi script:

#!/usr/bin/python
import web
from FeatureServer.Server import Server
from FeatureServer.DataSource.SQLite import SQLite

urls = ( '/logout', 'logout'
,'/login', 'login'
,'/(.*)', 'features')

app = web.application(urls, globals())
session = web.session.Session(app
, web.session.DiskStore('/tmp/sessions')
, initializer={'authorized': False})

datasource = SQLite('fsauth', file="/tmp/fsauth.sqlite")
featureserver = Server({'fsauth': datasource })

application = app.wsgifunc()

class login(object):
"""for a real app, save usernames, hashed pws in the db"""
def POST(self):
pw = web.input(password=None).password
user = web.input(user=None).user
if (user == 'abc' and pw == '123'):
session.authorized = True
return '[authorized]'
return '[NOT-authorized]'

class logout(object):
def GET(self): session.kill()

class features(object):
"""all the featureserver routing"""
path = "/" + datasource.name + "/" # fsauth
format = "geojson"
def GET(self, feature_id=''):
if "." in feature_id:
feature_id, self.format = feature_id.split(".")

# get web.py parsed url
path = self.path + feature_id
data = dict(web.input().items())
data['format'] = self.format

format, rsp = featureserver.dispatchRequest(data, path, "", request_method="GET")
web.header('Content-type', format)
return rsp

def PUT(self, feature_id=None):
return self.POST(feature_id, "PUT")

def DELETE(self, feature_id=None):
if "." in feature_id:
feature_id, self.format = feature_id.split(".")
# cant delete unless authorized.
if not session.authorized:
web.header('Content-type', "text/plain")
return "not logged in"
path = self.path + feature_id
data = dict(web.input().items())
data['format'] = self.format
format, rsp = featureserver.dispatchRequest(data, path, "", request_method="DELETE")
web.header('Content-type', format)
return rsp


def POST(self, feature_id=None, method="POST"):
if feature_id is None: return []
if "." in feature_id:
feature_id, self.format = feature_id.split(".")
# must be an admin to do something with an existing feature.
if not session.authorized:
if not feature_id in ('new', 'create'):
return 'not logged in'
e = web.ctx.environ
post_data = e['wsgi.input'].read(int(e['CONTENT_LENGTH']))
path = self.path + feature_id
format, rsp = featureserver.dispatchRequest({'format':self.format}, path, "", post_data=post_data, request_method=method)
web.header('Content-type', format)
return rsp

That'll be called from OpenLayers, but to demo, since everyone else is using curl which supports cookies:

FS="http://localhost/fsauth/"

echo "\n\nfirst borrow a feature. "
curl --url "http://featureserver.org/featureserver.cgi/scribble/35.geojson" > 35.geojson

echo "\nsee the features ... "
curl --url $FS

echo "\n\nadd a feature (no auth required.)"
curl -d @35.geojson --url "$FS/create.geojson"

echo "\n\nsee the features ... "
curl --url $FS

echo "\n\ntry to delete ... but cant"
curl -s -X DELETE $FS/1

echo "\n\nlogin ... \n"
curl -s --cookie-jar "cookies.txt" -d "password=123&user=abc" --url $FS/login > /dev/null

echo "\n\nthen delete ... "
curl -s -X DELETE -b "cookies.txt" $FS/1 > /dev/null

echo "\n\nsee the empty features ... \n"
curl --url $FS

echo "\n\nreturn the borrowed feature. thanks. :-) "
curl -s -X PUT -d @35.geojson --url "http://featureserver.org/featureserver.cgi/scribble/35.geojson"

the important point there being that the DELETE fails until the user is logged in. I'm pretty sure adding authentication makes it un-REST-ful. but ???. Anyway, I won't lose any sleep over it.

Tuesday, March 25, 2008

reading code

A fasta file of rice genomic sequence is 355MB. It's not easy to understand how large that is. This is an attempt to come up with a quick metric.
So, I downloaded Ulysses.
wc shows it to have 267235 words. Some googling says the average person can read 250 words per - minute. So that's 267,235 / 250 / 60 = 17.8 hours. Well, it's hard to believe anyone can really read Ulysses in 18 hours but... good enough.
So on the rice fasta file i ran:
grep -v ">" rice.fasta | wc -c
to get rid of the 12 header lines (1 per chromosome) and only count sequence (should be within 12 characters counting the extra new-lines). That gives 372,077,765 characters. The average word-size in ulysses is 5. I rounded up to 6. So, the rice sequence has the equivalent of 372,077,765 / 6 = 62,012,960 words
So, at 250 words per minute, it'd take:
62012960 / 250 / 60 = 4,134 hours to read the rice genome. That's 172 days. Also, from what I know, the plot is hard to follow.
Genome size varies widely among plants. I have a couple ideas for pointless visualizations of this...

Monday, March 24, 2008

point partitioning

After you spend all day banging your head against your own problems, sometimes it's just nice to bang it on something else for a bit.
This question came through on the postgis mailing list and it seemed like a good diversion. I think it's a very clear description of the problem. To quote:
I have around 300,000 points, each with a lat, lon and altitude (also converted to geometry). I need to get a subset of those points, where none of them are within 5m (or some other arbitrary distance) of each other. It doesnt matter which points get picked over another, as long as whatever the data set it creates, that none of the points are within that 5m radius and that relatively most of the points are used


So, I hacked up a quick solution. It's probably inefficient -- deleting keys from a dict, and removing entries from an rtree index. But, it's easy to understand, and (without the plotting) it runs in about 2 minutes for the requested 300000 points.
When plotting, the image looks like this:


and the code...


import rtree
import random
import pylab


dist = 500
index = rtree.Rtree()
points = {} # xy coords. use dict to delete without changing keys.
groups = {} # store the grouped points. all keys are > dist apart

# create some random points and put them in an index.
for i in range(3000):
x = random.random() * 10000
y = random.random() * 10000
pt = (x, y)
points[i] = pt
index.add(i, pt)

print "index created..."

while len(points.values()):
pt = random.choice(points.values())
print pt
bbox = (pt[0] - dist, pt[1] - dist, pt[0] + dist, pt[1] + dist)

idxs = index.intersection(bbox)
# add actual distance here, to get those within dist.

groups[pt] = []
for idx in sorted(idxs, reverse=True):
delpt = points[idx]
groups[pt].append(delpt)
index.delete(idx, delpt)
del points[idx]

# groups contains keys where no key is within dist of any other pt
# the values for a given key are all points with dist of that point.

for pt, subpts in groups.iteritems():
subpts = pylab.array(subpts)
pylab.plot(subpts[:,0], subpts[:,1], 'k.')
pylab.plot([pt[0]], [pt[1]], 'ro')

pylab.show()

Tuesday, March 18, 2008

OGR python projection

OGR Projection



If you're using shapely and you need to do projections, you'll either have a lot of boilerplate or a function like this one. Actually, even in OGR, there's a lot of bioler plate involved in transforming....



from osgeo import ogr
from shapely.wkb import loads

def project(geom, to_epsg=900913, from_epsg=4326):
"""utility function to do quick projection with ogr,
to and from shapely objects
>>> from shapely.geometry import LineString
>>> l = LineString([[-121, 43], [-122, 42]])
>>> lp = project(l, from_epsg=4326, to_epsg=26910)
>>> lp.wkt
'LINESTRING (663019.0700828594854102 4762755.6415722491219640, 582818.0692490270594135 4650259.8474613213911653)'
"""

to_srs = ogr.osr.SpatialReference()
to_srs.ImportFromEPSG(to_epsg)

from_srs = ogr.osr.SpatialReference()
from_srs.ImportFromEPSG(from_epsg)

ogr_geom = ogr.CreateGeometryFromWkb(geom.wkb)
ogr_geom.AssignSpatialReference(from_srs)

ogr_geom.TransformTo(to_srs)
return loads(ogr_geom.ExportToWkb())

import doctest
doctest.testmod()

Sunday, March 16, 2008

spatially explicit metapopulation models in scipy.

Making Pretty Pictures


I started to learn to program about 5 years ago running population ecology models in mathematica. Yesterday, I found an old mma notebook with a model modified to include differential parasitoid dispersal to adjacent host cells depending on the host density in those cells. It's bascially a discrete-time Nicholson-Bailey model. But in a grid of cells, where each cell contains a population of hosts (H) and parasitoids (P) that give birth, die, eat, and get eaten according to the NB model. Each generation, following birth/reproduction/predation, the hosts and parasitoids disperse. The hosts disperse equally to the 8 surrounding cells in their neigbhorhood. The parasitoids can move irrespective of host densities when the aggregation parameter (eta) is 0. When aggregation is 1 (eta == 1), the parasitoids move to adjacent cells in exact proportion to host densities in each of the surrounding cells. muH and muP (i'm too lazy to figure out how to write the symbols for mu) determine the proportion of individuals in each population that disperse. I no longer have mathematica, but I downloaded all 96 megabytes of the reader to open the model:

I'm no longer a big fan of mathematica, but, it does make it easy to use funky characters and show equations nicely.
There, the ListCorrelate does the dispersal. It's a convolution to "smear" out populations according to the kernels, which are defined by the muH, muP and for the parasitoid, eta.
It took literally 10 minutes to translate that to numpy/scipy.
I'll paste that code at the end.
The cool thing is that you can see complex spatial patterns arising within the gridded metapopulation, even when the sum of the densities across cells is constant. So, running the model, and plotting the time series (top), and the phase portrait (bottom) are shown.

The time-series is the average density of Hosts, Parasitoids across the metaopulation. It takes a while to stabilize, but they oscillate down to an equilibrium. The phase portrait is just plotting the mean density of hosts vs parasitoids for each generation. The generations are discrete, so the lines are merely to view the trajectory. It's hard to see from the scale of the time series, but the parasitoid cycles lag those of the host by about a generation.
But the model is running in a spatially explicit manner, and averaging loses all of that spatial data. So, starting at 200 generations, the program then saves a snapshot of the grid, with higher host densities in white, and lower values in black. This is done every third generation. After the model run, there's a directory with a bunch of images. It'd be nice to play them as a movie... Imagemagick's convert is perfect for this:

convert -delay 40 -loop 0 images/*.png metapop.gif

Where that command creates a nice little (4MB) animated gif:
metapop.gif
That shows the spatiotemporal dynamics of the host population.
That's a pretty simple way to incorporate time into an visualization when you're otherwise limited to 2 dimensions... Maybe I'm too easily amused, but I could watch that all day.
Anyway, here's the hastily translated code:

import numpy
import pylab
from scipy.signal import convolve2d

def doplot(hm, pm):
pylab.plot(range(gens), hm, 'b')
pylab.plot(range(gens), pm, 'g')
pylab.ylim(0, pm.mean() + hm.mean())
pylab.legend(('hosts', 'parasitoids'))
pylab.subplot(212)
pylab.plot(hm, pm, 'r')
pylab.plot(hm, pm, 'k.')
pylab.xlim(0, 1.3 * numpy.mean(hm) + 5)
pylab.ylim(0, 1.3 * numpy.mean(pm) + 5)
pylab.xlabel('hosts')
pylab.ylabel('parasitoids')
pylab.show()

def metapop(H0, P0, a=0.05, l=3., K=1000, muH=0.2, muP=0.8
, Hrange=1, Prange=1, eta=3, size=32, gens=1000
, mode="same", boundary="wrap"):
E = numpy.e
Hmeta = []
Pmeta = []
Pkern = numpy.ones((2 * Prange + 1, 2 * Prange + 1)
, dtype=numpy.double)
Hkern = numpy.ones((2 * Hrange + 1, 2 * Hrange + 1)
, dtype=numpy.double)
Hkern *= muH/((2 * Hrange + 1)**2 - 1.)
Hkern[Hrange + 1, Hrange + 1] = 1.0 - muH

for gen in range(1, gens + 1):
Hmeta.append(H0.mean())
Pmeta.append(P0.mean())

# poisson search.
f = E**(-a * P0)

# predation, births
H1 = l * H0 * f * numpy.exp(-numpy.log(l) * H0*f/K)
P1 = H0 * 1 - f

# simple movement between adjacent cells.
H0 = 0.001 + convolve2d(H1, Hkern, mode=mode, boundary=boundary)

# biased movement by parasitoid according to density of hosts in
# adjacent cells.
heta = H0**eta
B = heta / convolve2d(heta, Pkern, mode=mode
, boundary=boundary)

P0 = (1. - muP) * P1 + muP * B * convolve2d(P1, Pkern, mode=mode
, boundary=boundary)
P0 *= P1.sum()/P0.sum()
if gen > 200 and not gen % 3:
pylab.figure(figsize=(4,4))
pylab.axes([0, 0, 1, 1])
pylab.imshow(H0, cmap=pylab.cm.gray)
pylab.xticks([])
pylab.yticks([])
pylab.savefig('images/%03i.png' % gen)
pylab.close()

return numpy.array(Hmeta), numpy.array(Pmeta)

if __name__ == "__main__":
gens = 300
size = 64
H0 = 1000 * numpy.abs(numpy.random.randn(size * size).reshape(size, size))
P0 = numpy.zeros_like(H0)
P0[0, 0] = 1.
P0[32, 32] = 1.

pylab.close()
pylab.subplot(211)
hm, pm = metapop(H0, P0, gens=gens, a = 0.02)
doplot(hm, pm)


The variable names are ugly, but it's a start for something to play with. It's interesting to see the effects of parasitoid aggregation (eta). And the parasitoid initialization--here, it introduces a single parasitoid at cells 0,0 and 32, 32. Should anyone actually use this, I have a derivation of the equations, and better explanation of the parameters available.

Tuesday, March 11, 2008

rtree: know your nearest neigbhors

My computer spends a lot of time looking for neighbors of a given location-- even more so for bio, than for geo. This is what I've learned about the options for doing smarter search so far.

SELECT * from (
(SELECT * FROM ((SELECT * FROM feature WHERE start<= ? ORDER BY start DESC LIMIT 1) UNION (SELECT * FROM feature where start>= ?
ORDER BY start LIMIT 1)) as u)
UNION
(SELECT * FROM ((SELECT * FROM feature where stop<= ? ORDER BY stop DESC LIMIT 1) UNION (SELECT * FROM feature where stop>= ?
ORDER BY stop LIMIT 1)) as v)
) as w
ORDER BY ABS((start + stop)/2 - ?) LIMIT 1

if you fill in ? with an integer location, that query will return the closest feature most of the time. It's verbose if not ugly, and that's only for 1 dimension. It can return the wrong feature in certain cases.... You have to write it like that in MySQL, because it doesnt support functional indexes, so as soon as you do something like:
ORDER BY ABS((start + stop)/2 - ?)
it's no longer an indexed search
It's a hard problem, even if you're using postgis. And even if you're a postGIS badass.
Other than postGIS, there postgres's builtin geometric types. Even so, for most things, using SQL makes me feel far from the data.
Generally, I pull my data into a numpy array and use slicing to get to a region of interest:

feats = all_feats[(all_feats['bpmax'] > bpmin) & (all_feats['bpmin'] < bpmax)]

that's usually pretty fast. It can be even faster with numexpr. I suppose it could be done using indexed searching in pytables pro. Still doing a windowed search like that doesn't guarantee you get the nearest neighbor and you still have to pare it down and do some arithmetic to figure out the nearest gene. And, it can be the bottleneck.
For bioinformatics libraries, there's fjoin which does indexing, but doesnt seem much of a library (plus the indentation is messed up in the file). There's also pygr, which, presumably would be perfect for this, if I could understand it. I keep looking at pygr occassionally, but it just doesn't sink in yet.

This mention of rtree and nearest neighbor reminded me of this (see Sean's comment). Originally, I thought between shapely, rtree, and pylab, you could have a pretty useful python-specific plot-table, spatial data structure. It's a wrapper over spatialindex which is pretty comprehensive, supporting Rtree, MVR tree, TP Rtree (pdf). It's pretty low-level, you have to add each item to the index, but that lets you do powerful things, like add an index to uhhh, anything. As with most geo libraries, you have to hack them a bit to hand1e 1D shapes like genomics data, which only have a start and a stop, or maybe just a midpoint. But, I now have the simplest possible proof of concept to do stuff like this:

gstore = Gendex('genes')
for start, stop in features:
gene = (start, stop)
gstore.append(gene)
gstore.save()

and then to do nearest neighbor query:

gstore = Gendex('genes')
for loc in locations:
neighbors = gstore.nearby(loc, 2)
do_something(neighbors)

ideally, it'd allow for genomics upstream, downstream searches, with a syntax like:
gstore.nearby(loc, 2, direction=-1)
with -1 for upstream, and 1 for downstream locations.
Gendex is a simple class that automatically creates a store of locations and indexes them. Rtree can already create a persistent index, this just creates a persistent pickle of the start, stop locations. It has a bit of syntax sugar for 1D, automatically setting the y-values to 0.1 or -0.1 and returns the coordinates via .nearby, instead of their indicies from rtree's .nearest() search. Eventually, instead of just using a tuple location, it'll use take a class with name, strand, chromosome, etc. attributes but this is good enough for tinkering:

class Gendex(rtree.Rtree):
_store = []

def __init__(self, *args, **kwargs):
self.filename = args[0]
super(Gendex, self).__init__(*args, **kwargs)

if not 'overwrite' in kwargs and os.path.exists(args[0] + '.pkl'):
self.load()

def append(self, bds):
self.__setitem__(len(self._store), bds)

def __setitem__(self, i, bds):
# if < 2 y values, search performs poorly.
y = (i % 2) and 0.01 or -0.01
if isinstance(bds, (list, tuple)):
assert len(bds) == 2
self.add(i, (bds[0], y, bds[1], y))
else:
self.add(i, (bds, y))
self._store.append(bds)

def __getitem__(self, i):
if isinstance(i, int):
return self._store[i]
assert(isinstance(i, (list, tuple)))
return [self._store[ii] for ii in i]

def save(self):
f = open(self.filename + '.pkl', 'wb')
pickle.dump(self._store, f, -1)
f.close()

def load(self):
self._store = pickle.load(open(self.filename + '.pkl', 'rb'))

def nearby(self, bds, n):
if isinstance(bds, (int, float)):
bds = (bds, 0.1)
#return self.nearest(bds, n)
return self.__getitem__(self.nearest(bds, n))


According to hobu, the spatial index library supports 1-d, it just has to be added to the rtree python wrapper. I've started to look at this, but given my limited c/c++ skillz, that may require a lot of hand-holding.

A nice lightweight bio-informatics package for feature management could be the combination of:

that's it.
Pytables could work especially nicely if this gets implemented.

Tuesday, March 04, 2008

What a Shapely genome you have!

This might be a case of if you have a really cool hammer, everything looks like a nail, but it was fun mixing tools from different disciplines.

After finding synteny, there's a bunch of paired genes whose neighbors are also pairs. Paired (homologous) genes have similar sequence because they have some function and can't change without loss of function. Non-gene sequence between the paired genes is mostly randomized via mutation, deletion, etc. But, there is non-gene sequence that is conserved between the genes. These CNS's-- conserved non-coding sequences--are usually sites that bind stuff that regulates the expression of a gene.
That looks like this.

With one gene on the top, and its pair below, both yellow. Pink lines in the foreground connect putative CNSs (similar sequences) between these genes. That the lines cross is bad. CNSs occur right at the level of noise. So even though a similar sequence occurs near both genes, it could be by chance. It is possible to reduce this noise using local synteny. In the figure, that means both ends of the line should be about equi-distant from the end of either yellow gene. And lines that are going diagonally across the image should be removed. The goal is to find the parallel lines and remove those that cross.

Luckily, there's a "bioinformatics" library that lets me write code like this:
for aline in pink_lines:
for bline in pink_lines:
if aline == bline: continue
if aline.crosses(bline):
aline.has_crossed.update(b)
bline.has_crossed.update(a)
The .has_crossed is a set() to keep track which and how many lines any given line crosses. Then it's simple to find lines that crossed a lot of others and remove them.
# sort with most crosses first
pink_lines = sorted(pink_lines, cmp=lambda a, b: cmp(len(b.has_crossed), len(a.has_crossed)))

for cline in pink_lines:
if len(cline.has_crossed) > THRESHOLD:
cline.remove = True
for dline in cline.has_crossed:
# remove cline from all lines it crossed
dline.has_crossed.difference_update(cline)

pinklines = [pl for pl in pinklines if not pl.remove]
Then there will still be some crosses remaining, so the code then loops through pink_lines, find intersections and removes based on the score of the hit, and how many crosses it has.

That is the simplest case. It also has to remove lines (CNSs) that are in the intron of one gene, and not in the intron of another:

if acns.within(agene) and bcns.within(bgene):
do_something_with_intronic_cns(acns, bcns)
elif acns.within(agene) != bcns.within(bgene):
# bad! in only 1 intron
remove(acns, bcns)
There's also the cases where the lines of the CNSs don't cross, but the actual CNS's touch. So on one homeolog there might be a CNS from basepairs 920 to 936 and another from 922 to 937. We also want to get rid of those. So that'd be:
if cnsa.overlaps(cnsb):
remove_either(cnsa, cnsb)
Finally, the CNSs themselves have to be syntenic relative to the gene--that is, if a CNS cns_a is 9000 basepairs upstream from gene a, and its homeolog, cns_b is 12 basepairs upstream from gene b, that's likely not a syntenic pair. As with synteny dot plots, sometimes it's good to flip the subject homeolog to the y-axis (in the images above the subject is the bottom gene) and keep the query (the top gene in the images above) along the x-axis. Then any hit between the x and the y gene can be drawn in the x-y space. For the genes above, that looks like this:

Starting with the top image. The thick blue line in the center of the image is the gene itself. Any large green dots without a black dot over them are what the program called as CNSs. Red CNSs are on the wrong strand and are not considered. CNSs that overlap the grey--either along the x-axis or the y-axis occur in the introns of other genes and are not considered. Green dots covering the blue gene are intronic CNSs. The thin blue line just extends the y=x line. The purple bowtie is what I use to enforce synteny (bowtie.contains(cns) ). Anything outside of the bowtie is either close to the query homeolog and far from the subject, or vice-versa. Real CNSs should fall along the diagonal. And we chose an arbitrary maximum distance from the gene of 12,000 basepairs. It's easy to see the syntenic genes line up along the y=x blue line.

The bottom image filled with lines of the bilious hue is to visualize the crosses (as in the code above). The line with the black stripes has been removed because it overlapped > THRESHOLD lines. After that removal, no other lines crossed and after removing non-syntenic genes outside the bowtie, only 20 CNSs remained.

After all that noise is removed, the original image looks like this in our real viewer:


and those are real CNSs with lovely parallel lines. This runs for thousands of homologous gene pairs and creates a list of CNSs. There are other complexities, i.e. when pairs that are on opposite strands or when there are other syntenic genes in the up/downstream regions.

Other than the extra dimension, MultiLineString() works perfectly for a single gene with many CNSs. The fact that it integrates so well with numpy, and therefore matplotlib is great for visualization. Compare that to making a map file for just a quick look some data. Anyway, a fun project. Cheers Sean and the makers of geos.

Thursday, February 28, 2008

flash, vi, fcsh

All my flash tinkering has been in VIM-- no IDE, no XML, just actionscript. It's a little tough to deal with the adobe compiler as it takes about 11 seconds to compile a large project like modestmaps on my machine. That's not good for a guess-and-check programmer. The typing does catch some errors.

The worldkit project compiles instantaneously with mtasc (the predecessor to haxe)--likewise for the as2 branch of modestmaps. The flash compiler shell drops the compile time for as3 modestmaps to under 3 seconds, so I've added this to my .bash_aliases:

alias fcsh="/usr/bin/rlwrap /opt/src/flex2/bin/fcsh"

the rlwrap is to use readline in the flash shell--meaning I can just press up-arrow to get the previous compile command. By default, one has to paste or type the entire command again.
With that, it's close to a reasonable workflow.

Wednesday, February 27, 2008

synteny mapping

Living in Synteny
I've been working on automating synteny mapping between any pairs of genomes. Synteny is where there's a stretch of DNA or genes in some order on chromosomeA of organismX and due to a shared evolutionary history, you can find a similar stretch of genes in order on chromosomeB of organismY. Often there are small losses and inversions, but between closely related organisms like man and mouse, there's still a lot of synteny.
Plants can undergo polyploidy, following which, a species can have 2 entire copies of its genome. Over time, much of the duplicated cruft is lost, and the homologous chromosomes diverge, but if the divergence is not too great, it's still possible (actually common) to find synteny within the genome of a single organism--as well as between organisms.
I've written my own algorithm to find synteny which uses python sets, and numpy array slicing to do the heavy lifting. It is quite clever [wink]. And it _almost_ works but it's ... sorta non-deterministic.
The output looks like this for Arabidopsis against itself:


That's compressed from a 3000px*3000px image, but it's 25 boxes for the 5 vs. 5 chromsomes so above and below the diagonal should be mirror images. and with:
+ the diagonal dark line in the center being where each gene matches itself
+ the dark patches in the center of each box being all the crap in the centromere of each chromosome.
+ the speckled dots all over being because genes have many relatives. and there's lots of frequently appearing transposons.
+ the red lines being what my algorithm finds as diagonals.
The problem is it either extends too far into the sea of dots, or it doesn't seed on what should be a diagonal, depending on the parameters. The human eye is very good at this, but it's difficult to make a computer do it. Anyway, I'd previously tried published synteny finding programs and found them no better than mine, but just found dagchainer, where DAG is directed-acyclic graph. It's a but fussy in that given too many points, it will find spurious diagonals, but it's predictable, and fast, and it's a simple matter to cull the points before sending them in. DAGChainer is published and public domain, and the code is readable (made me think maybe C++ ain't so bad). Also, it doesnt rely on protein sequence as many synteny programs do. This is important when dealing with poorly annotated genomes. I had some trouble with it originally, and emailed the author, Brian Haas. He asked me to send my data set, so I sent the worst parts. He did some preprocessing, found some good parameters , ran it himself and sent me the results in under a day, including a couple of explanatory emails back and forth. So, now, I'm building my processing scripts around dagchainer, and it's working out great. Brian is extremely enthusiastic and helpful. Must be another one of those crazy folks who enjoy what they do.

Tuesday, February 26, 2008

open source gis and flash maps part two

Mash Flap
I started looking into flash mapping stuff lately. For the patch I submitted to worldkit, Mikel gave me commit access! So, now the svn version of worldkit can be compiled with mtasc by typing "make". I feel unreasonably proud of that, given that much of what I did was some global search and replace stuff in VI, and then read and fixed mtasc compiler errors until they went away. It was good fun.
Michal Migurski saw in that same post that I mentioned modestmaps and gave me some good ideas on getting WMS going. I just figured out how to get that working and posted a message to their overly web2.0 forums. Hopefully someone with some real actionscript skillz will clean it up. A mapping library without a good WMS interface is much less useful for most of the stuff I do.

I haven't decided whether to use modestmaps or worldkit, or both. The time stuff the Mikel has done in worldkit is very cool and I haven't really looked at that yet. But I have a time based project starting soon. But, most of my GIS project are reliant on good, hi-resolution imagery and quality roads data, and it's nice if that's just "magically" included via one of the commercial map providers. That's a plus for modestmaps. Though I _still_ do not understand if the modestmaps usage of google, yahoo, microsoft images falls within the terms of use... Especially since the google starts sending "X" tiles instead of imagery after browsing with modesmaps for a while. Meanwhile, I keep using OpenLayers, because that's what sane people do.