rtree: know your nearest neigbhors

March 11, 2008

My computer spends a lot of time looking for neighbors of a given location-- even more so for bio, than for geo. This is what I've learned about the options for doing smarter search so far.


SELECT * from (
 (SELECT * FROM ((SELECT * FROM feature WHERE start<= ? ORDER BY start DESC  LIMIT 1) UNION (SELECT * FROM feature where start>= ?
ORDER BY start LIMIT 1)) as u)
 UNION
 (SELECT * FROM ((SELECT * FROM feature where stop<= ? ORDER BY stop   DESC  LIMIT 1) UNION (SELECT * FROM feature where stop>= ?
ORDER BY stop LIMIT 1)) as v)
) as w
ORDER BY ABS((start + stop)/2 - ?) LIMIT 1

if you fill in ? with an integer location, that query will return the closest feature most of the time. It's verbose if not ugly, and that's only for 1 dimension. It can return the wrong feature in certain cases.... You have to write it like that in MySQL, because it doesnt support functional indexes, so as soon as you do something like:
ORDER BY ABS((start + stop)/2 - ?)
it's no longer an indexed search
It's a hard problem, even if you're using postgis. And even if you're a postGIS badass.
Other than postGIS, there postgres's builtin geometric types. Even so, for most things, using SQL makes me feel far from the data.
Generally, I pull my data into a numpy array and use slicing to get to a region of interest:


feats = all_feats[(all_feats['bpmax'] > bpmin) & (all_feats['bpmin'] < bpmax)]

that's usually pretty fast. It can be even faster with numexpr. I suppose it could be done using indexed searching in pytables pro. Still doing a windowed search like that doesn't guarantee you get the nearest neighbor and you still have to pare it down and do some arithmetic to figure out the nearest gene. And, it can be the bottleneck.
For bioinformatics libraries, there's fjoin which does indexing, but doesnt seem much of a library (plus the indentation is messed up in the file). There's also pygr, which, presumably would be perfect for this, if I could understand it. I keep looking at pygr occassionally, but it just doesn't sink in yet.

This mention of rtree and nearest neighbor reminded me of this (see Sean's comment). Originally, I thought between shapely, rtree, and pylab, you could have a pretty useful python-specific plot-table, spatial data structure. It's a wrapper over spatialindex which is pretty comprehensive, supporting Rtree, MVR tree, TP Rtree (pdf). It's pretty low-level, you have to add each item to the index, but that lets you do powerful things, like add an index to uhhh, anything. As with most geo libraries, you have to hack them a bit to hand1e 1D shapes like genomics data, which only have a start and a stop, or maybe just a midpoint. But, I now have the simplest possible proof of concept to do stuff like this:


gstore = Gendex('genes')
for start, stop in features:
   gene = (start, stop)
   gstore.append(gene)
gstore.save()

and then to do nearest neighbor query:


gstore = Gendex('genes')
for loc in locations:
   neighbors = gstore.nearby(loc, 2)
   do_something(neighbors)

ideally, it'd allow for genomics upstream, downstream searches, with a syntax like:
gstore.nearby(loc, 2, direction=-1)
with -1 for upstream, and 1 for downstream locations.
Gendex is a simple class that automatically creates a store of locations and indexes them. Rtree can already create a persistent index, this just creates a persistent pickle of the start, stop locations. It has a bit of syntax sugar for 1D, automatically setting the y-values to 0.1 or -0.1 and returns the coordinates via .nearby, instead of their indicies from rtree's .nearest() search. Eventually, instead of just using a tuple location, it'll use take a class with name, strand, chromosome, etc. attributes but this is good enough for tinkering:


class Gendex(rtree.Rtree):
    _store = []

    def __init__(self, *args, **kwargs):
        self.filename = args[0]
        super(Gendex, self).__init__(*args, **kwargs)

        if not 'overwrite' in kwargs and os.path.exists(args[0] + '.pkl'):
            self.load()

    def append(self, bds):
        self.__setitem__(len(self._store), bds)

    def __setitem__(self, i, bds):
        # if < 2 y values, search performs poorly.
        y = (i % 2) and 0.01 or -0.01
        if isinstance(bds, (list, tuple)):
            assert len(bds) == 2
            self.add(i, (bds[0], y, bds[1], y))
        else:
            self.add(i, (bds, y))
        self._store.append(bds)

    def __getitem__(self, i):
        if isinstance(i, int):
            return self._store[i]
        assert(isinstance(i, (list, tuple)))
        return [self._store[ii] for ii in i]

    def save(self):
        f = open(self.filename + '.pkl', 'wb')
        pickle.dump(self._store, f, -1)
        f.close()

    def load(self):
        self._store = pickle.load(open(self.filename + '.pkl', 'rb'))

    def nearby(self, bds, n):
        if isinstance(bds, (int, float)):
            bds = (bds, 0.1)
        #return self.nearest(bds, n)
        return self.__getitem__(self.nearest(bds, n))

According to hobu, the spatial index library supports 1-d, it just has to be added to the rtree python wrapper. I've started to look at this, but given my limited c/c++ skillz, that may require a lot of hand-holding.

A nice lightweight bio-informatics package for feature management could be the combination of:

Pytables or a memmaped numpy recarray

rtree

pylab

that's it.
Pytables could work especially nicely if this gets implemented.

Comments

Logan Warner said…

Lovvely blog you have

2:46 AM

Search This Blog

Bioinformatics

rtree: know your nearest neigbhors

Comments

Popular posts from this blog

python interval tree

levenshtein in cython

filtering paired end reads (high throughput sequencing)