[EDIT: update location of code repo]

This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with.

Binary search is easy. Just look at the python std library implementation (and the C API version). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order.

Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the explanation here, that means storing not only the start, but the stop and often the center of every interval--and using them to do the search. That makes things considerably more complicated than binary search. I started with an implementation of interval search here, but couldn't figure out how to customize.

Binary search is kind of a special case of binary search where intervals are of exactly length 0, e.g. start == stop. So, if all intervals are of exactly length 2? Well, then you can sort by start, find the left-most index by looking for start - 2 and find the right most index by searching for the (highest) index of start. That will give you the indicies in the sorted array of all intervals that overlap the query. The highest and lowest correspond to python's bisect.bisect_left, and bisect.bisect_right respectively.

This carries to any length. But, if all the intervals are different lengths. Well, then you can save the longest length, and then for any search, it's:

p_overlaps = intervals[bisect_left(start - max_len):bisect_right(stop)]

but that only gives the putative overlapping. Since we extended the left by max_len, and we may have found an interval whose length was < max_len (meaning its stop is before the start of the query) we have to explicitly test for distance:

real_overlaps = [iv for iv in p_overlaps if distance(query_interval, iv) == 0]

which gives only the intervals that overlap. So, that part is the price to pay for the simplified search. Another way, as suggested in the wikipedia article is to store the length of each interval as part of the interval.

In this setup, the worst case scenario is when a single looooong interval covers the entire range of the list of intervals. Then every search is linear, brute-force search. However, my use for this is genomic data. There, I'll have an entire range of say... 50 million, and the intervals (genes) are rarely longer than 4000 in length (basepairs). So, it's useful to optimize simplicity.

My cython version of this is in my googlecode repo here. It has all the stuff I use, methods for left(), right(), upstream(), downstream(), nearest_neighbors(). Most of the searching work is in the binsearch_* functions -- I couldn't use the python ones. There are a couple of hacks in there:

1) because pyrex/cython don't support closures

2) the left() method is confusing because the intervals are sorted by start, and the left() has to find the nearest intervals by stop(). That's where complexity is increased because of this setup.

On my machine, it creates a tree with 6815 intervals in .016 seconds and does 50000 searches on that tree in .14 seconds. It seems to scale well with the number of features as 50K searches on 68150 features takes .50 seconds. Adding an interval that covers the entire range of all other features (results in worst-case linear search) makes the 50K searches (on 6185 intervals) take 1.54 seconds--which is only so good because the brute-force in method is pretty close to c-speed. This could be optimized by saving the stops in a separate array, or by saving long intervals in a separate array, but it's rare enough, and the code is simple enough as-is, that I'll probably leave it for now.

The "proper" way to do this is with an interval/segment tree, of which there's a very readable version in bx-python. If I'd found that earlier, I probably wouldn't have coded this... The tree is faster for larger number of intervals, but that's rarely going to be an issue, it does take much less memory...

## Wednesday, June 04, 2008

Subscribe to:
Posts (Atom)