binary search over intervals
[EDIT: update location of code repo]
This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with.
Binary search is easy. Just look at the python std library implementation (and the C API version). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order.
Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the explanation here, that means storing not only the start, but the stop and often the center of every interval--and using them to do the search. That makes things considerably more complicated than binary search. I started with an implementation of interval search here, but couldn't figure out how to customize.
Binary search is kind of a special case of binary search where intervals are of exactly length 0, e.g. start == stop. So, if all intervals are of exactly length 2? Well, then you can sort by start, find the left-most index by looking for start - 2 and find the right most index by searching for the (highest) index of start. That will give you the indicies in the sorted array of all intervals that overlap the query. The highest and lowest correspond to python's bisect.bisect_left, and bisect.bisect_right respectively.
This carries to any length. But, if all the intervals are different lengths. Well, then you can save the longest length, and then for any search, it's:
p_overlaps = intervals[bisect_left(start - max_len):bisect_right(stop)]
but that only gives the putative overlapping. Since we extended the left by max_len, and we may have found an interval whose length was < max_len (meaning its stop is before the start of the query) we have to explicitly test for distance:
real_overlaps = [iv for iv in p_overlaps if distance(query_interval, iv) == 0]
which gives only the intervals that overlap. So, that part is the price to pay for the simplified search. Another way, as suggested in the wikipedia article is to store the length of each interval as part of the interval.
In this setup, the worst case scenario is when a single looooong interval covers the entire range of the list of intervals. Then every search is linear, brute-force search. However, my use for this is genomic data. There, I'll have an entire range of say... 50 million, and the intervals (genes) are rarely longer than 4000 in length (basepairs). So, it's useful to optimize simplicity.
My cython version of this is in my googlecode repo here. It has all the stuff I use, methods for left(), right(), upstream(), downstream(), nearest_neighbors(). Most of the searching work is in the binsearch_* functions -- I couldn't use the python ones. There are a couple of hacks in there:
1) because pyrex/cython don't support closures
2) the left() method is confusing because the intervals are sorted by start, and the left() has to find the nearest intervals by stop(). That's where complexity is increased because of this setup.
On my machine, it creates a tree with 6815 intervals in .016 seconds and does 50000 searches on that tree in .14 seconds. It seems to scale well with the number of features as 50K searches on 68150 features takes .50 seconds. Adding an interval that covers the entire range of all other features (results in worst-case linear search) makes the 50K searches (on 6185 intervals) take 1.54 seconds--which is only so good because the brute-force in method is pretty close to c-speed. This could be optimized by saving the stops in a separate array, or by saving long intervals in a separate array, but it's rare enough, and the code is simple enough as-is, that I'll probably leave it for now.
The "proper" way to do this is with an interval/segment tree, of which there's a very readable version in bx-python. If I'd found that earlier, I probably wouldn't have coded this... The tree is faster for larger number of intervals, but that's rarely going to be an issue, it does take much less memory...
This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with.
Binary search is easy. Just look at the python std library implementation (and the C API version). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order.
Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the explanation here, that means storing not only the start, but the stop and often the center of every interval--and using them to do the search. That makes things considerably more complicated than binary search. I started with an implementation of interval search here, but couldn't figure out how to customize.
Binary search is kind of a special case of binary search where intervals are of exactly length 0, e.g. start == stop. So, if all intervals are of exactly length 2? Well, then you can sort by start, find the left-most index by looking for start - 2 and find the right most index by searching for the (highest) index of start. That will give you the indicies in the sorted array of all intervals that overlap the query. The highest and lowest correspond to python's bisect.bisect_left, and bisect.bisect_right respectively.
This carries to any length. But, if all the intervals are different lengths. Well, then you can save the longest length, and then for any search, it's:
p_overlaps = intervals[bisect_left(start - max_len):bisect_right(stop)]
but that only gives the putative overlapping. Since we extended the left by max_len, and we may have found an interval whose length was < max_len (meaning its stop is before the start of the query) we have to explicitly test for distance:
real_overlaps = [iv for iv in p_overlaps if distance(query_interval, iv) == 0]
which gives only the intervals that overlap. So, that part is the price to pay for the simplified search. Another way, as suggested in the wikipedia article is to store the length of each interval as part of the interval.
In this setup, the worst case scenario is when a single looooong interval covers the entire range of the list of intervals. Then every search is linear, brute-force search. However, my use for this is genomic data. There, I'll have an entire range of say... 50 million, and the intervals (genes) are rarely longer than 4000 in length (basepairs). So, it's useful to optimize simplicity.
My cython version of this is in my googlecode repo here. It has all the stuff I use, methods for left(), right(), upstream(), downstream(), nearest_neighbors(). Most of the searching work is in the binsearch_* functions -- I couldn't use the python ones. There are a couple of hacks in there:
1) because pyrex/cython don't support closures
2) the left() method is confusing because the intervals are sorted by start, and the left() has to find the nearest intervals by stop(). That's where complexity is increased because of this setup.
On my machine, it creates a tree with 6815 intervals in .016 seconds and does 50000 searches on that tree in .14 seconds. It seems to scale well with the number of features as 50K searches on 68150 features takes .50 seconds. Adding an interval that covers the entire range of all other features (results in worst-case linear search) makes the 50K searches (on 6185 intervals) take 1.54 seconds--which is only so good because the brute-force in method is pretty close to c-speed. This could be optimized by saving the stops in a separate array, or by saving long intervals in a separate array, but it's rare enough, and the code is simple enough as-is, that I'll probably leave it for now.
The "proper" way to do this is with an interval/segment tree, of which there's a very readable version in bx-python. If I'd found that earlier, I probably wouldn't have coded this... The tree is faster for larger number of intervals, but that's rarely going to be an issue, it does take much less memory...
Comments
1. Whether the data set is static.
2. Whether the intervals are disjoint.
1. Determines the type of data structure to use.
2. Determines whether a binary search is possible.
Let's assume that the data structure is static (and not argue what static means at this point). Then if the intervals are disjoint it is always possible to put one end-point in an array and do a binary search on the array. The width of an interval is irrevelant and need not be captured. The pragmatics are that when the search delta == 0, you are either at the interval queried are at most +-1 cell away in the array.
For overlapping intervals this issue is more complex. However, you can always generate a set of disjoint, half-open, intervals from a set of overlapped intervals, and can then construct a binary search array from this list. The number of disjoint intervals constructed from a given set of overlapped intervals is bounded by O(2N), where N == the number of original intervals.
I have developed an algorithm to do construct disjoint intervals from a set of overlapped intervals. Your posting was seen during my Google search for others who have solved this same problem. I hope that my comments are taken well and not viewed as criticism.
Wtire if you have questions.
art
for
2. i wanted to show that binary search is possible even if the intervals are not disjoint.
i've also updated the post to link to the latest version of my code
which i havent touched for some time.
i typically use the intersection stuff in bx-python instead.
if you see something wrong with my implementation, please let me know.
-brent