Tuesday, March 25, 2008

reading code

A fasta file of rice genomic sequence is 355MB. It's not easy to understand how large that is. This is an attempt to come up with a quick metric.
So, I downloaded Ulysses.
wc shows it to have 267235 words. Some googling says the average person can read 250 words per - minute. So that's 267,235 / 250 / 60 = 17.8 hours. Well, it's hard to believe anyone can really read Ulysses in 18 hours but... good enough.
So on the rice fasta file i ran:
grep -v ">" rice.fasta | wc -c
to get rid of the 12 header lines (1 per chromosome) and only count sequence (should be within 12 characters counting the extra new-lines). That gives 372,077,765 characters. The average word-size in ulysses is 5. I rounded up to 6. So, the rice sequence has the equivalent of 372,077,765 / 6 = 62,012,960 words
So, at 250 words per minute, it'd take:
62012960 / 250 / 60 = 4,134 hours to read the rice genome. That's 172 days. Also, from what I know, the plot is hard to follow.
Genome size varies widely among plants. I have a couple ideas for pointless visualizations of this...

Monday, March 24, 2008

point partitioning

After you spend all day banging your head against your own problems, sometimes it's just nice to bang it on something else for a bit.
This question came through on the postgis mailing list and it seemed like a good diversion. I think it's a very clear description of the problem. To quote:
I have around 300,000 points, each with a lat, lon and altitude (also converted to geometry). I need to get a subset of those points, where none of them are within 5m (or some other arbitrary distance) of each other. It doesnt matter which points get picked over another, as long as whatever the data set it creates, that none of the points are within that 5m radius and that relatively most of the points are used

So, I hacked up a quick solution. It's probably inefficient -- deleting keys from a dict, and removing entries from an rtree index. But, it's easy to understand, and (without the plotting) it runs in about 2 minutes for the requested 300000 points.
When plotting, the image looks like this:

and the code...

import rtree
import random
import pylab

dist = 500
index = rtree.Rtree()
points = {} # xy coords. use dict to delete without changing keys.
groups = {} # store the grouped points. all keys are > dist apart

# create some random points and put them in an index.
for i in range(3000):
x = random.random() * 10000
y = random.random() * 10000
pt = (x, y)
points[i] = pt
index.add(i, pt)

print "index created..."

while len(points.values()):
pt = random.choice(points.values())
print pt
bbox = (pt[0] - dist, pt[1] - dist, pt[0] + dist, pt[1] + dist)

idxs = index.intersection(bbox)
# add actual distance here, to get those within dist.

groups[pt] = []
for idx in sorted(idxs, reverse=True):
delpt = points[idx]
index.delete(idx, delpt)
del points[idx]

# groups contains keys where no key is within dist of any other pt
# the values for a given key are all points with dist of that point.

for pt, subpts in groups.iteritems():
subpts = pylab.array(subpts)
pylab.plot(subpts[:,0], subpts[:,1], 'k.')
pylab.plot([pt[0]], [pt[1]], 'ro')