Showing posts from March 23, 2008

reading code

A fasta file of rice genomic sequence is 355MB. It's not easy to understand how large that is. This is an attempt to come up with a quick metric.
So, I downloadedUlysses.
wc shows it to have 267235 words. Some googling says the average person can read 250 words per - minute. So that's 267,235 / 250 / 60 = 17.8 hours. Well, it's hard to believe anyone can really read Ulysses in 18 hours but... good enough.
So on the rice fasta file i ran:grep -v ">" rice.fasta | wc -cto get rid of the 12 header lines (1 per chromosome) and only count sequence (should be within 12 characters counting the extra new-lines). That gives 372,077,765 characters. The average word-size in ulysses is 5. I rounded up to 6. So, the rice sequence has the equivalent of 372,077,765 / 6 = 62,012,960 words
So, at 250 words per minute, it'd take:
62012960 / 250 / 60 = 4,134 hours to read the rice genome. That's 172 days. Also, from what I know, the plot is hard to follow.
Genome size varies wi…

point partitioning

After you spend all day banging your head against your own problems, sometimes it's just nice to bang it on something else for a bit.
This question came through on the postgis mailing list and it seemed like a good diversion. I think it's a very clear description of the problem. To quote:
I have around 300,000 points, each with a lat, lon and altitude (also converted to geometry). I need to get a subset of those points, where none of them are within 5m (or some other arbitrary distance) of each other. It doesnt matter which points get picked over another, as long as whatever the data set it creates, that none of the points are within that 5m radius and that relatively most of the points are used

So, I hacked up a quick solution. It's probably inefficient -- deleting keys from a dict, and removing entries from an rtree index. But, it's easy to understand, and (without the plotting) it runs in about 2 minutes for the requested 300000 points.
When plotting, the image looks …