Tuesday, March 25, 2008

reading code

A fasta file of rice genomic sequence is 355MB. It's not easy to understand how large that is. This is an attempt to come up with a quick metric.
So, I downloaded Ulysses.
wc shows it to have 267235 words. Some googling says the average person can read 250 words per - minute. So that's 267,235 / 250 / 60 = 17.8 hours. Well, it's hard to believe anyone can really read Ulysses in 18 hours but... good enough.
So on the rice fasta file i ran:
grep -v ">" rice.fasta | wc -c
to get rid of the 12 header lines (1 per chromosome) and only count sequence (should be within 12 characters counting the extra new-lines). That gives 372,077,765 characters. The average word-size in ulysses is 5. I rounded up to 6. So, the rice sequence has the equivalent of 372,077,765 / 6 = 62,012,960 words
So, at 250 words per minute, it'd take:
62012960 / 250 / 60 = 4,134 hours to read the rice genome. That's 172 days. Also, from what I know, the plot is hard to follow.
Genome size varies widely among plants. I have a couple ideas for pointless visualizations of this...

2 comments:

Ken-ichi said...

Haha, bring on the pointless visualizations! I wonder what a ribosome would make of Ulysses... No doubt it would glean more meaning from it than I could.

Megan D said...

What did people do before computer programming? Wow I did not know you had a blog. But you do know your sh*t
-meg