Showing posts from December 20, 2009

genome scrubber : mask repetitive sequence

This is to describe a simple tool I've made available (svn repo) for masking repetitive sequence.

rice (Oryza Sativa) version 5 sequence looks something like below when run through pyfasta info.

>1 length:43596771
>3 length:36345490
>2 length:35925388
>4 length:35244269
>6 length:31246789
>5 length:29874162
>7 length:29688601
>11 length:28462103
>8 length:28309179
>12 length:27497214
>9 length:23011239
>10 length:22876596

372.078M basepairs in 12 sequences

So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command:
/usr/bin/blastall -…