Showing posts from October 17, 2010

(bloom) filter-ing repeated reads

In this post, I'll talk a bit about using a bloom filter as a pre-filter for large amounts of data, specifically some next-gen sequencing reads.
Bloom FiltersA Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter. It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives.
The code for that wrapper is in my github, here.

Big DataIt's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer, and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/hash structure an…