(bloom) filter-ing repeated reads
In this post, I'll talk a bit about using a bloom filter as a pre- filter for large amounts of data, specifically some next-gen sequencing reads. Bloom Filters A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter . It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives. The code for that wrapper is in my github, here . Big Data It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer , and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/ha...