(bloom) filter-ing repeated reads

In this post, I'll talk a bit about using a bloom filter as a pre-filter for large amounts of data, specifically some next-gen sequencing reads.

Bloom Filters

A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter. It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives.
The code for that wrapper is in my github, here.

Big Data

It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer, and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/hash structure and use that to find duplicates and print out only the best or first or whatever. However, once you start getting "next-gen", this is less useful. For example, someone kindly reported a bug on my simple c++ filter because he had 84Gigs of reads and was running out of memory. And that is a bug. Anything that's supposed to deal with next-gen sequences has to deal with that stuff.

As a side note, my friend/co-worker came up with an elegant solution for filtering larger files: split into 4 files based on the first base in the read (A, C, T, or G) then filter each of those files to uniqueness and merge. I like that approach, and as the file sizes grow it could be extended to separate into 16 files by reading the first 2 bases. But, it does require a lot of extra file io.

Just Do It

So, I wrote a simple script that filters to unique FASTQ reads using a bloom-filter in front of a python set. Basically only stuff that is flagged as appearing in the bloom-filter is added to the set. This trades speed--it iterates over the file 3 times--for memory. The amount of memory is tuneable by the specified error-rate. It's not pretty, but it should be simple enough to demonstrate what's going on. It only reads from stdin and writes to stdout, with some information about total reads an number of false positives in the bloom-filter sent to stderr.
usage looks like:
python fastq-unique.py > in.fastq < out.unique.fastq
On my machine, a particular run with a decent sized file looks like this:
$ time python fastq-unique.py < /usr/local/src/methylcode/data/dme/en.wt.1.fastq > unique.fastq
36754782 records in file
checking 2525155 potential duplicates in a python set
4699 false-positive duplicates in the bloom filter
real 11m31.077s
user 11m1.937s
sys 0m23.093s
view raw gistfile2.sh hosted with ❤ by GitHub

and here's the code:
"""
Reads a FastQ file from stdin and writes a file with unique records to sdout
usage:
%s < in.fastq > out.unique.fastq
see: http://hackmap.blogspot.com/2010/10/bloom-filter-ing-repeated-reads.html
"""
from bloomfaster import Elf
import collections
import sys
__doc__ %= sys.argv[0]
if len(sys.argv) > 1:
print sys.argv
print __doc__
sys.exit()
records = sum(1 for _ in sys.stdin) / 4
print >>sys.stderr, records, "records in file"
# say 1 out of 1000 is false positive.
bloom = Elf(records, error_rate=1e-3)
sys.stdin.seek(0)
readline = sys.stdin.readline
checks = []
header = readline().rstrip()
while header:
seq = readline().rstrip()
readline(); readline()
if seq in bloom:
checks.append(seq)
bloom.add(seq)
header = readline().rstrip()
# now checks contains anything that could be a duplicate according to
# the bloomfilter. for some, they were false positives.
# for actual duplicated, just choose the first, but can also sort by quality.
sys.stdin.seek(0)
checks = frozenset(checks)
print >>sys.stderr, "checking %s potential duplicates in a python set" \
% len(checks)
putative_false_positives = collections.defaultdict(int)
while True:
header = readline().rstrip()
if not header: break
seq = readline().rstrip()
plus = readline().rstrip()
qual = readline().rstrip()
# it appeared only once, so just print it.
if not seq in checks:
print "\n".join((header, seq, plus, qual))
continue
# it appeared in the bloom-filter > 1x, so track and make sure not
# to print any others with same sequence.
putative_false_positives[seq] += 1
if putative_false_positives[seq] > 1:
continue
print "\n".join((header, seq, plus, qual))
false_positives = sum(1 for count in putative_false_positives.values() \
if count == 1)
print >>sys.stderr, false_positives, "false-positive duplicates in the bloom filter"
view raw fastq-unique.py hosted with ❤ by GitHub

Comments

Popular posts from this blog

filtering paired end reads (high throughput sequencing)

levenshtein in cython

python interval tree