(bloom) filter-ing repeated reads
In this post, I'll talk a bit about using a bloom filter as a pre-filter for large amounts of data, specifically some next-gen sequencing reads.
The code for that wrapper is in my github, here.
As a side note, my friend/co-worker came up with an elegant solution for filtering larger files: split into 4 files based on the first base in the read (A, C, T, or G) then filter each of those files to uniqueness and merge. I like that approach, and as the file sizes grow it could be extended to separate into 16 files by reading the first 2 bases. But, it does require a lot of extra file io.
usage looks like:
python fastq-unique.py > in.fastq < out.unique.fastq
On my machine, a particular run with a decent sized file looks like this:
and here's the code:
Bloom Filters
A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter. It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives.The code for that wrapper is in my github, here.
Big Data
It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer, and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/hash structure and use that to find duplicates and print out only the best or first or whatever. However, once you start getting "next-gen", this is less useful. For example, someone kindly reported a bug on my simple c++ filter because he had 84Gigs of reads and was running out of memory. And that is a bug. Anything that's supposed to deal with next-gen sequences has to deal with that stuff.As a side note, my friend/co-worker came up with an elegant solution for filtering larger files: split into 4 files based on the first base in the read (A, C, T, or G) then filter each of those files to uniqueness and merge. I like that approach, and as the file sizes grow it could be extended to separate into 16 files by reading the first 2 bases. But, it does require a lot of extra file io.
Just Do It
So, I wrote a simple script that filters to unique FASTQ reads using a bloom-filter in front of a python set. Basically only stuff that is flagged as appearing in the bloom-filter is added to the set. This trades speed--it iterates over the file 3 times--for memory. The amount of memory is tuneable by the specified error-rate. It's not pretty, but it should be simple enough to demonstrate what's going on. It only reads from stdin and writes to stdout, with some information about total reads an number of false positives in the bloom-filter sent to stderr.usage looks like:
python fastq-unique.py > in.fastq < out.unique.fastq
On my machine, a particular run with a decent sized file looks like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ time python fastq-unique.py < /usr/local/src/methylcode/data/dme/en.wt.1.fastq > unique.fastq | |
36754782 records in file | |
checking 2525155 potential duplicates in a python set | |
4699 false-positive duplicates in the bloom filter | |
real 11m31.077s | |
user 11m1.937s | |
sys 0m23.093s |
and here's the code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Reads a FastQ file from stdin and writes a file with unique records to sdout | |
usage: | |
%s < in.fastq > out.unique.fastq | |
see: http://hackmap.blogspot.com/2010/10/bloom-filter-ing-repeated-reads.html | |
""" | |
from bloomfaster import Elf | |
import collections | |
import sys | |
__doc__ %= sys.argv[0] | |
if len(sys.argv) > 1: | |
print sys.argv | |
print __doc__ | |
sys.exit() | |
records = sum(1 for _ in sys.stdin) / 4 | |
print >>sys.stderr, records, "records in file" | |
# say 1 out of 1000 is false positive. | |
bloom = Elf(records, error_rate=1e-3) | |
sys.stdin.seek(0) | |
readline = sys.stdin.readline | |
checks = [] | |
header = readline().rstrip() | |
while header: | |
seq = readline().rstrip() | |
readline(); readline() | |
if seq in bloom: | |
checks.append(seq) | |
bloom.add(seq) | |
header = readline().rstrip() | |
# now checks contains anything that could be a duplicate according to | |
# the bloomfilter. for some, they were false positives. | |
# for actual duplicated, just choose the first, but can also sort by quality. | |
sys.stdin.seek(0) | |
checks = frozenset(checks) | |
print >>sys.stderr, "checking %s potential duplicates in a python set" \ | |
% len(checks) | |
putative_false_positives = collections.defaultdict(int) | |
while True: | |
header = readline().rstrip() | |
if not header: break | |
seq = readline().rstrip() | |
plus = readline().rstrip() | |
qual = readline().rstrip() | |
# it appeared only once, so just print it. | |
if not seq in checks: | |
print "\n".join((header, seq, plus, qual)) | |
continue | |
# it appeared in the bloom-filter > 1x, so track and make sure not | |
# to print any others with same sequence. | |
putative_false_positives[seq] += 1 | |
if putative_false_positives[seq] > 1: | |
continue | |
print "\n".join((header, seq, plus, qual)) | |
false_positives = sum(1 for count in putative_false_positives.values() \ | |
if count == 1) | |
print >>sys.stderr, false_positives, "false-positive duplicates in the bloom filter" |
Comments