Showing posts from September 19, 2010

filtering paired end reads (high throughput sequencing)

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it.

I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.
At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps only reads …