filtering paired end reads (high throughput sequencing)
NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it.
I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.
At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps only reads that appear in both. Usage is something like:
Output
This example will create 2 new files: en.wt.1.fastq.trim and en.wt.2.fastq.trim each with the same number of corresponding records that pass the filtering above.
Adaptors
As described in my previous post, sometimes there are multiple adaptor sequences in the reads. This script can filter out any number of adaptors--specified in a comma delimited option -a--in a single run.
Script
It's not too pretty, but it does the job:
As always, let me know of any feedback.
I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.
At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps only reads that appear in both. Usage is something like:
ADAPTORS=GAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT,GAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGT,GAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG pair_fastx_clip_trim.py --sanger -a $ADAPTORS -M 20 -t 28 -l 40 en.wt.1.fastq en.wt.2.fastqWhere the -a (adaptor) -M (length of adaptor match) -t (min quality threshold) and -l (min length after quality chops) options are copied directly from (and sent directly to) fastx toolkit. --sanger indicates that the reads have fastq qualities in the sanger encoding. If that option is not specified, qualities are assumed to be in illumina 1.3 format where the ascii offset is 64.
Output
This example will create 2 new files: en.wt.1.fastq.trim and en.wt.2.fastq.trim each with the same number of corresponding records that pass the filtering above.
Adaptors
As described in my previous post, sometimes there are multiple adaptor sequences in the reads. This script can filter out any number of adaptors--specified in a comma delimited option -a--in a single run.
Script
It's not too pretty, but it does the job:
As always, let me know of any feedback.
Comments
i believe fastQC finds over-represented sequences wether they are primers or not. for example it finds poly-A stuff left in the reads.
it often finds multiple primers (and potentially different primers for the _1 and _2 reads).
$ python fastq_pair_filter.py
File "fastq_pair_filter.py", line 29
% (FASTX_CLIPPER, a, M, fastq, "-Q 33" if sanger else ""))
^
SyntaxError: invalid syntax
Traceback (most recent call last):
File "fastq_pair_filter.py", line 96, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "fastq_pair_filter.py", line 41, in main
fha, fhb = procs[0].stdout, procs[1].stdout
IndexError: list index out of range
$ python2.4 fastq_pair_filter.py
File "fastq_pair_filter.py", line 29
% (FASTX_CLIPPER, a, M, fastq, "-Q 33" if sanger else ""))
^
SyntaxError: invalid syntax
I posted the command itself on the above posts.
"$ python2.7 fastq_pair_filter.py"
Even with input files (paired end files) same error.
If possible Can you also putup usage options of the programm
Thanks
if (not fastqs and len(fastqs)) == 2:
let me know if you have any more problems.
"if (not fastqs and len(fastqs)) == 2:"
still have same error
$ python2.7 fastq_pair_filter.py
Traceback (most recent call last):
File "fastq_pair_filter.py", line 96, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "fastq_pair_filter.py", line 41, in main
fha, fhb = procs[0].stdout, procs[1].stdout
IndexError: list index out of range
http://gist.github.com/588841
i edited the gist manually and put the new bracket in the wrong place in the gist and in my comment. the not should be out side the ()
No errors without inputs now
But with input has errors still:
$ python2.7 fastq_pair_filter.py -a ACCC,CGTA,CAGT,TTAG -t 24 SRR018062_1.fastq RR018062_2.fastq
Traceback (most recent call last):
File "fastq_pair_filter.py", line 96, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "fastq_pair_filter.py", line 22, in main
trim_cmd = "%s -t %i -l %i" % (FASTQ_QUALITY_TRIMMER, t, l)
TypeError: %d format: a number is required, not NoneType
Thanks for your time. Its seems working...
BTW how to do de-multiplex (separate barcodes) in paired-samples and place reads in seperate bins?.
for barcode splitting (which i havent done), i think you can use fastx directly:
http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage
It will be great if you can implement it. It will be complete package for PE reads for filtering, trimming and BC split.
However, in the end you still store them all in the 'seen' bitmap, right?
I don't think you need to do this, something like this pseudo code should do. Or am I missing something here?
Furthermore, I don't have a Fastq file handy at the moment, but if the headers occur in the files according to some order, you don't need an original file at all. But probably this is not the case.
"Traceback (most recent call last):
File "./pairedtrim", line 81, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "./pairedtrim", line 46, in main
for ra, rb in gen_pairs(procs[0].stdout, procs[1].stdout):
File "./pairedtrim", line 36, in gen_pairs
assert not all(b), ("files not same length")
AssertionError: files not same length
fastq_quality_trimmer: writing nucleotides failed: Broken pipe
"
the command line I ran was:
./pairedtrim -t 18 -l 10 ../s11utegl.txt ../s12utegl.txt &
I did not use any of the adapter specifications because I am new to this and was a bit unclear if we had adaptor sequences. We did illumina paired end. Any help would be great!
use this script. I haven't updated the one in this post, will do so shortly.
https://github.com/brentp/methylcode/blob/master/bench/scripts/fastq_pair_filter.py
Also, the original untrimmed files were 7 g and 6 g and the files after using the script were 6.6 g and 5.8 g. That seems like what they should be but I am not sure.
Can you put the link again?
Thanks so much,
nike
the latest is here:
https://github.com/brentp/bio-playground/blob/master/reads-utils/fastq_pair_filter.py
but there are problems with the way fastx trims adaptors (often it removes the entire read) that I havent' dealt with. The quality trimming and pairing stuff should work find though.
Also check out the scythe and sickle projects from the group and UC Davis.
thanks a lot,
I don't want to use the adaptors part, I'm interested in the quality and pairing control.
I have just started to use it,
I will tell you if it works or not, if you like :)
nike00
writing read1.fastq.trim and read2.fastq.trim
Is it unusual? What is the expected running time of this script on such data, and looking for 5 different adaptors?
My machine has 24 cores.
I have not been maintaining this and don't intend to do so in the near future.
There are multiple versions about (my fault) with various different bugs.
Thanks for this tool. I came here after checking Fastx_clipper tool box manually. To my understanding it did NOT work as expected/or praised.
For sure, it pulled out all those reads with adapter sequences. But, in addition, it also pulled out reads that are originally came from my genome of intrest. It was very obvious, when i align those reads (containing adatper as told by fastx tool kit) aginst NCBI_NT db or my genome of intrest.....
Following is the command i used to see what the fastx_clipper thinks as adapter containing reads....
fastx_clipper -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -l 1 -d 60 -k -i ES003_carlos_1220_sequence_1.10K.fasta > mp1_index1containg_reads_onlyadapt
er.fa
i use -d 60 so i can pull entire read again to check if it a good job of finding adapters
Anything i am doing worng here? Or is it something been overlooked.....
I will appreciate your help.