Comments on Bioinformatics: filtering paired end reads (high throughput sequencing)

Hi BrentP, Thanks for this tool. I came here after...

2011-11-14T14:27:21.161-08:00

Hi BrentP,
Thanks for this tool. I came here after checking Fastx_clipper tool box manually. To my understanding it did NOT work as expected/or praised.

For sure, it pulled out all those reads with adapter sequences. But, in addition, it also pulled out reads that are originally came from my genome of intrest. It was very obvious, when i align those reads (containing adatper as told by fastx tool kit) aginst NCBI_NT db or my genome of intrest.....

Following is the command i used to see what the fastx_clipper thinks as adapter containing reads....

fastx_clipper -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -l 1 -d 60 -k -i ES003_carlos_1220_sequence_1.10K.fasta > mp1_index1containg_reads_onlyadapt
er.fa

i use -d 60 so i can pull entire read again to check if it a good job of finding adapters

Anything i am doing worng here? Or is it something been overlooked.....

I will appreciate your help.

Hi, to any commenters having trouble, I have updat...

2011-10-21T12:10:10.521-07:00

Hi, to any commenters having trouble, I have updated the first sentence of the post to indicate that this should not be used.
I have not been maintaining this and don't intend to do so in the near future.
There are multiple versions about (my fault) with various different bugs.

Hi, Can you confirm if any of both scripts mention...

2011-10-20T07:27:28.113-07:00

Hi, Can you confirm if any of both scripts mentioned works properly? I tried pair_fastx_clip_trim.py and fastq_pair_filter.py. Both scripts crashes with the adaptors option. Without -a options, pair_fastx_clip_trim.py produces non sync files (one shorter than the other one and fastq_pair_filter.py produces files with a very few reads. With my 9GB start files the first script produces 7GB files and with the second one 150MB files. Thanks for your help

Hi, i just used your script on about 1 giga paired...

2011-10-10T23:23:21.261-07:00

Hi, i just used your script on about 1 giga paired reads. It has been a couple of hours that the scripts outputed:
writing read1.fastq.trim and read2.fastq.trim

Is it unusual? What is the expected running time of this script on such data, and looking for 5 different adaptors?

My machine has 24 cores.

I try the one downloaded from https://gist.github....

2011-08-03T16:37:53.483-07:00

I try the one downloaded from https://gist.github.com/588841. I found the output pairs DO NOT MATCH. Other version does not work: https://github.com/brentp/bio-playground/blob/master/reads-utils/fastq_pair_filter.py

@brentp thanks a lot, I don't want to use the...

2011-07-18T10:11:10.624-07:00

@brentp

thanks a lot,
I don't want to use the adaptors part, I'm interested in the quality and pairing control.
I have just started to use it,
I will tell you if it works or not, if you like :)

nike00

@nike00 the latest is here: https://github.com/bre...

2011-07-18T09:30:20.032-07:00

@nike00
the latest is here:
https://github.com/brentp/bio-playground/blob/master/reads-utils/fastq_pair_filter.py

but there are problems with the way fastx trims adaptors (often it removes the entire read) that I havent' dealt with. The quality trimming and pairing stuff should work find though.

Also check out the scythe and sickle projects from the group and UC Davis.

Dear brentp, I'm interested in your script, bu...

2011-07-18T09:00:20.761-07:00

Dear brentp, I'm interested in your script, but now I really don't know which is the latest version, the best working one :)

Can you put the link again?

Thanks so much,
nike

It worked, thanks!

2011-06-08T09:46:54.648-07:00

It worked, thanks!

no. that's a different one.

2011-06-07T13:32:00.995-07:00

no. that's a different one.

I am pretty sre that was the one that I used. I a...

2011-06-07T13:21:40.389-07:00

I am pretty sre that was the one that I used. I am not sure if there is actually a problem. I used the output files to do an assembly and it worked fine.

Also, the original untrimmed files were 7 g and 6 g and the files after using the script were 6.6 g and 5.8 g. That seems like what they should be but I am not sure.

@Anon, use this script. I haven't updated the...

2011-06-07T13:16:12.482-07:00

@Anon,
use this script. I haven't updated the one in this post, will do so shortly.
https://github.com/brentp/methylcode/blob/master/bench/scripts/fastq_pair_filter.py

Hi! Thanks so much for this script. I ran it and...

2011-06-07T13:09:49.444-07:00

Hi! Thanks so much for this script. I ran it and the files seemed to come out ok but I got the following message:

"Traceback (most recent call last):
File "./pairedtrim", line 81, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "./pairedtrim", line 46, in main
for ra, rb in gen_pairs(procs[0].stdout, procs[1].stdout):
File "./pairedtrim", line 36, in gen_pairs
assert not all(b), ("files not same length")
AssertionError: files not same length
fastq_quality_trimmer: writing nucleotides failed: Broken pipe
"

the command line I ran was:
./pairedtrim -t 18 -l 10 ../s11utegl.txt ../s12utegl.txt &

I did not use any of the adapter specifications because I am new to this and was a bit unclear if we had adaptor sequences. We did illumina paired end. Any help would be great!

If the length of read1 or read2 was trimmed to zer...

2011-05-26T00:46:26.531-07:00

If the length of read1 or read2 was trimmed to zero, could you add a function to report another read that was not trimmed to length of zero to the third output file?

So, you made 'fhr_headers' a generator whi...

2011-02-18T06:55:59.176-08:00

So, you made 'fhr_headers' a generator which I think has the advantage that the headers don't have to be loaded in memory all at once.

However, in the end you still store them all in the 'seen' bitmap, right?

I don't think you need to do this, something like this pseudo code should do. Or am I missing something here?

Furthermore, I don't have a Fastq file handy at the moment, but if the headers occur in the files according to some order, you don't need an original file at all. But probably this is not the case.

i suspect you could run the barcode splitter on bo...

2011-02-11T13:49:30.718-08:00

i suspect you could run the barcode splitter on both ends of the reads independently, then run each pair of resulting files through this program. it should remove any reads that are unpaired -- either because of the barcode or because of quality filtering.

fastx Barcode splitter is useful for SE reads dire...

2011-02-11T13:41:45.983-08:00

fastx Barcode splitter is useful for SE reads directly. For Paired End reads without barcodes how I need to eliminate from from other mate?

It will be great if you can implement it. It will be complete package for PE reads for filtering, trimming and BC split.

thanks for finding and reporting the errors. i hav...

2011-02-11T13:31:16.178-08:00

thanks for finding and reporting the errors. i have updated the gist.

for barcode splitting (which i havent done), i think you can use fastx directly:

http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_barcode_splitter_usage

@brentp Thanks for your time. Its seems working......

2011-02-11T13:26:16.335-08:00

@brentp
Thanks for your time. Its seems working...

BTW how to do de-multiplex (separate barcodes) in paired-samples and place reads in seperate bins?.

try specifying with -l 0 on the command-line, it s...

2011-02-11T11:58:25.814-08:00

try specifying with -l 0 on the command-line, it should be optional, but it must be defaulting to None, i will fix this shortly.

Thanks No errors without inputs now But with input...

2011-02-11T11:54:26.544-08:00

Thanks
No errors without inputs now
But with input has errors still:

$ python2.7 fastq_pair_filter.py -a ACCC,CGTA,CAGT,TTAG -t 24 SRR018062_1.fastq RR018062_2.fastq

Traceback (most recent call last):
File "fastq_pair_filter.py", line 96, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "fastq_pair_filter.py", line 22, in main
trim_cmd = "%s -t %i -l %i" % (FASTQ_QUALITY_TRIMMER, t, l)
TypeError: %d format: a number is required, not NoneType

please download from here: http://gist.github.com/...

2011-02-11T11:31:28.484-08:00

please download from here:
http://gist.github.com/588841

i edited the gist manually and put the new bracket in the wrong place in the gist and in my comment. the not should be out side the ()

thanks, I followed your suggestion and edited the ...

2011-02-11T11:10:19.775-08:00

thanks, I followed your suggestion and edited the line 92

"if (not fastqs and len(fastqs)) == 2:"

still have same error

$ python2.7 fastq_pair_filter.py
Traceback (most recent call last):
File "fastq_pair_filter.py", line 96, in
main(adaptors, opts.M, opts.t, opts.l, fastqs, opts.sanger)
File "fastq_pair_filter.py", line 41, in main
fha, fhb = procs[0].stdout, procs[1].stdout
IndexError: list index out of range

@Anonymous . you are right! thanks for your patien...

2011-02-11T10:53:48.023-08:00

@Anonymous . you are right! thanks for your patience. I fixed the gist so it should work now. just added parenthesis at line 92 so it looks like this:

if (not fastqs and len(fastqs)) == 2:

let me know if you have any more problems.

@brentp I posted the command itself on the above ...

2011-02-11T10:46:34.845-08:00

@brentp

I posted the command itself on the above posts.

"$ python2.7 fastq_pair_filter.py"

Even with input files (paired end files) same error.

If possible Can you also putup usage options of the programm

Thanks