Wednesday, July 14, 2010

GSNAP

Aligners

Since starting the methylcoder project, I've been using the bowtie short read aligner. It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough options to keep you busy (including my favorite: --try-hard).
There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle.
  1. As with many aligners, bowtie creates an index of the reference genome. It uses a suffix array compressed with the Burrows-Wheeler Transform(BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the reference and puts the human genome past that limit. The "solution" is to split the reference and create multiple indexes. But, with that setup, when requesting reads that map uniquely to a single best location, it's only guaranteed that the mapping is unique to each index independently and post-processing is required. This is also a problem for any large genome that will not fit into a bowtie index.
  2. Bowtie does not handle indels very well. Even the other popular aligners SOAP and BWA can only handle very short indels (up to about 3 bp) while MAQ and SOAP2 do not align reads with indels.

GSNAP

GSNAP: Genomic Short-read Nucleotide Alignment Program (from Thomas Wu at genentech) addresses both of those short-comings and seems reasonably fast. For the reference index, rather than use a suffix-array, GSNAP uses a hash of 12-mers sampled every 3 basepairs (by my reading, sampling complicates the search somewhat but reduces the size of the index). Since it can then use 12-mer seeds to span gaps, it's better at dealing with indels and can also be used to map RNA-seq data to a reference genome, with or without known, annotated splice sites! In addition, it can take known SNPs and add them to the 12-mer hash index so that reads mapping over a SNP will not be wrongly accused of having a mismatch.
Finally, it can map BS-sulfite treated reads to the genome without requiring the C=>T conversion--using the same re-indexed genome.
It's input format is a bit wonky for paired end reads. Normally, paired-end reads are specified in 2 separate files: file_1.fastq, file_2.fastq with the header (and order) indicating the pairing. GSNAP requires a single FASTA file (it will not accept FASTQ) with format:
>header
ACTCTCAGCGGGACGTTAACGCGACCGATTACGGTGACC
CCACGTGCCGACTTAGGCAGACCGACGTTACGCACCACA
where the first sequence is from the file_1.fastq and the 2nd is from file_2.fastq.
As of yesterday, MethylCoder supports GSNAP as an aligner, in addition to Bowtie. (This addition was quite simple since GSNAP does all the work anyway). MethylCoder includes a script to convert paired-end FASTQ files to GSNAP format. The script simply takes as arguments paired end FASTA or FASTQ files and puts them in a single file of the GSNAP paired end format. It's almost too simple to worry about, but I needed it so it's there...

The GSNAP paper is an interesting read. Ever heard of "galloping binary search"? I hadn't, but apparently it's used in python's timsort.