Thursday, January 22, 2015

my thoughts on golang

I've been playing with go in the evenings and over xmas break for about 8 weeks now. This post is about go the language and the tooling. I may write another post about a simple go package for bioinformatics that I've been writing which is under 1000 lines of code.

First, go is boring, and though it is pretty terse, I do miss things about python like list comprehensions; initializing a variable and writing a for loop is easy enough, but it's one of the things that I use all the time in python.  But, I can't argue with the "less is exponentially more" mantra as I was able to pick up the language very quickly. The tooling is fantastic. My project has dependencies that are wrappers to C-libs, but I can simply do:
    go get
and it just works. The project is only about 1K lines of code, but it compiles in about 0.1 seconds on my laptop. And, when the time comes, I can distribute binaries for common platforms!

vim-go is awesome! I've had various plugins for python in vim, but I think the combination of types and formatting and interfaces help vim-go to help me to a level I've not seen with python. In addition to the obvious stuff such as using incorrect types, if I try to use a struct that doesn't satisfy an interface as required, it will tell me which methods are missing or if I should be using a pointer receiver. Nearly all of the problems I would encounter are caught at edit-time (compile-time) via vim-go. The run-time error messages are pretty readable--though my most common one is nil pointer exception which I've learned to track down pretty quickly--or to obviate altogether by doing things in a more go-ish (?) way.

I needed some iterators for this project and I figured that using function closures would be the fastest, both in terms of development and performance because I hadn't used channels. I had also seen that using channels as iterators is quite slow. I initially wrote the iterator as a function closure and then converted it to a channel. The linked benchmark isn't doing any work inside the iterator, so (presumably) adding the channel machinery slows things down. In my case, I'm doing some parsing in the iterator so having that done concurrently in a go-routine can speed things up quite nicely. Converting from the closure-based iterator to the channel was simple, again due to the niceties in vim-go as allowed by the language itself.

In python, I never think about interfaces (well, almost never). In go, that's the obvious thing to do. I did find myself wishing that interfaces could be defined in terms of fields, and not just methods. Initially, writing:

    func (self *Interval) Start() uint32 { return self.start }

was annoying (I wanted to just have 'start uint32'), but, in dealing with genomic data, we sometimes have 1-based and sometimes 0-based coordinates, so for another struct to satisfy the interface I had to have:

    func (self *Gff) Start() uint32 {
            return uint32(self.Feature.Start() - 1)

So, I needed the wrapper anyway to normalize the positions. Forcing the cast to uint32 ( self.Feature.Start() is uint64), actually helped me catch some problems. Writing methods like that is simple and vim-go does let me know when I'm missing methods to satisfy an interface.

Other than my initial change from closures to go-routines, a failed attempt to convert from interface{} to a strict Interface (see below), then making small tweaks to the capacity in my slice initialization, or channel buffer size, I have done very little optimization but my toy project is within 3X of a highly developed C++ project (where a python version with minimal features was 20X slower than the C++ version). The compilation speed and speed of development is more than worth it for me. I did do some profiling with the easy-to-use tools; my project is very GC heavy, so I can hope that the GC changes in 1.5 and 1.6 will bring it even closer to C++.

(In go, an Interface is a declaration of a set of methods that allow a sort of duck-typing [or maybe that is exactly duck-typing] and the empty interface 'interface{}' is satisfied by all types.) I've seen that lack of generics is a common complaint. One alternative, using interface{}, can make things slow. I'm using a priority queue to implement merging of sorted streams (a la python's heapq.merge) and the heap implementation in go uses interface{}. I tried converting all occurrences of interface{} to "MyInterface" in the heap.go source file and then removing all of my casts, but actually got negligible improvement. For the most part, I found that Interfaces made my code "generic" enough but I did have to implement min() and max() for uint32 types since those are not provided--that could get annoying if I had to have those for many numeric types. The most common syntax annoyance where I still waste precious brain-power is the range syntax. For channels, it's a single thing that's yielded; for most, it's the index and the item. This is becoming 2nd nature, but it is in contrast to python where iterators are a consistent and simple abstraction. It's a minor thing but I guess it is noticeable in a language where most stuff is pretty unsurprising.

Slices just work. I use them heavily and never have to waste brain-cycles on deciding how to use them. I don't know if this is common, but I haven't used arrays, only slices.

Other nits:
1. In other languages I've worked with, I can open a gzipped file and get a single filehandle. With go,
I get 2 handles to track and close. I almost wrote a wrapper for zlib's gzfile, but have been trying to focus on getting something working (for now, my code for that is horrible).
2. For the most part, I really like go fmt, but I wish they had chosen 4 space indent.
3. It'd be nice if functions could have default arguments.

My use of the parallel (ahem, concurrent) features has been pretty basic, thus far relying only on select and range on buffered channels. I haven't used the sync package or even a channel as a sort of lock, but the syntax, even after very little use feels natural.

The reason I like python is because of the simple, concise syntax, standard library (which IMO is still pretty good despite common opinion), iterators/generators, and the numpy/scipy/statsmodels/sklearn stuff. I think that go makes my code a little nicer and it seems to run faster for my common uses. It helps that go has the amazing biogo libraries for genomic data. The tooling is phenomenal and the coding has been very enjoyable.

These are my perceptions after a short time with go. I'd be happy to be relieved of any *mis*conceptions.

Wednesday, December 04, 2013

streaming merge of sorted objects

A lot of software still seems to rely on being able to read big-ish data into memory. This is not possible (or at least not desirable) for much of the data that I work with. There are very nice tools in python to allow operating on chunks of data at a time. When combined with a decent data-layout, this can be very powerful, and simpler even than reading everything into memory. This can change working on big(ish) data into something like working on small data. The output of a tool that I'm using is a file of genomic positions and a value. Something like:
chr1 1234 0.9
chr1 1239 0.12
chr1 1249 0.12
That file is for a single sample and may contain about 10 million lines. That's not too much, but with 60 samples, this can become a problem. In addition, another sample may have sites that are not in that first sample:
chr1 1221 0.91
chr1 1239 0.13
chr1 1259 0.22
Many softwares will take a matrix with rows of genomic positions and 1 column per sample (e.g. R's limma, python's scikit-learn [after transpose of that shape]). Since we know the data is sorted we can use some of python's machinery to simplify creating that matrix. We could use merge in R or in pandas (both of those can operate on chunks but I think it's fair to say those faculties are not used much and would be difficult for this use-case), but this is a way to do it with very little memory use in a way that streams the input and the output so that it can be used immediately. Specifically, we will use heapq.merge. In this case, we need to make our data comparable using a class with a __cmp__ method
class Row(object):
    __slots__ = ('chrom', 'start', 'end', 'value', 'source')

    def __init__(self, toks, val_col=4, source=None):
        self.chrom = toks[0]
        self.start = int(toks[1])
        self.value = toks[val_col - 1]
        self.source = source

    def __cmp__(self, other):
        return cmp(self.chrom, other.chrom) or cmp(self.start, other.start)
That turns a each line from the file above into something order-able by chromosome and position. Since our data in each file is already sorted, this will be used to compare data across files. For each file we wish to sort, we create a lazy iterable like this:
def gen_iterable(fname, val_col):
    for toks in (x.rstrip("\r\n").split("\t") for x in xopen(fname)):
        yield Row(toks, val_col, source=fname)

and then a list of iterables as:
iterables = [gen_iterable(f, value_col) for f in bed_files]
where value_col would be 3 for the example data above (gets converted to a 0-based index). Since we want to have 1 output line for any position observed in any input, we can use itertools.groupby on the merged objects:
for loc, cgs in groupby(heapq.merge(*iterables),
                        lambda cg: (cg.chrom, cg.start)):

This groups the output of all the files by the location. We can know which files a represented in the cgs group by their .source attribute. In this case, I fill in empty values with 0 (this may be better set as NaN or some other value in some cases):
        present = dict((c.source, c) for c in cgs)

        # if a file doesn't have a record for here, just append 0
        values = [(present[s].value if s in present else '0') for s in bed_files]
        yield cgs[0].chrom, cgs[0].start, values
Where I have defined the source as the file where the value came. This is to ensure that the values are always in the same order regardless of which are missing. The yield sends back the chromosome, start and the list of values to the caller so that they can be manipulated as needed. At this point, that is small data. We have a single site (in this case a CpG site with methylation values) that we can operate on. This all happens on streaming data so the memory use is negligible. Similar ideas exist in BEDTools for sorted data (though it is more complicated to handle sorted interval data as opposed to here where we consider only points) but this is a simple way to build a custom setup for streaming arbitrary sorted data. The code the full code for this example is here:

Tuesday, April 12, 2011

Adding bed/wig data to dalliance genome browser

I have been playing a bit with the dalliance genome browser. It is quite useful and I have started using it to generate links to send to researchers to show regions of interest we find from bioinformatics analyses.
I added a document to my github repo describing how to display a bed file in the browser. That rst is here and displayed in inline below.
It uses the UCSC binaries for creating BigWig/BigBed files because dalliance can request a subset of the data without downloading the entire file given the correct apache configuration (also described below).
This will require a recent version of dalliance because there was a bug in the BigBed parsing until recently.

Dalliance Data Tutorial

dalliance is a web-based scrolling genome-browser. It can display data from
remote DAS servers or local or remote BigWig or BigBed files.
This will cover how to set up an html page that links to remote DAS services.
It will also show how to create and serve BigWig and BigBed files.


This document will be using hg18 for this tutorial, but it is applicable to
any version available from your favorite database or DAS .

Creating A BigBed

Getting a bed file from UCSC

  • From the UCSC table browser choose

    • genome: Human
    • assembly: NCBI36/hg18
    • group: Genes and Gene Prediction Tracks
    • track: UCSC Genes
    • table: knownGene
    • output format "selected fileds from primary and related tables"
    • in text box, name it "knownGene.hg18.stuff.txt"
    • click "get output"
    • check kgXref under 'Linked Tables'
    • click 'Allow Selection From Checked Tables' at bottom of page.
    • check 'geneSymbol' from hg18.kgXref fields section
    • click 'get output' and a file named 'knownGene.hg18.stuff.txt' will be saved to your downloads directory. move it to your current directory.
  • To get this into bed format copy and paste this onto the command-line:

    grep -v '#' knownGene.hg18.stuff.txt | awk 'BEGIN { OFS = "\t"; } ;
    {   split($9, astarts, /,/);
        split($10, aends, /,/);
        for(i in astarts){
            if (! astarts[i]) continue
            ends=ends(aends[i] - astarts[i])","
            starts=starts(astarts[i] = astarts[i] - $4)","
        print $2,$4,$5,$1","toupper($13),1,$3,$6,$5,".",$8,ends,starts
    }' | sort -k1,1 -k2,2n > knownGene.hg18.bed
  • To create a BigBed from this, do (note if you're not on a 64 bit
    machine, you'll have to find the 32bit binaries:

    chmod +x fetchChromSizes bedToBigBed
    ./fetchChromSizes hg18 > data/hg18.chrom.sizes
    ./bedToBigBed knownGene.hg18.bed data/hg18.chrom.sizes

now is a BigBed file containing both the UCSC and the common
name in the name column.


UCSC also has a public mysql server so the process of downloading to a bed can be simplified to:

mysql --user=genome -A -D hg18 -P 3306   -e "select chrom,txStart,txEnd,,X.geneSymbol,strand,exonStarts,exonEnds from knownGene as K,kgXref as X where;" > tmp.notbed
grep -v txStart tmp.notbed | awk 'BEGIN { OFS = "\t"; } ;
    {   split($7, astarts, /,/);
        split($8, aends, /,/);
        for(i in astarts){
            if (! astarts[i]) continue
            sizes=sizes""(aends[i] - astarts[i])","
            starts=starts""(astarts[i] = astarts[i] - $2)","
            exonCount=exonCount + 1
        print $1,$2,$3,$4","$5,1,$6,$2,$3,".",exonCount,sizes,starts
    }' | sort -k1,1 -k2,2n > knownGene.hg18.bed

then proceed as the last steps above to create the big bed file.

Displaying A BigBed in Dalliance

From there, download dalliance:

$ git://
cd dalliance

and edit test.html, adding:

{name: 'UCSC Genes',
 bwgURI:               '/dalliance/',

before the line that looks like:

{name: 'Repeats',

at around line 55.

Then edit your apache.conf (e.g. /etc/apache2/sites-enabled/000-default)
and put the following
(here i assume you cloned dalliance into /usr/usr/local/src/dalliance-git):

Alias /dalliance "/usr/local/src/dalliance-git"
<Directory "/usr/locals/src/dalliance-git">
    Header set Access-Control-Allow-Origin "*"
    Header set Access-Control-Allow-Headers "Range"
    Options Indexes MultiViews FollowSymLinks
    AllowOverride None
    Order allow,deny
    Allow from all

Then enable mod-headers apache module. On Ubuntu, that looks like:

sudo a2enmod headers

Then point your browser to:: http://yourhost/dalliance/test.html
And you should see the your 'UCSC Genes' track in full glory along
with the other niceties of the dalliance browser.

Friday, October 22, 2010

(bloom) filter-ing repeated reads

In this post, I'll talk a bit about using a bloom filter as a pre-filter for large amounts of data, specifically some next-gen sequencing reads.

Bloom Filters

A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter. It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives.
The code for that wrapper is in my github, here.

Big Data

It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer, and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/hash structure and use that to find duplicates and print out only the best or first or whatever. However, once you start getting "next-gen", this is less useful. For example, someone kindly reported a bug on my simple c++ filter because he had 84Gigs of reads and was running out of memory. And that is a bug. Anything that's supposed to deal with next-gen sequences has to deal with that stuff.

As a side note, my friend/co-worker came up with an elegant solution for filtering larger files: split into 4 files based on the first base in the read (A, C, T, or G) then filter each of those files to uniqueness and merge. I like that approach, and as the file sizes grow it could be extended to separate into 16 files by reading the first 2 bases. But, it does require a lot of extra file io.

Just Do It

So, I wrote a simple script that filters to unique FASTQ reads using a bloom-filter in front of a python set. Basically only stuff that is flagged as appearing in the bloom-filter is added to the set. This trades speed--it iterates over the file 3 times--for memory. The amount of memory is tuneable by the specified error-rate. It's not pretty, but it should be simple enough to demonstrate what's going on. It only reads from stdin and writes to stdout, with some information about total reads an number of false positives in the bloom-filter sent to stderr.
usage looks like:
python > in.fastq < out.unique.fastq
On my machine, a particular run with a decent sized file looks like this:

and here's the code:

Monday, September 20, 2010

filtering paired end reads (high throughput sequencing)

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it.

I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.
At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps only reads that appear in both. Usage is something like:
Where the -a (adaptor) -M (length of adaptor match) -t (min quality threshold) and -l (min length after quality chops) options are copied directly from (and sent directly to) fastx toolkit. --sanger indicates that the reads have fastq qualities in the sanger encoding. If that option is not specified, qualities are assumed to be in illumina 1.3 format where the ascii offset is 64.
This example will create 2 new files: en.wt.1.fastq.trim and en.wt.2.fastq.trim each with the same number of corresponding records that pass the filtering above.
As described in my previous post, sometimes there are multiple adaptor sequences in the reads. This script can filter out any number of adaptors--specified in a comma delimited option -a--in a single run.
It's not too pretty, but it does the job:

As always, let me know of any feedback.

Sunday, September 12, 2010

ngs / high-throughput sequencing pipeline

This is the minimal set of preprocessing steps I run on high-throughput sequencing data (mostly from the Illumina sequencers) and then how I prep and view the alignments. If there's something I should add or consider, please let me know.
I'll put it in the form of a shell script that assumes you've got this software installed.
I'll also assume your data is in the FASTQ format. If it's in illumina's qseq format, you can convert to FastQ with this script by sending a list of qseq files as the command-line arguments.
If your data is in color-space, you can just tell bowtie that's the case, but the FASTX stuff below will not apply.
This post will assume we're aligning genomic reads to a reference genome. I may cover bisulfite treated reads and RNA-Seq later, but the initial filtering and visualization will be the same. I also assume you're on *Nix or Mac.


The programs and files will be declared as follows.

#files. (you'll want to change these)
Most of the following will run as-is for any set of reads, you'll only need to change the FASTQ and REFERENCE variables above.

Seeing the Reads

If you have a file with 2GB of reads, you likely can't just read (pun intended) it and get an idea of what's going on -- though I've tried. There are a number of tools to give you a summary of the data including stats such as quality per base, nucleotide frequency, etc. While fastx toolkit will do this for you, I've found fastqc to be the best choice.
The command:
$fastqc $FASTQ
will write a folder "a_fastqc/" containing the html report in fastqc_report.html
Here's an example of the nicely formatted and informative FastQC report before quality filtering and trimming (scroll to see the interesting stuff):

from that, we can see that (as with all illumina datasets) quality scores are lower at the 3' end of the read. For this analysis, I'll choose to chop bases with a quality score under 28. In addition to the other information, we can see that there is an Illumina Primer still present on many of the reads, so we'll want to chop that out in the filtering step below.

Filtering the Reads

The fastx toolkit is not extremely fast, but it has a simple command-line interface for most common operations I use to filter reads (though it--like similar libraries--is lacking in support for filtering paired-end reads). For this case, we want to trim nucleotides with quality less than 28 from the ends of each read and then remove reads of length 40. In addition, we want to chip Illumina adaptor sequence. This can be done by piping 2 commands together:
# the sequence identified by FastQC above:
fastx_clipper -a $CLIP -i $FASTQ -M 30 | fastq_quality_trimmer -t 28 -l 40 > ${FASTQ}.trim
both fastx_clipper and fastq_quality_trimmer are provided in the fastx toolkit.
The first command clips the adaptor sequence from $FASTQ and pipes the output to the 2nd command, fastq_quality_trimmer, which chomps bases with quality less than 28 (-t) then discards and read with a remaining length less than 40 (-l) and sends the output to a .trim file.

Seeing the filtered Reads

We then want to re-run fastqc on the trimmed, clipped reads
The command:
$fastqc ${FASTQ}.trim
and the output looks like:

where we can see the quality looks much better and there are no primer sequences remaining.
This took the number of reads from 14,368,483 to 11,579,512 (and many shorter--trimmed--reads in the latter as well).


Up to this point, the analysis will have been very similar for RNA-Seq, BS-Seq, and genomic reads, but you'll want to customize your filtering steps. The following will go through a simple alignment using bowtie that assumes you have genomic reads.


The first step is to build the reference bowtie index:
${bowtie_dir}/bowtie-build --quiet $REFERENCE bowtie-index
That will create the bowtie index for your reference fasta in the bowtie-index/ directory. It can take a while so you may want to download the pre-built indexes from the bowtie website if your organism is available.
Then, you can run the actual alignment as:
${bowtie_dir}/bowtie --tryhard --phred64-quals -p 4 -m 1 -S bowtie-index -q ${FASTQ}.trim ${FASTQ}.trim.sam
which tells bowtie to try hard (--tryhard), assume quality scores are the latest schema from illumina (--phred64-quals), use 4 processors (-p 4), discard any reads that map to more than one location the the reference (-m 1), use SAM output format (-S) and then where to find the bowtie index and the reads. The output is sent to ${FASTQ}.trim.sam.
We've done a lot of experimenting with different values for -m, and can affect your results, but -m 1 seems a common choice in the literature. And it's clearly less sketchy than, for example, -m 100 which would report up to 100 alignments for any read that maps to less than 100 locations in the genome.

View the Alignment

From there, we want to view the alignment. Most tools can handle SAM formatted files, but will perform better with an indexed bam. To get that, we do:
# view with samtools filter out unmapped reads and converted to sorted, indexed bam.
${samtools} view -bu -S -F 0x0004 ${FASTQ}.trim.sam -o ${FASTQ}.trim.unsorted.bam
${samtools} sort ${FASTQ}.trim.unsorted.bam ${FASTQ}.trim
${samtools} index ${FASTQ}.trim.bam
where now ${FASTQ}.trim.bam is sorted and indexed and contains only mapped reads (the .sam file from bowtie contains unmapped reads).
You can have a quick look at the alignment stats with:
${samtools} flagstat ${FASTQ}.trim.bam
and you can see an awesome ascii curses view of the alignment with:
${samtools} tview ${FASTQ}.trim.bam
to get something that looks like:

But you'll probably want to use a "real" viewer such as IGV, MochiView, or Tablet, all of which I have had some sucess with.

Note that you may want to choose another aligner, because bowtie does poorly with indels. Something like GSNAP will be better suited for reads where you have more complex variants.
I welcome any feedback on these methods.

Wednesday, July 14, 2010



Since starting the methylcoder project, I've been using the bowtie short read aligner. It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough options to keep you busy (including my favorite: --try-hard).
There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle.
  1. As with many aligners, bowtie creates an index of the reference genome. It uses a suffix array compressed with the Burrows-Wheeler Transform(BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the reference and puts the human genome past that limit. The "solution" is to split the reference and create multiple indexes. But, with that setup, when requesting reads that map uniquely to a single best location, it's only guaranteed that the mapping is unique to each index independently and post-processing is required. This is also a problem for any large genome that will not fit into a bowtie index.
  2. Bowtie does not handle indels very well. Even the other popular aligners SOAP and BWA can only handle very short indels (up to about 3 bp) while MAQ and SOAP2 do not align reads with indels.


GSNAP: Genomic Short-read Nucleotide Alignment Program (from Thomas Wu at genentech) addresses both of those short-comings and seems reasonably fast. For the reference index, rather than use a suffix-array, GSNAP uses a hash of 12-mers sampled every 3 basepairs (by my reading, sampling complicates the search somewhat but reduces the size of the index). Since it can then use 12-mer seeds to span gaps, it's better at dealing with indels and can also be used to map RNA-seq data to a reference genome, with or without known, annotated splice sites! In addition, it can take known SNPs and add them to the 12-mer hash index so that reads mapping over a SNP will not be wrongly accused of having a mismatch.
Finally, it can map BS-sulfite treated reads to the genome without requiring the C=>T conversion--using the same re-indexed genome.
It's input format is a bit wonky for paired end reads. Normally, paired-end reads are specified in 2 separate files: file_1.fastq, file_2.fastq with the header (and order) indicating the pairing. GSNAP requires a single FASTA file (it will not accept FASTQ) with format:
where the first sequence is from the file_1.fastq and the 2nd is from file_2.fastq.
As of yesterday, MethylCoder supports GSNAP as an aligner, in addition to Bowtie. (This addition was quite simple since GSNAP does all the work anyway). MethylCoder includes a script to convert paired-end FASTQ files to GSNAP format. The script simply takes as arguments paired end FASTA or FASTQ files and puts them in a single file of the GSNAP paired end format. It's almost too simple to worry about, but I needed it so it's there...

The GSNAP paper is an interesting read. Ever heard of "galloping binary search"? I hadn't, but apparently it's used in python's timsort.

Monday, June 07, 2010

python tools for bioinformatics II: group-ing

This is the second post describing python tools and idioms I've found useful in bioinformatics.

This post involves grouping items either using a disjoint set or python's
itertools.groupby. I was introduced to both of these
by the post-doc who sits behind me.


Grouper is an implementation of a disjoint set, initially from this recipe by Michael Droettboom and included in matplotlib.cbook.
The slightly modified implementation we use is here.
Where the docstring gives a decent idea of useage:
   This class provides a lightweight way to group arbitrary objects
   together into disjoint sets when a full-blown graph data structure
   would be overkill. 

   Objects can be joined using .join(), tested for connectedness
   using .joined(), and all disjoint sets can be retrieved using list(g)
   The objects being joined must be hashable.

   >>> g = Grouper()
   >>> g.join('a', 'b')
   >>> g.join('b', 'c')
   >>> g.join('d', 'e')
   >>> list(g)
   [['a', 'b', 'c'], ['d', 'e']]
   >>> g.joined('a', 'b')
   >>> g.joined('a', 'c')
   >>> 'f' in g
   >>> g.joined('a', 'd')
Basically the idea is if there's an explicit connection between `A` and `B` and an explicit connection between `A` and `C` the Grouper class will create a connection between `B` and `C`. In the implementation above, the explicit connections are created with `Grouper.join()` For genomics stuff, we can use this in many places, one simple use-case is finding local duplicates. These are also called tandem duplications and are identified as a group of genes with very similar sequence co-occurring in a local chromosomal region. We want to group these into a single "family" with all members, even if there is not an explicit connection between all genes--due to slight sequence divergence or sequence alignment (BLAST) oddities.
A snippet of code from here (more nice code from Haibao) shows the use of a grouper to find syntenic regions by checking if adjacent members of (an already sorted list) are within a certain distance (window) of each other:

after this, the elements that are grouped in the `g` Grouper object are in the same window (though not strictly syntenic).
This is a very nice abstraction for anything where you have groups of related objects. It reduces the accounting invovled because once you have `join`ed all the elements, querying the Grouper object with any element will return all elements in the group to which it is joined.


Python's itertools has a number of useful features. `groupby` is a means to partition an iterable similar groups of adjacent items, where `similar` is determined by a function you supply to groupby. The example from the docs is:

but it's also possible to specify an arbitrary grouping function, so grouping capital letters together:

So the `k`ey tells what the grouping function returned and the values or groups are grouped with other adjacent items from the list that also pass the test. So in some cases, we often want to sort the input before using groupby and then we can get groups and avoid an extra nested for-loop that we would otherwise need for filtering.
An example use is grouping BLAST hits by query and subject. Here, BlastLine is a simple object that takes a line from a tab-delimited blast and converts the e-value and bitscore to float and the starts and stops to ints, etc.

and since the blasts are sorted beforehand, this gives a useful set of blasts in blast_iter--containing only blasts with a matching query and subject. To do this without groupby requires quite a bit more code and reduces readability.

I recently saw a post about using `groupby` to parse fasta files. It's a good idea as a fasta file consists of pairs of a single header line, followed by N lines of sequence. A header line always starts with ">" so the test function for groupby is clear, it's simply getting them out in pairwise fashion, here's one way building from the post above:


Here's a final example, both of these used together to find tandem duplications given a genomic Bed file (contains chromsome, start, stop) and a list of blasts. Again this is code from Haibao from here.

So, this creates a `simple_blast` that's easier to sort and group, it iterates for the groups based on the subject
and then groups hits if they're on the same chromosome an within a given distance (where distance is measured in genes).
I like this example because I'd previously written code to do the same thing without Grouper or groupby and it was longer, slower and less readable.

Sunday, May 30, 2010

python tools for bioinformatics I: convolve for moving average

This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics.


numpy makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using convolve. It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')".
So the workflow is to
  • get the sequence into a numpy array of dtype 'c'
  • convert the boolean array to int so True is 1, False is 0
  • decide on the window size and make a kernel of that shape that sums to 1
  • profit
with a quick addition for plotting:
the first 100 kilobases of gc content for 50 and 250bp windows look like:

where, as expected, the 250 basepair window is more "smoothed".
From there, it's possible to do some analysis, for example grab the regions with the highest gc content.
That's some fairly dense code to pull out the values centered in the highest GC-content windows and then show the mean of those windows. (Arabidopsis thaliana, which I used for that example, has a fairly low genome-wide GC-content, so the 0.61 average is quite high.)

Another Example

There was a recent question on biostar where someone wanted to find regions with an abundance of a couple amino acids above some cutoff.
The main part of that looks like:
    abool = (seq == 'N') | (seq == 'Q') # convert to boolean
    mw = np.convolve(abool, kern, mode='same')
    if mw.max() < cutoff: continue

Where that operates on 'N' and 'Q' amino acids in a protein sequence instead of 'G' and 'C' as in the example above.


Finally, this approach is much more flexible than I've shown, the kernel does not have to be uniform, it can be guassian, or even only taking values to the left of each nucleotide, or weighting the nearest 10 nucleotides at 1 and the rest at 0.2. Even given those changes, once the kernel is chosen, the rest of the "workflow" remains unchanged.


In cases where the input array is numeric--not sequence--there is no need to do the conversion to a boolean, simply run the convolution on the original array with the chosen kernel.

In the examples above, I've shown only np.convolve with mode='same', for most cases dealing with sequence, I think this is a good choice, but it's best to consult the documentation for each specific case.

Finally, in cases where the kernel and the input array are very large, it may be better to use fft convolution from fftconvolve in scipy. I haven't used this much, I think it may require doing your own padding and chopping at the edges...

Wednesday, May 19, 2010

numpy + GDAL = agoodle

It's been a while since I've posted anything geo-related. Just want to let folks know about agoodle a project that makes it easy to access raster geo files as numpy arrays (thanks to GDAL), and to query them with polygons. The simplest way to see it in action is to go to landsummary, a project Josh Livni and I worked on a couple years ago, and query the map with a polygon (click "Draw Polygon" on the right side of the map first). Then you'll see a nice summary and a pie-chart (due to some google-chart-api usage by Josh) of the land-use types in the polygon you queried. The backend of that is agoodle. The map is, of course, openlayers.

You can see the usage on the github page but basically, it'll be something like: which gives you a dictionary of raster grid code keys to the area values-- meaning, for each landclass, it tells you the sum area of cells of that class that fall inside the polygon. Note that summarize_wkt can also take keyword arguments for raster_epsg and poly_epsg, so you can query with a polygon that's in a different projection than the raster.

The library could use some work. There's sketchiness about "inside", so the sum of the areas in the returned dictionary will not match the area of the polygon. Internally, we use matplotlib's nxutils which does a very-fast point in poly test for every cell in the raster that's within the bounding box of the polygon. So any cell that passes that test will be included in the results--all or nothing, there's no corrections for the case where a cell is half in the polygon. This is not a problem unless the polygon is small relative to the size of the raster cells (or the perimeter of the query polygon is large). Patches Welcome.

Check out the code, we made gdal, numpy and nxutils do all the hard work, but it comes together pretty well. If there's other opensource projects out there that do this well, let me know and i'll link to them. I'm posting this because as far as I know, no others exist.

Saturday, April 10, 2010


Disclaimer: This is about a generic indexing method that fits in < 50 lines of code, while it's cool, I'm not suggesting it be used in place of a real project. The code is here and explained below.

Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python. And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it.

Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence or the fastq record).
In the case of flat files, it's a mapping from and id or name to the fseek file-position at the start of a record. Given that, it's possible to make a generic file indexer that creates an index given a file and a function that advances the file pointer (fh) to the next record and returns an id.

So for the case of the FASTQ format, which contains a new record every 4 lines, the first of which is the 'id', the parser could be:

def fastq_next(fh):
id = fh.readline() # save the id.
# advance the file handle 3 lines.
fh.readline(); fh.readline(); fh.readline()
return id

So regardless of the file format, this is the interface. The function takes a file handle and 1) advances the file position to the start of the next record and 2) returns the id.
Given that, the indexer call looks like:
FileIndex.create('some.fastq', fastq_next)

all that call has to do is repeatedly send a filehandle to fastq_next, accept the the id returned by fastq_next, and save a mapping of id to the (previous) file position. An implementation detail is how that mapping is saved. I use a tokyo-cabinet BTree database.
Once the index is created, usage is:

fi = FileIndex(somefile, handler)
record = fi[somekey]

where handler is a callable (including a class) that takes a pointer and returns a thing. For fastq, it could be:

class FastQEntry(object):
def __init__(self, fh): = fh.readline().rstrip('\r\n')
self.seq = fh.readline().rstrip('\r\n')
self.l3 = fh.readline().rstrip('\r\n')
self.qual = fh.readline().rstrip('\r\n')

so usage for fastq looks like:

>>> fi = FileIndex('some.fastq', FastQEntry)
>>> some_id = '@HWI-EAS105_0001:1:1:7:1680#0/1'
>>> record = fi[some_id]
>>> assert isinstance(record, FastQEntry)
>>> record.seq


So, from any name, there's direct access to each record. Given the implementation of FileIndex setup, it's actually possible to create and use an index for a fastq file with 9 total lines of code, 6 of which are the class itself:

class FastQEntry(object):
def __init__(self, fh): = fh.readline().rstrip('\r\n')
self.seq = fh.readline().rstrip('\r\n')
self.l3 = fh.readline().rstrip('\r\n')
self.qual = fh.readline().rstrip('\r\n')

# note the 2nd argument takes a file handle and returns a name.
FileIndex.create('some.fastq', lambda fh: FastQEntry(fh).name)
fi = FileIndex('some.fastq', FastQEntry)
print fi['@HWI-EAS105_0001:1:1:7:1680#0/1'].seq


That's decent, but what makes it cooler is that this same interface can be used to implement an index on a SAM (Sequence Alignment/Map) format file:

class SamLine(object):
def __init__(self, fh):
line = fh.readline().split("\t") or [None] = line[0]
self.ref_seqid = line[2]
self.ref_loc = int(line[3])
# ... other SAM format stuff omitted.

f = 'some.sam'
FileIndex.create(f, lambda fh: SamLine(fh).name, allow_multiple=True)
fi = FileIndex(f, SamLine, allow_multiple=True)
print [(, s.ref_seqid, s.ref_loc) for s in fi['23351265']]

# output: [('23351265', '2', 8524), ('23351265', '3', 14202495)]

That's it. Note one extra thing: In an alignment represented in the SAM file, I used the read id (the first column) as the id for the indexing. That does not have to be unique, so I specify allow_multiple=True and it returns an array of records for every key. (So although using tokyocabinet is an implementation detail, putcat() functions make it much easier to support multiple entries per id.).


For the SAM file I tested, the original file is 2.5G and the index created by FileIndex is 493M. For a large fastq file of 62.5 million lines (about 15.5 million records) the file is 3.3G and the index is 1.2G (compared to 3.8G for screed). It takes about 5 minutes to index the FASTQ file (compared to 11 for screed).
On my machine, with that FASTQ file, I get about 17,000 queries per second (including object creation). For the same fastq, screed gives about 10,000 queries per second (which matches the image in Titus' post).


The full implementation right now is 44 lines, including (sparse) comments and is shown in the gist at the end of this post. It lacks a lot of nice things (like, um tests) and I'm not recommending anyone use it in-place of a well thought out project like the biopython stuff or screed, still, i think the concept is useful--and I am using it for SAM files.
I've put the full code in bio-playground on github, including the examples from this post. But, this is generic enough that it could be used to index anything: blast-files, fasta, ... whatever. As always, I welcome any feedback.

About Tokyo Cabinet

In the course of this, I found major problems in 3 different python wrappers for tokyo-cabinet. I reported bugs to 2 of them. So, kudos to py-tcdb. While I don't like that I have to explicitly tell it _not_ to pickle what I send it, it's well tested, and the code is very nice and ... best of all, I haven't been able to make it segfault. It is also easy_install'able.
Another thing I learned today is that you can't use tokyo-cabinet hash for more than a couple million records. A web-search shows that lots of people have asked about this and the answers always have to do with adjusting the bnum bucket parameter. That does not work. If you can create a tokyo cabinet hash database with over 5 million records that does not slow on inserting, please show me how. Until I see it, I think one has to use the BTree database in TC.

Sunday, April 04, 2010

writing and building a lua c extension module

[Update: 2009-04-09]
This package is now on luarocks under it's new name: "stringy". It now includes startswith and endswith methods (more coming). Under advice of lua gurus, I no longer add to the string base methods. And finally, it's available in my lua github repo.
I've been messing with Lua programming language lately, mostly because I wanted to try to use love2d as a visualization tool. I got side-tracked into building a c extension for lua. The C-API is much different from python.
In lua, all the c functions have a signature like:
int c_function(lua_State *L)
and you use the lua c API to pull values off the lua_state stack thingy -- L.
And then to return values, you just push them back onto the stack. I don't grok this fully yet, but it seems to handle all the memory allocation for you.
Anyway, it's hard to find a full example of creating a C extension for lua 5.1. It actually seems more common just to provide patches for the lua distribution itself. There are some docs but they were difficult for me to find and it's not clear which docs are for lua-5.1--the current version. So I'm including a full example with sourcecode, simple tests, Makefile, and luarock spec.
The full gist is here.
The C code (shamelessly stolen from the wiki and book -- not my own code) actually implements 2 very useful functions string.split and string.strip. These are otherwise easily added using lua's regular expression searches, but these are faster as they're not using the regexp machinery:
Note the functions are included in a struct and "registered" with the luaopen_stringext function.
The Makefile then builds the shared library: The shared library is then immediately usable from the lua interpreter. A test session looks like:. Where string.split() returns a table (array) of the tokens. Another cool thing about lua is visible in that script. The added functions actually become methods on Lua strings! So after importing stringext, all strings now have strip() and split() methods! This is because of the line in stringext.c:
luaL_openlib(L, "string", stringext, 0);
which tells it to add the methods to the "string" module.
Finally, luarocks... Luarocks are the equivalent of python eggs or ruby gems (you gotta love all these clever names). They take a rockspec file (equiv of python's setup.{py,cfg}). Mine looks like this.
I've requested that be added to the main luarocks repository, there doesnt seem to be a way to upload your own rock directly. Still once you write the rockspec, you can build the C extension without a Makefile by typing
luarocks make
and it handles all the appropriate flags and such.
Anyone interested can download a tar-ball of the entire distribution here.
I add that the lua community seems very active and helpful. I asked a question about building the extension and received quick, helpful replies.

Tuesday, February 16, 2010

organizing genomic data

In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like gff (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative.

Big ol' Matrix

because it was already nicely pre-processed, I downloaded some Arabidopsis expression data from here. They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for more details on how the coexpression is calculated, see the link above, I'm using this dataset specifically so I don't have to think about the raw expression data which is difficult to deal with].

I'm using numpy to create and then read that binary file (though it's just 32bit floats so the format will work for any language). The index of a particular gene in the array can be set as it's alphabetic order. So, to find the level of coexpression between the 13656th gene and the 12355th gene is arr[13657,12356]--or using pointer arithmetic if you're not using numpy. To find the average level of coexpression of the 13656th gene with all others is arr[13657].mean(). Looking up the index of each gene is a simple matter of reading in the file of gene names (already ordered alphabetically) and creating a dictionary lookup of gene => index. The one line of code to do that looks like this:

idx_lookup = dict((accn.rstrip(), i) for i, accn in enumerate(open(path + "/names.order")))
which can take an accn name like "At2g23450" and return its integer index.
Since the binary file is 1.9G of data, it's not ideal to read the entire thing into memory. But, it's simple to memory map it with numpy into a big ol' matrix/array:

L = len(idx_lookup)
coexp = np.memmap(path + '/coexp.bin', dtype=np.float32, shape=(L, L), mode='r')

so then, given a pair like: AT1G47770,AT2G26550
the code to find the coexpression is:

ai = idx_lookup["AT1G47770"]
bi = idx_lookup["AT2G26550"]
ce = coexp[ai, bi]

this is stupid fast. To memory map the array, read in the idx_lookup and do 6294 queries takes 0.17 seconds on my machine--and much of that is the one-time cost of reading in idx_lookup. The code for all of this is readable here.
and svn'able via:
svn checkout

That includes the code to get the data ( convert it to the .bin file ( and serve it as a web script (coexp.wsgi).


In most eukaryote organisms there are on the order of 30K genes (varying from maybe a couple thousand up to 60K or so). Each gene can have multiple sub-features: CDS coding regions, mRNAs, UTR's, introns, etc. So it's a pretty small amount of data, but large enough and with enough structure that it's nice to have it in an efficient data structure. After five years, I think am finally converging on a good abstraction. It holds 1 feature per row in a subclass of a numpy array. This hides some complexity arising from the fact that each gene feature has multiple exons and each gene can have a different number of exons (gff uses one row per cds * gene * mRNA * exon). Numpy has typed data "fields" which are something like columns. Common types are floats and ints of various precisions and python Objects. So, the exons start, stops are stored as a list of python integers in an object field of a python array and that python array is stored as a 'locs' field in each row of the numpy array. The .flat file format looks like this:
#id  chr accn    start   end    strand  ftype   locs
1 1 AT1G01010 3631 5899 + CDS 3760,3913,3996,4276,4486,4605,4706,5095,5174,5326,5439,5630

where the columns should be mostly understandable by the header. `start` is the left-most position of the feature on the genome and `end` is the rightmost. `ftype` tells the type of the feature(s) in the following column, `locs` which is a list of start,stops. most often ftype is `CDS` (or exon) for coding sequence--or the part of the gene that gets translated. From there, it's possible to take advantage of the speed and syntax of numpy:

>>> from flatfeature import Flat
# note also attaching the genomic fasta sequence.
>>> flat = Flat('data/thaliana_v8.flat', 'data/thaliana_v8.fasta')
# so can get coding sequence.
>>> cds_seq = flat.row_cds_sequence('AT1G01370')
# find all the gene names in a given chromosome, region:
>>> [gene['accn'] for gene in flat.get_features_in_region('1', 5000, 7000)]
['AT1G01010', 'AT1G01020']

# how many features on chromosome 4 are on the - strand:
>>> flat[(flat['seqid'] == '4') & (flat['strand'] == '-')].shape

The row_cds_sequence, while an ugly name, shows the advantage of tyeing the fasta to the Flat object--it allows extracting the genomic sequence from the fasta based on the coordinates in each row. It also highlights a problem with this data structure--what about alternative splicings which are very common? For our lab, we always use the union of all coding sequences so I report the resulting intervals as the 'locs' column. I intend to improve the flexibility of how that is handled in the code, if not the file.

As a side note, the final example in that session shows how you can think of the numpy slicing syntax as a sort of SQL-like selection. so: flat[(flat['seqid'] == '4') & (flat['strand'] == '-')] reads like: "select * from flat where seqid = '4' and strand = '-'", but there's no database, it's all in memory, and the work is all done in C, very quickly. Actually, I think flatfeature and the huge matrix could both classify as NoSQL.

The code for flatfeature is on github it includes an example .flat file for arabidopsis thaliana in the data directory. I'll be tinkering with that for some time. As with all my code, any forks, patches, suggestions or ridicule will be welcome (in that order of preference). Finally, to add that for many cases genometools works great, they do a nice job of wrapping GFF for a number of languages including python. But for some things GFF is too much trouble.

Monday, January 18, 2010

Loopless programming (for calculating methylation types)

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation, determined by the nucleotides that follow the 'C':
  • CG: a 'G' follows the 'C'

  • CHG: anything but a 'G' follows the 'C' and a 'G' follows that

  • CHH: no 'G' in the 2 subsequent nucleotides.

So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs.


Given a numpy array `seq`, it's possible to find all the 'C's with:
np.where(seq == 'C')
. From there one can calculate the methylation type without looping or if'ing with:

which potentially takes more memory but is extremely fast, and (IMO) nicely shows how to take advantage of the things numpy is good at using the slicing and indexing.
From there, I put that (slightly modified to differentiate between + and - methylation) into a function named _calc_methylation, and call it from this function:

which does some work to handle the ends of the sequence and reverse-complementing. The result, as shown in the docstring, is a numpy array where the values indicate the type of methylation as:

0: not a C or G
1: CG at a 'C' (+ strand )
2: CHG at a 'C' (+ strand )
3: CHH at a 'C' (+ strand )
4: CG at a 'G' (- strand )
5: CHG at a 'G' (- strand )
6: CHH at a 'G' (- strand )

The full implemenation is here. (even the reverse-complement is done without loops or ifs)

Sunday, January 10, 2010

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign

I've written previously about nwalign, a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance.


It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing.


Placement of Gaps

Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and "AGEBAM", the alignments:
Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through the scoring matrix.

Gap Extension

nwalign allows use of a scoring matrix to lookup the score/penalty for conversion from one letter to another or a simpler version where the user specifies gap_open, gap_extend, mismatch penalties, and match reward. For the simpler non-matrix path, earlier versions of nwalign did not heed the gap_extension--using the gap_open penalty regardless of the previous entries in the dynamic programming matrix. It's common to use only a gap penalty without a separate gap_extend penalty, nwalign no longer assumes that's what is preferred -- if it is, one can simply specify a `gap_extend` penalty that is equal to `gap_open`.


Previous versions of nwalign only returned the alignment and didn't provide access to the score of the alignment. Recent versions allow the user to get the score via:

>>> nw.score_alignment('CEELECANTH', '-PELICAN--', gap_open=-5,
... gap_extend=-2, matrix='PAM250')

where `11` is the score of the alignment given, the gap_open and extend_penalty, and subsitution scores in the file 'PAM250' (which is in the NCBI substitution matrix format).


The earliest version of nwalign was literally about 100 times faster than the perl version from the BLAST book. But, there were a few places where performance has improved even more since that. The most dramatic was in the lookups for the scoring matrix values. Given, again the 2 short strings: "AGEBAMAM" and "AGEBAM", the alignment algorithm has to do the inner loop 48 ( 6 * 8 ) times, in each of those loops, it has to look up the substitution score in the matrix. For example in the first iteration, it has to find the score for "changing" from an "A" to an "A"; in the PAM250 matrix, that score is +2. In the substitution matrix, the row and column "keys" are the amino acids and the values are the scores for changing from the row key to the column key. Previously, I stored the keys in their own array, then did a linear search to find the index. So, for both the row and column, an amino acid letter is translated to an index with a function like:

aa_keys = ['A', 'R', 'N', 'D', ... ] # or 'ARND...'
def findpos(amino_acid, aa_keys):
i = 0
while i < len(aa_keys):
if aa_keys[i] == amino_acid:
return i
i += 1
return - 1

except in C(ython) so it was much faster, the alignment then uses the return values for row and column amino acid to look up the score from the substitution matrix. When the string are long enough-- as they will be with real proteins, this is a huge speed bottleneck. So, it trades memory for speed. Instead of using a 25 * 25 matrix to store the substitution matrix, it now uses an X * X matrix where 'X' is the ord value of the largest amino acid. Usually this will be less than 'Z', so the matrix will be 90 * 90, stored efficiently as a numpy array. From there, the lookup is directly with matrix[ord(x_amino), ord(y_amino)] which is extremely fast in C because the ord is unnecessary as a char can index an array. This gave at least a 3X speed improvement. I could reduce the memory used by the matrix by subtracting ord('A'), but that would be trading speed for memory since it would require 2 extra subtractions per inner loop.

Also, the latest version uses less memory in other areas; in order to do the dynamic programming, the algo has to save N * M arrays of direction "pointers" (in the left/right/diagonal sense (not the c sense), scores and gaps. However, the gaps are not actually needed for the entire trace-back, they are only needed 1 step back to determine if a current gap is a gap_open, or gap_extend. So, now, instead of an N*M matrix for the gap matrix, it's N*1 and M*1 arrays. For large N * M, this is enough to offset the increased memory incurred by how the substitution matrix is stored.

Running a benchmark test script with nwalign-0.1 takes 12.13 seconds and the same script on nwalign-0.3.1 takes 3.41 seconds (and actually runs in 2.99 seconds with unladen-swallow... but that's another story). The script does an alignment on 1200 * 1600 basepair sequences 100 times. As with most things, I figured most of this by banging my head against the wall long enough that the stars aligned, so to speak, anyone who cares to look at the code and offer suggestions would be much appreciated.

Monday, January 04, 2010

updates to pyfasta

At the end of last year, I put in quite a few updates to pyfasta. One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends.
With this option, a file like:


will overwritten (flattened in-place) to:


simple. When dealing with large files (where this is actually useful), the flattened file does not behave well when opened in an editor because the editor will attempt to read a number of lines into the buffer and a single line may be 200 mega-bases. So, this is a pain if you're planning to sit down with a cup of joe and read through the genome, but otherwise, the fasta file should be un-affected.
This method is not currently the default (though it may be so in future versions). But, it's possible to use the commandline:
pyfasta flatten some.fasta
which will create the flattened fasta (and the index file) and a placeholder some.fasta.flat, containing the text "@flattened@" as a marker to pyfasta that it's ok to use the original (now-flattened) fasta. Once the file is flattened, there is no performance loss compared to having a separate flat file containing no headers.

pyfasta was a fun project for me in 2009. It's a ridiculously simple little module, but when I started it, I didn't think there was a good alternative. (Though discriminating Fasta-ers should look at the sequence module in pygr, and the Bio.Seq module in BioPython which I think has improved quite a lot recently). It has over 100 tests and very close to 100% test coverage for the modules in pyfasta, and much of the code is run once for each of the 4 backends.

the source is on bitbucket.

Wednesday, December 23, 2009

genome scrubber : mask repetitive sequence

This is to describe a simple tool I've made available (svn repo) for masking repetitive sequence.

rice (Oryza Sativa) version 5 sequence looks something like below when run through pyfasta info.

>1 length:43596771
>3 length:36345490
>2 length:35925388
>4 length:35244269
>6 length:31246789
>5 length:29874162
>7 length:29688601
>11 length:28462103
>8 length:28309179
>12 length:27497214
>9 length:23011239
>10 length:22876596

372.078M basepairs in 12 sequences

So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command:
/usr/bin/blastall -K 100 -a 8 -d rice.fasta -e 0.001 -i rice.fasta -m 8 -o rice_rice.blast -p blastn

nearly causes my machine to run out of memory (16G), takes a couple of days to run, and results in a blast output of 5.1G and 84 million rows--that's 84 million blast hits with an e-value below 0.001! By definition, that output is dominated by the repetitive elements. Repetitive elements are interesting, but in the case were we want to look at synteny, we have to wade through that 5.1G of stuff to find the very small chunk of data we need. This adds time to run the sequence comparison, time to parse, time to plot, time to analyze, and data to store, etc...
The solution we use in our lab is to create a "masked" sequence where we 'X' out any DNA sequence occurring more than a given number of times in the original blast. So, from the output of the blast above, any basepair that is covered more than 50 times is hard-masked with 'X'. The single script to run this is here. Available from svn via:
svn checkout

when run as a script with no options, it will print some help text. To mask rice at 50X, it is run as:
python -b rice_rice.blast -f rice.fasta -o rice -c 50 -m X

This hard-masks with "X". To soft-mask (changing basepairs to upper-case except those occurring more than 50 times which are lower-cased.) one can use the commandline option -m SOFT.
The resulting file: "rice.masked.50.fasta" has the same number of chromosomes, each with the same number of basepairs, but with repetitive regions masked. To show the difference, I re-ran the full-genome blast, except this time on the masked genome:
/usr/bin/blastall -K 100 -a 8 -d rice.masked.50.fasta -e 0.001 -i rice.masked.50.fasta -m 8 -o ricemasked.blast -p blastn
. The resulting blast file is only 128MB and 2 million rows (contrast to 5.1G and 84 million rows) or a blast file that's 2.4% of the original size. This makes things simpler, and makes doing genomic analyses more enjoyable. This is a different tool than the dust filter that comes with BLAST or many others, because it masks based on the global coverage at each location (there are similar tools).



In order to do the full-genome mask, stores an array for each chromosome, with a length equal to that of the chromosome in a pytables/hdf5 structure which acts like a numpy array. So to increment the counts for a blast hit, the code looks like:

cache[qchr][qstart - 1: qstop] += 1
cache[schr][sstart - 1: sstop] += 1

where cache is the hd5 structure (except for some cases it's read into memory to improve speed), 'qchr' and 'schr' are the query and subject chromosome keys with the values being the count arrays, and 's/qstart', 's/qstop' are the bounds of the subject and query for a particular blast line. This sends much of the work to numpy, instead of iterating over the range of qstart, qstop in python. That is repeated for every row in the blast file.


Once all of the counts are created, the script uses that data to mask out the original sequence for any element in count that is greater than the cutoff. That is achieved in a single line iterated per chromosome:
masked_seq = np.where(numexpr.evaluate("hit_counts > %i" % cutoff)
, mask_value, seq).tostring()

where `np` is from "import numpy as np", `seq` is the original fasta sequence in upper case, `mask_value is either a constant, like 'X', or in the case of soft-masking, a lower-case copy of `seq`, and `hit_counts` is the array of counts. The only tricky bit there is the use of numexpr which makes things a bit faster. The `masked_seq` is written to file after it's corresponding header, and the masking is performed for the next chromosome. The entire masking takes under 10 minutes on my machine.


Generally, we used the masked fasta file as the final output, but there's useful information in the hd5f file--a sort of measure of repetitiveness at each basepair in the genome. also in the svn repo is a short matplotlib web script (for mod_wsgi/mod_python but can be modified to run as cgi or command-line) to generate an image given a request like:

where 'org' is the name of the organism sent to the script, `xmin`, `xmax` specify the extents in basepairs, and `seqid` the chromosome.
Here's an example from our genome viewer (built on openlayers) where I've added copy count as a layer (with admittedly poor cartography). The nice gene renderings and basepair location ruler are courtesy of Eric Lyons.
The height of the blue line indicates the copy count. The flat red line is at a copy count of 50. So, the genes in this region are repetitive, but it's possible to see that they are surrounded by long terminal repeats, and that there is some repetitive sequence in the introns (the gray between the big green exons). Scrolling along, openlayers-style, it's possible to see patterns, and spot transposons.


I've made a few genomes available and will be adding more. There's also a (currently undocumented) service to plot the genome. So if the organism is brachy and the chromosome of interest is Bd1 then:

will result in:

where the image is described above.

If you're doing full-genome analyses, give it a try and let me know any suggestions. There's a full example starting from only a genomic fasta, including the BLAST command in the README.rst included in svn. If you get a genome that's too big to run through a full-genome BLAST (maize, ahem). You can split with "pyfasta -split -n 4", run 4 blasts, concat the results, and send through