Friday, October 22, 2010

(bloom) filter-ing repeated reads

In this post, I'll talk a bit about using a bloom filter as a pre-filter for large amounts of data, specifically some next-gen sequencing reads.

Bloom Filters

A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter. It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives.
The code for that wrapper is in my github, here.

Big Data

It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer, and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/hash structure and use that to find duplicates and print out only the best or first or whatever. However, once you start getting "next-gen", this is less useful. For example, someone kindly reported a bug on my simple c++ filter because he had 84Gigs of reads and was running out of memory. And that is a bug. Anything that's supposed to deal with next-gen sequences has to deal with that stuff.

As a side note, my friend/co-worker came up with an elegant solution for filtering larger files: split into 4 files based on the first base in the read (A, C, T, or G) then filter each of those files to uniqueness and merge. I like that approach, and as the file sizes grow it could be extended to separate into 16 files by reading the first 2 bases. But, it does require a lot of extra file io.

Just Do It

So, I wrote a simple script that filters to unique FASTQ reads using a bloom-filter in front of a python set. Basically only stuff that is flagged as appearing in the bloom-filter is added to the set. This trades speed--it iterates over the file 3 times--for memory. The amount of memory is tuneable by the specified error-rate. It's not pretty, but it should be simple enough to demonstrate what's going on. It only reads from stdin and writes to stdout, with some information about total reads an number of false positives in the bloom-filter sent to stderr.
usage looks like:
python > in.fastq < out.unique.fastq
On my machine, a particular run with a decent sized file looks like this:

and here's the code:

Monday, September 20, 2010

filtering paired end reads (high throughput sequencing)

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it.

I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains.
At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps only reads that appear in both. Usage is something like:
Where the -a (adaptor) -M (length of adaptor match) -t (min quality threshold) and -l (min length after quality chops) options are copied directly from (and sent directly to) fastx toolkit. --sanger indicates that the reads have fastq qualities in the sanger encoding. If that option is not specified, qualities are assumed to be in illumina 1.3 format where the ascii offset is 64.
This example will create 2 new files: en.wt.1.fastq.trim and en.wt.2.fastq.trim each with the same number of corresponding records that pass the filtering above.
As described in my previous post, sometimes there are multiple adaptor sequences in the reads. This script can filter out any number of adaptors--specified in a comma delimited option -a--in a single run.
It's not too pretty, but it does the job:

As always, let me know of any feedback.

Sunday, September 12, 2010

ngs / high-throughput sequencing pipeline

This is the minimal set of preprocessing steps I run on high-throughput sequencing data (mostly from the Illumina sequencers) and then how I prep and view the alignments. If there's something I should add or consider, please let me know.
I'll put it in the form of a shell script that assumes you've got this software installed.
I'll also assume your data is in the FASTQ format. If it's in illumina's qseq format, you can convert to FastQ with this script by sending a list of qseq files as the command-line arguments.
If your data is in color-space, you can just tell bowtie that's the case, but the FASTX stuff below will not apply.
This post will assume we're aligning genomic reads to a reference genome. I may cover bisulfite treated reads and RNA-Seq later, but the initial filtering and visualization will be the same. I also assume you're on *Nix or Mac.


The programs and files will be declared as follows.

#files. (you'll want to change these)
Most of the following will run as-is for any set of reads, you'll only need to change the FASTQ and REFERENCE variables above.

Seeing the Reads

If you have a file with 2GB of reads, you likely can't just read (pun intended) it and get an idea of what's going on -- though I've tried. There are a number of tools to give you a summary of the data including stats such as quality per base, nucleotide frequency, etc. While fastx toolkit will do this for you, I've found fastqc to be the best choice.
The command:
$fastqc $FASTQ
will write a folder "a_fastqc/" containing the html report in fastqc_report.html
Here's an example of the nicely formatted and informative FastQC report before quality filtering and trimming (scroll to see the interesting stuff):

from that, we can see that (as with all illumina datasets) quality scores are lower at the 3' end of the read. For this analysis, I'll choose to chop bases with a quality score under 28. In addition to the other information, we can see that there is an Illumina Primer still present on many of the reads, so we'll want to chop that out in the filtering step below.

Filtering the Reads

The fastx toolkit is not extremely fast, but it has a simple command-line interface for most common operations I use to filter reads (though it--like similar libraries--is lacking in support for filtering paired-end reads). For this case, we want to trim nucleotides with quality less than 28 from the ends of each read and then remove reads of length 40. In addition, we want to chip Illumina adaptor sequence. This can be done by piping 2 commands together:
# the sequence identified by FastQC above:
fastx_clipper -a $CLIP -i $FASTQ -M 30 | fastq_quality_trimmer -t 28 -l 40 > ${FASTQ}.trim
both fastx_clipper and fastq_quality_trimmer are provided in the fastx toolkit.
The first command clips the adaptor sequence from $FASTQ and pipes the output to the 2nd command, fastq_quality_trimmer, which chomps bases with quality less than 28 (-t) then discards and read with a remaining length less than 40 (-l) and sends the output to a .trim file.

Seeing the filtered Reads

We then want to re-run fastqc on the trimmed, clipped reads
The command:
$fastqc ${FASTQ}.trim
and the output looks like:

where we can see the quality looks much better and there are no primer sequences remaining.
This took the number of reads from 14,368,483 to 11,579,512 (and many shorter--trimmed--reads in the latter as well).


Up to this point, the analysis will have been very similar for RNA-Seq, BS-Seq, and genomic reads, but you'll want to customize your filtering steps. The following will go through a simple alignment using bowtie that assumes you have genomic reads.


The first step is to build the reference bowtie index:
${bowtie_dir}/bowtie-build --quiet $REFERENCE bowtie-index
That will create the bowtie index for your reference fasta in the bowtie-index/ directory. It can take a while so you may want to download the pre-built indexes from the bowtie website if your organism is available.
Then, you can run the actual alignment as:
${bowtie_dir}/bowtie --tryhard --phred64-quals -p 4 -m 1 -S bowtie-index -q ${FASTQ}.trim ${FASTQ}.trim.sam
which tells bowtie to try hard (--tryhard), assume quality scores are the latest schema from illumina (--phred64-quals), use 4 processors (-p 4), discard any reads that map to more than one location the the reference (-m 1), use SAM output format (-S) and then where to find the bowtie index and the reads. The output is sent to ${FASTQ}.trim.sam.
We've done a lot of experimenting with different values for -m, and can affect your results, but -m 1 seems a common choice in the literature. And it's clearly less sketchy than, for example, -m 100 which would report up to 100 alignments for any read that maps to less than 100 locations in the genome.

View the Alignment

From there, we want to view the alignment. Most tools can handle SAM formatted files, but will perform better with an indexed bam. To get that, we do:
# view with samtools filter out unmapped reads and converted to sorted, indexed bam.
${samtools} view -bu -S -F 0x0004 ${FASTQ}.trim.sam -o ${FASTQ}.trim.unsorted.bam
${samtools} sort ${FASTQ}.trim.unsorted.bam ${FASTQ}.trim
${samtools} index ${FASTQ}.trim.bam
where now ${FASTQ}.trim.bam is sorted and indexed and contains only mapped reads (the .sam file from bowtie contains unmapped reads).
You can have a quick look at the alignment stats with:
${samtools} flagstat ${FASTQ}.trim.bam
and you can see an awesome ascii curses view of the alignment with:
${samtools} tview ${FASTQ}.trim.bam
to get something that looks like:

But you'll probably want to use a "real" viewer such as IGV, MochiView, or Tablet, all of which I have had some sucess with.

Note that you may want to choose another aligner, because bowtie does poorly with indels. Something like GSNAP will be better suited for reads where you have more complex variants.
I welcome any feedback on these methods.

Wednesday, July 14, 2010



Since starting the methylcoder project, I've been using the bowtie short read aligner. It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough options to keep you busy (including my favorite: --try-hard).
There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle.
  1. As with many aligners, bowtie creates an index of the reference genome. It uses a suffix array compressed with the Burrows-Wheeler Transform(BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the reference and puts the human genome past that limit. The "solution" is to split the reference and create multiple indexes. But, with that setup, when requesting reads that map uniquely to a single best location, it's only guaranteed that the mapping is unique to each index independently and post-processing is required. This is also a problem for any large genome that will not fit into a bowtie index.
  2. Bowtie does not handle indels very well. Even the other popular aligners SOAP and BWA can only handle very short indels (up to about 3 bp) while MAQ and SOAP2 do not align reads with indels.


GSNAP: Genomic Short-read Nucleotide Alignment Program (from Thomas Wu at genentech) addresses both of those short-comings and seems reasonably fast. For the reference index, rather than use a suffix-array, GSNAP uses a hash of 12-mers sampled every 3 basepairs (by my reading, sampling complicates the search somewhat but reduces the size of the index). Since it can then use 12-mer seeds to span gaps, it's better at dealing with indels and can also be used to map RNA-seq data to a reference genome, with or without known, annotated splice sites! In addition, it can take known SNPs and add them to the 12-mer hash index so that reads mapping over a SNP will not be wrongly accused of having a mismatch.
Finally, it can map BS-sulfite treated reads to the genome without requiring the C=>T conversion--using the same re-indexed genome.
It's input format is a bit wonky for paired end reads. Normally, paired-end reads are specified in 2 separate files: file_1.fastq, file_2.fastq with the header (and order) indicating the pairing. GSNAP requires a single FASTA file (it will not accept FASTQ) with format:
where the first sequence is from the file_1.fastq and the 2nd is from file_2.fastq.
As of yesterday, MethylCoder supports GSNAP as an aligner, in addition to Bowtie. (This addition was quite simple since GSNAP does all the work anyway). MethylCoder includes a script to convert paired-end FASTQ files to GSNAP format. The script simply takes as arguments paired end FASTA or FASTQ files and puts them in a single file of the GSNAP paired end format. It's almost too simple to worry about, but I needed it so it's there...

The GSNAP paper is an interesting read. Ever heard of "galloping binary search"? I hadn't, but apparently it's used in python's timsort.

Monday, June 07, 2010

python tools for bioinformatics II: group-ing

This is the second post describing python tools and idioms I've found useful in bioinformatics.

This post involves grouping items either using a disjoint set or python's
itertools.groupby. I was introduced to both of these
by the post-doc who sits behind me.


Grouper is an implementation of a disjoint set, initially from this recipe by Michael Droettboom and included in matplotlib.cbook.
The slightly modified implementation we use is here.
Where the docstring gives a decent idea of useage:
   This class provides a lightweight way to group arbitrary objects
   together into disjoint sets when a full-blown graph data structure
   would be overkill. 

   Objects can be joined using .join(), tested for connectedness
   using .joined(), and all disjoint sets can be retrieved using list(g)
   The objects being joined must be hashable.

   >>> g = Grouper()
   >>> g.join('a', 'b')
   >>> g.join('b', 'c')
   >>> g.join('d', 'e')
   >>> list(g)
   [['a', 'b', 'c'], ['d', 'e']]
   >>> g.joined('a', 'b')
   >>> g.joined('a', 'c')
   >>> 'f' in g
   >>> g.joined('a', 'd')
Basically the idea is if there's an explicit connection between `A` and `B` and an explicit connection between `A` and `C` the Grouper class will create a connection between `B` and `C`. In the implementation above, the explicit connections are created with `Grouper.join()` For genomics stuff, we can use this in many places, one simple use-case is finding local duplicates. These are also called tandem duplications and are identified as a group of genes with very similar sequence co-occurring in a local chromosomal region. We want to group these into a single "family" with all members, even if there is not an explicit connection between all genes--due to slight sequence divergence or sequence alignment (BLAST) oddities.
A snippet of code from here (more nice code from Haibao) shows the use of a grouper to find syntenic regions by checking if adjacent members of (an already sorted list) are within a certain distance (window) of each other:

after this, the elements that are grouped in the `g` Grouper object are in the same window (though not strictly syntenic).
This is a very nice abstraction for anything where you have groups of related objects. It reduces the accounting invovled because once you have `join`ed all the elements, querying the Grouper object with any element will return all elements in the group to which it is joined.


Python's itertools has a number of useful features. `groupby` is a means to partition an iterable similar groups of adjacent items, where `similar` is determined by a function you supply to groupby. The example from the docs is:

but it's also possible to specify an arbitrary grouping function, so grouping capital letters together:

So the `k`ey tells what the grouping function returned and the values or groups are grouped with other adjacent items from the list that also pass the test. So in some cases, we often want to sort the input before using groupby and then we can get groups and avoid an extra nested for-loop that we would otherwise need for filtering.
An example use is grouping BLAST hits by query and subject. Here, BlastLine is a simple object that takes a line from a tab-delimited blast and converts the e-value and bitscore to float and the starts and stops to ints, etc.

and since the blasts are sorted beforehand, this gives a useful set of blasts in blast_iter--containing only blasts with a matching query and subject. To do this without groupby requires quite a bit more code and reduces readability.

I recently saw a post about using `groupby` to parse fasta files. It's a good idea as a fasta file consists of pairs of a single header line, followed by N lines of sequence. A header line always starts with ">" so the test function for groupby is clear, it's simply getting them out in pairwise fashion, here's one way building from the post above:


Here's a final example, both of these used together to find tandem duplications given a genomic Bed file (contains chromsome, start, stop) and a list of blasts. Again this is code from Haibao from here.

So, this creates a `simple_blast` that's easier to sort and group, it iterates for the groups based on the subject
and then groups hits if they're on the same chromosome an within a given distance (where distance is measured in genes).
I like this example because I'd previously written code to do the same thing without Grouper or groupby and it was longer, slower and less readable.

Sunday, May 30, 2010

python tools for bioinformatics I: convolve for moving average

This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics.


numpy makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using convolve. It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')".
So the workflow is to
  • get the sequence into a numpy array of dtype 'c'
  • convert the boolean array to int so True is 1, False is 0
  • decide on the window size and make a kernel of that shape that sums to 1
  • profit
with a quick addition for plotting:
the first 100 kilobases of gc content for 50 and 250bp windows look like:

where, as expected, the 250 basepair window is more "smoothed".
From there, it's possible to do some analysis, for example grab the regions with the highest gc content.
That's some fairly dense code to pull out the values centered in the highest GC-content windows and then show the mean of those windows. (Arabidopsis thaliana, which I used for that example, has a fairly low genome-wide GC-content, so the 0.61 average is quite high.)

Another Example

There was a recent question on biostar where someone wanted to find regions with an abundance of a couple amino acids above some cutoff.
The main part of that looks like:
    abool = (seq == 'N') | (seq == 'Q') # convert to boolean
    mw = np.convolve(abool, kern, mode='same')
    if mw.max() < cutoff: continue

Where that operates on 'N' and 'Q' amino acids in a protein sequence instead of 'G' and 'C' as in the example above.


Finally, this approach is much more flexible than I've shown, the kernel does not have to be uniform, it can be guassian, or even only taking values to the left of each nucleotide, or weighting the nearest 10 nucleotides at 1 and the rest at 0.2. Even given those changes, once the kernel is chosen, the rest of the "workflow" remains unchanged.


In cases where the input array is numeric--not sequence--there is no need to do the conversion to a boolean, simply run the convolution on the original array with the chosen kernel.

In the examples above, I've shown only np.convolve with mode='same', for most cases dealing with sequence, I think this is a good choice, but it's best to consult the documentation for each specific case.

Finally, in cases where the kernel and the input array are very large, it may be better to use fft convolution from fftconvolve in scipy. I haven't used this much, I think it may require doing your own padding and chopping at the edges...

Wednesday, May 19, 2010

numpy + GDAL = agoodle

It's been a while since I've posted anything geo-related. Just want to let folks know about agoodle a project that makes it easy to access raster geo files as numpy arrays (thanks to GDAL), and to query them with polygons. The simplest way to see it in action is to go to landsummary, a project Josh Livni and I worked on a couple years ago, and query the map with a polygon (click "Draw Polygon" on the right side of the map first). Then you'll see a nice summary and a pie-chart (due to some google-chart-api usage by Josh) of the land-use types in the polygon you queried. The backend of that is agoodle. The map is, of course, openlayers.

You can see the usage on the github page but basically, it'll be something like: which gives you a dictionary of raster grid code keys to the area values-- meaning, for each landclass, it tells you the sum area of cells of that class that fall inside the polygon. Note that summarize_wkt can also take keyword arguments for raster_epsg and poly_epsg, so you can query with a polygon that's in a different projection than the raster.

The library could use some work. There's sketchiness about "inside", so the sum of the areas in the returned dictionary will not match the area of the polygon. Internally, we use matplotlib's nxutils which does a very-fast point in poly test for every cell in the raster that's within the bounding box of the polygon. So any cell that passes that test will be included in the results--all or nothing, there's no corrections for the case where a cell is half in the polygon. This is not a problem unless the polygon is small relative to the size of the raster cells (or the perimeter of the query polygon is large). Patches Welcome.

Check out the code, we made gdal, numpy and nxutils do all the hard work, but it comes together pretty well. If there's other opensource projects out there that do this well, let me know and i'll link to them. I'm posting this because as far as I know, no others exist.

Saturday, April 10, 2010


Disclaimer: This is about a generic indexing method that fits in < 50 lines of code, while it's cool, I'm not suggesting it be used in place of a real project. The code is here and explained below.

Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python. And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it.

Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence or the fastq record).
In the case of flat files, it's a mapping from and id or name to the fseek file-position at the start of a record. Given that, it's possible to make a generic file indexer that creates an index given a file and a function that advances the file pointer (fh) to the next record and returns an id.

So for the case of the FASTQ format, which contains a new record every 4 lines, the first of which is the 'id', the parser could be:

def fastq_next(fh):
id = fh.readline() # save the id.
# advance the file handle 3 lines.
fh.readline(); fh.readline(); fh.readline()
return id

So regardless of the file format, this is the interface. The function takes a file handle and 1) advances the file position to the start of the next record and 2) returns the id.
Given that, the indexer call looks like:
FileIndex.create('some.fastq', fastq_next)

all that call has to do is repeatedly send a filehandle to fastq_next, accept the the id returned by fastq_next, and save a mapping of id to the (previous) file position. An implementation detail is how that mapping is saved. I use a tokyo-cabinet BTree database.
Once the index is created, usage is:

fi = FileIndex(somefile, handler)
record = fi[somekey]

where handler is a callable (including a class) that takes a pointer and returns a thing. For fastq, it could be:

class FastQEntry(object):
def __init__(self, fh): = fh.readline().rstrip('\r\n')
self.seq = fh.readline().rstrip('\r\n')
self.l3 = fh.readline().rstrip('\r\n')
self.qual = fh.readline().rstrip('\r\n')

so usage for fastq looks like:

>>> fi = FileIndex('some.fastq', FastQEntry)
>>> some_id = '@HWI-EAS105_0001:1:1:7:1680#0/1'
>>> record = fi[some_id]
>>> assert isinstance(record, FastQEntry)
>>> record.seq


So, from any name, there's direct access to each record. Given the implementation of FileIndex setup, it's actually possible to create and use an index for a fastq file with 9 total lines of code, 6 of which are the class itself:

class FastQEntry(object):
def __init__(self, fh): = fh.readline().rstrip('\r\n')
self.seq = fh.readline().rstrip('\r\n')
self.l3 = fh.readline().rstrip('\r\n')
self.qual = fh.readline().rstrip('\r\n')

# note the 2nd argument takes a file handle and returns a name.
FileIndex.create('some.fastq', lambda fh: FastQEntry(fh).name)
fi = FileIndex('some.fastq', FastQEntry)
print fi['@HWI-EAS105_0001:1:1:7:1680#0/1'].seq


That's decent, but what makes it cooler is that this same interface can be used to implement an index on a SAM (Sequence Alignment/Map) format file:

class SamLine(object):
def __init__(self, fh):
line = fh.readline().split("\t") or [None] = line[0]
self.ref_seqid = line[2]
self.ref_loc = int(line[3])
# ... other SAM format stuff omitted.

f = 'some.sam'
FileIndex.create(f, lambda fh: SamLine(fh).name, allow_multiple=True)
fi = FileIndex(f, SamLine, allow_multiple=True)
print [(, s.ref_seqid, s.ref_loc) for s in fi['23351265']]

# output: [('23351265', '2', 8524), ('23351265', '3', 14202495)]

That's it. Note one extra thing: In an alignment represented in the SAM file, I used the read id (the first column) as the id for the indexing. That does not have to be unique, so I specify allow_multiple=True and it returns an array of records for every key. (So although using tokyocabinet is an implementation detail, putcat() functions make it much easier to support multiple entries per id.).


For the SAM file I tested, the original file is 2.5G and the index created by FileIndex is 493M. For a large fastq file of 62.5 million lines (about 15.5 million records) the file is 3.3G and the index is 1.2G (compared to 3.8G for screed). It takes about 5 minutes to index the FASTQ file (compared to 11 for screed).
On my machine, with that FASTQ file, I get about 17,000 queries per second (including object creation). For the same fastq, screed gives about 10,000 queries per second (which matches the image in Titus' post).


The full implementation right now is 44 lines, including (sparse) comments and is shown in the gist at the end of this post. It lacks a lot of nice things (like, um tests) and I'm not recommending anyone use it in-place of a well thought out project like the biopython stuff or screed, still, i think the concept is useful--and I am using it for SAM files.
I've put the full code in bio-playground on github, including the examples from this post. But, this is generic enough that it could be used to index anything: blast-files, fasta, ... whatever. As always, I welcome any feedback.

About Tokyo Cabinet

In the course of this, I found major problems in 3 different python wrappers for tokyo-cabinet. I reported bugs to 2 of them. So, kudos to py-tcdb. While I don't like that I have to explicitly tell it _not_ to pickle what I send it, it's well tested, and the code is very nice and ... best of all, I haven't been able to make it segfault. It is also easy_install'able.
Another thing I learned today is that you can't use tokyo-cabinet hash for more than a couple million records. A web-search shows that lots of people have asked about this and the answers always have to do with adjusting the bnum bucket parameter. That does not work. If you can create a tokyo cabinet hash database with over 5 million records that does not slow on inserting, please show me how. Until I see it, I think one has to use the BTree database in TC.

Sunday, April 04, 2010

writing and building a lua c extension module

[Update: 2009-04-09]
This package is now on luarocks under it's new name: "stringy". It now includes startswith and endswith methods (more coming). Under advice of lua gurus, I no longer add to the string base methods. And finally, it's available in my lua github repo.
I've been messing with Lua programming language lately, mostly because I wanted to try to use love2d as a visualization tool. I got side-tracked into building a c extension for lua. The C-API is much different from python.
In lua, all the c functions have a signature like:
int c_function(lua_State *L)
and you use the lua c API to pull values off the lua_state stack thingy -- L.
And then to return values, you just push them back onto the stack. I don't grok this fully yet, but it seems to handle all the memory allocation for you.
Anyway, it's hard to find a full example of creating a C extension for lua 5.1. It actually seems more common just to provide patches for the lua distribution itself. There are some docs but they were difficult for me to find and it's not clear which docs are for lua-5.1--the current version. So I'm including a full example with sourcecode, simple tests, Makefile, and luarock spec.
The full gist is here.
The C code (shamelessly stolen from the wiki and book -- not my own code) actually implements 2 very useful functions string.split and string.strip. These are otherwise easily added using lua's regular expression searches, but these are faster as they're not using the regexp machinery:
Note the functions are included in a struct and "registered" with the luaopen_stringext function.
The Makefile then builds the shared library: The shared library is then immediately usable from the lua interpreter. A test session looks like:. Where string.split() returns a table (array) of the tokens. Another cool thing about lua is visible in that script. The added functions actually become methods on Lua strings! So after importing stringext, all strings now have strip() and split() methods! This is because of the line in stringext.c:
luaL_openlib(L, "string", stringext, 0);
which tells it to add the methods to the "string" module.
Finally, luarocks... Luarocks are the equivalent of python eggs or ruby gems (you gotta love all these clever names). They take a rockspec file (equiv of python's setup.{py,cfg}). Mine looks like this.
I've requested that be added to the main luarocks repository, there doesnt seem to be a way to upload your own rock directly. Still once you write the rockspec, you can build the C extension without a Makefile by typing
luarocks make
and it handles all the appropriate flags and such.
Anyone interested can download a tar-ball of the entire distribution here.
I add that the lua community seems very active and helpful. I asked a question about building the extension and received quick, helpful replies.

Tuesday, February 16, 2010

organizing genomic data

In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like gff (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative.

Big ol' Matrix

because it was already nicely pre-processed, I downloaded some Arabidopsis expression data from here. They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for more details on how the coexpression is calculated, see the link above, I'm using this dataset specifically so I don't have to think about the raw expression data which is difficult to deal with].

I'm using numpy to create and then read that binary file (though it's just 32bit floats so the format will work for any language). The index of a particular gene in the array can be set as it's alphabetic order. So, to find the level of coexpression between the 13656th gene and the 12355th gene is arr[13657,12356]--or using pointer arithmetic if you're not using numpy. To find the average level of coexpression of the 13656th gene with all others is arr[13657].mean(). Looking up the index of each gene is a simple matter of reading in the file of gene names (already ordered alphabetically) and creating a dictionary lookup of gene => index. The one line of code to do that looks like this:

idx_lookup = dict((accn.rstrip(), i) for i, accn in enumerate(open(path + "/names.order")))
which can take an accn name like "At2g23450" and return its integer index.
Since the binary file is 1.9G of data, it's not ideal to read the entire thing into memory. But, it's simple to memory map it with numpy into a big ol' matrix/array:

L = len(idx_lookup)
coexp = np.memmap(path + '/coexp.bin', dtype=np.float32, shape=(L, L), mode='r')

so then, given a pair like: AT1G47770,AT2G26550
the code to find the coexpression is:

ai = idx_lookup["AT1G47770"]
bi = idx_lookup["AT2G26550"]
ce = coexp[ai, bi]

this is stupid fast. To memory map the array, read in the idx_lookup and do 6294 queries takes 0.17 seconds on my machine--and much of that is the one-time cost of reading in idx_lookup. The code for all of this is readable here.
and svn'able via:
svn checkout

That includes the code to get the data ( convert it to the .bin file ( and serve it as a web script (coexp.wsgi).


In most eukaryote organisms there are on the order of 30K genes (varying from maybe a couple thousand up to 60K or so). Each gene can have multiple sub-features: CDS coding regions, mRNAs, UTR's, introns, etc. So it's a pretty small amount of data, but large enough and with enough structure that it's nice to have it in an efficient data structure. After five years, I think am finally converging on a good abstraction. It holds 1 feature per row in a subclass of a numpy array. This hides some complexity arising from the fact that each gene feature has multiple exons and each gene can have a different number of exons (gff uses one row per cds * gene * mRNA * exon). Numpy has typed data "fields" which are something like columns. Common types are floats and ints of various precisions and python Objects. So, the exons start, stops are stored as a list of python integers in an object field of a python array and that python array is stored as a 'locs' field in each row of the numpy array. The .flat file format looks like this:
#id  chr accn    start   end    strand  ftype   locs
1 1 AT1G01010 3631 5899 + CDS 3760,3913,3996,4276,4486,4605,4706,5095,5174,5326,5439,5630

where the columns should be mostly understandable by the header. `start` is the left-most position of the feature on the genome and `end` is the rightmost. `ftype` tells the type of the feature(s) in the following column, `locs` which is a list of start,stops. most often ftype is `CDS` (or exon) for coding sequence--or the part of the gene that gets translated. From there, it's possible to take advantage of the speed and syntax of numpy:

>>> from flatfeature import Flat
# note also attaching the genomic fasta sequence.
>>> flat = Flat('data/thaliana_v8.flat', 'data/thaliana_v8.fasta')
# so can get coding sequence.
>>> cds_seq = flat.row_cds_sequence('AT1G01370')
# find all the gene names in a given chromosome, region:
>>> [gene['accn'] for gene in flat.get_features_in_region('1', 5000, 7000)]
['AT1G01010', 'AT1G01020']

# how many features on chromosome 4 are on the - strand:
>>> flat[(flat['seqid'] == '4') & (flat['strand'] == '-')].shape

The row_cds_sequence, while an ugly name, shows the advantage of tyeing the fasta to the Flat object--it allows extracting the genomic sequence from the fasta based on the coordinates in each row. It also highlights a problem with this data structure--what about alternative splicings which are very common? For our lab, we always use the union of all coding sequences so I report the resulting intervals as the 'locs' column. I intend to improve the flexibility of how that is handled in the code, if not the file.

As a side note, the final example in that session shows how you can think of the numpy slicing syntax as a sort of SQL-like selection. so: flat[(flat['seqid'] == '4') & (flat['strand'] == '-')] reads like: "select * from flat where seqid = '4' and strand = '-'", but there's no database, it's all in memory, and the work is all done in C, very quickly. Actually, I think flatfeature and the huge matrix could both classify as NoSQL.

The code for flatfeature is on github it includes an example .flat file for arabidopsis thaliana in the data directory. I'll be tinkering with that for some time. As with all my code, any forks, patches, suggestions or ridicule will be welcome (in that order of preference). Finally, to add that for many cases genometools works great, they do a nice job of wrapping GFF for a number of languages including python. But for some things GFF is too much trouble.

Monday, January 18, 2010

Loopless programming (for calculating methylation types)

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation, determined by the nucleotides that follow the 'C':
  • CG: a 'G' follows the 'C'

  • CHG: anything but a 'G' follows the 'C' and a 'G' follows that

  • CHH: no 'G' in the 2 subsequent nucleotides.

So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs.


Given a numpy array `seq`, it's possible to find all the 'C's with:
np.where(seq == 'C')
. From there one can calculate the methylation type without looping or if'ing with:

which potentially takes more memory but is extremely fast, and (IMO) nicely shows how to take advantage of the things numpy is good at using the slicing and indexing.
From there, I put that (slightly modified to differentiate between + and - methylation) into a function named _calc_methylation, and call it from this function:

which does some work to handle the ends of the sequence and reverse-complementing. The result, as shown in the docstring, is a numpy array where the values indicate the type of methylation as:

0: not a C or G
1: CG at a 'C' (+ strand )
2: CHG at a 'C' (+ strand )
3: CHH at a 'C' (+ strand )
4: CG at a 'G' (- strand )
5: CHG at a 'G' (- strand )
6: CHH at a 'G' (- strand )

The full implemenation is here. (even the reverse-complement is done without loops or ifs)

Sunday, January 10, 2010

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign

I've written previously about nwalign, a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance.


It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing.


Placement of Gaps

Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and "AGEBAM", the alignments:
Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through the scoring matrix.

Gap Extension

nwalign allows use of a scoring matrix to lookup the score/penalty for conversion from one letter to another or a simpler version where the user specifies gap_open, gap_extend, mismatch penalties, and match reward. For the simpler non-matrix path, earlier versions of nwalign did not heed the gap_extension--using the gap_open penalty regardless of the previous entries in the dynamic programming matrix. It's common to use only a gap penalty without a separate gap_extend penalty, nwalign no longer assumes that's what is preferred -- if it is, one can simply specify a `gap_extend` penalty that is equal to `gap_open`.


Previous versions of nwalign only returned the alignment and didn't provide access to the score of the alignment. Recent versions allow the user to get the score via:

>>> nw.score_alignment('CEELECANTH', '-PELICAN--', gap_open=-5,
... gap_extend=-2, matrix='PAM250')

where `11` is the score of the alignment given, the gap_open and extend_penalty, and subsitution scores in the file 'PAM250' (which is in the NCBI substitution matrix format).


The earliest version of nwalign was literally about 100 times faster than the perl version from the BLAST book. But, there were a few places where performance has improved even more since that. The most dramatic was in the lookups for the scoring matrix values. Given, again the 2 short strings: "AGEBAMAM" and "AGEBAM", the alignment algorithm has to do the inner loop 48 ( 6 * 8 ) times, in each of those loops, it has to look up the substitution score in the matrix. For example in the first iteration, it has to find the score for "changing" from an "A" to an "A"; in the PAM250 matrix, that score is +2. In the substitution matrix, the row and column "keys" are the amino acids and the values are the scores for changing from the row key to the column key. Previously, I stored the keys in their own array, then did a linear search to find the index. So, for both the row and column, an amino acid letter is translated to an index with a function like:

aa_keys = ['A', 'R', 'N', 'D', ... ] # or 'ARND...'
def findpos(amino_acid, aa_keys):
i = 0
while i < len(aa_keys):
if aa_keys[i] == amino_acid:
return i
i += 1
return - 1

except in C(ython) so it was much faster, the alignment then uses the return values for row and column amino acid to look up the score from the substitution matrix. When the string are long enough-- as they will be with real proteins, this is a huge speed bottleneck. So, it trades memory for speed. Instead of using a 25 * 25 matrix to store the substitution matrix, it now uses an X * X matrix where 'X' is the ord value of the largest amino acid. Usually this will be less than 'Z', so the matrix will be 90 * 90, stored efficiently as a numpy array. From there, the lookup is directly with matrix[ord(x_amino), ord(y_amino)] which is extremely fast in C because the ord is unnecessary as a char can index an array. This gave at least a 3X speed improvement. I could reduce the memory used by the matrix by subtracting ord('A'), but that would be trading speed for memory since it would require 2 extra subtractions per inner loop.

Also, the latest version uses less memory in other areas; in order to do the dynamic programming, the algo has to save N * M arrays of direction "pointers" (in the left/right/diagonal sense (not the c sense), scores and gaps. However, the gaps are not actually needed for the entire trace-back, they are only needed 1 step back to determine if a current gap is a gap_open, or gap_extend. So, now, instead of an N*M matrix for the gap matrix, it's N*1 and M*1 arrays. For large N * M, this is enough to offset the increased memory incurred by how the substitution matrix is stored.

Running a benchmark test script with nwalign-0.1 takes 12.13 seconds and the same script on nwalign-0.3.1 takes 3.41 seconds (and actually runs in 2.99 seconds with unladen-swallow... but that's another story). The script does an alignment on 1200 * 1600 basepair sequences 100 times. As with most things, I figured most of this by banging my head against the wall long enough that the stars aligned, so to speak, anyone who cares to look at the code and offer suggestions would be much appreciated.

Monday, January 04, 2010

updates to pyfasta

At the end of last year, I put in quite a few updates to pyfasta. One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends.
With this option, a file like:


will overwritten (flattened in-place) to:


simple. When dealing with large files (where this is actually useful), the flattened file does not behave well when opened in an editor because the editor will attempt to read a number of lines into the buffer and a single line may be 200 mega-bases. So, this is a pain if you're planning to sit down with a cup of joe and read through the genome, but otherwise, the fasta file should be un-affected.
This method is not currently the default (though it may be so in future versions). But, it's possible to use the commandline:
pyfasta flatten some.fasta
which will create the flattened fasta (and the index file) and a placeholder some.fasta.flat, containing the text "@flattened@" as a marker to pyfasta that it's ok to use the original (now-flattened) fasta. Once the file is flattened, there is no performance loss compared to having a separate flat file containing no headers.

pyfasta was a fun project for me in 2009. It's a ridiculously simple little module, but when I started it, I didn't think there was a good alternative. (Though discriminating Fasta-ers should look at the sequence module in pygr, and the Bio.Seq module in BioPython which I think has improved quite a lot recently). It has over 100 tests and very close to 100% test coverage for the modules in pyfasta, and much of the code is run once for each of the 4 backends.

the source is on bitbucket.