Bioinformatics

Posts

Showing posts from 2010

(bloom) filter-ing repeated reads

October 22, 2010

In this post, I'll talk a bit about using a bloom filter as a pre- filter for large amounts of data, specifically some next-gen sequencing reads. Bloom Filters A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter . It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives. The code for that wrapper is in my github, here . Big Data It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer , and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/ha

filtering paired end reads (high throughput sequencing)

September 20, 2010

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it. I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains. At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps onl

ngs / high-throughput sequencing pipeline

September 12, 2010

This is the minimal set of preprocessing steps I run on high-throughput sequencing data (mostly from the Illumina sequencers) and then how I prep and view the alignments. If there's something I should add or consider, please let me know. I'll put it in the form of a shell script that assumes you've got this software installed. I'll also assume your data is in the FASTQ format . If it's in illumina's qseq format, you can convert to FastQ with this script by sending a list of qseq files as the command-line arguments. If your data is in color-space, you can just tell bowtie that's the case, but the FASTX stuff below will not apply. This post will assume we're aligning genomic reads to a reference genome. I may cover bisulfite treated reads and RNA-Seq later, but the initial filtering and visualization will be the same. I also assume you're on *Nix or Mac. Setup The programs and files will be declared as follows. #programs fastqc=/usr/local/src/

GSNAP

July 14, 2010

Aligners Since starting the methylcoder project, I've been using the bowtie short read aligner . It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough options to keep you busy (including my favorite: --try-hard). There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle. As with many aligners, bowtie creates an index of the reference genome. It uses a suffix array compressed with the Burrows-Wheeler Transform (BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the r

python tools for bioinformatics II: group-ing

June 07, 2010

This is the second post describing python tools and idioms I've found useful in bioinformatics. This post involves grouping items either using a disjoint set or python's itertools.groupby . I was introduced to both of these by the post-doc who sits behind me. Grouper Grouper is an implementation of a disjoint set, initially from this recipe by Michael Droettboom and included in matplotlib.cbook. The slightly modified implementation we use is here . Where the docstring gives a decent idea of useage: """ This class provides a lightweight way to group arbitrary objects together into disjoint sets when a full-blown graph data structure would be overkill. Objects can be joined using .join(), tested for connectedness using .joined(), and all disjoint sets can be retrieved using list(g) The objects being joined must be hashable. >>> g = Grouper() >>> g.join('a', 'b') >>> g.join('b

python tools for bioinformatics I: convolve for moving average

May 30, 2010

This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics. Convolve numpy makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using convolve . It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')". So the workflow is to get the sequence into a numpy array of dtype 'c' convert the boolean array to int so True is 1, False is 0 decide on the window size and make a kernel of

numpy + GDAL = agoodle

May 19, 2010

It's been a while since I've posted anything geo-related. Just want to let folks know about agoodle a project that makes it easy to access raster geo files as numpy arrays (thanks to GDAL), and to query them with polygons. The simplest way to see it in action is to go to landsummary , a project Josh Livni and I worked on a couple years ago, and query the map with a polygon (click "Draw Polygon" on the right side of the map first). Then you'll see a nice summary and a pie-chart (due to some google-chart-api usage by Josh) of the land-use types in the polygon you queried. The backend of that is agoodle. The map is, of course, openlayers . You can see the usage on the github page but basically, it'll be something like: which gives you a dictionary of raster grid code keys to the area values-- meaning, for each landclass, it tells you the sum area of cells of that class that fall inside the polygon. Note that summarize_wkt can also take keyword arguments for ra

fileindex

April 10, 2010

Disclaimer: This is about a generic indexing method that fits in < 50 lines of code, while it's cool, I'm not suggesting it be used in place of a real project. The code is here and explained below. Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python . And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it. Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence o

writing and building a lua c extension module

April 04, 2010

[Update: 2009-04-09] This package is now on luarocks under it's new name: "stringy". It now includes startswith and endswith methods (more coming). Under advice of lua gurus, I no longer add to the string base methods. And finally, it's available in my lua github repo . [/Update] I've been messing with Lua programming language lately, mostly because I wanted to try to use love2d as a visualization tool. I got side-tracked into building a c extension for lua. The C-API is much different from python . In lua, all the c functions have a signature like: int c_function(lua_State *L) and you use the lua c API to pull values off the lua_state stack thingy -- L . And then to return values, you just push them back onto the stack. I don't grok this fully yet, but it seems to handle all the memory allocation for you. Anyway, it's hard to find a full example of creating a C extension for lua 5.1. It actually seems more common just to provide patches for the lua dis

organizing genomic data

February 16, 2010

In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like gff (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative. Big ol' Matrix because it was already nicely pre-processed, I downloaded some Arabidopsis expression data from here . They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for mo

Loopless programming (for calculating methylation types)

January 18, 2010

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation , determined by the nucleotides that follow the 'C': CG: a 'G' follows the 'C' CHG: anything but a 'G' follows the 'C' and a 'G' follows that CHH: no 'G' in the 2 subsequent nucleotides. So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs. Loopless Given a nump

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign

January 10, 2010

I've written previously about nwalign , a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance. Ack It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing. Bugs Placement of Gaps Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and "AGEBAM", the alignments: AGEBAMAM AGEBAM-- and AGEBAMAM AGEB--AM Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through t

updates to pyfasta

January 04, 2010

At the end of last year, I put in quite a few updates to pyfasta . One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends. With this option, a file like: >a actg actg actg >b aaaacccc aaaacccc will overwri