Posts

streaming merge of sorted objects

A lot of software still seems to rely on being able to read big-ish data into memory. This is not possible (or at least not desirable) for much of the data that I work with. There are very nice tools in python to allow operating on chunks of data at a time. When combined with a decent data-layout, this can be very powerful, and simpler even than reading everything into memory. This can change working on big(ish) data into something like working on small data. The output of a tool that I'm using is a file of genomic positions and a value. Something like: chr1 1234 0.9 chr1 1239 0.12 chr1 1249 0.12 That file is for a single sample and may contain about 10 million lines. That's not too much, but with 60 samples, this can become a problem. In addition, another sample may have sites that are not in that first sample: chr1 1221 0.91 chr1 1239 0.13 chr1 1259 0.22 Many softwares will take a matrix with rows of genomic positions and 1 column per sample (e.g. R's limma, pyt...

Adding bed/wig data to dalliance genome browser

I have been playing a bit with the dalliance genome browser . It is quite useful and I have started using it to generate links to send to researchers to show regions of interest we find from bioinformatics analyses. I added a document to my github repo describing how to display a bed file in the browser. That rst is here and displayed in inline below. It uses the UCSC binaries for creating BigWig/BigBed files because dalliance can request a subset of the data without downloading the entire file given the correct apache configuration (also described below). This will require a recent version of dalliance because there was a bug in the BigBed parsing until recently. Dalliance Data Tutorial dalliance is a web-based scrolling genome-browser. It can display data from remote DAS servers or local or remote BigWig or BigBed files. This will cover how to set up an html page that links to remote DAS services. It will also show how to create and serve BigWig and BigBed files. N...

(bloom) filter-ing repeated reads

In this post, I'll talk a bit about using a bloom filter as a pre- filter for large amounts of data, specifically some next-gen sequencing reads. Bloom Filters A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter . It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives. The code for that wrapper is in my github, here . Big Data It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer , and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/ha...

filtering paired end reads (high throughput sequencing)

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it. I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains. At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps onl...

ngs / high-throughput sequencing pipeline

Image
This is the minimal set of preprocessing steps I run on high-throughput sequencing data (mostly from the Illumina sequencers) and then how I prep and view the alignments. If there's something I should add or consider, please let me know. I'll put it in the form of a shell script that assumes you've got this software installed. I'll also assume your data is in the FASTQ format . If it's in illumina's qseq format, you can convert to FastQ with this script by sending a list of qseq files as the command-line arguments. If your data is in color-space, you can just tell bowtie that's the case, but the FASTX stuff below will not apply. This post will assume we're aligning genomic reads to a reference genome. I may cover bisulfite treated reads and RNA-Seq later, but the initial filtering and visualization will be the same. I also assume you're on *Nix or Mac. Setup The programs and files will be declared as follows. #programs fastqc=/usr/local/src/...

GSNAP

Aligners Since starting the methylcoder project, I've been using the bowtie short read aligner . It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough options to keep you busy (including my favorite: --try-hard). There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle. As with many aligners, bowtie creates an index of the reference genome. It uses a suffix array compressed with the Burrows-Wheeler Transform (BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the r...

python tools for bioinformatics II: group-ing

This is the second post describing python tools and idioms I've found useful in bioinformatics. This post involves grouping items either using a disjoint set or python's itertools.groupby . I was introduced to both of these by the post-doc who sits behind me. Grouper Grouper is an implementation of a disjoint set, initially from this recipe by Michael Droettboom and included in matplotlib.cbook. The slightly modified implementation we use is here . Where the docstring gives a decent idea of useage: """ This class provides a lightweight way to group arbitrary objects together into disjoint sets when a full-blown graph data structure would be overkill. Objects can be joined using .join(), tested for connectedness using .joined(), and all disjoint sets can be retrieved using list(g) The objects being joined must be hashable. >>> g = Grouper() >>> g.join('a', 'b') >>> g.join('b...

python tools for bioinformatics I: convolve for moving average

Image
This will be the first post of a few describing python tools or idioms that I have found work to well for doing bioinformatics -- or more specifically genomics. Convolve numpy makes a lot of things easier--for example, moving average. We often want to know the GC-content for a given sequence, which is the proportion of DNA sequence in a given window that is made of of (G)uanine or (C)ytosine. You can often see moving average code done with a nested for loop--loop over each nucleotide and then over each surrounding nucleotide in that window. Below is the gist of the simpler, faster method using convolve . It does require you do get your sequence data into a numpy array, but that's easily done from any DNA (or protein) sequence string doing "np.array(sequence_string, dtype='c')". So the workflow is to get the sequence into a numpy array of dtype 'c' convert the boolean array to int so True is 1, False is 0 decide on the window size and make a kernel of ...

numpy + GDAL = agoodle

It's been a while since I've posted anything geo-related. Just want to let folks know about agoodle a project that makes it easy to access raster geo files as numpy arrays (thanks to GDAL), and to query them with polygons. The simplest way to see it in action is to go to landsummary , a project Josh Livni and I worked on a couple years ago, and query the map with a polygon (click "Draw Polygon" on the right side of the map first). Then you'll see a nice summary and a pie-chart (due to some google-chart-api usage by Josh) of the land-use types in the polygon you queried. The backend of that is agoodle. The map is, of course, openlayers . You can see the usage on the github page but basically, it'll be something like: which gives you a dictionary of raster grid code keys to the area values-- meaning, for each landclass, it tells you the sum area of cells of that class that fall inside the polygon. Note that summarize_wkt can also take keyword arguments for ra...

fileindex

Disclaimer: This is about a generic indexing method that fits in here and explained below. Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python . And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it. Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence or the fastq record). In the case of flat files, it's a mapping from and id or name to the fseek file-position at ...

writing and building a lua c extension module

[Update: 2009-04-09] This package is now on luarocks under it's new name: "stringy". It now includes startswith and endswith methods (more coming). Under advice of lua gurus, I no longer add to the string base methods. And finally, it's available in my lua github repo . [/Update] I've been messing with Lua programming language lately, mostly because I wanted to try to use love2d as a visualization tool. I got side-tracked into building a c extension for lua. The C-API is much different from python . In lua, all the c functions have a signature like: int c_function(lua_State *L) and you use the lua c API to pull values off the lua_state stack thingy -- L . And then to return values, you just push them back onto the stack. I don't grok this fully yet, but it seems to handle all the memory allocation for you. Anyway, it's hard to find a full example of creating a C extension for lua 5.1. It actually seems more common just to provide patches for the lua dis...

organizing genomic data

In some cases, using sqlite, a DBM, or just a python pickle is as much of a pain as dealing with something like gff (generic feature format) which is comprehensive but requires a lot of manipulation. Relational db's are slow and have extra overhead when I don't need the relational or ACID parts and DBM's, for the most part, only allow querying by keys. These are 2 cases where I've found a nice alternative. Big ol' Matrix because it was already nicely pre-processed, I downloaded some Arabidopsis expression data from here . They keep the data in a hierarchy of files. I want to be able to quickly find the coexpression for any gene pair. 22573 * 22573 == 509 million entries is a bit large for a hash (aka python dictionary)--and even disregarding speed concerns it's more than I want to push into an sqlite db--but fits in a 1.9 gig file of 32 bit floats. So each entry is a measure of coexpression between 2 genes. In that sense, this is a key-key-value store. [for mo...

Loopless programming (for calculating methylation types)

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation , determined by the nucleotides that follow the 'C': CG: a 'G' follows the 'C' CHG: anything but a 'G' follows the 'C' and a 'G' follows that CHH: no 'G' in the 2 subsequent nucleotides. So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs. Loopless Given a nump...

needleman-wunsch global sequence alignment -- updates and optimizations to nwalign

I've written previously about nwalign , a python package I wrote to do fast (thanks to cython) global sequence alignment. Over break, I spent some time fixing some bugs and improving performance. Ack It's actually nice to get bug reports for a relatively obscure bit of software like this as it shows it's getting used. Thanks especially to R. Christen for his patience in (repeatedly) showing me places where nwalign was not doing the right thing. Bugs Placement of Gaps Some of the "Bugs" were actually just "unspecified behavior". For example, given input text of "AGEBAMAM" and "AGEBAM", the alignments: AGEBAMAM AGEBAM-- and AGEBAMAM AGEB--AM Have the same score. However, the first version is generally more, um, polite. Previously, the behavior was deterministic, but depended on the length and order of the input sequence. As of version 0.3, nwalign will always put matches together when given a choice between 2 (or more) paths through t...

updates to pyfasta

At the end of last year, I put in quite a few updates to pyfasta . One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends. With this option, a file like: >a actg actg actg >b aaaacccc aaaacccc will overwri...

genome scrubber : mask repetitive sequence

Image
This is to describe a simple tool I've made available ( svn repo ) for masking repetitive sequence. rice (Oryza Sativa) version 5 sequence looks something like below when run through pyfasta info. rice.fasta ========== >1 length:43596771 >3 length:36345490 >2 length:35925388 >4 length:35244269 >6 length:31246789 >5 length:29874162 >7 length:29688601 >11 length:28462103 >8 length:28309179 >12 length:27497214 >9 length:23011239 >10 length:22876596 372.078M basepairs in 12 sequences So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command: /usr/bin/bl...

rst2s5 with syntax highlighting

Restructured Text to S5 Presentation (lots of caffeine today so 2 posts in one day) The stub example presentation I'll be talking about is viewable as a presentation here (click on that page to advance the demo slides). There's a nice browser-based tool for presentations called S5 . In recent python docutils , there's a tool called rst2s5.py which converts restructured-text to an s5 presentation. However, it's not obvious how to get syntax highlighting for code blocks to work. So pygments , a python library that will highlight syntax for many programming languages comes with this file which they recommend you use as a starting point. That's what I did, and I've created a stub example project accessible via subversion: $ svn export http://bpgeo.googlecode.com/svn/trunk/rst2s5_template/ with a build script and a couple of example slides (and a nice theme). It's possible to change the theme by editing rst-directive.py (included in the source) and changing th...

some python ctypes stuff in Rtree

I've been working with and on the Rtree python module. It's some cool work done by Howard Butler (hobu) (originally with Sean Gillies ) to make python bindings for the Spatial Index C++ library by Marios Hadjieleftheriou which provides various tree structures for efficient spatial searches in n-dimensions. Hobu has written a C API for that along with a new ctypes wrapper to that API which appears in Rtree 0.5 and greater. There is some cool ctypes stuff in there which I'm starting to understand. From the website: ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python. as a simple example of how ctypes works we can pretend there's no atan() in python's math module and access the one from libm in c like this: import ctypes libm = ctypes.CDLL('libm.so.6') # the following 2 lines correspond to the c signature: double...

biostuff

I've been trying in 2009 to write less throw-away code. I'm not sure how successful I've been at that, but at least I'm writing more code that I keep around. Previously, I stuck anything of at least marginal quality and re-usability into my google code project bpbio . As of yesterday, I've moved a lot of stuff from there to bitbucket . "Biostuff" is where I'll put modules that are well documented and tested in hopes that using a distributed VCS and a project that doesn't contain my initials will foster any contribution. Currently, all the modules on bitbucket are also on pypi. pyfasta provides pythonic access fasta sequence files. Previously, it had been a part of genedex (which I've stopped supporting since @hobu has done so much good work on Rtree that genedex is now pretty much obsolete) but it's been pulled out and simplified and improved. Check out the docs on pypi . nwalign is a command-line or python interface to the Needleman-W...

starting haxe. (stuff i want to remember)

I've been tinkering with a flash project recently. Actually haxe , so it's only linux tools -- VI, and the command line -- not the GUI interface people normally associate with flash. This post is a summary of how to get started with haxe using only the command line, and a project containing flash stuff I want to remember. To start, here's a gist of shell commands that will set up haxe on an ubuntu machine. The installers from the haxe website work fine for windows and mac (and I think 32 bit linux). Haxe has a slightly different syntax from actionscript 3, but for most things it is identical. These docs are very good, and better than the adobe site, I have that page open always when working with haxe. I also grabbed an "actionscript.vim" from the internet somewhere and put it in ~/.vim/syntax/ for syntax highlighting and added this line to my .vimrc: autocmd BufRead *.hx set filetype=actionscript Then compilation and code is simply a matter of following this . ...