Posts

Showing posts with the label bio

Adding bed/wig data to dalliance genome browser

I have been playing a bit with the dalliance genome browser . It is quite useful and I have started using it to generate links to send to researchers to show regions of interest we find from bioinformatics analyses. I added a document to my github repo describing how to display a bed file in the browser. That rst is here and displayed in inline below. It uses the UCSC binaries for creating BigWig/BigBed files because dalliance can request a subset of the data without downloading the entire file given the correct apache configuration (also described below). This will require a recent version of dalliance because there was a bug in the BigBed parsing until recently. Dalliance Data Tutorial dalliance is a web-based scrolling genome-browser. It can display data from remote DAS servers or local or remote BigWig or BigBed files. This will cover how to set up an html page that links to remote DAS services. It will also show how to create and serve BigWig and BigBed files. N...

(bloom) filter-ing repeated reads

In this post, I'll talk a bit about using a bloom filter as a pre- filter for large amounts of data, specifically some next-gen sequencing reads. Bloom Filters A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter . It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives. The code for that wrapper is in my github, here . Big Data It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer , and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/ha...

ngs / high-throughput sequencing pipeline

Image
This is the minimal set of preprocessing steps I run on high-throughput sequencing data (mostly from the Illumina sequencers) and then how I prep and view the alignments. If there's something I should add or consider, please let me know. I'll put it in the form of a shell script that assumes you've got this software installed. I'll also assume your data is in the FASTQ format . If it's in illumina's qseq format, you can convert to FastQ with this script by sending a list of qseq files as the command-line arguments. If your data is in color-space, you can just tell bowtie that's the case, but the FASTX stuff below will not apply. This post will assume we're aligning genomic reads to a reference genome. I may cover bisulfite treated reads and RNA-Seq later, but the initial filtering and visualization will be the same. I also assume you're on *Nix or Mac. Setup The programs and files will be declared as follows. #programs fastqc=/usr/local/src/...

GSNAP

Aligners Since starting the methylcoder project, I've been using the bowtie short read aligner . It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough options to keep you busy (including my favorite: --try-hard). There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle. As with many aligners, bowtie creates an index of the reference genome. It uses a suffix array compressed with the Burrows-Wheeler Transform (BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the r...

Loopless programming (for calculating methylation types)

DNA Methylation is used in plants as a means of epi-genetic regulation. Bi-sulfite sequencing is a method used to determine the methylation pattern of a given set of DNA. Methylation occurs at 'C'ytosines. In plants, there are 3 types of methylation , determined by the nucleotides that follow the 'C': CG: a 'G' follows the 'C' CHG: anything but a 'G' follows the 'C' and a 'G' follows that CHH: no 'G' in the 2 subsequent nucleotides. So, programmatically, it's a trivial matter to calculate the type of methylation that can occur at each cytosine in a given sequence. Though, once the edges and edge-cases and reverse-complement are handled, it becomes a few lines of code full of loops and if's. For python (and most scripting languages) that's slow and loopy and iffy. The rest of this code explains how I utilized numpy to make a fast methylation-type calculator without loops or ifs. Loopless Given a nump...

updates to pyfasta

At the end of last year, I put in quite a few updates to pyfasta . One of the nicest is the new flatten stuff. In order to provide fast access to the sequence data, pyfasta creates a separate flattened version of the sequence file containing no newlines or headers for any file that it interacts with. That flattened file is used as the basis for the index which allows fast random-access. This is an additional file, nearly the same size as the original, and can be more space overhead than one would like to incur when dealing with large files. The new "flatten_inplace" keyword arg to the pyfasta.Fasta() constructor will remove all newlines but keep headers. This will leave the fasta file in a valid FASTA format that BLAST or any sequence tools will understand, but will also allow fast access via pyfasta, since pyfasta only needs to know the file position where each sequence starts and ends. With this option, a file like: >a actg actg actg >b aaaacccc aaaacccc will overwri...

genome scrubber : mask repetitive sequence

Image
This is to describe a simple tool I've made available ( svn repo ) for masking repetitive sequence. rice (Oryza Sativa) version 5 sequence looks something like below when run through pyfasta info. rice.fasta ========== >1 length:43596771 >3 length:36345490 >2 length:35925388 >4 length:35244269 >6 length:31246789 >5 length:29874162 >7 length:29688601 >11 length:28462103 >8 length:28309179 >12 length:27497214 >9 length:23011239 >10 length:22876596 372.078M basepairs in 12 sequences So, it's not huge (still only 1/10th the size of human) but, it can be difficult to deal with the entire genome because of the large amount of repetitive sequence and transposable elements. This is sometimes mistakenly referred to as "junk DNA", while that's not true, it does make whole-genome analyses a pain as a the output is dominated by repetitive sequences matching their own families. Doing a blast of the rice genome with this command: /usr/bin/bl...

biostuff

I've been trying in 2009 to write less throw-away code. I'm not sure how successful I've been at that, but at least I'm writing more code that I keep around. Previously, I stuck anything of at least marginal quality and re-usability into my google code project bpbio . As of yesterday, I've moved a lot of stuff from there to bitbucket . "Biostuff" is where I'll put modules that are well documented and tested in hopes that using a distributed VCS and a project that doesn't contain my initials will foster any contribution. Currently, all the modules on bitbucket are also on pypi. pyfasta provides pythonic access fasta sequence files. Previously, it had been a part of genedex (which I've stopped supporting since @hobu has done so much good work on Rtree that genedex is now pretty much obsolete) but it's been pulled out and simplified and improved. Check out the docs on pypi . nwalign is a command-line or python interface to the Needleman-W...

displaying and serving big(ish) data with numpy and memmap

Image
In this case, "big" is only 8 million rows, but I have it working for 40+ million extremely fast and it will work for anything that can fit in memory (and larger with a bit of extra work). So the idea is to create simple web "service" (where "service" is just the URL scheme I pull out of my .. uh genes ... ) implemented in a wsgi script. A web request containing a width=XXX in the url will show the user wants an image. so a url like: http://128.32.8.28/salktome/?seqid=1&xmin=491520&xmax=499712&&width=512 will give an image: (geo-hackers will recognize that with a URL scheme like that, it'll be simple to put this into an openlayers slippy map.) where the each horrible color represents a different tissue, and values extending up from the middle represent the top (+) strand (W) while those on the bottom represent the - or Crick strand. The heights are the levels of expression. Without the width=XXX param, the service returns JSON of values...

biohash

This is a quick project I did a while back but, I've seen people interested in similar ideas, so I'll post my implementation here. Geohash encodes latitude/longtitude locations into a string such that "nearby places will often present similar prefixes" and the longer the string, the greater the precision. Using this python implementation by Schuyler as a reference, I ported the concept to a "biohash" which can encode intervals. It works in a similar fashion, starting with the extremes and halving the space until it finds the smallest space that contains the interval. The use to allow efficient search of intervals using a BTree index, as in any relational db. It's implemented with only a dumps() and loads() function after the pickle interface. The dumps function takes start and end args and returns a 1/0 encoded string. The loads takes a 1/0 encoded string and returns the tightest interval it can given that string. Both functions take a rng kwarg, whic...

Stupid things I did as a Bioinformatics Programmer in 2008

In 2008, I was good enough at programming to get my ass kicked by hard problems. I think that's the most positive way to say it. My main bioinformatics project was a long annotation pipeline. It takes days to run, often using 8 CPUs. It's driven by a big ol' Makefile. I made the mistake of passing data between steps in un-structured text files or python pickles. I'd create one at the beginning of the pipeline and not notice it was messed up in a way that affected other parts until the entire pipeline was done, days later. Toward the end of the pipeline, I'd need something simple, like the strand of a BLAST hit, but I'd have to parse through an entire GFF file, or load some huge pickle into memory just to get to that. Then I'd need some annotation, and I'd have to add a slow step of doing a lookup in a script that'd otherwise run very quickly. I was passing around data in arrays and tuples, so then when I changed the order or added another element in...

python interval tree

EDIT: added a couple points inline. I'm obsessed with trees lately -- of the CS variety, not the plant variety. Although we are studying poplar, so I'll be using trees to study trees . I'd tried a couple times to implement an interval tree from scratch following the Wikipedia entry . Today I more or less did that in python. It's the simplest possible form. There's no insertion (though that's trivial to add), it just takes a list of 'things' with start and stop attributes and creates a tree with a .find() method. The wikipedia entry that baffled me was about storing 2 copies of each node's intervals --one sorted by start and the other by stop. I didn't do that as I think in most cases it won't improve lookup time. I figure if you have 1 million elements and a tree of depth 16, then you have on average 15 intervals per node (actually fewer since there are the non-leaf nodes). So I just brute force each of those nodes and move to the next. I th...

genedex.fasta with numpy.memmap

EDIT: added job posting to comments. I've been working a bit on genedex , I'm still not happy with the way it stores features. Which is a huge pickle of dictionaries where every dictionary is a 'feature' that looks like: {'name':'At2g26540', 'start': 1234, 'stop': 3456, 'strand': 1, 'chr': 2}. So the only way to do a search is by location--and that is _very_ fast, thanks to rtree , but there's no way to search by name or any other attribute--and an entire organism is loaded into memory at once--that part actually works out ok, but it feels dirty. I quickly wrote an SQLAlchemy backed interface to a simple db schema do allow this sort of searching here: http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py . That already supports Feature.upstream(), downstream(), etc. methods, but it will work nicely once python supports sqlite rtree without any extra work--for now, it just uses BTree indexes on th...

binary search over intervals

[EDIT: update location of code repo] This isn't particularly advanced or clever, it's just a simple implementation--less code than anything else I could come up with. Binary search is easy. Just look at the python std library implementation (and the C API version ). When you play the game with a friend of guessing a number between 0 and 100, you guess 50, your friend tells you "higher", you guess 75. That's pretty much binary search. It takes about 7 guesses max to guess a number between 0 and 100. It just requires that the numbers be in order. Interval search is more difficult. It's not just looking for a single value, but rather for a range of values that overlap a given interval. Also, you can't sort on just start, because some intervals are longer than others, so when sorting by start, the stops may be out of order. So, you have to arrange some protocol. In all the examples I've seen, including the explanation here , that means storing not only t...

openlayers, genomes and image-maps

In response to Titus' post on using imagemaps for genomic visualization: Why are imagemaps so popular in genomics? As an extreme and unfair comparison, just imagine if http://maps.google.com was an image map. Given a CGI script that can accept a url like &start=1024&stop=2048&chr=3 and return an appropriate image, you can provide a substantial set of tools using openlayers , which is developed by what must be one of the largest and active developer communities in GIS. (Yes, I am an openlayers fan-boy.) You can do that with a small addition to openlayers which I updated a couple weeks ago to OL version 2.6. In that update, I removed > 140 lines of code . So, it's now even less of a change to OL. Maybe when 2.7 comes out, I'll figure out how to provide a patch that allows an extra argument to the OpenLayers.Map constructor that limits panning to the horizontal direction -- in which case genome-browser will cease to exist and only the single file containing ...

seqfind: levenshtein + bktree

I've copied this recipe that I modified before, and added the BK Tree structure in cython. It's in my repo here . Check it out with: svn checkout http://bpbio.googlecode.com/svn/trunk/seqfind or easy_install with sudo easy_install http://bpbio.googlecode.com/svn/trunk/seqfind It's now using the Damerau-Levenshtein distance which is more sensible for bioinformatics where transpositions are frequent. Bearophile's original implementation used a tuple, which made sense, but in Cython, it's more efficient to use an object where the properties can be typed--as a class is converted to a c-struct--so there is no conversion when appending to a python array -- if i understand the generated c code correctly. Using an object also allows arbitrary info to be passed along with the word when creating the tree, again, this is important for bio-informatics when the string is something like "actgcc ... acgtc" and it's useful to attach some annotation to it like: word...

levenshtein in cython

EDIT(2): fix markup for <char *> casts ... fix malloc (see comments. thanks Bao). NOTE: using a kwarg for limit slows things down. setting that to a required arg and using calloc for m2 speed things up to nearly as fast as the pylevenshtein. Well, it seems to be popular to code up the levenshtein . I actually have a use for this and wanted to practice some Cython , so I've written a version. I used bearophile's recipe , wikibooks , this (from k4st on reddit) and this for reference. It follows bearophile's code closely, using only O(m) space instead of O(mn). cdef extern from "stdlib.h": ctypedef unsigned int size_t size_t strlen(char *s) void *malloc(size_t size) void free(void *ptr) int strcmp(char *a, char *b) cdef inline size_t imin(int a, int b, int c): if a if c return c return a if c return c return b cpdef int levenshtein(char *a, char *b, int limit=100): cdef int m = strlen(a),...

Genedex: query genomic features and sequence

Normally, I don't write libraries, I figure smarter people than I should do such things, and I should just use them. But, I got tired enough of writing one-off scripts for genomic feature manipulation-- find the upstream, downstream neighbors and get the sequence -- and I saw enough of the pieces coming together that I decided to build it. I'd complained before about how rtree didn't support 1D indicies. Not only is this not a problem, it's beneficial. Genomic features should have strand information, so that's the 2nd dimension. Then rtree does containment queries, so it's simple to find only the features on a given strand. I realized this about the same time that the docstring for numpy's memmap went from 0 lines to about 100, and it was enhanced to take a filehandle , not just a filename. This means you can send in a start position and a shape to the numpy.memmap constuctor and it can create a numpy array of only that chunk. This means that it's pos...

comparative genomics with openlayers

Image
Traditional genome browsers, look like this . In fact, I think that's the most popular genome-browser used--gbrowse. They display information in tracks, so any layer of annotation you just add on to the bottom of the image (after making the image taller). This doesnt work for genome-browser, the hack of openlayers to support only horizontal scrolling, because you if you have 2 adjacent tiles, if one has more features than the next, there's not guarantee that they'll be the same height, and no guarantee that a feature that's on both images will align correctly. I was just hacking around, trying to test some work I'd done and realized that you can have annotation layers with OpenLayers, just add another map, and tie them together! So that's 2 OpenLayers.Map() instances. What makes this easy is the new Map.panTo() methods in OpenLayers 2.6 (which is in release candidate 1). So, the top map registers for 'move' and 'zoomend' events with callbacks th...

reading code

A fasta file of rice genomic sequence is 355MB. It's not easy to understand how large that is. This is an attempt to come up with a quick metric. So, I downloaded Ulysses . wc shows it to have 267235 words. Some googling says the average person can read 250 words per - minute. So that's 267,235 / 250 / 60 = 17.8 hours. Well, it's hard to believe anyone can really read Ulysses in 18 hours but... good enough. So on the rice fasta file i ran: grep -v ">" rice.fasta | wc -c to get rid of the 12 header lines (1 per chromosome) and only count sequence (should be within 12 characters counting the extra new-lines). That gives 372,077,765 characters. The average word-size in ulysses is 5. I rounded up to 6. So, the rice sequence has the equivalent of 372,077,765 / 6 = 62,012,960 words So, at 250 words per minute, it'd take: 62012960 / 250 / 60 = 4,134 hours to read the rice genome . That's 172 days. Also, from what I know, the plot is hard to follow. Genome siz...