Posts

my thoughts on golang

I've been playing with go in the evenings and over xmas break for about 8 weeks now. This post is about go the language and the tooling. I may write another post about a simple go package for bioinformatics that I've been writing which is under 1000 lines of code. First, go is boring , and though it is pretty terse, I do miss things about python like list comprehensions; initializing a variable and writing a for loop is easy enough, but it's one of the things that I use all the time in python.  But, I can't argue with the "less is exponentially more" mantra as I was able to pick up the language very quickly. The tooling is fantastic. My project has dependencies that are wrappers to C-libs, but I can simply do:     go get code.google.com/p/biogo.boom and it just works. The project is only about 1K lines of code, but it compiles in about 0.1 seconds on my laptop. And, when the time comes, I can distribute binaries for common platforms! vim-go  is awesome...

streaming merge of sorted objects

A lot of software still seems to rely on being able to read big-ish data into memory. This is not possible (or at least not desirable) for much of the data that I work with. There are very nice tools in python to allow operating on chunks of data at a time. When combined with a decent data-layout, this can be very powerful, and simpler even than reading everything into memory. This can change working on big(ish) data into something like working on small data. The output of a tool that I'm using is a file of genomic positions and a value. Something like: chr1 1234 0.9 chr1 1239 0.12 chr1 1249 0.12 That file is for a single sample and may contain about 10 million lines. That's not too much, but with 60 samples, this can become a problem. In addition, another sample may have sites that are not in that first sample: chr1 1221 0.91 chr1 1239 0.13 chr1 1259 0.22 Many softwares will take a matrix with rows of genomic positions and 1 column per sample (e.g. R's limma, pyt...

Adding bed/wig data to dalliance genome browser

I have been playing a bit with the dalliance genome browser . It is quite useful and I have started using it to generate links to send to researchers to show regions of interest we find from bioinformatics analyses. I added a document to my github repo describing how to display a bed file in the browser. That rst is here and displayed in inline below. It uses the UCSC binaries for creating BigWig/BigBed files because dalliance can request a subset of the data without downloading the entire file given the correct apache configuration (also described below). This will require a recent version of dalliance because there was a bug in the BigBed parsing until recently. Dalliance Data Tutorial dalliance is a web-based scrolling genome-browser. It can display data from remote DAS servers or local or remote BigWig or BigBed files. This will cover how to set up an html page that links to remote DAS services. It will also show how to create and serve BigWig and BigBed files. N...

(bloom) filter-ing repeated reads

In this post, I'll talk a bit about using a bloom filter as a pre- filter for large amounts of data, specifically some next-gen sequencing reads. Bloom Filters A Bloom Filter is a memory efficient way of determining if an element is in a set. It can have false positives, but not false negatives. A while ago, I wrote a Cython/Python wrapper for the C code that powers the perl module, Bloom::Filter . It's has a nice API and seems very fast. It allows specifying the false positive rate. As with any bloom-filter there's a tradeoff between the amount of memory used and the expected number of false positives. The code for that wrapper is in my github, here . Big Data It's a common request to filter out repeated reads from next-gen sequencing data. Just see this question on biostar. My answer , and every answer in that thread, uses a tool that must read all the sequences into memory. This is an easy problem to solve in any language, just read the records into a dict/ha...

filtering paired end reads (high throughput sequencing)

NOTE: I don't recommend using this code. It is not supported and currently does not work for some sets of reads. If you use it, be prepared to fix it. I wrote last time about a pipeline for high-throughput sequence data. In it, I mentioned that the fastx toolkit works well for filtering but does not handle paired end reads. The problem is that you can filter each end (file) of reads independently, but most aligners expect that the nth record in file 1 will be the pair of the nth record in file 2. That may not be the case if one end of the pair is completely removed while the other remains. At the end of this post is the code for a simple python script that clips an adaptor sequences and trims low-quality bases from paired end reads. It simply calls fastx toolkit (which is assumed to be on your path). It uses fastx_clipper if an adaptor sequence is specified and then pipes the output to fastq_quality_trimmer for each file then loops through the filtered output and keeps onl...

ngs / high-throughput sequencing pipeline

Image
This is the minimal set of preprocessing steps I run on high-throughput sequencing data (mostly from the Illumina sequencers) and then how I prep and view the alignments. If there's something I should add or consider, please let me know. I'll put it in the form of a shell script that assumes you've got this software installed. I'll also assume your data is in the FASTQ format . If it's in illumina's qseq format, you can convert to FastQ with this script by sending a list of qseq files as the command-line arguments. If your data is in color-space, you can just tell bowtie that's the case, but the FASTX stuff below will not apply. This post will assume we're aligning genomic reads to a reference genome. I may cover bisulfite treated reads and RNA-Seq later, but the initial filtering and visualization will be the same. I also assume you're on *Nix or Mac. Setup The programs and files will be declared as follows. #programs fastqc=/usr/local/src/...

GSNAP

Aligners Since starting the methylcoder project, I've been using the bowtie short read aligner . It's very fast, uses very little memory, aligns Illimina, SOLID, and colorspace reads, and has enough options to keep you busy (including my favorite: --try-hard). There's a new short-read aligner in my feed-reader each week. I wish, as a service, they'd tell me what they do that bowtie doesn't. There are 2 scenarios I know of that bowtie doesn't handle. As with many aligners, bowtie creates an index of the reference genome. It uses a suffix array compressed with the Burrows-Wheeler Transform (BWT). It uses 32 bit integers to store the offsets into that index so it's limited to under 4 gigabases of reference sequence. This is more than enough to hold the human genome and, so is not a problem for most users, but, for BS-treated alignments, I need to store a converted (Cytosine to Thymine) forward, and reverse copy of the genome which doubles the size of the r...