Saturday, April 10, 2010

fileindex

Disclaimer: This is about a generic indexing method that fits in < 50 lines of code, while it's cool, I'm not suggesting it be used in place of a real project. The code is here and explained below.

Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python. And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it.

Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence or the fastq record).
In the case of flat files, it's a mapping from and id or name to the fseek file-position at the start of a record. Given that, it's possible to make a generic file indexer that creates an index given a file and a function that advances the file pointer (fh) to the next record and returns an id.

So for the case of the FASTQ format, which contains a new record every 4 lines, the first of which is the 'id', the parser could be:

def fastq_next(fh):
id = fh.readline() # save the id.
# advance the file handle 3 lines.
fh.readline(); fh.readline(); fh.readline()
return id

So regardless of the file format, this is the interface. The function takes a file handle and 1) advances the file position to the start of the next record and 2) returns the id.
Given that, the indexer call looks like:
FileIndex.create('some.fastq', fastq_next)

all that call has to do is repeatedly send a filehandle to fastq_next, accept the the id returned by fastq_next, and save a mapping of id to the (previous) file position. An implementation detail is how that mapping is saved. I use a tokyo-cabinet BTree database.
Once the index is created, usage is:

fi = FileIndex(somefile, handler)
record = fi[somekey]

where handler is a callable (including a class) that takes a pointer and returns a thing. For fastq, it could be:

class FastQEntry(object):
def __init__(self, fh):
self.name = fh.readline().rstrip('\r\n')
self.seq = fh.readline().rstrip('\r\n')
self.l3 = fh.readline().rstrip('\r\n')
self.qual = fh.readline().rstrip('\r\n')

so usage for fastq looks like:

>>> fi = FileIndex('some.fastq', FastQEntry)
>>> some_id = '@HWI-EAS105_0001:1:1:7:1680#0/1'
>>> record = fi[some_id]
>>> assert isinstance(record, FastQEntry)
>>> record.seq
TATTTATTGTTATTAGTTATTTTANTANAAATANTNGANGGGGAGGAAGGNNNNNNTNNNNNNNNGANNNNANGAG

FastQ

So, from any name, there's direct access to each record. Given the implementation of FileIndex setup, it's actually possible to create and use an index for a fastq file with 9 total lines of code, 6 of which are the class itself:

class FastQEntry(object):
def __init__(self, fh):
self.name = fh.readline().rstrip('\r\n')
self.seq = fh.readline().rstrip('\r\n')
self.l3 = fh.readline().rstrip('\r\n')
self.qual = fh.readline().rstrip('\r\n')

# note the 2nd argument takes a file handle and returns a name.
FileIndex.create('some.fastq', lambda fh: FastQEntry(fh).name)
fi = FileIndex('some.fastq', FastQEntry)
print fi['@HWI-EAS105_0001:1:1:7:1680#0/1'].seq

SAM

That's decent, but what makes it cooler is that this same interface can be used to implement an index on a SAM (Sequence Alignment/Map) format file:

class SamLine(object):
def __init__(self, fh):
line = fh.readline().split("\t") or [None]
self.name = line[0]
self.ref_seqid = line[2]
self.ref_loc = int(line[3])
# ... other SAM format stuff omitted.

f = 'some.sam'
FileIndex.create(f, lambda fh: SamLine(fh).name, allow_multiple=True)
fi = FileIndex(f, SamLine, allow_multiple=True)
print [(s.name, s.ref_seqid, s.ref_loc) for s in fi['23351265']]

# output: [('23351265', '2', 8524), ('23351265', '3', 14202495)]

That's it. Note one extra thing: In an alignment represented in the SAM file, I used the read id (the first column) as the id for the indexing. That does not have to be unique, so I specify allow_multiple=True and it returns an array of records for every key. (So although using tokyocabinet is an implementation detail, putcat() functions make it much easier to support multiple entries per id.).

Performance

For the SAM file I tested, the original file is 2.5G and the index created by FileIndex is 493M. For a large fastq file of 62.5 million lines (about 15.5 million records) the file is 3.3G and the index is 1.2G (compared to 3.8G for screed). It takes about 5 minutes to index the FASTQ file (compared to 11 for screed).
On my machine, with that FASTQ file, I get about 17,000 queries per second (including object creation). For the same fastq, screed gives about 10,000 queries per second (which matches the image in Titus' post).

Implementation

The full implementation right now is 44 lines, including (sparse) comments and is shown in the gist at the end of this post. It lacks a lot of nice things (like, um tests) and I'm not recommending anyone use it in-place of a well thought out project like the biopython stuff or screed, still, i think the concept is useful--and I am using it for SAM files.
I've put the full code in bio-playground on github, including the examples from this post. But, this is generic enough that it could be used to index anything: blast-files, fasta, ... whatever. As always, I welcome any feedback.

About Tokyo Cabinet

In the course of this, I found major problems in 3 different python wrappers for tokyo-cabinet. I reported bugs to 2 of them. So, kudos to py-tcdb. While I don't like that I have to explicitly tell it _not_ to pickle what I send it, it's well tested, and the code is very nice and ... best of all, I haven't been able to make it segfault. It is also easy_install'able.
Another thing I learned today is that you can't use tokyo-cabinet hash for more than a couple million records. A web-search shows that lots of people have asked about this and the answers always have to do with adjusting the bnum bucket parameter. That does not work. If you can create a tokyo cabinet hash database with over 5 million records that does not slow on inserting, please show me how. Until I see it, I think one has to use the BTree database in TC.

Sunday, April 04, 2010

writing and building a lua c extension module

[Update: 2009-04-09]
This package is now on luarocks under it's new name: "stringy". It now includes startswith and endswith methods (more coming). Under advice of lua gurus, I no longer add to the string base methods. And finally, it's available in my lua github repo.
[/Update]
I've been messing with Lua programming language lately, mostly because I wanted to try to use love2d as a visualization tool. I got side-tracked into building a c extension for lua. The C-API is much different from python.
In lua, all the c functions have a signature like:
int c_function(lua_State *L)
and you use the lua c API to pull values off the lua_state stack thingy -- L.
And then to return values, you just push them back onto the stack. I don't grok this fully yet, but it seems to handle all the memory allocation for you.
Anyway, it's hard to find a full example of creating a C extension for lua 5.1. It actually seems more common just to provide patches for the lua distribution itself. There are some docs but they were difficult for me to find and it's not clear which docs are for lua-5.1--the current version. So I'm including a full example with sourcecode, simple tests, Makefile, and luarock spec.
The full gist is here.
The C code (shamelessly stolen from the wiki and book -- not my own code) actually implements 2 very useful functions string.split and string.strip. These are otherwise easily added using lua's regular expression searches, but these are faster as they're not using the regexp machinery:
Note the functions are included in a struct and "registered" with the luaopen_stringext function.
The Makefile then builds the shared library: The shared library stringext.so is then immediately usable from the lua interpreter. A test session looks like:. Where string.split() returns a table (array) of the tokens. Another cool thing about lua is visible in that script. The added functions actually become methods on Lua strings! So after importing stringext, all strings now have strip() and split() methods! This is because of the line in stringext.c:
luaL_openlib(L, "string", stringext, 0);
which tells it to add the methods to the "string" module.
Finally, luarocks... Luarocks are the equivalent of python eggs or ruby gems (you gotta love all these clever names). They take a rockspec file (equiv of python's setup.{py,cfg}). Mine looks like this.
I've requested that be added to the main luarocks repository, there doesnt seem to be a way to upload your own rock directly. Still once you write the rockspec, you can build the C extension without a Makefile by typing
luarocks make
and it handles all the appropriate flags and such.
Anyone interested can download a tar-ball of the entire distribution here.
I add that the lua community seems very active and helpful. I asked a question about building the extension and received quick, helpful replies.