Saturday, April 10, 2010

fileindex

Disclaimer: This is about a generic indexing method that fits in < 50 lines of code, while it's cool, I'm not suggesting it be used in place of a real project. The code is here and explained below.

Everyone's indexing big sequence files. Peter Cock has a bio-python branch to index via sqlite. Brad Chapman writes up a nice approach using tools from bx-python. And there's my own pyfasta for fasta files. This morning I set out to use another one: screed from the pygr guys to index fastq and fasta files via sqlite db. Titus wrote about screed and posted to the biology-in-python mailng list, which is how i originally heard about it.

Screed and the biopython branch use sqlite to get quickly from name to thing--random access. This is a nice approach because sqlite comes with python and it's easy to use and quite fast. Thinking simply about an index, all it really does is get you from some id (e.g. a fasta header or fastq name) to the thing (the fasta sequence or the fastq record).
In the case of flat files, it's a mapping from and id or name to the fseek file-position at the start of a record. Given that, it's possible to make a generic file indexer that creates an index given a file and a function that advances the file pointer (fh) to the next record and returns an id.

So for the case of the FASTQ format, which contains a new record every 4 lines, the first of which is the 'id', the parser could be:

def fastq_next(fh):
id = fh.readline() # save the id.
# advance the file handle 3 lines.
fh.readline(); fh.readline(); fh.readline()
return id

So regardless of the file format, this is the interface. The function takes a file handle and 1) advances the file position to the start of the next record and 2) returns the id.
Given that, the indexer call looks like:
FileIndex.create('some.fastq', fastq_next)

all that call has to do is repeatedly send a filehandle to fastq_next, accept the the id returned by fastq_next, and save a mapping of id to the (previous) file position. An implementation detail is how that mapping is saved. I use a tokyo-cabinet BTree database.
Once the index is created, usage is:

fi = FileIndex(somefile, handler)
record = fi[somekey]

where handler is a callable (including a class) that takes a pointer and returns a thing. For fastq, it could be:

class FastQEntry(object):
def __init__(self, fh):
self.name = fh.readline().rstrip('\r\n')
self.seq = fh.readline().rstrip('\r\n')
self.l3 = fh.readline().rstrip('\r\n')
self.qual = fh.readline().rstrip('\r\n')

so usage for fastq looks like:

>>> fi = FileIndex('some.fastq', FastQEntry)
>>> some_id = '@HWI-EAS105_0001:1:1:7:1680#0/1'
>>> record = fi[some_id]
>>> assert isinstance(record, FastQEntry)
>>> record.seq
TATTTATTGTTATTAGTTATTTTANTANAAATANTNGANGGGGAGGAAGGNNNNNNTNNNNNNNNGANNNNANGAG

FastQ

So, from any name, there's direct access to each record. Given the implementation of FileIndex setup, it's actually possible to create and use an index for a fastq file with 9 total lines of code, 6 of which are the class itself:

class FastQEntry(object):
def __init__(self, fh):
self.name = fh.readline().rstrip('\r\n')
self.seq = fh.readline().rstrip('\r\n')
self.l3 = fh.readline().rstrip('\r\n')
self.qual = fh.readline().rstrip('\r\n')

# note the 2nd argument takes a file handle and returns a name.
FileIndex.create('some.fastq', lambda fh: FastQEntry(fh).name)
fi = FileIndex('some.fastq', FastQEntry)
print fi['@HWI-EAS105_0001:1:1:7:1680#0/1'].seq

SAM

That's decent, but what makes it cooler is that this same interface can be used to implement an index on a SAM (Sequence Alignment/Map) format file:

class SamLine(object):
def __init__(self, fh):
line = fh.readline().split("\t") or [None]
self.name = line[0]
self.ref_seqid = line[2]
self.ref_loc = int(line[3])
# ... other SAM format stuff omitted.

f = 'some.sam'
FileIndex.create(f, lambda fh: SamLine(fh).name, allow_multiple=True)
fi = FileIndex(f, SamLine, allow_multiple=True)
print [(s.name, s.ref_seqid, s.ref_loc) for s in fi['23351265']]

# output: [('23351265', '2', 8524), ('23351265', '3', 14202495)]

That's it. Note one extra thing: In an alignment represented in the SAM file, I used the read id (the first column) as the id for the indexing. That does not have to be unique, so I specify allow_multiple=True and it returns an array of records for every key. (So although using tokyocabinet is an implementation detail, putcat() functions make it much easier to support multiple entries per id.).

Performance

For the SAM file I tested, the original file is 2.5G and the index created by FileIndex is 493M. For a large fastq file of 62.5 million lines (about 15.5 million records) the file is 3.3G and the index is 1.2G (compared to 3.8G for screed). It takes about 5 minutes to index the FASTQ file (compared to 11 for screed).
On my machine, with that FASTQ file, I get about 17,000 queries per second (including object creation). For the same fastq, screed gives about 10,000 queries per second (which matches the image in Titus' post).

Implementation

The full implementation right now is 44 lines, including (sparse) comments and is shown in the gist at the end of this post. It lacks a lot of nice things (like, um tests) and I'm not recommending anyone use it in-place of a well thought out project like the biopython stuff or screed, still, i think the concept is useful--and I am using it for SAM files.
I've put the full code in bio-playground on github, including the examples from this post. But, this is generic enough that it could be used to index anything: blast-files, fasta, ... whatever. As always, I welcome any feedback.

About Tokyo Cabinet

In the course of this, I found major problems in 3 different python wrappers for tokyo-cabinet. I reported bugs to 2 of them. So, kudos to py-tcdb. While I don't like that I have to explicitly tell it _not_ to pickle what I send it, it's well tested, and the code is very nice and ... best of all, I haven't been able to make it segfault. It is also easy_install'able.
Another thing I learned today is that you can't use tokyo-cabinet hash for more than a couple million records. A web-search shows that lots of people have asked about this and the answers always have to do with adjusting the bnum bucket parameter. That does not work. If you can create a tokyo cabinet hash database with over 5 million records that does not slow on inserting, please show me how. Until I see it, I think one has to use the BTree database in TC.

3 comments:

István Albert said...

Neat interface, a generic solution to common problem.

Couldn't resist to check against bsddb though so I took your code and replaced TokyoCabinet with a btree BerkleyDB that comes default with python (at least up to 2.6) ... and it seems that I get the fastest time:

---------- Python ----------
0.0
getting 1000000 keys...
shuffling
timing
40.9879999161
24397.3846503 queries per second

Output completed (45 sec consumed) - Normal Termination

brentp said...

@István, doh! that would have saved me a lot of time just using bsddb instead of trying various tc wrappers. just to check, i put stuck 100M keys in bsddb and it does fine.
What time did you get for the original version of the timing you posted?

i think it would be pretty simple to extract out the storage so one can specify to use any key-value store. but i'll probably make bsddb the default.

István Albert said...

I did not check TC since I was a little lazy to install and understand it - truth to be told, that's what motivated me to try bsddb in the first place ;-)