Friday, October 03, 2008

genedex.fasta with numpy.memmap

EDIT: added job posting to comments.
I've been working a bit on genedex, I'm still not happy with the way it stores features. Which is a huge pickle of dictionaries where every dictionary is a 'feature' that looks like: {'name':'At2g26540', 'start': 1234, 'stop': 3456, 'strand': 1, 'chr': 2}. So the only way to do a search is by location--and that is _very_ fast, thanks to rtree, but there's no way to search by name or any other attribute--and an entire organism is loaded into memory at once--that part actually works out ok, but it feels dirty. I quickly wrote an SQLAlchemy backed interface to a simple db schema do allow this sort of searching here: http://code.google.com/p/genedex/source/browse/trunk/genedex/models/sqla.py. That already supports Feature.upstream(), downstream(), etc. methods, but it will work nicely once python supports sqlite rtree without any extra work--for now, it just uses BTree indexes on the start and stop. I could use rtree to index the sqlite database, but I'd like to move away from the LGPL. Maybe this KDTree that's already in a scipy branch with a more permissive license. Then it could do both spatial, and attribute queries...
That's all tinkering...

I also cleaned up the genedex.fasta module. The useage is nice, if not entirely the implementation. A fasta file can look like:

>chr1
ATGTCGTCGGCCGC
GGGCCAAGA
CAACGGAGA

>chr3
ATGGAGGAGGCTGGCGAGCGG

>chr2
ATGGCGTGC
ACGGCGGCG
CGCATGTT
CGCCT

where a line starting with > is the name of the sequence ('chr1') and everything up until the next > is the sequence. The problem is the newlines, so every time you want to look at chr1 basepairs 10 to 20, you have to find where the sequence starts, and account for newlines. -- Acutally one should never do that, as Biopython, and pretty much any library will take care of that for you. Pygr for example, creates a new file something.fasta.pureseq which removes all newlines and labels and indexes where the starts and stop are. genedex.fasta.Fasta now does something similar, here's example useage on the file above ('123.fasta').

from genedex import Fasta
f = Fasta('123.fasta')
print f.keys()
print f['chr1'][9:20].tostring()
print f['chr1'][9:20]

after, the fasta file looks like this:

>chr1
ATGTCGTCGGCCGCGGGCCAAGACAACGGAGA
>chr3
ATGGAGGAGGCTGGCGAGCGG
>chr2
ATGGCGTGCACGGCGGCGCGCATGTTCGCCT

with spaces removed--so it's still a valid fasta file (you can also send an argument to the constructor and it will create a new file) and there's a new file 123.fasta.gdx that is a python pickle containing:
{'chr3': (45L, 67L), 'chr2': (73L, 105L), 'chr1': (6L, 39L)}
which indicate the start and stop positions of the sequence in the file.
So the file remains a valid fasta file, but now it can be efficiently sliced. For now, it actually uses a numpy mmmap (numpy.memmap), to take advantage of the broadcasting, other than that, it'd be simpler to just use python mmap. So, when it sees f['chr1'][10:20] it acts just like it's indexing a numpy array, but it's accessing the data directly from the disk (well, not actually, but mmap works it's magic and I dont have to think about that). I like that--I can keep my fasta files as valid, add only a small python pickle file, and get simple, fast, pythonic indexing into them. It does take about 12 seconds to index and flatten the entire sorghum genome (629MB, ~660 million basepairs) on first view, after that first time, it's instantaneous.
Anyway, the source is available and easy_install-able:

svn checkout http://genedex.googlecode.com/svn/trunk/ genedex
As always, I'll gladly take any improvements, bugs, enhancements.

Also, our lab at UCB is looking for another (full-time or close to it, on-site) programmer who knows some biology, perl, and hopefully some python. If you're interested, or know anyone, send me an email. I have no real authority in the matter (or any matter) but I will have some say in this. I'd like to work with someone I can learn from. I'll add a link to the job posting in the comments below once it's posted.

1 comment:

brentp said...

the job posting i mentioned above is up. i'm pasting the full listing here as it's not a public link:


Job Description


Posting Title:

Programmer/Analyst III-Ucb
Requisition:

009222
Department:

Plant & Microbial Biology
Location:

Main Campus-Berkeley
Salary:
Annual Salary Range: $59,676 - $85,656
First Review Date:

10/29/2008

This requisition will remain open until filled.



Job Description:

The Department of Plant and Microbial Biology (PMB) was established in 1989 in the College of Natural Resources as a research, teaching, and training center of the Agricultural Experiment Station. The Department is affiliated with the USDA Plant Gene Expression Center in Albany and has 35 regular faculty members, one cooperative extension specialist, and nine adjunct faculty members. The faculty are involved in both graduate and undergraduate teaching programs as well as extensive scholarly research. In addition to faculty, there are approximately 20 administrative staff. The Department Chair, Associate Chair, and a Chief Administrative Officer oversee the department's programs.

This position is in the Freeling laboratory, which is a part of PMB. Our research focus is comparative plant genomes.

This job involves writing programs in a computer language, designing related databases, web interfaces and content, of multimedia processes. Incumbent will design, develop, modify, test, evaluate and maintain computer programs. Work includes test-to-production processes, quality assurance, maintenance and documentation of applications. Includes web applications programming. Includes some system administration.
Responsibilities:

40%) Design, develop, modify, debug and evaluate complex programs to enhance biological research.

(20%) Learn software languages as needed. Determine which programming language is needed to efficiently code for the required application. If necessary learn new languages or develop advanced skills in a particular language.

(20%) Work with the PI to can learn cutting-edge genetic and genomic concepts that will inform algorithm coding. The programmer must be able to assimilate new concepts and information and incorporate it into programs and databases.

(5%) Analyze existing programs or work to formulate logic for new systems, devise logic procedures, prepare flowcharting, perform coding, data analysis, and test/debug programs through the application of professional programming concepts.

(5%) Provide analysis for the design and use complex relational databases. Develop and execute moderately complete test plans. Develop conversion and system implementation plans.

(5%) Gather, analyze, prepare and summarize recommendations for approval of system and programming documentation.

(5%) Modest system administration: backup, assure continuous server function, work with others here.
Requirements & Qualifications:

• B.S. or B.A. in related area and/or equivalent experience/training. Previous expert-level experience with Linux and Perl at minimum, either as part of an advanced degree program or in the workplace, or both. Sound background in science
• Requiring thorough knowledge of applications programming. Perl essential, some experience with Python, JavaScript, actionscript (Flash), C, bash, VBA would be helpful
• Must have knowledge relating to the design and development of applications programs across the laboratory
• Requiring knowledge of other related areas of IT
• Requiring advanced skills associated with programming design, modification and implementation. Requires interpersonal skills in order to work with other members of the lab
• Must be self-motivated, work independently or sometimes as part of a team, able to learn quickly, meet deadlines and demonstrate problem-solving skills. Must be available to help fix crashes. Must have advanced skills in web applications, web programming language and object oriented programming concepts
• Must have basic bioinformatics background and be able to use molecular biological and genetic textbook-type material to learn enough science to understand why one algorithm is appropriate and another is not