Sunday, February 10, 2008

python bioinformatics

There's a new article out in BMC Bioinformatics with a comparison of the speed and length of programs from various languages. This article was sent to the biology in python (BIP) mailing list. Looking at the code, it's not that bad, but it is clear that the authors are not pythonistas, and that the reviewers have done a great job on the actual paper, but likely there was no thorough review of the code.
The authors define their own max() function that needlessly overrides python's built-in, and they use the code:
line.rstrip('/n')
that indicates there was not a thorough understanding of python, or a complete code review. Even a non-python programmer should have seen the intent was to strip a newline '\n', but the operation is not inplace, so the desired behavior could be achieved by:
line = line.rstrip('\n')
Syntactical mistakes aside, python was given a poor review on speed. Andrew Dalke, of wide-finder (and general python-bio) fame ran the alignment program (which seems to be Needleman-Wunsch) and found the psyco JIT'ed version to run in 1.7 seconds instead of the original 18+.

Given the lack of polish on the programs, even from other languages, the question then is what's the recourse? On the BIP list, some even suggested requesting that the authors retract the article. I think that's going too far as (to my knowledge) the code runs, and does more/less what it should. An alternate approach may be to propose more thorough code review for all articles, not just those that are benchmarks. As an example of informal code review, had the authors sent their code to the python list, they would undoubtedly have received numerous suggestions for making the code faster and more idiomatic. Likewise for the other languages. Clearly, that's not a solution for all cases, but given reasonably active communities for Bio-Python, Perl, Java, Ruby, a journal or author should be able to find a competent reviewer--especially in the most common scenarios where a single language was used. Journals demand a very specific format of the text of an article, should have similar standards for any code that is used in the article? Should they require automated tests? This presents a problem for those that use proprietary software. But one of the points of a scientific paper is to document the methods sufficiently to reproduce the results, is the actual code required to do so?
It's an interesting question, I don't know the answer. I notice 2 things:
1) A reviewer is not expected to know how to program, but she is expected to understand the science.
2) We do not have the same standards for code review as for reviewing the text.

Do the ends justify the means?

3 comments:

Dan said...

Interesting, I was clocked this article over the weekend. The first thing that crossed my mind was 'Will BMC publish anything these days?' - probably because this kind of benchmarking is a) a kind of obvious thing to do and b) any competent bioinformatician uses the right tool for thr right job anyway. Don't they?

However having been caught up in 'code hell' last week with a paper that had been published (but the GEO dataset not released), R code in the supplementary materials presented *in a Word document* and the code provided not working without modification (and one section so completely broken I couldn't fix it in the afternoon I spent on it), it made me wonder what the role of a reviewer is for code?

Don't all reviewers of applications download, install and attempt to use them? If not - they're derelicting their duty.

Retraction might be a step too far, but a stiff letter from the appropriate communities to the editors along the lines of your post should definitely be investigated.

sgillies said...

Reminds me a bit of the nonsense we heard about Python after the Chandler project shake-up.

brent said...

@dan, isn't it pretty common--not including any source-code with a publication? it seems on rare cases when it's available, it's only as supplemental info or just a link to a download site.
i may (ahem) have some scripts used to generate pub data that i'd be embarrassed if anyone saw...
@sean, i'm (ab)using shapely for a bioinformatics project. i'll write it up in a few days now that there's some indication that someone besides me may read it.