Stupid things I did as a Bioinformatics Programmer in 2008
In 2008, I was good enough at programming to get my ass kicked by hard problems. I think that's the most positive way to say it.
My main bioinformatics project was a long annotation pipeline. It takes days to run, often using 8 CPUs. It's driven by a big ol' Makefile. I made the mistake of passing data between steps in un-structured text files or python pickles.
I'd create one at the beginning of the pipeline and not notice it was messed up in a way that affected other parts until the entire pipeline was done, days later. Toward the end of the pipeline, I'd need something simple, like the strand of a BLAST hit, but I'd have to parse through an entire GFF file, or load some huge pickle into memory just to get to that. Then I'd need some annotation, and I'd have to add a slow step of doing a lookup in a script that'd otherwise run very quickly.
I was passing around data in arrays and tuples, so then when I changed the order or added another element in the script that was creating the tuple, a downstream script that was using the tuple would be using the wrong index to access data. If I was lucky, my code would fail, if not, it'd be using the strand when what it should have had was the start location.
I hit problems where I'd run out of memory. At one point, I ran out of disk space (it's a big series of datasets), hit bugs of software I was using.
When I should have run just 1 chromosome to test the pipleline in 1/100th (comparing 10 chromsomes to 10 chromosomes) the time, I ran it over an entire
genome.
When I should have taken the time to really fix small mistakes as I found them, I instead worked around them, making the code unnecessarily complex as a result. If I had fixed instead of fudged in those cases, I would have been more productive.
I did write tests, but not enough, and I didn't set up the project in a way that it was really testable. I'm still learning how to do that. All the other stuff may just be discipline, but the testing is very difficult for me in bioinformatics. I've extracted what I could into tested libraries and added checks for the intermediate data, and every time I found a dumb error, I'd add an assert. (there's a discussion of pretty much exactly the problems I'm describing here: http://ivory.idyll.org/blog/sep-08/the-future-of-bioinformatics-part-1b.html )
I'd assume that it was ok to use a tuple, because I was only going to store start and stop, but then later I'd need to add chromosome, then strand, then score, and pretty soon i'd have code elsewhere like:
where i have no idea what i'm comparing there. That one's simple to fix, just use objects, or dicts at the very least, but a 2-tuple is so tempting, and
a 3-tuple not so bad, and ...
On top of all this, the project was changing as I was working on it, so I was changing what went in to the start of the pipeline, what we were getting out, and the steps I was doing. And because of all the extra little hacks in there, I would be stuck in some function far away from the data that I needed (*).
It sucked. I can make those the sort of mistakes on a project that runs in 5 seconds and has a very simple, and relatively known output. But, not for complex pipelines.
It's been painful watching myself do such stupid stuff, and then reading the code afterward, but I think I've learned a lot. Much of that code still sucks, but I've moved to more rigid data-structures. Every time I make a change now, I do it for real, it's not just hacked in. I have to deal with someone pacing around, waiting for the results and asking me questions like "why is it so hard to just add in the RNA?" or whatever. Also, statistically speaking, I'm
probably running out of mistakes I can make...
And actually, the most difficult things have been inter-personal relationships, but that's for another post--as are the more positive, awesome things I did with projects I set up correctly, and had good test coverage...
And finally, I did make a lot of mistakes, but this has been genuinely a difficult problem.
My main bioinformatics project was a long annotation pipeline. It takes days to run, often using 8 CPUs. It's driven by a big ol' Makefile. I made the mistake of passing data between steps in un-structured text files or python pickles.
I'd create one at the beginning of the pipeline and not notice it was messed up in a way that affected other parts until the entire pipeline was done, days later. Toward the end of the pipeline, I'd need something simple, like the strand of a BLAST hit, but I'd have to parse through an entire GFF file, or load some huge pickle into memory just to get to that. Then I'd need some annotation, and I'd have to add a slow step of doing a lookup in a script that'd otherwise run very quickly.
I was passing around data in arrays and tuples, so then when I changed the order or added another element in the script that was creating the tuple, a downstream script that was using the tuple would be using the wrong index to access data. If I was lucky, my code would fail, if not, it'd be using the strand when what it should have had was the start location.
I hit problems where I'd run out of memory. At one point, I ran out of disk space (it's a big series of datasets), hit bugs of software I was using.
When I should have run just 1 chromosome to test the pipleline in 1/100th (comparing 10 chromsomes to 10 chromosomes) the time, I ran it over an entire
genome.
When I should have taken the time to really fix small mistakes as I found them, I instead worked around them, making the code unnecessarily complex as a result. If I had fixed instead of fudged in those cases, I would have been more productive.
I did write tests, but not enough, and I didn't set up the project in a way that it was really testable. I'm still learning how to do that. All the other stuff may just be discipline, but the testing is very difficult for me in bioinformatics. I've extracted what I could into tested libraries and added checks for the intermediate data, and every time I found a dumb error, I'd add an assert. (there's a discussion of pretty much exactly the problems I'm describing here: http://ivory.idyll.org/blog/sep-08/the-future-of-bioinformatics-part-1b.html )
I'd assume that it was ok to use a tuple, because I was only going to store start and stop, but then later I'd need to add chromosome, then strand, then score, and pretty soon i'd have code elsewhere like:
if cns[2][4] < cns[2][5]:
...
where i have no idea what i'm comparing there. That one's simple to fix, just use objects, or dicts at the very least, but a 2-tuple is so tempting, and
a 3-tuple not so bad, and ...
On top of all this, the project was changing as I was working on it, so I was changing what went in to the start of the pipeline, what we were getting out, and the steps I was doing. And because of all the extra little hacks in there, I would be stuck in some function far away from the data that I needed (*).
It sucked. I can make those the sort of mistakes on a project that runs in 5 seconds and has a very simple, and relatively known output. But, not for complex pipelines.
It's been painful watching myself do such stupid stuff, and then reading the code afterward, but I think I've learned a lot. Much of that code still sucks, but I've moved to more rigid data-structures. Every time I make a change now, I do it for real, it's not just hacked in. I have to deal with someone pacing around, waiting for the results and asking me questions like "why is it so hard to just add in the RNA?" or whatever. Also, statistically speaking, I'm
probably running out of mistakes I can make...
And actually, the most difficult things have been inter-personal relationships, but that's for another post--as are the more positive, awesome things I did with projects I set up correctly, and had good test coverage...
And finally, I did make a lot of mistakes, but this has been genuinely a difficult problem.
* see this post, especially the last 4 paragraphs. It's good to hear that someone with 43 years of programming language has the same problems as I do.
Comments