Sunday, January 14, 2007

real-world postgresql vs mysql benchmark

At my work, we have a large MySQL database (15 MyISAM tables, 21 million rows, 10Gigs size). After seeing the benchmarks showing that Postgres out-performs MySQL on multi-core machines (our new db server has 4 CPU's), I ported the database to PostgreSQL.
We have begun using the DBIx perl module since Class::DBI is too sloooow. The DBIx module allows closer access to the generated SQL, and it allows "prefetch"ing which eliminates extra back-and-forth (and object creation) between the server and client. In addition, the connection string is in the script, not in the generated API. This makes it easy to benchmark as all that is required to change between db engines is to change the connection string. Using this script:

use CogeX; use strict;
# mysql
#my $connstr = 'dbi:mysql:genomes:host:3306';
# postgresql
my $connstr = 'dbi:Pg:dbname=genomes;host=host;port=5432';
my $s = CoGeX->connect($connstr, 'user', 'pass' );

my $rs = $s->resultset('Feature')->search({
'feature_type.name' => 'CDS' ,
'feature_names.name' => {like => 'At1g%' }
},
{
join => ['feature_names','feature_type'],
prefetch => ['feature_names','feature_type']
}
);

while (my $feat =$rs->next()){
my $fn = $feat->feature_names;
my $type = $feat->feature_type->name;

map { print $_->name . ":". $type . "\t" } $fn->next();
print "\n";

# this prefetch avoids n calls where n is number of sequences #
foreach my $seq ($feat->sequences({},{prefetch=>"sequence_type"})){
print $seq->sequence_data if $seq->sequence_type->name eq 'protein';
}
print "\n\n";
}


fetches the protein sequence of any coding sequence (CDS) that has a feature name that starts with 'At1g' which would be any CDS on chromosome 1 of arabidopsis in our database.
The script consistently runs in 45 seconds on MySQL and in 29-31 seconds in Postgres. Other scripts seem to have about that difference--PostgreSQL finishes in about 60-70% of the time that the MySQL scripts do. Or, more dramatically: MySQL is 50% slower. That's pretty good for no change in the API, and all tables have the same indexing and structure.
Postgres was set up as default except for these values in postgresql.conf
shared_buffers 40000
max_connections 200
work_mem 4096
effective_cache_size 10000
This was the most concise indication of values to set, though likely closer tuning could improve performance (ahem, suggestions welcome).
An added benefit, is now, we can push more work into the database server using PL/Perl or another PL language, which can further reduce network back and forth, and reduce the creation of perl objects when not necessary.