real-world postgresql vs mysql benchmark

At my work, we have a large MySQL database (15 MyISAM tables, 21 million rows, 10Gigs size). After seeing the benchmarks showing that Postgres out-performs MySQL on multi-core machines (our new db server has 4 CPU's), I ported the database to PostgreSQL.
We have begun using the DBIx perl module since Class::DBI is too sloooow. The DBIx module allows closer access to the generated SQL, and it allows "prefetch"ing which eliminates extra back-and-forth (and object creation) between the server and client. In addition, the connection string is in the script, not in the generated API. This makes it easy to benchmark as all that is required to change between db engines is to change the connection string. Using this script:

use CogeX; use strict;
# mysql
#my $connstr = 'dbi:mysql:genomes:host:3306';
# postgresql
my $connstr = 'dbi:Pg:dbname=genomes;host=host;port=5432';
my $s = CoGeX->connect($connstr, 'user', 'pass' );

my $rs = $s->resultset('Feature')->search({
'' => 'CDS' ,
'' => {like => 'At1g%' }
join => ['feature_names','feature_type'],
prefetch => ['feature_names','feature_type']

while (my $feat =$rs->next()){
my $fn = $feat->feature_names;
my $type = $feat->feature_type->name;

map { print $_->name . ":". $type . "\t" } $fn->next();
print "\n";

# this prefetch avoids n calls where n is number of sequences #
foreach my $seq ($feat->sequences({},{prefetch=>"sequence_type"})){
print $seq->sequence_data if $seq->sequence_type->name eq 'protein';
print "\n\n";

fetches the protein sequence of any coding sequence (CDS) that has a feature name that starts with 'At1g' which would be any CDS on chromosome 1 of arabidopsis in our database.
The script consistently runs in 45 seconds on MySQL and in 29-31 seconds in Postgres. Other scripts seem to have about that difference--PostgreSQL finishes in about 60-70% of the time that the MySQL scripts do. Or, more dramatically: MySQL is 50% slower. That's pretty good for no change in the API, and all tables have the same indexing and structure.
Postgres was set up as default except for these values in postgresql.conf
shared_buffers 40000
max_connections 200
work_mem 4096
effective_cache_size 10000
This was the most concise indication of values to set, though likely closer tuning could improve performance (ahem, suggestions welcome).
An added benefit, is now, we can push more work into the database server using PL/Perl or another PL language, which can further reduce network back and forth, and reduce the creation of perl objects when not necessary.


Anonymous said…
Thank you.
Anonymous said…
You should post your MySQL config along with the size of the indexes for your tables in question.

MySQL requires a lot of tuning to be very efficient (both on the server as in the queris you send). I am not surprised that a default install leads to those results.

It would be most beneficial to have the EXPLAIN output of the query you are trying to run.

Things to look out for with MySQL are to make sure indexes can be preloaded/kept in memory for maximum efficiency.
Anonymous said…

I need to update that article. A few things to set:

1) wal_buffers should be set to 128. We'll probably make this the default in the future.

2) 10000 seems low for effective_cache_size. How much RAM do you have?

--Josh Berkus
brent said…
thanks for the tips. i'll try those out. the ram is 12G, i believe.
also, i asked on #postgresql if there was an example postgresql.conf file for a more modern machine, but noone was able to point me to one. i think it would be a nice addition to your site to have some links to some rough suggested conf files a la: postgresql_2GRAM.conf, _8GRAM.conf, _multCPU.conf, etc.
for folks like me who are just running around in the dark.

agreed and noted. and i'll look into this preloading of indicies. i know nothing about that.
jrock said…
BTW, you should call DBIx::Class DBIx::Class (DBIC is fine too) and not DBIx. DBIx:: is a generic namespace for extensions to DBI.

Popular posts from this blog

python interval tree

filtering paired end reads (high throughput sequencing)

my thoughts on golang