Monday, November 26, 2007

tinycc

i've been _trying_ to learn C. tinycc, beside being tiny, it compiles very quickly, allowing you to do cool things like script in C
#!/usr/bin/tcc -run

#include

int main(int argc, char *argv[]) {
printf("Hello World %s, %s", argv[0], argv[1]);
return 0;
}
and then run as
./file.c arg_1

which makes it easier for those c-fu to guess and check.
it also allows such nice things as c in python
which is like pyinline, but uses ctypes and doesn't need write access.

Thursday, October 11, 2007

Sorting by proximity to a date in PostgreSQL

postgreSQL has great support for dates,

=> SELECT '2007-08-23'::date - '2006-09-14'::date as days;
days
------
343

given a date column and a date, to find the nearest date, you can "extract the epoch", here, i used ABS as i just want the nearest date, before or after.

SELECT *, ABS(EXTRACT(EPOCH FROM(date - '2006-08-23'))::BIGINT) as date_order FROM record WHERE well_id = 1234 ORDER BY date_order limit 1

i suppose this could make a nice PL/PGSQL function...

Saturday, September 29, 2007

k-means clustering in scipy

it's fairly simple to do clustering of points with similar z-values in scipy:


import numpy
import matplotlib
matplotlib.use('Agg')
from scipy.cluster.vq import *
import pylab
pylab.close()

# generate some random xy points and
# give them some striation so there will be "real" groups.
xy = numpy.random.rand(30,2)
xy[3:8,1] -= .9
xy[22:28,1] += .9

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

# whiten them
z = whiten(z)

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

# convert groups to rbg 3-tuples.
colors = ([([0,0,0],[1,0,0],[0,0,1])[i] for i in idx])

# show sizes and colors. each color belongs in diff cluster.
pylab.scatter(xy[:,0],xy[:,1],s=20*z+9, c=colors)
pylab.savefig('/var/www/tmp/clust.png')


Tuesday, July 17, 2007

using python mapscript to create a shapefile and dbf

i always have trouble remembering how to use mapscript. it's pretty simple, but the docs are hard to find and the test cases (though excellent!) have a lot of abstraction.

heres some code that creates a shapefile and dbf (using another module). and does a quick projection at the start.


import mapscript as M
import random
from dbfpy import dbf

#########################################
# do some projection
#########################################

p = 'POINT(466666 466000)'
shape = M.shapeObj.fromWKT(p)
projInObj = M.projectionObj("init=epsg:32619")
projOutObj = M.projectionObj("init=epsg:4326")
shape.project(projInObj, projOutObj)
print shape.toWKT()


#########################################
# create a shapefile from scractch
#########################################
ms_dbf = dbf.Dbf("/tmp/t.dbf", new=True)
ms_dbf.addField(('some_field', "C", 10))

ms_shapefile = M.shapefileObj('/tmp/t.shp', M.MS_SHAPEFILE_POLYGON)

for i in xrange(10):
ms_shape = M.shapeObj(M.MS_SHAPE_POLYGON)
ms_line = M.lineObj()


for j in xrange(10):
ms_line.add(M.pointObj(random.randint(0,99), -random.randint(0,99)))

ms_shape.add(ms_line)
ms_shapefile.add(ms_shape)


rec = ms_dbf.newRecord()
rec['some_field'] = 'hi' + str(i)
rec.store()

ms_dbf.close()

Wednesday, June 27, 2007

Note to self: using python logging module


import logging

logging.basicConfig(level=logging.DEBUG
,format='%(asctime)s [[%(levelname)s]] %(message)s'
,datefmt='%d %b %y %H:%M'
,filename='/tmp/app.log'
,filemode='a')

logging.debug('A debug message')
logging.info('Some information')
logging.warning('A shot across the bows')

Thursday, April 26, 2007

Fix indentation in VIM

Often times, i get which has the indentation completely messed up, not just mixing tab/spaces, but really "whack"

these commands seem to magically fix for at least 2 test cases:

:set filetype=xml
:filetype indent on
:e
gg=G

Wednesday, April 25, 2007

Using Python MiddleWare

just trying to figure this stuff out. it's pretty simple, but there's one level of abstraction through web.py. you can use middleware to add keys to the environ for example.
http://groovie.org/files/WSGI_Presentation.pdf


#!/usr/bin/python

import web
import random

class hi(object):
def GET(self,who='world'):
web.header('Content-type','text/html')
print "hello %s" % who

class bye(object):
def GET(self,who='world'):
web.header('Content-type','text/plain')
print "bye %s" % who

for c in web.ctx.env:
print c, web.ctx.env[c]

class other(object):
def GET(self):
web.header('Content-type','text/plain')
for c in web.ctx:
print c, web.ctx[c]

urls = ( '/bye/(.*)', 'bye'
,'/hi/(.*)' , 'hi'
, '/.*' , 'other')

class RandomWare(object):
def __init__(self, app):
self.your_app = app;

def __call__(self,environ,start):
environ['hello'] = random.random()
return self.your_app(environ,start)

def random_mw(app):
return RandomWare(app)

if __name__ == "__main__":

web.run(urls,globals(),random_mw)

Tuesday, April 24, 2007

Install run, and benchmark mod_wsgi in < 10 minutes

svn checkout http://modwsgi.googlecode.com/svn/trunk/ modwsgi
cd mod_wsgi
./configure
make
sudo make install
# note where mod_wsgi.so went on your system
echo "LoadModule wsgi_module /path/to/mod_wsgi.so" >> /path/to/apache2.conf

mkdir /var/www/wsgitest/
cd /var/www/wsgitest/

vi .htaccess
# [in .htaccess]
Options +ExecCGI
< Files hi.py >
SetHandler wsgi-script
</Files>
# [ end .htaccess]

vi hi.py
# [in hi.py]

#!/usr/bin/python

import web

class hi(object):
def GET(self,who='world'):
web.header('Content-type','text/html')
print "hello %s" % who

class bye(object):
def GET(self,who='world'):
web.header('Content-type','text/html')
print "bye %s" % who

urls = ( '/bye/?(.*)', 'bye'
,'/hi/?(.*)' , 'hi' )

application = web.wsgifunc(web.webpyfunc(urls, globals()))

#[end hi.py ]


you can then browse to
http://localhost/wsgitest/hi.py/hi/there
# see "hello there"
http://localhost/wsgitest/hi.py/bye/bye%20bye
# see "bye bye bye"



(meaningless) Benchmarking:
change last line in hi.py to:
if __name__ == "__main__": web.run(urls,globals())
and save as cgi.py

cgi
$ ab -n 1000 -c 30 http://localhost/wsgitest/cgi.py/hi/there | grep 'Requests per second'

Requests per second: 4.08 [#/sec] (mean)

wsgi
$ ab -n 1000 -c 30 http://localhost/wsgitest/hi.py/hi/there | grep 'Requests per second'

Requests per second: 351.05 [#/sec] (mean)

Wednesday, March 21, 2007

vim tricks

i've been trying to learn new stuff in vim, instead of doing same old. recently, i've been using :tabe to edit in tabs. lately, i've been trying the :sp to edit in splits.
this set of tricks makes it even nicer:
http://www.vim.org/tips/tip.php?tip_id=173
now i can type ctrl+j to move down or ctrl+k to move up a split and
have that split maximized.
both tabs and split make it simple to yank and paste between files. something for which i had been using the mouse.

Wednesday, January 31, 2007

postgresql and mysql: benchmark? how?

so somehow, my previous post on postgres / mysql made reddit , which i happened to be reading yesterday afternoon. i didnt even realize it was my post until following the link.
there were a couple harsh comments stating that i found what i wanted to find. ... which were merited given the sensationalist way i presented the results (50%) and the careless use of the term "benchmark". and yes, the config for mySQL was the default. still, i just presented what i found.
i was surprised noone commented on the hackish way that i checked to see if it was a protein sequence in perl, rather than mysql--or the coolness of pre-fetching in DBIx (which is available as eager loading or setting lazy=False in the mapper in python's sqlalchemy).

re the comments on things to change in the postgresql.conf... i'll try at some point. are there any suggestions for mysql?
the machine has 12G ram, 4CPUs. likely, the raid configuration (i dont know how it's set up) is not optimal, but that is out of my hands.

for a "real benchmark" it'd be nice to do this in sqlalchemy with the schema written in python and then just change the database engine between mysql/postgresql/sqlite. any pointers on how one would go about creating a "real benchmark"?

actually, that would be a good ask reddit topic: "How to design a 'real database benchmark'?"

Sunday, January 14, 2007

real-world postgresql vs mysql benchmark

At my work, we have a large MySQL database (15 MyISAM tables, 21 million rows, 10Gigs size). After seeing the benchmarks showing that Postgres out-performs MySQL on multi-core machines (our new db server has 4 CPU's), I ported the database to PostgreSQL.
We have begun using the DBIx perl module since Class::DBI is too sloooow. The DBIx module allows closer access to the generated SQL, and it allows "prefetch"ing which eliminates extra back-and-forth (and object creation) between the server and client. In addition, the connection string is in the script, not in the generated API. This makes it easy to benchmark as all that is required to change between db engines is to change the connection string. Using this script:

use CogeX; use strict;
# mysql
#my $connstr = 'dbi:mysql:genomes:host:3306';
# postgresql
my $connstr = 'dbi:Pg:dbname=genomes;host=host;port=5432';
my $s = CoGeX->connect($connstr, 'user', 'pass' );

my $rs = $s->resultset('Feature')->search({
'feature_type.name' => 'CDS' ,
'feature_names.name' => {like => 'At1g%' }
},
{
join => ['feature_names','feature_type'],
prefetch => ['feature_names','feature_type']
}
);

while (my $feat =$rs->next()){
my $fn = $feat->feature_names;
my $type = $feat->feature_type->name;

map { print $_->name . ":". $type . "\t" } $fn->next();
print "\n";

# this prefetch avoids n calls where n is number of sequences #
foreach my $seq ($feat->sequences({},{prefetch=>"sequence_type"})){
print $seq->sequence_data if $seq->sequence_type->name eq 'protein';
}
print "\n\n";
}


fetches the protein sequence of any coding sequence (CDS) that has a feature name that starts with 'At1g' which would be any CDS on chromosome 1 of arabidopsis in our database.
The script consistently runs in 45 seconds on MySQL and in 29-31 seconds in Postgres. Other scripts seem to have about that difference--PostgreSQL finishes in about 60-70% of the time that the MySQL scripts do. Or, more dramatically: MySQL is 50% slower. That's pretty good for no change in the API, and all tables have the same indexing and structure.
Postgres was set up as default except for these values in postgresql.conf
shared_buffers 40000
max_connections 200
work_mem 4096
effective_cache_size 10000
This was the most concise indication of values to set, though likely closer tuning could improve performance (ahem, suggestions welcome).
An added benefit, is now, we can push more work into the database server using PL/Perl or another PL language, which can further reduce network back and forth, and reduce the creation of perl objects when not necessary.