Saturday, September 29, 2007

k-means clustering in scipy

it's fairly simple to do clustering of points with similar z-values in scipy:


import numpy
import matplotlib
matplotlib.use('Agg')
from scipy.cluster.vq import *
import pylab
pylab.close()

# generate some random xy points and
# give them some striation so there will be "real" groups.
xy = numpy.random.rand(30,2)
xy[3:8,1] -= .9
xy[22:28,1] += .9

# make some z vlues
z = numpy.sin(xy[:,1]-0.2*xy[:,1])

# whiten them
z = whiten(z)

# let scipy do its magic (k==3 groups)
res, idx = kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

# convert groups to rbg 3-tuples.
colors = ([([0,0,0],[1,0,0],[0,0,1])[i] for i in idx])

# show sizes and colors. each color belongs in diff cluster.
pylab.scatter(xy[:,0],xy[:,1],s=20*z+9, c=colors)
pylab.savefig('/var/www/tmp/clust.png')


2 comments:

Jakab Kristóf said...

In this
kmeans2(numpy.array(zip(xy[:,0],xy[:,1],z)),3)

An explanation for this line would have been helpful for me:
zip(xy[:,0],xy[:,1],z)

Anonymous said...

This:

numpy.array(zip(xy[:, 0], xy[:, 1], z))

is a weird (and slow) way of writing

numpy.column_stack([xy, z])

Anyway it's clearer in two steps:

data = numpy.column_stack([xy, z])
vq.kmeans2(data, 3)

where `data` is an array of 30 points in 3d where the x, y values are given by the original `xy` array and the z-values are the new `z` array. All together, it is passed to the kmeans algorithm.

Take a look here for an explanation:
Using scipy's kmeans2 function in python