levenshtein in cython
EDIT(2): fix markup for <char *> casts ... fix malloc (see comments. thanks Bao).
NOTE: using a kwarg for limit slows things down. setting that to a required arg and using calloc for m2 speed things up to nearly as fast as the pylevenshtein.
Well, it seems to be popular to code up the levenshtein. I actually have a use for this and wanted to practice some Cython, so I've written a version. I used bearophile's recipe, wikibooks, this (from k4st on reddit) and this for reference. It follows bearophile's code closely, using only O(m) space instead of O(mn).
I believe that's correct (if not, let me know!), it matches output from other versions (which i used fort testing) and it doesn't leak memory, so I must have done the malloc/free correctly despite my lack of C-fu.
Again, I'm not sure, but I think that's a pretty good example of how to mix python and C with cython as it's not much longer than the other python versions and pretty readable. And, it's very fast, it does 500K iterations of:
levenshtein('i ehm a gude spehlar', 'i am a good speller')
in 2.5 seconds.
bearophile's pysco-ed editDistanceFast took 3 minutes, 31 seconds to do the same (which is why I started this project...).
Literally, just before I was to post this, I found pylevenshtein which is even faster, does the 500K in 1.5 seconds. ho hum. It's doing more, handling unicode, and apparently doing some optimizations.... So, use that instead! -- and contribute a test-suite.
Next, I think I'll try a variant that allows transpositions (Damerau) and/or implement the BK-Tree, again cribbing off bearophile's recipe.
for reference, here are the contents of the setup.py to build the module.
NOTE: using a kwarg for limit slows things down. setting that to a required arg and using calloc for m2 speed things up to nearly as fast as the pylevenshtein.
Well, it seems to be popular to code up the levenshtein. I actually have a use for this and wanted to practice some Cython, so I've written a version. I used bearophile's recipe, wikibooks, this (from k4st on reddit) and this for reference. It follows bearophile's code closely, using only O(m) space instead of O(mn).
cdef extern from "stdlib.h":
ctypedef unsigned int size_t
size_t strlen(char *s)
void *malloc(size_t size)
void free(void *ptr)
int strcmp(char *a, char *b)
cdef inline size_t imin(int a, int b, int c):
if a < b:
if c < a:
return c
return a
if c < b:
return c
return b
cpdef int levenshtein(char *a, char *b, int limit=100):
cdef int m = strlen(a), n = strlen(b)
cdef char *ctmp
cdef int i = 0, j = 0, retval
cdef int achr, bchr
if strcmp(a, b) == 0:
return 0
if m > n:
ctmp = a;
a = b;
b = ctmp;
#a, b = b, a
m, n = n, m
# short circuit.
if n - m >= limit:
return n - m
cdef char *m1 = <char *>malloc((n + 2) * sizeof(char))
cdef char *m2 = <char *>malloc((n + 2) * sizeof(char))
for i from 0 <= i <= n:
m1[i] = i
m2[i] = 0
for i from 0 <= i <= m:
m2[0] = i + 1
achr = a[i]
for j from 0 <= j <= n:
bchr = b[j]
if achr == bchr:
m2[j + 1] = m1[j]
else:
m2[j + 1] = 1 + imin(m2[j], m1[j], m1[j + 1])
m1, m2 = m2, m1
retval = m1[n + 1]
free(m2)
free(m1)
return retval
I believe that's correct (if not, let me know!), it matches output from other versions (which i used fort testing) and it doesn't leak memory, so I must have done the malloc/free correctly despite my lack of C-fu.
Again, I'm not sure, but I think that's a pretty good example of how to mix python and C with cython as it's not much longer than the other python versions and pretty readable. And, it's very fast, it does 500K iterations of:
levenshtein('i ehm a gude spehlar', 'i am a good speller')
in 2.5 seconds.
bearophile's pysco-ed editDistanceFast took 3 minutes, 31 seconds to do the same (which is why I started this project...).
Literally, just before I was to post this, I found pylevenshtein which is even faster, does the 500K in 1.5 seconds. ho hum. It's doing more, handling unicode, and apparently doing some optimizations.... So, use that instead! -- and contribute a test-suite.
Next, I think I'll try a variant that allows transpositions (Damerau) and/or implement the BK-Tree, again cribbing off bearophile's recipe.
for reference, here are the contents of the setup.py to build the module.
from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
setup( name = 'levenshtein',
ext_modules=[ Extension("levenshtein", sources=["levenshtein.pyx"] ), ],
cmdclass = {'build_ext': build_ext})
Comments
char *m1 = malloc((n + 1) * sizeof(char))
retval = m1[n + 1]
malloc n + 2 would be better; perhaps cython handles this well, but c may not.
it's alarming that it didnt give any errors!
cdef int alen = strlen(a), blen = strlen(b), retval
cdef char *ctmp
cdef size_t i, j
cdef size_t achr, bchr
if strcmp(a, b) == 0:
return 0
if alen > blen:
ctmp = a;
a = b;
b = ctmp;
alen, blen = blen, alen
cdef unsigned int arraysize = ( blen + 2 ) * sizeof( unsigned int )
cdef unsigned int *m1 = malloc( arraysize )
cdef unsigned int *m2 = malloc( arraysize )
cdef unsigned int *m3 = malloc( arraysize )
for i from 0 <= i <= blen:
m1[ i ] = 0
m2[ i ] = i
m3[ i ] = 0
for i from 1 <= i <= alen:
m1[0] = i + 1
achr = a[i - 1]
for j from 1 <= j <= blen:
bchr = b[j- 1]
if achr == bchr:
m1[j] = m2[j - 1]
else:
m1[j] = 1 + imin(m1[j - 1], m2[j - 1], m2[j])
if i != 1 and j != 1 and achr == b[j - 2] and bchr == a[i - 2]: # and m1[j] > m3[j - 1] + cost:
m1[j] = m3[j - 1]
m3, m2, m1 = m2, m1, m3
retval = m2[blen]
free(m3)
free(m1)
free(m2)
return retval
(sorry how can i post a code snippet? i put it on pastebin: http://pastebin.com/FN5K8yZT)
now, curiously, when i compare b'helo' to b'hleo', your code gives me a difference of 1, while the active state code gives me 2; both seem to be correct, as there is one transposition involved. but when i compare b'1234' to b'154421152421351112544221', results are 21 and 20. i can transform the longer into the shorter string by deleting 20 bytes, and damerau distance should always be equal or less levenshtein distance, so 21 doesn't seem to be correct. when i compare b'1234' to b'1243', results are 1 and 2, but comparing b'1234' to b'1243xxxxxxxxxxxxxxxxxxxx', i get 21 from both algorithms (i think this is correct for damerau, but levenshtein should be 22, no?). i am still a bit puzzled.