Reality.sys corrupted! Universe halted! Reboot?

Levenshtein distance and the “SIFT” algorithm Levenshtein distance and the “SIFT” algorithm

While reading this article, I’ve found out about a new algorithm that computes the Levenshtein distance between two strings really fast. The author’s tests on a C# implementation of the algorithm proved it to be as much as 10 to 25 times faster than the original implementation of the Levenshtein algorithm.

«Levenshtein distance» What’s that and why do we need it?

Levenshtein distance definition, from Wikipedia:

Levenshtein distance or edit distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character.

…you can read the entire article here.

This “distance” between two strings is very useful when writing a spellchecker or other softwares that need to make assumptions wether a word was written correctly or not. As I might be involved in such a project, I took the time and looked over the algorithm, ported the C# implementation to PHP, and ran some tests.

Porting SIFT and “goto” in PHP

Porting it proved to be an easy and relaxing 5 minutes job. The only problem I’ve encountered were some “goto” calls, which unfortunately can’t be used in PHP. You can read more about goto in PHP on this blog. But with some breaks and some minor modifications, I got over it fast. You can download the class here.

Testing

There’s a native levenshtein function in PHP, but as Siderite’s blog said, the SIFT algorithm should be faster than Levenshtein. So I’ve tought that an user declared function might have a chance against the native method, this assumption being backed up by the fact that the PHP crew admits on the manual page that levenshtein is “pretty expensive”.

The tests were really simple:

<?php
$sift = new StringSift2();
$start = microTime();
echo $sift->distance( $string1, $string2 );
$end = microTime();
echo '
Took: ' . number_format( $end - $start, 10 );
?>

…and…

<?php
$start = microTime();
echo levenshtein( $string1, $string2 );
$end = microTime();
echo '
Took: ' . number_format( $end - $start, 10 );
?>

In all the tests, the native levenshtein function proved to be as much as twenty times faster (when working on really long strings) than the implementation of the SIFT algorithm, and in some cases the 2 algorithms returned different values (again, for long strings). This proves that beating the native methods in a scripting language is really hard.

Conclusion: for now, I’ll stick with the native implementation of the Levenshtein algorithm, but it’s good to know that are other alternatives out there, and, maybe someday these alternatives might prove even faster than the original algorithm. But if you use other programming languages than PHP, you should give it good try, because the tests on the C# implementation proved this algorithm to be really fast.

similarposts (presented by Monkey)

Monkey

comments...comments...8comments

  1. SideriteNo Gravatar said:

    Actually I am a bit put off that PHP is that much slower than the native C language in which it was built. (or that the SIFT2 algorithm wasn't completely and awsomely fast! :) ) Besides, I am sure I can still improve it in the direction of precision. I don't think I can make it faster, though…

  2. TudorNo Gravatar said:

    Don’t worry. I’m sure that some people will find it usefull. Or one day you will make some terrific improvements of the algorithm and it will become an industry standard.
    And you’ll be the all knowing guru of string parsing.

  3. pcdinhNo Gravatar said:

    Your implementation seems to be a bit different from the original one. So your result returned from your script gets most inaccurate in comparison with the original one and PHP native implemetation.

    I must admit that this algorithm works best for spelling checking or strings that have nearly the same prefix. However, out of that, Levenshtein is the most optimal algorithm. Anyway, SIFT solution is a special case of Levenshtein.

    // My translation into PHP5.2
    $s = new StringSift();
    $t1 = microtime(true);
    echo $s->distance(”phpvietnam”, “123456789 p-hpvietnam”);
    $t2 = microtime(true);
    echo “n”.($t2 - $t1);

    echo “n n”;
    $t3 = microtime(true);
    echo levenshtein(”phpvietnam”, “123456789 p-hpvietnam”);
    $t4 = microtime(true);
    echo “n”.($t4 - $t3);

    // Your implementation
    $s3 = new StringSift3();
    $t5 = microtime(true);
    echo $s3->distance(”phpvietnam”, “123456789 p-hpvietnam”);
    $t6 = microtime(true);
    echo “n n”;
    echo “n”.($t6 - $t5);
    ——————-
    Result:
    15.5
    0.01040506362915
    11
    0.0002591609954834
    21
    0.0039699077606201

    Your script runs faster but the result is not optimal. It may be the result of $c–;

  4. pcdinhNo Gravatar said:

    With a bit tuning, my script returns
    15.5
    0.0046000480651855
    11
    0.00024199485778809
    21
    0.0038189888000488

    Keeping much slower than the native one.

  5. TudorNo Gravatar said:

    The $c– is there because, because, well…I admit, I made a mistake. It happens, it’s human. Thanks for pointing it out for me, pcdinh.
    The thing is that it’s supposed to be $dist– instead of $c–. That’s because, in the original C# implementation, when the goto clause is executed, the dist incrementation doesn’t take place. When I use break, that variable gets incremented, so I have to decrement it first. Ma bad. I’ve uploaded the correct version.
    I’ve ran the test again, and still, the native levenshtein is much faster. It seems pretty hard to beat C implementations.
    Anyway, could you post your tunned script? Or, if it’s too long, send it to me via email?

  6. SideriteNo Gravatar said:

    On my C# application, using my own Levenshtein and Sift2 implemented algorithms I get
    [code]15/0.00125 milliseconds - Sift2
    11/0.06625 milliseconds - Levenshtein
    [/code]
    ,averaged on 100000 attempts.

    I submit your testing script may not be so good since the microtime may be too small to register correctly. I suggest putting a loop around the functions and divide by the loop size.

    But for examples such as the one pcdinh showed, the sift2 distance is only incidentally closer to Levenshtein, since it never got to find any letter match. The resulting 15 is only the length of the digit prefix + half the difference in size.

    A larger maxOffset value, like 12, should give more accurate results. However, in that case the algorithm returns a distance of 2, which is obviously wrong. I have to take into account the length of the (first?) offset when adding to the distance… I haven’t yet figured out how to fix the formula.

    Thanks for the testing and I will try to find a reasonable fix for situations like this, even if Sift is supposed to fix typos rather than whole word insertions.

  7. carlosNo Gravatar said:

    Hi! i found this article very interesting. do you have the tunned php script? I will like to test it in my application

  8. TudorNo Gravatar said:

    Nope, pcdinh hasn’t been heard from since his last post here.

yourReply