public class Levenshtein extends java.lang.Object implements Comparator
Title: Dataspace Framework
Description: An implementation of the Levenshtein distance metric. This is a fairly complicated metric, and there are a number of different ways to implement it, with different performance characteristics. As this comparator is highly performance-critical, this class contains a number of different implementations, some of them experimental.
Some general background on Levenshtein, and a StackOverflow page on faster algorithms.
To see which algorithms are implemented, see comments on individual methods.
Copyright: Copyright (c) 2013
Company: StreamScape Technologies
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
NAME |
Constructor and Description |
---|
Levenshtein() |
Modifier and Type | Method and Description |
---|---|
static int |
compactDistance(java.lang.String s1,
java.lang.String s2)
Optimized version of the Wagner & Fischer algorithm that only
keeps a single column in the matrix in memory at a time.
|
double |
compare(java.lang.String s1,
java.lang.String s2) |
static int |
cutoffDistance(java.lang.String s1,
java.lang.String s2)
An optimized version of the Wagner & Fischer algorithm, which
exploits our knowledge that if the distance is above a certain
limit (0.5 when normalized) we use the lower probability.
|
static int |
distance(java.lang.String s1,
java.lang.String s2)
This is the original implementation, using the Wagner &
Fischer algorithm from 1974.
|
boolean |
isTokenized()
Returns true if the comparator breaks string values up into
tokens when comparing.
|
static int |
recursiveDistance(java.lang.String s1,
java.lang.String s2)
This is a prototype implementation of Ukkonen's optimized
version of the Wagner & Fischer algorithm.
|
public double compare(java.lang.String s1, java.lang.String s2)
compare
in interface Comparator
public boolean isTokenized()
Comparator
isTokenized
in interface Comparator
public static int distance(java.lang.String s1, java.lang.String s2)
public static int cutoffDistance(java.lang.String s1, java.lang.String s2)
On at least one use case, this optimization shaves 15% off the total execution time (ie: not just Levenshtein).
public static int recursiveDistance(java.lang.String s1, java.lang.String s2)
public static int compactDistance(java.lang.String s1, java.lang.String s2)
Copyright © 2015-2024 StreamScape Technologies. All rights reserved.