public class QGram extends java.lang.Object implements Comparator
Title: Dataspace Framework
Description: An implementation of q-grams comparison that can tokenize a few different ways, and also use a couple different formulas to compute the final score. The default is using basic q-grams and q-gram overlap.
Copyright: Copyright (c) 2013
Company: StreamScape Technologies
Modifier and Type | Class and Description |
---|---|
static class |
QGram.Formula
Represents the different formulas we can use to compute similarity.
|
static class |
QGram.Tokenizer
Represents the different ways we can tokenize a string into a set
of q-grams for a given q.
|
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
NAME |
Constructor and Description |
---|
QGram() |
Modifier and Type | Method and Description |
---|---|
static java.util.Set |
basicTokens(java.lang.String s,
int q)
Produces basic q-grams, so that 'gail' -> 'ga', 'ai', 'il'.
|
double |
compare(java.lang.String s1,
java.lang.String s2) |
static double |
dice(int common,
java.util.Set q1,
java.util.Set q2) |
static java.util.Set |
endsTokens(java.lang.String s,
int q)
Produces q-grams with padding, so that 'gail' -> '.g', 'ga', 'ai',
'il', 'l.'.
|
boolean |
isTokenized()
Returns true if the comparator breaks string values up into
tokens when comparing.
|
static double |
jaccard(int common,
java.util.Set q1,
java.util.Set q2) |
static double |
overlap(int common,
java.util.Set q1,
java.util.Set q2) |
static java.util.Set |
positionalTokens(java.lang.String s,
int q)
Produces positional q-grams, so that 'gail' -> 'ga1', 'ai2', 'il3'.
|
java.util.Set |
qgrams(java.lang.String s) |
void |
setFormula(QGram.Formula formula)
Tells the comparator what formula to use to compute the actual
similarity.
|
void |
setQ(int q)
Sets the value of q, that is, the size of the q-grams.
|
void |
setTokenizer(QGram.Tokenizer tokenizer)
Tells the comparator what tokenizer to use to produce q-grams.
|
public boolean isTokenized()
Comparator
isTokenized
in interface Comparator
public double compare(java.lang.String s1, java.lang.String s2)
compare
in interface Comparator
public void setQ(int q)
public void setFormula(QGram.Formula formula)
public void setTokenizer(QGram.Tokenizer tokenizer)
public java.util.Set qgrams(java.lang.String s)
public static java.util.Set basicTokens(java.lang.String s, int q)
public static java.util.Set positionalTokens(java.lang.String s, int q)
public static java.util.Set endsTokens(java.lang.String s, int q)
public static double overlap(int common, java.util.Set q1, java.util.Set q2)
public static double dice(int common, java.util.Set q1, java.util.Set q2)
public static double jaccard(int common, java.util.Set q1, java.util.Set q2)
Copyright © 2015-2024 StreamScape Technologies. All rights reserved.