Greenplum Fuzzy String Match Extension

    The Greenplum Database installation contains the files required for the functions in this extension module and SQL scripts to define the extension functions in a database and remove the functions from a database.

    Warning: The functions soundex, metaphone, dmetaphone, and dmetaphone_alt do not work well with multibyte encodings (such as UTF-8).

    The Greenplum Database Fuzzy String Match extension is based on the PostgreSQL fuzzystrmatch module.

    Parent topic:

    The Soundex system is a method of matching similar-sounding (similar phonemes) names by converting them to the same code.

    Note: Soundex is most useful for English names.

    These functions work with Soundex codes:

    The difference function converts two strings to their Soundex codes and then reports the number of matching code positions. The result ranges from zero to four, zero being no match and four being an exact match. These are some examples:

    1. SELECT soundex('Anne'), soundex('Ann'), difference('Anne', 'Ann');
    2. SELECT soundex('Anne'), soundex('Andrew'), difference('Anne', 'Andrew');
    3. CREATE TABLE s (nm text);
    4. INSERT INTO s VALUES ('john');
    5. INSERT INTO s VALUES ('joan');
    6. INSERT INTO s VALUES ('wobbly');
    7. INSERT INTO s VALUES ('jack');
    8. SELECT * FROM s WHERE soundex(nm) = soundex('john');
    9. SELECT * FROM s WHERE difference(s.nm, 'john') > 2;

    For information about the Soundex indexing system see .

    These functions calculate the Levenshtein distance between two strings:

    1. levenshtein(text <source>, text <target>, int <ins_cost>, int <del_cost>, int <sub_cost>) returns int
    2. levenshtein(text <source>, text <target>) returns int
    3. levenshtein_less_equal(text <source>, text <target>, int max_d) returns int

    Both the source and target parameters can be any non-null string, with a maximum of 255 bytes. The cost parameters ins_cost, del_cost, and sub_cost specify cost of a character insertion, deletion, or substitution, respectively. You can omit the cost parameters, as in the second version of the function; in that case the cost parameters default to 1.

    levenshtein_less_equal is accelerated version of levenshtein function for low values of distance. If actual distance is less or equal then max_d, then levenshtein_less_equal returns an accurate value of the distance. Otherwise, this function returns value which is greater than max_d. Examples:

    For information about the Levenshtein algorithm, see .

    Metaphone, like Soundex, is based on the idea of constructing a representative code for an input string. Two strings are then deemed similar if they have the same codes. This function calculates the metaphone code of an input string :

    1. test=# SELECT metaphone('GUMBO', 4);
    2. metaphone
    3. -----------
    4. KM
    5. (1 row)

    For information about the Metaphone algorithm, see .

    The Double Metaphone system computes two “sounds like” strings for a given input string - a “primary” and an “alternate”. In most cases they are the same, but for non-English names especially they can be a bit different, depending on pronunciation. These functions compute the primary and alternate codes:

    There is no length limit on the input strings. Example:

    1. test=# select dmetaphone('gumbo');
    2. dmetaphone
    3. ------------
    4. KMP
    5. (1 row)

    For information about the Double Metaphone algorithm, see .

    Greenplum Database supplies SQL scripts to install and uninstall the Fuzzy String Match extension functions.

    To install the functions in a database, run the following SQL script:

    To uninstall the functions, run the following SQL script: