Tokenizers

    Letter

    The letter tokenizer is a tokenizer that simply identifies tokens as sequences of Unicode runes that are part of the Letter category.

    Whitespace

    The Whitespace Tokenizer is tokenizer which simply identifies tokens as sequences of Unicode runes that are NOT part of the Space category.

    ICU

    The ICU tokenizer uses the ICU library to tokenize the input using on word boundaries.

    The exception tokenizer allows you to define exceptions. Exceptions are sections of the input stream which match regular expressions. These sections are left intact as single tokens. Any input not matching these regular expressions is passed to the child tokenizer.