Fingerprint Analyzer
editFingerprint Analyzer
editThe fingerprint
analyzer implements a
fingerprinting algorithm
which is used by the OpenRefine project to assist in clustering.
Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.
Definition
editIt consists of:
- Tokenizer
- Token Filters (in order)
-
- Lower Case Token Filter
- ASCII Folding Token Filter
- Stop Token Filter (disabled by default)
- Fingerprint Token Filter
Example output
editPOST _analyze { "analyzer": "fingerprint", "text": "Yes yes, Gödel said this sentence is consistent and." }
The above sentence would produce the following single term:
[ and consistent godel is said sentence this yes ]
Configuration
editThe fingerprint
analyzer accepts the following parameters:
|
The character to use to concate the terms. Defaults to a space. |
|
The maximum token size to emit. Defaults to |
|
A pre-defined stop words list like |
|
The path to a file containing stop words. |
See the Stop Token Filter for more information about stop word configuration.
Example configuration
editIn this example, we configure the fingerprint
analyzer to use the
pre-defined list of English stop words:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_" } } } } } POST my_index/_analyze { "analyzer": "my_fingerprint_analyzer", "text": "Yes yes, Gödel said this sentence is consistent and." }
The above example produces the following term:
[ consistent godel said sentence yes ]