WARNING: Version 5.4 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Fingerprint Analyzer
editFingerprint Analyzer
editThe fingerprint
analyzer implements a
fingerprinting algorithm
which is used by the OpenRefine project to assist in clustering.
Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.
Definition
editIt consists of:
- Tokenizer
- Token Filters (in order)
-
- Lower Case Token Filter
- ASCII Folding Token Filter
- Stop Token Filter (disabled by default)
- Fingerprint Token Filter
Example output
editPOST _analyze { "analyzer": "fingerprint", "text": "Yes yes, Gödel said this sentence is consistent and." }
The above sentence would produce the following single term:
[ and consistent godel is said sentence this yes ]
Configuration
editThe fingerprint
analyzer accepts the following parameters:
|
The character to use to concate the terms. Defaults to a space. |
|
The maximum token size to emit. Defaults to |
|
A pre-defined stop words list like |
|
The path to a file containing stop words. |
See the Stop Token Filter for more information about stop word configuration.
Example configuration
editIn this example, we configure the fingerprint
analyzer to use the
pre-defined list of English stop words:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_" } } } } } POST my_index/_analyze { "analyzer": "my_fingerprint_analyzer", "text": "Yes yes, Gödel said this sentence is consistent and." }
The above example produces the following term:
[ consistent godel said sentence yes ]