Fingerprint analyzer
editFingerprint analyzer
editThe fingerprint
analyzer implements a
fingerprinting algorithm
which is used by the OpenRefine project to assist in clustering.
Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.
Example output
editresp = client.indices.analyze( analyzer="fingerprint", text="Yes yes, Gödel said this sentence is consistent and.", ) print(resp)
response = client.indices.analyze( body: { analyzer: 'fingerprint', text: 'Yes yes, Gödel said this sentence is consistent and.' } ) puts response
const response = await client.indices.analyze({ analyzer: "fingerprint", text: "Yes yes, Gödel said this sentence is consistent and.", }); console.log(response);
POST _analyze { "analyzer": "fingerprint", "text": "Yes yes, Gödel said this sentence is consistent and." }
The above sentence would produce the following single term:
[ and consistent godel is said sentence this yes ]
Configuration
editThe fingerprint
analyzer accepts the following parameters:
|
The character to use to concatenate the terms. Defaults to a space. |
|
The maximum token size to emit. Defaults to |
|
A pre-defined stop words list like |
|
The path to a file containing stop words. |
See the Stop Token Filter for more information about stop word configuration.
Example configuration
editIn this example, we configure the fingerprint
analyzer to use the
pre-defined list of English stop words:
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_" } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_fingerprint_analyzer", text="Yes yes, Gödel said this sentence is consistent and.", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_fingerprint_analyzer: { type: 'fingerprint', stopwords: '_english_' } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_fingerprint_analyzer', text: 'Yes yes, Gödel said this sentence is consistent and.' } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_fingerprint_analyzer: { type: "fingerprint", stopwords: "_english_", }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_fingerprint_analyzer", text: "Yes yes, Gödel said this sentence is consistent and.", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_fingerprint_analyzer": { "type": "fingerprint", "stopwords": "_english_" } } } } } POST my-index-000001/_analyze { "analyzer": "my_fingerprint_analyzer", "text": "Yes yes, Gödel said this sentence is consistent and." }
The above example produces the following term:
[ consistent godel said sentence yes ]
Definition
editThe fingerprint
tokenizer consists of:
- Tokenizer
- Token Filters (in order)
-
- Lower Case Token Filter
- ASCII folding
- Stop Token Filter (disabled by default)
- Fingerprint
If you need to customize the fingerprint
analyzer beyond the configuration
parameters then you need to recreate it as a custom
analyzer and modify
it, usually by adding token filters. This would recreate the built-in
fingerprint
analyzer and you can use it as a starting point for further
customization:
resp = client.indices.create( index="fingerprint_example", settings={ "analysis": { "analyzer": { "rebuilt_fingerprint": { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding", "fingerprint" ] } } } }, ) print(resp)
response = client.indices.create( index: 'fingerprint_example', body: { settings: { analysis: { analyzer: { rebuilt_fingerprint: { tokenizer: 'standard', filter: [ 'lowercase', 'asciifolding', 'fingerprint' ] } } } } } ) puts response
const response = await client.indices.create({ index: "fingerprint_example", settings: { analysis: { analyzer: { rebuilt_fingerprint: { tokenizer: "standard", filter: ["lowercase", "asciifolding", "fingerprint"], }, }, }, }, }); console.log(response);
PUT /fingerprint_example { "settings": { "analysis": { "analyzer": { "rebuilt_fingerprint": { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding", "fingerprint" ] } } } } }