Fingerprint token filter

edit

Sorts and removes duplicate tokens from a token stream, then concatenates the stream into a single output token.

For example, this filter changes the [ the, fox, was, very, very, quick ] token stream as follows:

  1. Sorts the tokens alphabetically to [ fox, quick, the, very, very, was ]
  2. Removes a duplicate instance of the very token.
  3. Concatenates the token stream to a output single token: [fox quick the very was ]

Output tokens produced by this filter are useful for fingerprinting and clustering a body of text as described in the OpenRefine project.

This filter uses Lucene’s FingerprintFilter.

Example

edit

The following analyze API request uses the fingerprint filter to create a single output token for the text zebra jumps over resting resting dog:

response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      'fingerprint'
    ],
    text: 'zebra jumps over resting resting dog'
  }
)
puts response
GET _analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["fingerprint"],
  "text" : "zebra jumps over resting resting dog"
}

The filter produces the following token:

[ dog jumps over resting zebra ]

Add to an analyzer

edit

The following create index API request uses the fingerprint filter to configure a new custom analyzer.

PUT fingerprint_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_fingerprint": {
          "tokenizer": "whitespace",
          "filter": [ "fingerprint" ]
        }
      }
    }
  }
}

Configurable parameters

edit
max_output_size
(Optional, integer) Maximum character length, including whitespace, of the output token. Defaults to 255. Concatenated tokens longer than this will result in no token output.
separator
(Optional, string) Character to use to concatenate the token stream input. Defaults to a space.

Customize

edit

To customize the fingerprint filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom fingerprint filter with that use + to concatenate token streams. The filter also limits output tokens to 100 characters or fewer.

response = client.indices.create(
  index: 'custom_fingerprint_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          whitespace_: {
            tokenizer: 'whitespace',
            filter: [
              'fingerprint_plus_concat'
            ]
          }
        },
        filter: {
          fingerprint_plus_concat: {
            type: 'fingerprint',
            max_output_size: 100,
            separator: '+'
          }
        }
      }
    }
  }
)
puts response
PUT custom_fingerprint_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_": {
          "tokenizer": "whitespace",
          "filter": [ "fingerprint_plus_concat" ]
        }
      },
      "filter": {
        "fingerprint_plus_concat": {
          "type": "fingerprint",
          "max_output_size": 100,
          "separator": "+"
        }
      }
    }
  }
}