Predicate script token filter

edit

Removes tokens that don’t match a provided predicate script. The filter supports inline Painless scripts only. Scripts are evaluated in the analysis predicate context.

Example

edit

The following analyze API request uses the predicate_token_filter filter to only output tokens longer than three characters from the fox jumps the lazy dog.

resp = client.indices.analyze(
    tokenizer="whitespace",
    filter=[
        {
            "type": "predicate_token_filter",
            "script": {
                "source": "\n          token.term.length() > 3\n        "
            }
        }
    ],
    text="the fox jumps the lazy dog",
)
print(resp)
response = client.indices.analyze(
  body: {
    tokenizer: 'whitespace',
    filter: [
      {
        type: 'predicate_token_filter',
        script: {
          source: "\n          token.term.length() > 3\n        "
        }
      }
    ],
    text: 'the fox jumps the lazy dog'
  }
)
puts response
const response = await client.indices.analyze({
  tokenizer: "whitespace",
  filter: [
    {
      type: "predicate_token_filter",
      script: {
        source: "\n          token.term.length() > 3\n        ",
      },
    },
  ],
  text: "the fox jumps the lazy dog",
});
console.log(response);
GET /_analyze
{
  "tokenizer": "whitespace",
  "filter": [
    {
      "type": "predicate_token_filter",
      "script": {
        "source": """
          token.term.length() > 3
        """
      }
    }
  ],
  "text": "the fox jumps the lazy dog"
}

The filter produces the following tokens.

[ jumps, lazy ]

The API response contains the position and offsets of each output token. Note the predicate_token_filter filter does not change the tokens' original positions or offsets.

Response
{
  "tokens" : [
    {
      "token" : "jumps",
      "start_offset" : 8,
      "end_offset" : 13,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "lazy",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "word",
      "position" : 4
    }
  ]
}

Configurable parameters

edit
script

(Required, script object) Script containing a condition used to filter incoming tokens. Only tokens that match this script are included in the output.

This parameter supports inline Painless scripts only. The script is evaluated in the analysis predicate context.

Customize and add to an analyzer

edit

To customize the predicate_token_filter filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

The following create index API request configures a new custom analyzer using a custom predicate_token_filter filter, my_script_filter.

The my_script_filter filter removes tokens with of any type other than ALPHANUM.

resp = client.indices.create(
    index="my-index-000001",
    settings={
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "standard",
                    "filter": [
                        "my_script_filter"
                    ]
                }
            },
            "filter": {
                "my_script_filter": {
                    "type": "predicate_token_filter",
                    "script": {
                        "source": "\n              token.type.contains(\"ALPHANUM\")\n            "
                    }
                }
            }
        }
    },
)
print(resp)
response = client.indices.create(
  index: 'my-index-000001',
  body: {
    settings: {
      analysis: {
        analyzer: {
          my_analyzer: {
            tokenizer: 'standard',
            filter: [
              'my_script_filter'
            ]
          }
        },
        filter: {
          my_script_filter: {
            type: 'predicate_token_filter',
            script: {
              source: "\n              token.type.contains(\"ALPHANUM\")\n            "
            }
          }
        }
      }
    }
  }
)
puts response
const response = await client.indices.create({
  index: "my-index-000001",
  settings: {
    analysis: {
      analyzer: {
        my_analyzer: {
          tokenizer: "standard",
          filter: ["my_script_filter"],
        },
      },
      filter: {
        my_script_filter: {
          type: "predicate_token_filter",
          script: {
            source:
              '\n              token.type.contains("ALPHANUM")\n            ',
          },
        },
      },
    },
  },
});
console.log(response);
PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_script_filter"
          ]
        }
      },
      "filter": {
        "my_script_filter": {
          "type": "predicate_token_filter",
          "script": {
            "source": """
              token.type.contains("ALPHANUM")
            """
          }
        }
      }
    }
  }
}