This documentation contains work-in-progress information for future Elastic Stack and Cloud releases. Use the version selector to view supported release docs. It also contains some Elastic Cloud serverless information. Check out our serverless docs for more details.
Character group tokenizer
editCharacter group tokenizer
editThe char_group
tokenizer breaks text into terms whenever it encounters a
character which is in a defined set. It is mostly useful for cases where a simple
custom tokenization is desired, and the overhead of use of the pattern
tokenizer
is not acceptable.
Configuration
editThe char_group
tokenizer accepts one parameter:
|
A list containing a list of characters to tokenize the string on. Whenever a character
from this list is encountered, a new token is started. This accepts either single
characters like e.g. |
|
The maximum token length. If a token is seen that exceeds this length then
it is split at |
Example output
editresp = client.indices.analyze( tokenizer={ "type": "char_group", "tokenize_on_chars": [ "whitespace", "-", "\n" ] }, text="The QUICK brown-fox", ) print(resp)
response = client.indices.analyze( body: { tokenizer: { type: 'char_group', tokenize_on_chars: [ 'whitespace', '-', "\n" ] }, text: 'The QUICK brown-fox' } ) puts response
const response = await client.indices.analyze({ tokenizer: { type: "char_group", tokenize_on_chars: ["whitespace", "-", "\n"], }, text: "The QUICK brown-fox", }); console.log(response);
POST _analyze { "tokenizer": { "type": "char_group", "tokenize_on_chars": [ "whitespace", "-", "\n" ] }, "text": "The QUICK brown-fox" }
returns
{ "tokens": [ { "token": "The", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "QUICK", "start_offset": 4, "end_offset": 9, "type": "word", "position": 1 }, { "token": "brown", "start_offset": 10, "end_offset": 15, "type": "word", "position": 2 }, { "token": "fox", "start_offset": 16, "end_offset": 19, "type": "word", "position": 3 } ] }