Character group tokenizer
editCharacter group tokenizer
editThe char_group
tokenizer breaks text into terms whenever it encounters a
character which is in a defined set. It is mostly useful for cases where a simple
custom tokenization is desired, and the overhead of use of the pattern
tokenizer
is not acceptable.
Configuration
editThe char_group
tokenizer accepts one parameter:
|
A list containing a list of characters to tokenize the string on. Whenever a character
from this list is encountered, a new token is started. This accepts either single
characters like e.g. |
|
The maximum token length. If a token is seen that exceeds this length then
it is split at |
Example output
editresp = client.indices.analyze( tokenizer={ "type": "char_group", "tokenize_on_chars": [ "whitespace", "-", "\n" ] }, text="The QUICK brown-fox", ) print(resp)
response = client.indices.analyze( body: { tokenizer: { type: 'char_group', tokenize_on_chars: [ 'whitespace', '-', "\n" ] }, text: 'The QUICK brown-fox' } ) puts response
const response = await client.indices.analyze({ tokenizer: { type: "char_group", tokenize_on_chars: ["whitespace", "-", "\n"], }, text: "The QUICK brown-fox", }); console.log(response);
POST _analyze { "tokenizer": { "type": "char_group", "tokenize_on_chars": [ "whitespace", "-", "\n" ] }, "text": "The QUICK brown-fox" }
returns
{ "tokens": [ { "token": "The", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "QUICK", "start_offset": 4, "end_offset": 9, "type": "word", "position": 1 }, { "token": "brown", "start_offset": 10, "end_offset": 15, "type": "word", "position": 2 }, { "token": "fox", "start_offset": 16, "end_offset": 19, "type": "word", "position": 3 } ] }