IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
UAX URL email tokenizer
editUAX URL email tokenizer
editThe uax_url_email
tokenizer is like the standard
tokenizer except that it
recognises URLs and email addresses as single tokens.
Example output
editresp = client.indices.analyze( tokenizer="uax_url_email", text="Email me at john.smith@global-international.com", ) print(resp)
response = client.indices.analyze( body: { tokenizer: 'uax_url_email', text: 'Email me at john.smith@global-international.com' } ) puts response
const response = await client.indices.analyze({ tokenizer: "uax_url_email", text: "Email me at john.smith@global-international.com", }); console.log(response);
POST _analyze { "tokenizer": "uax_url_email", "text": "Email me at john.smith@global-international.com" }
The above sentence would produce the following terms:
[ Email, me, at, john.smith@global-international.com ]
while the standard
tokenizer would produce:
[ Email, me, at, john.smith, global, international.com ]
Configuration
editThe uax_url_email
tokenizer accepts the following parameters:
|
The maximum token length. If a token is seen that exceeds this length then
it is split at |
Example configuration
editIn this example, we configure the uax_url_email
tokenizer to have a
max_token_length
of 5 (for demonstration purposes):
resp = client.indices.create( index="my-index-000001", settings={ "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "uax_url_email", "max_token_length": 5 } } } }, ) print(resp) resp1 = client.indices.analyze( index="my-index-000001", analyzer="my_analyzer", text="john.smith@global-international.com", ) print(resp1)
response = client.indices.create( index: 'my-index-000001', body: { settings: { analysis: { analyzer: { my_analyzer: { tokenizer: 'my_tokenizer' } }, tokenizer: { my_tokenizer: { type: 'uax_url_email', max_token_length: 5 } } } } } ) puts response response = client.indices.analyze( index: 'my-index-000001', body: { analyzer: 'my_analyzer', text: 'john.smith@global-international.com' } ) puts response
const response = await client.indices.create({ index: "my-index-000001", settings: { analysis: { analyzer: { my_analyzer: { tokenizer: "my_tokenizer", }, }, tokenizer: { my_tokenizer: { type: "uax_url_email", max_token_length: 5, }, }, }, }, }); console.log(response); const response1 = await client.indices.analyze({ index: "my-index-000001", analyzer: "my_analyzer", text: "john.smith@global-international.com", }); console.log(response1);
PUT my-index-000001 { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "my_tokenizer" } }, "tokenizer": { "my_tokenizer": { "type": "uax_url_email", "max_token_length": 5 } } } } } POST my-index-000001/_analyze { "analyzer": "my_analyzer", "text": "john.smith@global-international.com" }
The above example produces the following terms:
[ john, smith, globa, l, inter, natio, nal.c, om ]