IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Standard tokenizer UAX URL email tokenizer »

› › ›

Thai tokenizer

edit

Thai tokenizer

edit

The thai tokenizer segments Thai text into words, using the Thai segmentation algorithm included with Java. Text in other languages in general will be treated the same as the standard tokenizer.

This tokenizer may not be supported by all JREs. It is known to work with Sun/Oracle and OpenJDK. If your application needs to be fully portable, consider using the ICU Tokenizer instead.

Example output

edit

resp = client.indices.analyze(
    tokenizer="thai",
    text="การที่ได้ต้องแสดงว่างานดี",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'thai',
    text: 'การที่ได้ต้องแสดงว่างานดี'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "thai",
  text: "การที่ได้ต้องแสดงว่างานดี",
});
console.log(response);

POST _analyze
{
  "tokenizer": "thai",
  "text": "การที่ได้ต้องแสดงว่างานดี"
}

The above sentence would produce the following terms:

[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]

Configuration

edit

The thai tokenizer is not configurable.

« Standard tokenizer UAX URL email tokenizer »