IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
Thai tokenizer
editThai tokenizer
editThe thai
tokenizer segments Thai text into words, using the Thai
segmentation algorithm included with Java. Text in other languages in general
will be treated the same as the
standard
tokenizer.
This tokenizer may not be supported by all JREs. It is known to work with Sun/Oracle and OpenJDK. If your application needs to be fully portable, consider using the ICU Tokenizer instead.
Example output
editPOST _analyze { "tokenizer": "thai", "text": "การที่ได้ต้องแสดงว่างานดี" }
The above sentence would produce the following terms:
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]
Configuration
editThe thai
tokenizer is not configurable.