- Elasticsearch Guide: other versions:
- Elasticsearch introduction
- Getting started with Elasticsearch
- Set up Elasticsearch
- Installing Elasticsearch
- Configuring Elasticsearch
- Setting JVM options
- Secure settings
- Logging configuration
- Auditing settings
- Cross-cluster replication settings
- Transforms settings
- Index lifecycle management settings
- License settings
- Machine learning settings
- Monitoring settings
- Security settings
- Snapshot lifecycle management settings
- SQL access settings
- Watcher settings
- Important Elasticsearch configuration
- Important System Configuration
- Bootstrap Checks
- Heap size check
- File descriptor check
- Memory lock check
- Maximum number of threads check
- Max file size check
- Maximum size virtual memory check
- Maximum map count check
- Client JVM check
- Use serial collector check
- System call filter check
- OnError and OnOutOfMemoryError checks
- Early-access check
- G1GC check
- All permission check
- Discovery configuration check
- Starting Elasticsearch
- Stopping Elasticsearch
- Adding nodes to your cluster
- Full-cluster restart and rolling restart
- Set up X-Pack
- Configuring X-Pack Java Clients
- Bootstrap Checks for X-Pack
- Upgrade Elasticsearch
- Aggregations
- Metrics Aggregations
- Avg Aggregation
- Weighted Avg Aggregation
- Cardinality Aggregation
- Extended Stats Aggregation
- Geo Bounds Aggregation
- Geo Centroid Aggregation
- Max Aggregation
- Min Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Scripted Metric Aggregation
- Stats Aggregation
- String Stats Aggregation
- Sum Aggregation
- Top Hits Aggregation
- Value Count Aggregation
- Median Absolute Deviation Aggregation
- Bucket Aggregations
- Adjacency Matrix Aggregation
- Auto-interval Date Histogram Aggregation
- Children Aggregation
- Composite aggregation
- Date histogram aggregation
- Date Range Aggregation
- Diversified Sampler Aggregation
- Filter Aggregation
- Filters Aggregation
- Geo Distance Aggregation
- GeoHash grid Aggregation
- GeoTile Grid Aggregation
- Global Aggregation
- Histogram Aggregation
- IP Range Aggregation
- Missing Aggregation
- Nested Aggregation
- Parent Aggregation
- Range Aggregation
- Rare Terms Aggregation
- Reverse nested Aggregation
- Sampler Aggregation
- Significant Terms Aggregation
- Significant Text Aggregation
- Terms Aggregation
- Subtleties of bucketing range fields
- Pipeline Aggregations
- Avg Bucket Aggregation
- Derivative Aggregation
- Max Bucket Aggregation
- Min Bucket Aggregation
- Sum Bucket Aggregation
- Stats Bucket Aggregation
- Extended Stats Bucket Aggregation
- Percentiles Bucket Aggregation
- Moving Average Aggregation
- Moving Function Aggregation
- Cumulative Sum Aggregation
- Cumulative Cardinality Aggregation
- Bucket Script Aggregation
- Bucket Selector Aggregation
- Bucket Sort Aggregation
- Serial Differencing Aggregation
- Matrix Aggregations
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Returning the type of the aggregation
- Indexing aggregation results with transforms
- Metrics Aggregations
- Query DSL
- Search across clusters
- Scripting
- Mapping
- Text analysis
- Overview
- Concepts
- Configure text analysis
- Built-in analyzer reference
- Tokenizer reference
- Char Group Tokenizer
- Classic Tokenizer
- Edge n-gram tokenizer
- Keyword Tokenizer
- Letter Tokenizer
- Lowercase Tokenizer
- N-gram tokenizer
- Path Hierarchy Tokenizer
- Path Hierarchy Tokenizer Examples
- Pattern Tokenizer
- Simple Pattern Tokenizer
- Simple Pattern Split Tokenizer
- Standard Tokenizer
- Thai Tokenizer
- UAX URL Email Tokenizer
- Whitespace Tokenizer
- Token filter reference
- Apostrophe
- ASCII folding
- CJK bigram
- CJK width
- Classic
- Common grams
- Conditional
- Decimal digit
- Delimited payload
- Dictionary decompounder
- Edge n-gram
- Elision
- Fingerprint
- Flatten graph
- Hunspell
- Hyphenation decompounder
- Keep types
- Keep words
- Keyword marker
- Keyword repeat
- KStem
- Length
- Limit token count
- Lowercase
- MinHash
- Multiplexer
- N-gram
- Normalization
- Pattern capture
- Pattern replace
- Phonetic
- Porter stem
- Predicate script
- Remove duplicates
- Reverse
- Shingle
- Snowball
- Stemmer
- Stemmer override
- Stop
- Synonym
- Synonym graph
- Trim
- Truncate
- Unique
- Uppercase
- Word delimiter
- Word delimiter graph
- Character filters reference
- Normalizers
- Modules
- Index modules
- Ingest node
- Pipeline Definition
- Accessing Data in Pipelines
- Conditional Execution in Pipelines
- Handling Failures in Pipelines
- Enrich your data
- Processors
- Append Processor
- Bytes Processor
- Circle Processor
- Convert Processor
- CSV Processor
- Date Processor
- Date Index Name Processor
- Dissect Processor
- Dot Expander Processor
- Drop Processor
- Enrich Processor
- Fail Processor
- Foreach Processor
- GeoIP Processor
- Grok Processor
- Gsub Processor
- HTML Strip Processor
- Inference Processor
- Join Processor
- JSON Processor
- KV Processor
- Lowercase Processor
- Pipeline Processor
- Remove Processor
- Rename Processor
- Script Processor
- Set Processor
- Set Security User Processor
- Split Processor
- Sort Processor
- Trim Processor
- Uppercase Processor
- URL Decode Processor
- User Agent processor
- ILM: Manage the index lifecycle
- SQL access
- Overview
- Getting Started with SQL
- Conventions and Terminology
- Security
- SQL REST API
- SQL Translate API
- SQL CLI
- SQL JDBC
- SQL ODBC
- SQL Client Applications
- SQL Language
- Functions and Operators
- Comparison Operators
- Logical Operators
- Math Operators
- Cast Operators
- LIKE and RLIKE Operators
- Aggregate Functions
- Grouping Functions
- Date/Time and Interval Functions and Operators
- Full-Text Search Functions
- Mathematical Functions
- String Functions
- Type Conversion Functions
- Geo Functions
- Conditional Functions And Expressions
- System Functions
- Reserved keywords
- SQL Limitations
- Monitor a cluster
- Frozen indices
- Roll up or transform your data
- Set up a cluster for high availability
- Snapshot and restore
- Secure a cluster
- Overview
- Configuring security
- User authentication
- Built-in users
- Internal users
- Token-based authentication services
- Realms
- Realm chains
- Active Directory user authentication
- File-based user authentication
- LDAP user authentication
- Native user authentication
- OpenID Connect authentication
- PKI user authentication
- SAML authentication
- Kerberos authentication
- Integrating with other authentication systems
- Enabling anonymous access
- Controlling the user cache
- Configuring SAML single-sign-on on the Elastic Stack
- Configuring single sign-on to the Elastic Stack using OpenID Connect
- User authorization
- Built-in roles
- Defining roles
- Security privileges
- Document level security
- Field level security
- Granting privileges for indices and aliases
- Mapping users and groups to roles
- Setting up field and document level security
- Submitting requests on behalf of other users
- Configuring authorization delegation
- Customizing roles and authorization
- Enabling audit logging
- Encrypting communications
- Restricting connections with IP filtering
- Cross cluster search, clients, and integrations
- Tutorial: Getting started with security
- Tutorial: Encrypting communications
- Troubleshooting
- Some settings are not returned via the nodes settings API
- Authorization exceptions
- Users command fails due to extra arguments
- Users are frequently locked out of Active Directory
- Certificate verification fails for curl on Mac
- SSLHandshakeException causes connections to fail
- Common SSL/TLS exceptions
- Common Kerberos exceptions
- Common SAML issues
- Internal Server Error in Kibana
- Setup-passwords command fails due to connection failure
- Failures due to relocation of the configuration files
- Limitations
- Alerting on cluster and index events
- Command line tools
- How To
- Glossary of terms
- REST APIs
- API conventions
- cat APIs
- Cluster APIs
- Cross-cluster replication APIs
- Document APIs
- Enrich APIs
- Explore API
- Index APIs
- Add index alias
- Analyze
- Clear cache
- Clone index
- Close index
- Create index
- Delete index
- Delete index alias
- Delete index template
- Flush
- Force merge
- Freeze index
- Get field mapping
- Get index
- Get index alias
- Get index settings
- Get index template
- Get mapping
- Index alias exists
- Index exists
- Index recovery
- Index segments
- Index shard stores
- Index stats
- Index template exists
- Open index
- Put index template
- Put mapping
- Refresh
- Rollover index
- Shrink index
- Split index
- Synced flush
- Type exists
- Unfreeze index
- Update index alias
- Update index settings
- Index lifecycle management API
- Ingest APIs
- Info API
- Licensing APIs
- Machine learning anomaly detection APIs
- Add events to calendar
- Add jobs to calendar
- Close jobs
- Create jobs
- Create calendar
- Create datafeeds
- Create filter
- Delete calendar
- Delete datafeeds
- Delete events from calendar
- Delete filter
- Delete forecast
- Delete jobs
- Delete jobs from calendar
- Delete model snapshots
- Delete expired data
- Find file structure
- Flush jobs
- Forecast jobs
- Get buckets
- Get calendars
- Get categories
- Get datafeeds
- Get datafeed statistics
- Get influencers
- Get jobs
- Get job statistics
- Get machine learning info
- Get model snapshots
- Get overall buckets
- Get scheduled events
- Get filters
- Get records
- Open jobs
- Post data to jobs
- Preview datafeeds
- Revert model snapshots
- Set upgrade mode
- Start datafeeds
- Stop datafeeds
- Update datafeeds
- Update filter
- Update jobs
- Update model snapshots
- Machine learning data frame analytics APIs
- Create data frame analytics jobs
- Create inference trained model
- Delete data frame analytics jobs
- Delete inference trained model
- Evaluate data frame analytics
- Explain data frame analytics API
- Get data frame analytics jobs
- Get data frame analytics jobs stats
- Get inference trained model
- Get inference trained model stats
- Start data frame analytics jobs
- Stop data frame analytics jobs
- Migration APIs
- Reload search analyzers
- Rollup APIs
- Search APIs
- Security APIs
- Authenticate
- Change passwords
- Clear cache
- Clear roles cache
- Create API keys
- Create or update application privileges
- Create or update role mappings
- Create or update roles
- Create or update users
- Delegate PKI authentication
- Delete application privileges
- Delete role mappings
- Delete roles
- Delete users
- Disable users
- Enable users
- Get API key information
- Get application privileges
- Get builtin privileges
- Get role mappings
- Get roles
- Get token
- Get users
- Has privileges
- Invalidate API key
- Invalidate token
- OpenID Connect Prepare Authentication API
- OpenID Connect authenticate API
- OpenID Connect logout API
- SAML prepare authentication API
- SAML authenticate API
- SAML logout API
- SAML invalidate API
- SSL certificate
- Snapshot and restore APIs
- Snapshot lifecycle management API
- Transform APIs
- Usage API
- Watcher APIs
- Definitions
- Release highlights
- Breaking changes
- Release notes
- Elasticsearch version 7.6.2
- Elasticsearch version 7.6.1
- Elasticsearch version 7.6.0
- Elasticsearch version 7.5.2
- Elasticsearch version 7.5.1
- Elasticsearch version 7.5.0
- Elasticsearch version 7.4.2
- Elasticsearch version 7.4.1
- Elasticsearch version 7.4.0
- Elasticsearch version 7.3.2
- Elasticsearch version 7.3.1
- Elasticsearch version 7.3.0
- Elasticsearch version 7.2.1
- Elasticsearch version 7.2.0
- Elasticsearch version 7.1.1
- Elasticsearch version 7.1.0
- Elasticsearch version 7.0.0
- Elasticsearch version 7.0.0-rc2
- Elasticsearch version 7.0.0-rc1
- Elasticsearch version 7.0.0-beta1
- Elasticsearch version 7.0.0-alpha2
- Elasticsearch version 7.0.0-alpha1
Word delimiter token filter
editWord delimiter token filter
editWe recommend using the
word_delimiter_graph
instead of
the word_delimiter
filter.
The word_delimiter
filter can produce invalid token graphs. See
Differences between word_delimiter_graph
and word_delimiter
.
The word_delimiter
filter also uses Lucene’s
WordDelimiterFilter,
which is marked as deprecated.
Splits tokens at non-alphanumeric characters. The word_delimiter
filter
also performs optional token normalization based on a set of rules. By default,
the filter uses the following rules:
-
Split tokens at non-alphanumeric characters.
The filter uses these characters as delimiters.
For example:
Super-Duper
→Super
,Duper
-
Remove leading or trailing delimiters from each token.
For example:
XL---42+'Autocoder'
→XL
,42
,Autocoder
-
Split tokens at letter case transitions.
For example:
PowerShot
→Power
,Shot
-
Split tokens at letter-number transitions.
For example:
XL500
→XL
,500
-
Remove the English possessive (
's
) from the end of each token. For example:Neil's
→Neil
The word_delimiter
filter was designed to remove punctuation from complex
identifiers, such as product IDs or part numbers. For these use cases, we
recommend using the word_delimiter
filter with the
keyword
tokenizer.
Avoid using the word_delimiter
filter to split hyphenated words, such as
wi-fi
. Because users often search for these words both with and without
hyphens, we recommend using the
synonym_graph
filter instead.
Example
editThe following analyze API request uses the
word_delimiter
filter to split Neil's-Super-Duper-XL500--42+AutoCoder
into normalized tokens using the filter’s default rules:
GET /_analyze { "tokenizer": "keyword", "filter": [ "word_delimiter" ], "text": "Neil's-Super-Duper-XL500--42+AutoCoder" }
The filter produces the following tokens:
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
Add to an analyzer
editThe following create index API request uses the
word_delimiter
filter to configure a new
custom analyzer.
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "filter": [ "word_delimiter" ] } } } } }
Avoid using the word_delimiter
filter with tokenizers that remove punctuation,
such as the standard
tokenizer. This could
prevent the word_delimiter
filter from splitting tokens correctly. It can also
interfere with the filter’s configurable parameters, such as catenate_all
or
preserve_original
. We recommend using the
keyword
or
whitespace
tokenizer instead.
Configurable parameters
edit-
catenate_all
-
(Optional, boolean) If
true
, the filter produces catenated tokens for chains of alphanumeric characters separated by non-alphabetic delimiters. For example:super-duper-xl-500
→ [super
,superduperxl500
,duper
,xl
,500
]. Defaults tofalse
.When used for search analysis, catenated tokens can cause problems for the
match_phrase
query and other queries that rely on token position for matching. Avoid setting this parameter totrue
if you plan to use these queries. -
catenate_numbers
-
(Optional, boolean) If
true
, the filter produces catenated tokens for chains of numeric characters separated by non-alphabetic delimiters. For example:01-02-03
→ [01
,010203
,02
,03
]. Defaults tofalse
.When used for search analysis, catenated tokens can cause problems for the
match_phrase
query and other queries that rely on token position for matching. Avoid setting this parameter totrue
if you plan to use these queries. -
catenate_words
-
(Optional, boolean) If
true
, the filter produces catenated tokens for chains of alphabetical characters separated by non-alphabetic delimiters. For example:super-duper-xl
→ [super
,superduperxl
,duper
,xl
]. Defaults tofalse
.When used for search analysis, catenated tokens can cause problems for the
match_phrase
query and other queries that rely on token position for matching. Avoid setting this parameter totrue
if you plan to use these queries. -
generate_number_parts
-
(Optional, boolean)
If
true
, the filter includes tokens consisting of only numeric characters in the output. Iffalse
, the filter excludes these tokens from the output. Defaults totrue
. -
generate_word_parts
-
(Optional, boolean)
If
true
, the filter includes tokens consisting of only alphabetical characters in the output. Iffalse
, the filter excludes these tokens from the output. Defaults totrue
. -
preserve_original
-
(Optional, boolean)
If
true
, the filter includes the original version of any split tokens in the output. This original version includes non-alphanumeric delimiters. For example:super-duper-xl-500
→ [super-duper-xl-500
,super
,duper
,xl
,500
]. Defaults tofalse
. -
protected_words
- (Optional, array of strings) Array of tokens the filter won’t split.
-
protected_words_path
-
(Optional, string) Path to a file that contains a list of tokens the filter won’t split.
This path must be absolute or relative to the
config
location, and the file must be UTF-8 encoded. Each token in the file must be separated by a line break. -
split_on_case_change
-
(Optional, boolean)
If
true
, the filter splits tokens at letter case transitions. For example:camelCase
→ [camel
,Case
]. Defaults totrue
. -
split_on_numerics
-
(Optional, boolean)
If
true
, the filter splits tokens at letter-number transitions. For example:j2se
→ [j
,2
,se
]. Defaults totrue
. -
stem_english_possessive
-
(Optional, boolean)
If
true
, the filter removes the English possessive ('s
) from the end of each token. For example:O'Neil's
→ [O
,Neil
]. Defaults totrue
. -
type_table
-
(Optional, array of strings) Array of custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the following array maps the plus (
+
) and hyphen (-
) characters as alphanumeric, which means they won’t be treated as delimiters:[ "+ => ALPHA", "- => ALPHA" ]
Supported types include:
-
ALPHA
(Alphabetical) -
ALPHANUM
(Alphanumeric) -
DIGIT
(Numeric) -
LOWER
(Lowercase alphabetical) -
SUBWORD_DELIM
(Non-alphanumeric delimiter) -
UPPER
(Uppercase alphabetical)
-
-
type_table_path
-
(Optional, string) Path to a file that contains custom type mappings for characters. This allows you to map non-alphanumeric characters as numeric or alphanumeric to avoid splitting on those characters.
For example, the contents of this file may contain the following:
# Map the $, %, '.', and ',' characters to DIGIT # This might be useful for financial data. $ => DIGIT % => DIGIT . => DIGIT \\u002C => DIGIT # in some cases you might not want to split on ZWJ # this also tests the case where we need a bigger byte[] # see http://en.wikipedia.org/wiki/Zero-width_joiner \\u200D => ALPHANUM
Supported types include:
-
ALPHA
(Alphabetical) -
ALPHANUM
(Alphanumeric) -
DIGIT
(Numeric) -
LOWER
(Lowercase alphabetical) -
SUBWORD_DELIM
(Non-alphanumeric delimiter) -
UPPER
(Uppercase alphabetical)
This file path must be absolute or relative to the
config
location, and the file must be UTF-8 encoded. Each mapping in the file must be separated by a line break. -
Customize
editTo customize the word_delimiter
filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a word_delimiter
filter that uses the following rules:
-
Split tokens at non-alphanumeric characters, except the hyphen (
-
) character. - Remove leading or trailing delimiters from each token.
- Do not split tokens at letter case transitions.
- Do not split tokens at letter-number transitions.
-
Remove the English possessive (
's
) from the end of each token.
PUT /my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "filter": [ "my_custom_word_delimiter_filter" ] } }, "filter": { "my_custom_word_delimiter_filter": { "type": "word_delimiter", "type_table": [ "- => ALPHA" ], "split_on_case_change": false, "split_on_numerics": false, "stem_english_possessive": true } } } } }