Start free trial Contact Sales

The Search AI Company

Search, Security, Observability

Build tailored experiences with Elastic.

Elastic Search AI Platform overview

Scale your business with Elastic Partners

Partner overview

ELK Stack

Search and analytics, data ingestion, and visualization – all at your fingertips.

ELK Stack overview

By developers, for developers

Elastic Cloud

Unlock the power of real-time insights with Elastic on your preferred cloud provider.

Elastic Cloud overview

Generative AI

Prototype and integrate with LLMs faster using search AI.

Generative AI overview

Search

Discover a world of AI possibilities — built with the power of search.

Search Labs

Search overview

Security

Protect, investigate, and respond to cyber threats with AI-driven security analytics.

Security Labs

Security overview

Observability

Unify app and infrastructure visibility to proactively resolve issues.

Observability Labs

Observability overview

By solution

See how customers search, solve, and succeed — all on one Search AI Platform.

All customer stories

Industries

Exceed customer expectations and go to market faster.

Industries overview

Customer spotlight

Cisco saves 5,000 support engineer hours per month

Sitecore automates 96 percent of security workflows with Elastic

Comcast transforms customer experiences with Elastic Observability

Research

Stay at the forefront of innovation with technical tips from the experts.

Build

Code with other developers to create a better Elastic, together.

Learn

Unleash the possibilities of your data and grow your skill set.

Connect

Keep informed about the latest tech and news from Elastic.

Have questions?

New

The executive guide to generative AI

About us Partners Support|Login

请注意:
本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。

« Unicode 大小写折叠排序和整理 »

› › ›

Unicode 字符折叠

Unicode 字符折叠

在多语言((("Unicode", "character folding")))((("tokens", "normalizing", "Unicode character folding")))处理中，`lowercase` 语汇单元过滤器(token filters)是一个很好的开始。但是作为对比的话，也只是对于整个巴别塔的惊鸿一瞥。所以 <<asciifolding-token-filter,`asciifolding` token filter>> 需要更有效的Unicode _字符折叠_ (_character-folding_)工具来处理全世界的各种语言。((("asciifolding token filter")))

`icu_folding` 语汇单元过滤器(token filters) (provided by the <<icu-plugin,`icu` plug-in>>)的功能和 `asciifolding` 过滤器一样， ((("icu_folding token filter")))但是它扩展到了非ASCII编码的语言，例如：希腊语，希伯来语，汉语。它把这些语言都转换对应拉丁文字，甚至包含它们的各种各样的计数符号，象形符号和标点符号。

`icu_folding` 语汇单元过滤器(token filters)自动使用 `nfkc_cf` 模式来进行大小写折叠和Unicode归一化(normalization)，所以不需要使用 `icu_normalizer` ：

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_folder": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "icu_folding" ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_folder
١٢٣٤٥

阿拉伯数字 ١٢٣٤٥ 被折叠成等价的拉丁数字: 12345.

如果你有指定的字符不想被折叠，你可以使用 UnicodeSet(像字符的正则表达式) 来指定哪些Unicode才可以被折叠。例如：瑞典单词 å,ä, ö, Å, Ä, 和 Ö 不能被折叠，你就可以设定为： [^åäöÅÄÖ] (^ 表示 不包含)。这样就会对于所有的Unicode字符生效。

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "swedish_folding": { 
          "type": "icu_folding",
          "unicodeSetFilter": "[^åäöÅÄÖ]"
        }
      },
      "analyzer": {
        "swedish_analyzer": { 
          "tokenizer": "icu_tokenizer",
          "filter":  [ "swedish_folding", "lowercase" ]
        }
      }
    }
  }
}

	`swedish_folding`语汇单元过滤器(token filters) 定制了 `icu_folding`语汇单元过滤器(token filters)来不处理那些大写和小写的瑞典单词。
	`swedish` 分析器首先分词，然后用`swedish_folding`语汇单元过滤器来折叠单词，最后把他们走转换为小写，除了被排除在外的单词： `Å`, `Ä`, 或者 `Ö`。

« Unicode 大小写折叠排序和整理 »

Was this helpful?

Feedback