› ›

加载示例数据

本节内容依赖以下数据：

威廉·莎士比亚全集，解析成合适的字段。点击这里下载这个数据集： shakespeare.json.
一组虚构的账户与随机生成的数据。点击这里下载这个数据集： accounts.zip.
一组随机生成的日志文件。点击这里下载这个数据集： logs.jsonl.gz.

其中有两个数据集是压缩文件，可使用以下命令解压缩文件：

unzip accounts.zip
gunzip logs.jsonl.gz

莎士比亚数据集的组织方式如下：

{
    "line_id": INT,
    "play_name": "String",
    "speech_number": INT,
    "line_number": "String",
    "speaker": "String",
    "text_entry": "String",
}

帐户数据集的组织方式如下：

{
    "account_number": INT,
    "balance": INT,
    "firstname": "String",
    "lastname": "String",
    "age": INT,
    "gender": "M or F",
    "address": "String",
    "employer": "String",
    "email": "String",
    "city": "String",
    "state": "String"
}

日志数据集的结构有许多不同的字段，以下是其中比较重要的字段：

{
    "memory": INT,
    "geo.coordinates": "geo_point"
    "@timestamp": "date"
}

在莎士比亚和日志数据集加载之前，我们需要为字段设置映射。映射把索引中的文档按逻辑分组并指定了字段的属性，比如字段的可搜索性或者该字段是否是 tokenized ，或分解成单独的单词。

使用以下命令在终端（如 bash ）建立一个莎士比亚数据集的映射：

PUT /shakespeare
{
 "mappings": {
  "doc": {
   "properties": {
    "speaker": {"type": "keyword"},
    "play_name": {"type": "keyword"},
    "line_id": {"type": "integer"},
    "speech_number": {"type": "integer"}
   }
  }
 }
}

拷贝为 curl 在 Elastic 中尝试

这个映射指定了数据集的以下特点：

因为 speaker 和 play_name 字段是关键字字段，它们不需要分析。字符串即使包含多个词也仍被视为一个整体。
line_id 和 speech_number 字段是整数。

日志数据集映射需要利用 geo_point 类型来标记经度/纬度地理位置字段。

使用下面的命令来为日志建立 geo_point 映射：

PUT /logstash-2015.05.18
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}

拷贝为 curl 在 Elastic 中尝试

PUT /logstash-2015.05.19
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}

拷贝为 curl 在 Elastic 中尝试

PUT /logstash-2015.05.20
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}

拷贝为 curl 在 Elastic 中尝试

账户数据集不需要任何映射，基于这一点我们准备用 Elasticsearch bulk API 来加载数据集，命令如下：

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary @accounts.json
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/_bulk?pretty' --data-binary @logs.jsonl

执行这些命令可能需要一段时间，取决于可用的计算资源。

使用下面的命令来验证加载是否成功：

GET /_cat/indices?v

拷贝为 curl 在 Elastic 中尝试

您应该会看到类似下面的输出：

health status index               pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank                  5   1       1000            0    418.2kb        418.2kb
yellow open   shakespeare           5   1     111396            0     17.6mb         17.6mb
yellow open   logstash-2015.05.18   5   1       4631            0     15.6mb         15.6mb
yellow open   logstash-2015.05.19   5   1       4624            0     15.7mb         15.7mb
yellow open   logstash-2015.05.20   5   1       4750            0     16.4mb         16.4mb

« 基础入门定义自己的索引模式 »

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

加载示例数据

加载示例数据

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards