Loading Sample Data

edit

This tutorial requires three data sets:

  • The complete works of William Shakespeare, suitably parsed into fields. Download shakespeare.json.
  • A set of fictitious accounts with randomly generated data. Download accounts.zip.
  • A set of randomly generated log files. Download logs.jsonl.gz.

Two of the data sets are compressed. To extract the files, use these commands:

unzip accounts.zip
gunzip logs.jsonl.gz

The Shakespeare data set has this structure:

{
    "line_id": INT,
    "play_name": "String",
    "speech_number": INT,
    "line_number": "String",
    "speaker": "String",
    "text_entry": "String",
}

The accounts data set is structured as follows:

{
    "account_number": INT,
    "balance": INT,
    "firstname": "String",
    "lastname": "String",
    "age": INT,
    "gender": "M or F",
    "address": "String",
    "employer": "String",
    "email": "String",
    "city": "String",
    "state": "String"
}

The logs data set has dozens of different fields. Here are the notable fields for this tutorial:

{
    "memory": INT,
    "geo.coordinates": "geo_point"
    "@timestamp": "date"
}

Before you load the Shakespeare and logs data sets, you must set up mappings for the fields. Mappings divide the documents in the index into logical groups and specify the characteristics of the fields. These characteristics include the searchability of the field and whether it’s tokenized, or broken up into separate words.

In Kibana Dev Tools > Console, set up a mapping for the Shakespeare data set:

PUT /shakespeare
{
 "mappings": {
  "doc": {
   "properties": {
    "speaker": {"type": "keyword"},
    "play_name": {"type": "keyword"},
    "line_id": {"type": "integer"},
    "speech_number": {"type": "integer"}
   }
  }
 }
}

This mapping specifies field characteristics for the data set:

  • The speaker and play_name fields are keyword fields. These fields are not analyzed. The strings are treated as a single unit even if they contain multiple words.
  • The line_id and speech_number fields are integers.

The logs data set requires a mapping to label the latitude and longitude pairs as geographic locations by applying the geo_point type.

PUT /logstash-2015.05.18
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}
PUT /logstash-2015.05.19
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}
PUT /logstash-2015.05.20
{
  "mappings": {
    "log": {
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        }
      }
    }
  }
}

The accounts data set doesn’t require any mappings.

At this point, you’re ready to use the Elasticsearch bulk API to load the data sets:

curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary @accounts.json
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/shakespeare/doc/_bulk?pretty' --data-binary @shakespeare_6.0.json
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/_bulk?pretty' --data-binary @logs.jsonl

Or for Windows users, in Powershell:

Invoke-RestMethod "http://localhost:9200/bank/account/_bulk?pretty" -Method Post -ContentType 'application/x-ndjson' -InFile "accounts.json"
Invoke-RestMethod "http://localhost:9200/shakespeare/doc/_bulk?pretty" -Method Post -ContentType 'application/x-ndjson' -InFile "shakespeare_6.0.json"
Invoke-RestMethod "http://localhost:9200/_bulk?pretty" -Method Post -ContentType 'application/x-ndjson' -InFile "logs.jsonl"

These commands might take some time to execute, depending on the available computing resources.

Verify successful loading:

GET /_cat/indices?v

Your output should look similar to this:

health status index               pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank                  5   1       1000            0    418.2kb        418.2kb
yellow open   shakespeare           5   1     111396            0     17.6mb         17.6mb
yellow open   logstash-2015.05.18   5   1       4631            0     15.6mb         15.6mb
yellow open   logstash-2015.05.19   5   1       4624            0     15.7mb         15.7mb
yellow open   logstash-2015.05.20   5   1       4750            0     16.4mb         16.4mb