Usage

edit

Using the attachment type is simple, in your mapping JSON, simply set a certain JSON element as attachment, for example:

PUT /test
{
  "mappings": {
    "person" : {
      "properties" : {
        "my_attachment" : { "type" : "attachment" }
      }
    }
  }
}

In this case, the JSON to index can be:

PUT /test/person/1
{
    "my_attachment" : "... base64 encoded attachment ..."
}

Or it is possible to use more elaborated JSON if content type, resource name or language need to be set explicitly:

PUT /test/person/1
{
    "my_attachment" : {
        "_content_type" : "application/pdf",
        "_name" : "resource/name/of/my.pdf",
        "_language" : "en",
        "_content" : "... base64 encoded attachment ..."
    }
}

The attachment type not only indexes the content of the doc in content sub field, but also automatically adds meta data on the attachment as well (when available).

The metadata supported are:

  • date
  • title
  • name only available if you set _name see above
  • author
  • keywords
  • content_type
  • content_length is the original content_length before text extraction (aka file size)
  • language

They can be queried using the "dot notation", for example: my_attachment.author.

Both the meta data and the actual content are simple core type mappers (text, date, …), thus, they can be controlled in the mappings. For example:

PUT /test
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["standard"]
          }
        }
      }
    }
  },
  "mappings": {
    "person" : {
      "properties" : {
        "file" : {
          "type" : "attachment",
          "fields" : {
            "content" : {"index" : true},
            "title" : {"store" : true},
            "date" : {"store" : true},
            "author" : {"analyzer" : "my_analyzer"},
            "keywords" : {"store" : true},
            "content_type" : {"store" : true},
            "content_length" : {"store" : true},
            "language" : {"store" : true}
          }
        }
      }
    }
  }
}

In the above example, the actual content indexed is mapped under fields name content, and we decide not to index it, so it will only be available in the _all field. The other fields map to their respective metadata names, but there is no need to specify the type (like text or date) since it is already known.