Using the Attachment Processor in a Pipeline

edit

Table 1. Attachment options

Name Required Default Description

field

yes

-

The field to get the base64 encoded field from

target_field

no

attachment

The field that will hold the attachment information

indexed_chars

no

100000

The number of chars being used for extraction to prevent huge fields. Use -1 for no limit.

indexed_chars_field

no

null

Field name from which you can overwrite the number of chars being used for extraction. See indexed_chars.

properties

no

all properties

 Array of properties to select to be stored. Can be content, title, name, author, keywords, date, content_type, content_length, language

ignore_missing

no

false

If true and field does not exist, the processor quietly exits without modifying the document

Example

edit

If attaching files to JSON documents, you must first encode the file as a base64 string. On Unix-like systems, you can do this using a base64 command:

base64 -in myfile.rtf

The command returns the base64-encoded string for the file. The following base64 string is for an .rtf file containing the text Lorem ipsum dolor sit amet: e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=.

Use an attachment processor to decode the string and extract the file’s properties:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my-index-000001/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my-index-000001/_doc/my_id

The document’s attachment object contains extracted properties for the file:

{
  "found": true,
  "_index": "my-index-000001",
  "_type": "_doc",
  "_id": "my_id",
  "_version": 1,
  "_seq_no": 22,
  "_primary_term": 1,
  "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
  }
}

To extract only certain attachment fields, specify the properties array:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "content", "title" ]
      }
    }
  ]
}

Extracting contents from binary data is a resource intensive operation and consumes a lot of resources. It is highly recommended to run pipelines using this processor in a dedicated ingest node.