Content extraction
editContent extraction
editConnectors use the Elastic ingest attachment processor to extract file contents. The processor extracts files using the Apache Tika text extraction library. The logic for content extraction is defined in utils.py.
While intended primarily for PDF and Microsoft Office formats, you can use any of the supported formats.
Enterprise Search uses an Elasticsearch ingest pipeline to power the web crawler’s binary content extraction.
The default pipeline, ent-search-generic-ingestion
, is automatically created when Enterprise Search first starts.
You can view this pipeline in Kibana. Customizing your pipeline usage is also an option. See Index-specific ingest pipelines.
Supported file types
editThe following file types are supported:
-
.txt
-
.py
-
.rst
-
.html
-
.markdown
-
.json
-
.xml
-
.csv
-
.md
-
.ppt
-
.rtf
-
.docx
-
.odt
-
.xls
-
.xlsx
-
.rb
-
.paper
-
.sh
-
.pptx
-
.pdf
-
.doc
The ingest attachment processor does not support compressed files, e.g., an archive file containing a set of PDFs. Expand the archive file and make individual uncompressed files available for the connector to process.