Content extraction

edit

Content extraction

edit

The 3rd party services you sync with Workplace Search, such as Dropbox or Google Drive, usually contain a wide variety of documents and file types. Workplace Search will try to extract the content of these files, to transform the source document into a searchable document.

To make the document searchable, the Workplace Search connector tries to extract text content into fields, and images into thumbnail previews. Full text content extraction is available for many types of documents, including PDFs and most Office365 and GSuite formats. Thumbnail extraction is available for certain image formats.

Nevertheless, you might be surprised that some of your documents are not having their content extracted, or that the extraction is not perfect. The following documentation covers the file extensions and media types supported by Workplace Search, as well as how to troubleshoot surprising results.

Content extraction limits

edit

There are some important facts and figures to note up front:

  • The maximum file size for content extraction is 20MB. This is not configurable.
  • The resulting text will be truncated if it exceeds 100KB.
  • Encrypted documents are skipped by the extractor.
  • Content extraction from binary formats (e.g., images, audio, videos) is currently not supported.
  • Thumbnail extraction is automatically disabled when less than 2GB of Heap is available. For maximum performance and stability, ensure that Enterprise Search has at least 4GB of Heap.
Disabling thumbnail or full-text extraction
edit

You have the option to toggle off extracting thumbnails and/or full-text from files, if you need to save RAM. However, these toggle options are only available after you create the content source. When you create a content source this triggers a full sync immediately, including thumbnail and full-text extraction.

If you want to avoid any thumbnail or full-text extraction you need to switch the toggle immediately after creating the content source. This ensures that the toggle is registered during the initial, metadata-only phase of the first full sync.

Enabling/disabling thumbnail extraction or full-text syncs does not interrupt or restart jobs. These settings changes are only picked up for new jobs.

The toggle disables/enables any subsequent thumbnail or full-text extraction. Existing thumbnail or full-text data will not be removed.

File types

edit

Media, or MIME types, are an internet standard for describing file formats.

Per this guide, the type represents the general category into which the data type falls, such as video or text. The subtype identifies the exact kind of data the type represents.

Workplace Search analyzes the file to determine its type, since file extensions are not reliable. Workplace Search leverages the industry standard open-source Apache Tika toolkit for detecting and extracting text and metadata. The Elasticsearch ingest attachment plugin also uses Apache Tika and can return a number of metadata fields.

You can add search checkbox filters for file extensions and MIME types in your search experience. Learn more about customizing content source filters in Workplace Search.

Full text content extraction

edit

The following file types are supported for full text extraction:

  • .doc
  • .docx
  • .html
  • .odt
  • .one
  • .md
  • .markdown
  • .paper
  • .pdf
  • .ppt
  • .pptx
  • .rtf
  • .txt
  • .xls
  • .xlsx

Formatted text files are normalized to decrease whitespace and minimize storage costs:

  • .md
  • .markdown
  • .paper
  • .rtf
  • .txt

Workplace Search supports these MIME types for text files:

  • application/msword
  • application/pdf
  • application/vnd.openxmlformats-officedocument.presentationml.presentation
  • application/vnd.openxmlformats-officedocument.wordprocessingml.document

Unstructured textual data has the highest likelihood of benefiting from content extraction. Structured documents such as excel spreadsheets, html, or csv files, do not lend themselves to well-ordered text extraction.

Google Docs, Sheets and Slides

edit

Content/text extraction from Google Docs, Sheets and Slides is also supported. Google docs do not have a native download format, and in order to extract their content, Workplace Search exports these files as PDFs.

However, extracting text from PDFs is not always a perfectly lossless process, and can lead to unexpected results in some cases.

You might be surprised that some of your PDFs are being transformed into searchable documents, while others are not. That’s because there are different types of PDFs.

The easiest way to tell if a particular PDF document supports full text extraction is to try to copy and paste from the document. If this works, Workplace Search can extract the file’s content.

If you cannot select the text, this means the PDF is actually an image. You will have to use a 3rd party OCR (optical character recognition) engine to scan the image for text and ingest via a custom source. This process can be hit and miss, depending on the quality of the image and the font used.

Thumbnail extraction

edit

Workplace Search provides document thumbnail previews for certain file types, to help you quickly find exactly what you need.

The Workplace Search thumbnail extractor supports specific media or MIME types:

  • image/gif
  • image/jpeg
  • image/png

Thumbnail generation can be quite memory-intensive, requiring at least 2GB of JVM Heap to run. Even then, thumbnail generation may suspend if the available heap becomes insufficient.

If document thumbnails are missing, a good first step is to increase the available RAM to your server.

Content extraction from binary formats

edit

Searching binary formats (e.g., images, audio, videos) is currently not supported. You can use a 3rd party content extractor and add the content via a custom source.

Content post-processing

edit

Since 8.3.0, Workplace Search has used an Elasticsearch ingest pipeline to power post-processing of all documents. This pipeline is named ent_search_connector, and is automatically created when Enterprise Search first starts. As a final step in the Full and Incremental sync jobs, every document extracted by these connectors will be sent through this ingest pipeline in order to clean up and compress the content of the body field. This primarily involves removing unicode replacement characters (�), and collapsing large sections of whitespace. This helps to ensure a more meaningful end-user display of your documents in your search experience. You can view and update this pipeline in Kibana or with Elasticsearch APIs.

If you make changes to the default ent_search_connector ingest pipeline, it will not be overwritten when you upgrade Enterprise Search.

Therefore, we recommend comparing the default ent_search_connector pipeline that comes with the newer version — to determine if you need to incorporate the latest changes into your customized pipeline.