Aryn is an AI-powered document parsing and ETL system for complex, unstructured data like PDFs, HTML, presentations, and more. It can process 30+ file formats and extract tables, images, and more with high quality. You can use Aryn to chunk documents, extract metadata, create vector embeddings, and load your Elasticsearch vector and keyword indexes with high-quality data.

Aryn’s document ETL system has two components:

  • Aryn DocParse is a service for segmenting and labeling documents, running optical character recognition (OCR), and extracting tables and images. It can return the structured output of each document in JSON or Markdown and provides labeled bounding boxes for titles, tables, table rows, and columns, images, and regular text. DocParse can process over 30 types of document formats, including PDFs, Microsoft Word, Microsoft PowerPoint, text, and more. It leverages the Aryn Partitioner and its state-of-the-art, open-source deep learning AI model trained on 80k+ enterprise documents. DocParse can be used in document ETL pipelines for GenAI apps or just for table extraction and document processing workflows (like in this video).
  • Aryn DocPrep is a tool for creating document ETL pipelines for processing complex, unstructured data and loading it into vector databases and hybrid search engines like Elasticsearch. The pipelines use DocParse for document partitioning and generate ETL pipeline code in Python using the open-source, scalable Sycamore document ETL library. DocPrep and Sycamore provide scalable and reliable loading of Elasticsearch indices, and include a variety of powerful data transforms that leverage LLMs, including entity extraction, data cleaning, semantic operations, vector embeddings generation, and data enrichment.

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself