Aryn is an AI-powered document parsing and ETL system for complex, unstructured data like PDFs, HTML, presentations, and more. It can process 30+ file formats and extract tables, images, and more with high quality. You can use Aryn to chunk documents, extract metadata, create vector embeddings, and load your Elasticsearch vector and keyword indexes with high-quality data.
Aryn’s document ETL system has two components:
- Aryn DocParse is a service for segmenting and labeling documents, running optical character recognition (OCR), and extracting tables and images. It can return the structured output of each document in JSON or Markdown and provides labeled bounding boxes for titles, tables, table rows, and columns, images, and regular text. DocParse can process over 30 types of document formats, including PDFs, Microsoft Word, Microsoft PowerPoint, text, and more. It leverages the Aryn Partitioner and its state-of-the-art, open-source deep learning AI model trained on 80k+ enterprise documents. DocParse can be used in document ETL pipelines for GenAI apps or just for table extraction and document processing workflows (like in this video).
- Aryn DocPrep is a tool for creating document ETL pipelines for processing complex, unstructured data and loading it into vector databases and hybrid search engines like Elasticsearch. The pipelines use DocParse for document partitioning and generate ETL pipeline code in Python using the open-source, scalable Sycamore document ETL library. DocPrep and Sycamore provide scalable and reliable loading of Elasticsearch indices, and include a variety of powerful data transforms that leverage LLMs, including entity extraction, data cleaning, semantic operations, vector embeddings generation, and data enrichment.