Unstructured Open Source
unstructured
open source library is designed as a starting point for quick prototyping and has limits. For production scenarios, see Unstructured API services and Unstructured Platform.The unstructured
library offers an open-source toolkit
designed to simplify the ingestion and pre-processing of diverse data formats, including images and text-based documents
such as PDFs, HTML files, Word documents, and more. With a focus on optimizing data workflows for Large Language Models (LLMs),
unstructured
provides modular functions and connectors that work seamlessly together. This cohesive system ensures
efficient transformation of unstructured data into structured formats, while also offering adaptability to various platforms
and use cases.
Key functionality
-
Precise Document Extraction: Unstructured offers advanced capabilities in extracting elements and metadata from documents. This includes a variety of document element types and metadata. Learn more about Document elements and metadata.
-
Extensive File Support: The platform supports a wide array of file types, ensuring versatility in handling different document formats from PDF, Images, HTML, and many more. Detailed information on supported file types can be found here.
-
Robust Core Functionality: Unstructured provides a suite of core functionalities critical for efficient data processing. This includes:
-
Partitioning: The partitioning functions in Unstructured enable the extraction of structured content from raw, unstructured documents. This feature is crucial for transforming unorganized data into usable formats, aiding in efficient data processing and analysis.
-
Cleaning: Data preparation for NLP models often requires cleaning to ensure quality. The Unstructured library includes cleaning functions that assist in sanitizing output, removing unwanted content, and improving the performance of NLP models. This step is essential for maintaining the integrity of data before it is passed to downstream applications.
-
Extracting: This functionality allows for the extraction of specific entities within documents. It is designed to identify and isolate relevant pieces of information, making it easier for users to focus on the most pertinent data in their documents.
-
Staging: Staging functions help prepare your data for ingestion into downstream systems. Please note that this functionality is being deprecated in favor of
Destination Connectors
. -
Chunking: The chunking process in Unstructured is distinct from conventional methods. Instead of relying solely on text-based features to form chunks, Unstructured uses a deep understanding of document formats to partition documents into semantic units (document elements).
-
-
High-performant Connectors: The platform includes optimized connectors for efficient data ingestion and output. These comprise Source Connectors for data input and Destination Connectors for data export.
Common use cases
- Pretraining models
- Fine-tuning models
- Retrieval Augmented Generation (RAG)
- Traditional ETL
Limits
The open source library has the following limits as compared to Unstructured API services and the Unstructured Platform:
- Not designed for production scenarios.
- Significantly decreased performance on document and table extraction.
- Access only to older and less sophisticated vision transformer models.
- No access to Unstructured’s fine-tuned OCR models.
- No access to Unstructured’s by-page and by-similarity chunking strategies.
- Lack of security and SOC2 and HIPAA compliance.
- No authentication or identity management.
- No incremental data loading.
- No ETL job scheduling or monitoring.
- No image extraction from documents.
- Less sophisticated document hierarchy detection.
- You must manage many of your own code dependencies, for instance for libraries such as Poppler and Tesseract.
- You must manage your own infrastructure, including parallelization and other performance optimizations.
Telemetry
The open source library allows you to make calls to Unstructured Serverless API services. If you do plan to make such calls, please note:
We’ve partnered with Scarf to collect anonymized user statistics to understand which features our community is using and how to prioritize product decision-making in the future.
To learn more about how we collect and use this data, please read our Privacy Policy.
To opt out of this data collection, you can set the environment variable SCARF_NO_ANALYTICS=true
before running any commands that call Unstructured Serverless API services.