Chunking strategies
Chunking functions use metadata and document elements detected with partition functions to split a document into appropriately-sized chunks for uses cases such as Retrieval Augmented Generation (RAG).
If you are familiar with chunking methods that split long text documents into smaller chunks, you’ll notice that Unstructured methods slightly differ, since the partitioning step already divides an entire document into its structural elements.
Individual elements will only be split if they exceed the desired maximum chunk size. Two or more consecutive text elements
that will together fit within max_characters
will be combined. After chunking, you will only have elements of the
following types:
CompositeElement
: Any text element will become aCompositeElement
after chunking. A composite element can be a combination of two or more original text elements that together fit within the maximum chunk size. It can also be a single element that doesn’t leave room in the chunk for any others but fits by itself. Or it can be a fragment of an original text element that was too big to fit in one chunk and required splitting.Table
: A table element is not combined with other elements and if it fits withinmax_characters
it will remain as is.TableChunk
: large tables that exceedmax_characters
chunk size are split into specialTableChunk
elements.
”basic” chunking strategy
-
The basic strategy combines sequential elements to maximally fill each chunk while respecting both the specified
max_characters
(hard-max) andnew_after_n_chars
(soft-max) option values. -
A single element that by itself exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text-splitting.
-
A
Table
element is always isolated and never combined with another element. ATable
can be oversized, like any other text element, and in that case is divided into two or moreTableChunk
elements using text-splitting. -
If specified,
overlap
is applied between chunks formed by splitting oversized elements and is also applied between other chunks whenoverlap_all
isTrue
.
”by_title” chunking strategy
The by_title
chunking strategy preserves section boundaries and optionally page boundaries as well. “Preserving” here means that a single chunk will never contain text that occurred in two different sections. When a new section starts, the existing chunk is closed and a new one started, even if the next element would fit in the prior chunk.
In addition to the behaviors of the basic
strategy above, the by_title
strategy has the following behaviors:
-
Detect section headings. A
Title
element is considered to start a new section. When aTitle
element is encountered, the prior chunk is closed and a new chunk started, even if theTitle
element would fit in the prior chunk. -
Respect page boundaries. Page boundaries can optionally also be respected using the
multipage_sections
argument. This defaults toTrue
meaning that a page break does not start a new chunk. Setting this toFalse
will separate elements that occur on different pages into distinct chunks. -
Combine small sections. In certain documents, partitioning may identify a list-item or other short paragraph as a
Title
element even though it does not serve as a section heading. This can produce chunks substantially smaller than desired. This behavior can be mitigated using thecombine_text_under_n_chars
argument. This defaults to the same value asmax_characters
such that sequential small sections are combined to maximally fill the chunking window. Setting this to0
will disable section combining.
”by_page” chunking strategy
Only available in Unstructured API and Platform.
The by_page
chunking strategy ensures the content from different pages do not end up in the same chunk.
When a new page is detected, the existing chunk is completed and a new one is started, even if the next element would fit in the
prior chunk.
”by_similarity” chunking strategy
Only available in Unstructured API and Platform.
The by_similarity
chunking strategy employs the sentence-transformers/multi-qa-mpnet-base-dot-v1
embedding model to
identify topically similar sequential elements and combine them into chunks.
As with other strategies, chunks will never exceed the hard-maximum chunk size set by max_characters
. For this reason,
not all elements that share a topic will necessarily appear in the same chunk. However, with this strategy you can
guarantee that two elements with low similarity will not be combined in a single chunk.
You can control the level of topic similarity you require for elements to have by setting the similarity_threshold
parameter.
similarity_threshold
expects a value between 0.0 and 1.0 specifying the minimum similarity text in consecutive elements
must have to be included in the same chunk. The default is 0.5.