Choose a partitioning strategy
Task
You want to use the fastest, highest-precision, yet most cost-effective partitioning strategy overall, but you are not sure which one to specify for your particular files.
Approach
The --strategy
command option (CLI) or strategy
parameter (Python/JavaScript/TypeScript) specifies the partitioning strategy to use.
Use the following decision-maker to help you determine what to set --strategy
or strategy
to.
How many of the files to be processed are image files, or PDF files with embedded images or tables in them?
- If All, you can increase precision for processing images and tables by specifying
hi_res
for--strategy
orstrategy
. For additional benefits, skip ahead to Step 2. - If Some, you can increase precision for processing images and tables by specifying
auto
for--strategy
orstrategy
. This will leave the decision up to Unstructured on a file-by-file basis about the best partitioning strategy to use. You have completed this decision-maker. See also Auto partitioning strategy logic. - If None, you can increase performance and decrease cost by specifying
fast
for--strategy
orstrategy
. You have completed this decision-maker.
--strategy
or strategy
to ocr_only
. Unstructured will use the Tesseract OCR agent as applicable. You cannot specify any other OCR agent to be used instead.For object detection, do you know which particular high-resolution object detection model you want to use?
- If No, then specify
auto
for--strategy
orstrategy
. This will decrease performance compared to answering Yes. You have completed this decision-maker. See also Auto partitioning strategy logic. - If Yes, then set
--hi-res-model-name
(CLI),hi_res_model_name
(Python), orhiResModelName
(JavaScript/TypeScript) to the available model’s name. Learn about the available models.
Code example
See Changing partition strategy for a PDF.
Auto partitioning strategy logic
Setting --strategy
or strategy
to auto
leaves the decision up to Unstructured on a file-by-file basis about which partitioning strategy to use. Specifically:
-
If the file is an image, the
hi_res
strategy is used for that file. Thelayout_v1.0.0
high-resolution object detection model is used. -
If the file is a PDF, the local processing logic or Unstructured tries to detect whether there are any embedded tables or images in that file.
- If no embedded tables or images are detected, the
fast
strategy is used for that file. No high-resolution object detection model is used. - If at least one embedded table or image is found, the
hi_res
strategy is used for that file. Thelayout_v1.0.0
high-resolution object detection model is used.
- If no embedded tables or images are detected, the
-
If
--strategy
orstrategy
is not specified, theauto
strategy is used by default.