Cleaning
As part of data preparation for an NLP model, it’s common to need to clean up your data prior to passing it into the model. If there’s unwanted content in your output, for example, it could impact the quality of your NLP model. To help with this, the unstructured
library includes cleaning functions to help users sanitize output before sending it to downstream applications.
Some cleaning functions apply automatically. In the example in the Partition section, the output Philadelphia Eaglesâ\x80\x99 victory
automatically gets converted to Philadelphia Eagles' victory
in partition_html
using the replace_unicode_quotes
cleaning function. You can see how that works in the code snippet below:
Document elements in unstructured
include an apply
method that allow you to apply the text cleaning to the document element without instantiating a new element. The apply
method expects a callable that takes a string as input and produces another string as output. In the example below, we invoke the replace_unicode_quotes
cleaning function using the apply
method.
Since a cleaning function is just a str -> str
function, users can also easily include their own cleaning functions for custom data preparation tasks. In the example below, we remove citations from a section of text.
See below for a full list of cleaning functions in the unstructured
library.
bytes_string_to_string
Converts an output string that looks like a byte string to a string using the specified encoding. This happens sometimes in partition_html
when there is a character like an emoji that isn’t expected by the HTML parser. In that case, the encoded bytes get processed.
Examples:
For more information about the bytes_string_to_string
function, you can check the source code here.
clean
Cleans a section of text with options including removing bullets, extra whitespace, dashes and trailing punctuation. Optionally, you can choose to lowercase the output.
Options:
-
Applies
clean_bullets
ifbullets=True
. -
Applies
clean_extra_whitespace
ifextra_whitespace=True
. -
Applies
clean_dashes
ifdashes=True
. -
Applies
clean_trailing_punctuation
iftrailing_punctuation=True
. -
Lowercases the output if
lowercase=True
.
Examples:
For more information about the clean
function, you can check the source code here.
clean_bullets
Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the text are not removed.
Examples:
For more information about the clean_bullets
function, you can check the source code here.
clean_dashes
Removes dashes from a section of text. Also handles special characters such as \u2013
.
Examples:
For more information about the clean_dashes
function, you can check the source code here.
clean_non_ascii_chars
Removes non-ascii characters from a string.
Examples:
For more information about the clean_non_ascii_chars
function, you can check the source code here.
clean_ordered_bullets
Remove alphanumeric bullets from the beginning of text up to three “sub-section” levels.
Examples:
For more information about the clean_ordered_bullets
function, you can check the source code here.
clean_postfix
Removes the postfix from a string if they match a specified pattern.
Options:
-
Ignores case if
ignore_case
is set toTrue
. The default isFalse
. -
Strips trailing whitespace is
strip
is set toTrue
. The default isTrue
.
Examples:
For more information about the clean_postfix
function, you can check the source code here.
clean_prefix
Removes the prefix from a string if they match a specified pattern.
Options:
-
Ignores case if
ignore_case
is set toTrue
. The default isFalse
. -
Strips leading whitespace is
strip
is set toTrue
. The default isTrue
.
Examples:
For more information about the clean_prefix
function, you can check the source code here.
clean_trailing_punctuation
Removes trailing punctuation from a section of text.
Examples:
For more information about the clean_trailing_punctuation
function, you can check the source code here.
group_broken_paragraphs
Groups together paragraphs that are broken up with line breaks for visual or formatting purposes. This is common in .txt
files. By default, group_broken_paragraphs
groups together lines split by \n
. You can change that behavior with the line_split
kwarg. The function considers \n\n
to be a paragraph break by default. You can change that behavior with the paragraph_split
kwarg.
Examples:
For more information about the group_broken_paragraphs
function, you can check the source code here.
remove_punctuation
Removes ASCII and unicode punctuation from a string.
Examples:
For more information about the remove_punctuation
function, you can check the source code here.
replace_unicode_quotes
Replaces unicode quote characters such as \x91
in strings.
Examples:
For more information about the replace_unicode_quotes
function, you can check the source code here.
translate_text
The translate_text
cleaning functions translates text between languages. translate_text
uses the Helsinki NLP MT models from transformers
for machine translation. Works for Russian, Chinese, Arabic, and many other languages.
Parameters:
-
text
: the input string to translate. -
source_lang
: the two letter language code for the source language of the text. Ifsource_lang
is not specified, the language will be detected usinglangdetect
. -
target_lang
: the two letter language code for the target language for translation. Defaults to"en"
.
Examples:
For more information about the translate_text
function, you can check the source code here.