-
Notifications
You must be signed in to change notification settings - Fork 688
Loaders
CAMEL introduced two IO modules, Base IO
and Unstructured IO
which are designed for handling various file types and unstructured data processing.
Base IO module is focused on fundamental input/output operations related to files. It includes functionalities for representing, reading, and processing different file formats.
Unstructured IO module deals with the handling, parsing, and processing of unstructured data. It provides tools for parsing files or URLs, cleaning data, extracting specific information, staging elements for different platforms, and chunking elements. The core of this module lies in its advanced ETL capabilities to manipulate unstructured data to make it usable for various applications like Retrieval-Augmented Generation(RAG).
To get started with the Base IO
module, you'll need to understand how to work with the File class and its subclasses. This module is designed to read files of various formats, extract their contents, and represent them as File objects, each tailored to handle a specific file type.
from io import BytesIO
from camel.loaders import read_file
# Read a pdf file from disk
with open("test.pdf", "rb") as file:
file_content = BytesIO(file.read())
file_content.name = "test.pdf"
# Use the read_file function to create an object based on the file extension
file_obj = read_file(file_content)
# Once you have the File object, you can access its content
print(file_obj.docs[0]["page_content"])
To get started with the Unstructured IO
module, you first need to import the module and initialize an instance of it. Once initialized, you can utilize this module to handle a variety of functionalities such as parsing, cleaning, extracting data, and integrating with cloud services like AWS S3 and Azure. Here's a basic guide to help you begin:
Utilize parse_file_or_url to parse file
from camel.loaders import UnstructuredIO
# Create an instance of UnstructuredIO
uio = UnstructuredIO()
elements = uio.parse_file_or_url("test.pdf")
content = ("\n\n".join([str(el) for el in elements]))
print(content)
Utilize clean_text_data to do various text cleaning operations
# Set example dirty text
example_dirty_text = ("\x93Some dirty text ’ with extra spaces and – dashes.")
# Set clean options
options = [
('replace_unicode_quotes', {}),
('clean_dashes', {}),
('clean_non_ascii_chars', {}),
('clean_extra_whitespace', {}),
]
cleaned_text = uio.clean_text_data(text=example_dirty_text,clean_options=options)
print(cleaned_text)
>>> Some dirty text with extra spaces and dashes.
Utilize extract_data_example to do text extraction operation
# Set example email to extract
example_email_text = ("Contact me at example@email.com.")
extracted_text = uio.extract_data_from_text(text=example_email_text,
extract_type="extract_email_address")
print(extracted_text)
>>> ['example@email.com']
Utilize parse_file_or_url to load and parse unstructured data from a file or URL
# Set example url
example_url = (
"https://www.cnn.com/2023/01/30/sport/empire-state-building-green-"
"philadelphia-eagles-spt-intl/index.html")
elements = uio.parse_file_or_url(example_url)
print(("\n\n".join([str(el) for el in elements])))
>>> The Empire State Building was lit in green and white to celebrate the Philadelphia Eagles’ victory in the NFC Championship game on Sunday – a decision that’s sparked a bit of a backlash in the Big Apple.
>>> The Eagles advanced to the Super Bowl for the first time since 2018 after defeating the San Francisco 49ers 31-7, and the Empire State Building later tweeted how it was marking the occasion.
>>> Fly @Eagles Fly! We’re going Green and White in honor of the Eagles NFC Championship Victory. pic.twitter.com/RNiwbCIkt7— Empire State Building (@EmpireStateBldg)
>>> January 29, 2023...
Utilize chunk_elements to do element chunking
chunks = uio.chunk_elements(elements=elements,chunk_type="chunk_by_title")
for chunk in chunks:
print(chunk)
print("\n" + "-" * 80)
>>> The Empire State Building was lit in green and white to celebrate the Philadelphia Eagles’ victory in the NFC Championship game on Sunday – a decision that’s sparked a bit of a backlash in the Big Apple.
>>> The Eagles advanced to the Super Bowl for the first time since 2018 after defeating the San Francisco 49ers 31-7, and the Empire State Building later tweeted how it was marking the occasion.
>>> --------------------------------------------------------------------------------
>>> Fly @Eagles Fly! We’re going Green and White in honor of the Eagles NFC Championship Victory. pic.twitter.com/RNiwbCIkt7— Empire State Building (@EmpireStateBldg)
>>> --------------------------------------------------------------------------------
>>> January 29, 2023
Utilize stage_elements to do element staging
staged_element = uio.stage_elements(elements=elements,stage_type="stage_for_baseplate")
print(staged_element)
>>> {'rows': [{'data': {'type': 'UncategorizedText', 'element_id': 'e78902d05b0cb1e4c38fc7a79db450d5', 'text': 'CNN\n \xa0—'}, 'metadata': {'filetype': 'text/html', 'languages': ['eng'], 'page_number': 1, 'url': 'https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html', 'emphasized_text_contents': ['CNN'], 'emphasized_text_tags': ['span']}}, ...
This is a basic guide to get you started with the Unstructured IO
module. For more advanced usage, refer to the specific method documentation and the Unstructured IO Documentation.
🪐 This Wiki page is a budding planet in the universe of knowledge, still under construction. Beware of informational meteor showers and the occasional black hole of error as it orbits towards completeness. - From an anonymous cat.