Data Ingestion Pipelines
Pre-process and ingest diverse data formats, including images and text-based documents
The Data Ingestion Pipelines framework, part of the Data Utils library, offers a comprehensive suite of robust functions — commonly referred to as pipelines — designed to pre-process, transform, and prepare images and text documents for seamless ingestion into the Clarifai platform.
These ready-to-use pipelines enable efficient processing of unstructured data, including partitioning, chunking, cleaning, and extracting valuable information, ensuring the data is optimized for downstream use cases such as Retrieval Augmented Generation (RAG).
Leveraging the capabilities of the open-source Unstructured library, this framework is designed to streamline data processing workflows, making it an essential tool for working with Large Language Models (LLMs) and other AI-driven applications.
It supports these file formats:
- Text (.txt)
- Docx
- Markdown
Prerequisites
Install Python SDK and Data Utils
Install the latest version of the clarifai
Python SDK package. Also, install the Data Utils library.
- Bash
pip install --upgrade clarifai
pip install clarifai-datautils
Install Extra Dependencies
The Data Ingestion Pipelines framework requires additional libraries to function properly. First, create a requirements-dev.txt
file and add the following dependencies:
- Text
unstructured[pdf] @ git+https://github.com/clarifai/unstructured.git@support_clarifai_model
llama-index-core==0.10.33
llama-index-llms-clarifai==0.1.2
pi_heif==0.18.0
markdown==3.7
python-docx==1.1.2
schema==0.7.5
Note that this command
pip install unstructured[pdf] @ git+https://github.com/clarifai/unstructured.git@support_clarifai_model
installs thesupport_clarifai_model
branch from the Clarifai fork ofunstructured
library.
Then, run the following command to install the required dependencies:
- Bash
pip install -r requirements-dev.txt
You can also install the following system dependencies if they are not already available on your system. Based on the document types you're handling, you may not need all of them.
opencv-python-headless
— A lightweight version of OpenCV (Open Source Computer Vision Library) designed for environments where GUI functionalities (such as image or video display) are not needed. You can install it by running:pip install opencv-python-headless
.poppler-utils
— Essential for processing and extracting data from PDF files. You can install it by running:sudo apt update && sudo apt install poppler-utils
.tesseract-ocr
— Required for performing OCR on images or scanned documents to extract text. You can install it by running:sudo apt update && sudo apt install tesseract-ocr
.libgl1-mesa-glx
— Ensures compatibility with graphical operations, which may be required by certain libraries (e.g., OpenCV) even in headless environments. You can install it by running:sudo apt update && sudo apt install libgl1-mesa-glx
.punkt_tab
— Enables tokenization of text data with tab-separated values; it's part of the NLTK library. You can install it by running:nltk.download('punkt_tab')
.averaged_perceptron_tagger_eng
— Provides a pre-trained model for accurate part-of-speech tagging in English; it's part of the NLTK library. You can install it by running:nltk.download('averaged_perceptron_tagger_eng')
.
Get a PAT
You need a PAT (Personal Access Token) key to authenticate your connection to the Clarifai platform. You can generate it in your Personal Settings page by navigating to the Security section.
Then, set it as an environment variable in your script.
- Python
import os
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE" # replace with your own PAT key
Create a Dataset
Create a dataset on the Clarifai platform to use for uploading your pre-processed data.
- Python
from clarifai.client.app import App
app = App(app_id="YOUR_APP_ID_HERE", user_id="YOUR_USER_ID_HERE",pat="YOUR_PAT_HERE")
# Provide the dataset name as parameter in the create_dataset function
dataset = app.create_dataset(dataset_id="annotations_dataset")
Building Pipelines
When working with unstructured documents like PDFs, building pipelines is a crucial step to automate the processing and transformation of data.
Here is an example of a basic pipeline for PDF partitioning.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
# Define the processing pipeline
pipeline = Pipeline(
name="basic_pdf",
transformations=[
PDFPartition()
]
)
# Load predefined pipeline
pipeline = Pipeline.load(name="basic_pdf")
# View the pipeline
pipeline.info()
Output Example
Pipeline: basic_pdf
<clarifai_datautils.multimodal.pipeline.PDF.PDFPartition object at 0x0000017BFFC92000>
Note that:
Pipeline
andPDFPartition
classes are imported fromclarifai_datautils.multimodal
. These are used to define and execute processing pipelines for PDF documents.- A
Pipeline
object is created with the name"basic_pdf"
. You can provide any arbitrary name for the pipeline. The name can be used to identify or call the pipeline. PDFPartition()
uses default parameters (such asmax_characters=500
) for ingesting PDFs.- After loading a predefined pipeline, you can view its details.
Partitioning & Chunking
Partitioning is the first step in document processing. It breaks down a raw, unstructured document into smaller, meaningful units called document elements, while preserving the document ’s semantic structure.
These elements — such as paragraphs, titles, tables, and images — help maintain the original context. The process involves reading the document, segmenting it into sections, categorizing those sections, and extracting the relevant text.
Chunking follows partitioning and involves grouping or rearranging document elements generated by partitioning into "chunks" based on specific size constraints or criteria. This step ensures that the resulting segments are optimized for use cases like search, summarization, or content retrieval.
Once a chunk of text or image data is uploaded to the Clarifai platform, metadata fields — such as filename
, page_number
, orig_elements
, and type
— are automatically added to provide detailed information about the uploaded input.
Example
PDF Partitioning
PDF partitioning helps transform PDFs into a structured format that can be used for further processing.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024)
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Run the pipeline on a PDF file
# Set `loader=False` to return the transformed chunks as elements instead of loading them into a dataset
elements = pipeline.run(files=LOCAL_PDF_PATH, loader=False)
# Print the resulting chunks (document elements)
print(elements)
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Output Example
Transforming Files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.82s/it]
[<unstructured.documents.elements.CompositeElement object at 0x0000013D30FE63F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314540B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31454B90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31335760>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31355370>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315034D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314C53D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501C40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31500F80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501370>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502DB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502D50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501F40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501190>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315008F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502CC0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31500470>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315017C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502060>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503410>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315028D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31500770>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502D80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502810>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502090>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503320>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501670>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503AA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31500DA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502300>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503200>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502EA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315018B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503A70>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315023C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501520>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315013A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502C60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315000E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502540>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322805F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280D10>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282780>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322837A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282870>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281B50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322838C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281BE0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280200>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283680>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282C60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322823C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283350>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283D10>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280950>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282750>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283560>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322809E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280FB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281010>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322801A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322823F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281400>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322808F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283800>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280530>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282060>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283110>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322821E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281640>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281490>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281100>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280CE0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280290>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281820>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283500>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281A60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282B40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280B60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282CC0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281A90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281190>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280C20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281C40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322838F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280EF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281D00>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281850>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280DD0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322812E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280CB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322803E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283D40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280BF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280500>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322827E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283EF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281280>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283E90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283890>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282240>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281250>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322836E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322803B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280650>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322811C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283DD0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280C80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283C50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282000>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281730>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322824B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322834D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282D50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282CF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282420>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283AA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322818B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280C50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283470>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283380>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322822D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322815B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283590>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280E90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280A70>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280F50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281CD0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283440>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322804D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283B60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281D30>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282D80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322825A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283020>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322802C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322820F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281B20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280A40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322820C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322810A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282690>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281670>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322808C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281460>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280800>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281550>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281D60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322831D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387020>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387E30>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387470>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31384380>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313846B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31384E90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387680>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31385070>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31386D20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387B60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31385A90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31386600>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313842C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313871A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31386C90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302EA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313025A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300F80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303650>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313031A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303890>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302450>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302B70>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303B00>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302FF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300E00>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303830>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301430>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313027E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300590>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313013D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301160>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301E20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302930>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300410>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302C90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301D30>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313016A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301B50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300260>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303C80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303170>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313009E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302F60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302120>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301A60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301700>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300C50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313018E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301880>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313034A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313028D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313004D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300FB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313027B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302AB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313016D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301EB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303B60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302840>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300C20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313004A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303200>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300290>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302D80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300AA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301F70>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B1100>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B2000>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B10A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B3C80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B1040>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B2E40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31336060>]
Transforming Files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.73s/it]
Uploading Dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:16<00:00, 2.42s/it]
Note that:
- The transformation step uses the
PDFPartition
object to partition the PDF into smaller chunks.chunking_strategy
is set to"by_title"
, meaning the document is split based on its title sections.max_characters
limits each chunk to 1024 characters for better processing and retrieval efficiency. The default behavior is 500 characters.
- The
loader=False
, which is the default, argument ensures the transformed chunks are returned as Python objects (elements
), allowing for local inspection or further processing. Conversely, settingloader=True
directly ingests the transformed chunks into a Clarifai dataset instead of just returning them locally. - The partitioned and chunked PDF elements are uploaded to a Clarifai dataset. The uploaded data is automatically annotated with the pipeline name on the Clarifai platform. This makes it easy to identify and distinguish between data processed through different pipelines.
You can also configure the following arguments for the PDFPartition
object:
- Set
chunking_strategy="basic"
for the document to be chunked purely based on character length and sequential order rather than structural elements like section titles or page boundaries. It's useful when you simply want to group text into evenly sized chunks without preserving the document’s logical structure. - Set
ocr=True
to enable OCR for extracting text from scanned or image-based PDFs. Set it toFalse
, which is the default, to disable OCR. - By default,
overlap=None
oroverlap=0
ensures no overlap between chunks; that is, chunks are created without any shared text between them. To enable overlap, provide an integer value (e.g.,overlap=100
) to specify the number of overlapping characters between consecutive chunks. - Set
overlap_all=True
to enable overlapping across all chunks. Set it toFalse
, which is the default, to disable this behavior. - Set
strategy="ocr_only"
to force the document to be processed using the Tesseract OCR strategy. If Tesseract is unavailable and the document contains extractable text, it falls back to the"fast"
strategy. Setstrategy="fast"
to extract text usingpdfminer
, which is faster and suitable for text-based PDFs. Otherwise,strategy="auto"
is the default that selects the partitioning strategy based on document characteristics and the function kwargs. - Use
clarifai_ocr_model
to set the URL of a Clarifai OCR model for processing the document. The default isNone
.
PDF Partitioning Multimodal
The PDFPartitionMultimodal
ingestion pipeline supports multimodal scenarios, where files containing a mix of text, images, and other elements are to be processed and ingested into the Clarifai platform.
We use the Clarifai-hosted YOLOX object detection model to process the PDFs containing embedded images.
- Python
from clarifai_datautils.multimodal import Pipeline
from clarifai_datautils.multimodal.pipeline.PDF import PDFPartitionMultimodal
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartitionMultimodal(chunking_strategy="by_title", max_characters=1024)
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Note that:
- The
PDFPartitionMultimodal
supports the following arguments for configuration:chunking_strategy
,max_characters
,overlap
, andoverlap_all
, which have been explained earlier.
You can also configure the following arguments for the PDFPartitionMultimodal
object:
- By default,
extract_images_in_pdf=True
extracts images from a PDF file. In that case, the partitioning strategy is set asstrategy="hi_res"
, which is intended to identify the layout of the document and gain additional information about the document elements. Otherwise, setextract_images_in_pdf=False
to disable this behavior. - Set
extract_image_block_types=["Image"]
to specify that you want to extract a list of image block types. - Set
extract_image_block_to_payload=True
to allow for the conversion of extracted images from a PDF into base64 format (return images as bytes). Note that to use this feature, you must set thestrategy
parameter tohi_res
andextract_images_in_pdf
toTrue
. Otherwise, setextract_image_block_to_payload=False
to disable this behavior.
Text Partitioning
Text partitioning transforms unstructured .txt
documents into text elements, making them easier to process, analyze, and utilize in downstream applications.
- Python
from clarifai_datautils.multimodal import Pipeline, TextPartition
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
TextPartition(chunking_strategy="by_title", max_characters=1024)
]
)
LOCAL_TEXT_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.txt"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed text chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_TEXT_PATH, loader=True))
Note that:
- The
TextPartition
object supports the following arguments for configuration:chunking_strategy
,max_characters
,overlap
, andoverlap_all
, which have been explained earlier.
Docx Partitioning
Docx partitioning processes .docx
files, extracting and partitioning their contents into structured text elements.
- Python
from clarifai_datautils.multimodal import Pipeline, DocxPartition
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
DocxPartition(chunking_strategy="by_title", max_characters=1024)
]
)
LOCAL_DOCX_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.docx"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed text chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_DOCX_PATH, loader=True))
Note that:
- The
DocxPartition
object supports the following arguments for configuration:chunking_strategy
,max_characters
,overlap
, andoverlap_all
, which have been explained earlier.
Markdown Partitioning
Markdown partitioning processes .md
files, breaking them down into structured text elements for improved usability in downstream applications.
- Python
from clarifai_datautils.multimodal import Pipeline, MarkdownPartition
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
MarkdownPartition(chunking_strategy="by_title", max_characters=1024)
]
)
LOCAL_MD_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.md"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed text chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_MD_PATH, loader=True))
Note that:
- The
MarkdownPartition
object supports the following arguments for configuration:chunking_strategy
,max_characters
,overlap
, andoverlap_all
, which have been explained earlier.
Image Summarization
The Image Summarizer pipeline enables you to utilize a Clarifai's multimodal-to-text model to generate text summaries for the uploaded image data.
Each summary is stored as an individual input on the Clarifai platform, and you can view its metadata field to see the source image it’s associated with.
The generated summaries are concise, optimized for retrieval, and enriched with relevant keywords, making them highly effective for search and indexing.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartitionMultimodal
from clarifai_datautils.multimodal.pipeline.summarizer import ImageSummarizer
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartitionMultimodal(chunking_strategy="by_title", max_characters=1024),
ImageSummarizer(model_url="https://clarifai.com/openai/chat-completion/models/gpt-4o") # You can use any other multimodal-to-text model available on the Clarifai platform
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Example
Text Cleaning
The Data Ingestion Pipelines framework allows you to prepare and refine raw text data by removing or correcting unwanted elements to improve readability, consistency, and usability for downstream applications.
The following examples use the PDFPartition
object, but they can also be applied to any other supported partitioning objects.
Clean Extra Whitespaces
You can remove unnecessary spaces, tabs, or newlines from documents.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_extra_whitespace
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_extra_whitespace()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Replace Unicode Quotes
You can replace Unicode quotes with ASCII quotes for standardization.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Replace_unicode_quotes
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Replace_unicode_quotes()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Clean Dashes
You can remove unnecessary dashes from texts.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_dashes
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_dashes()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Clean Bullets
You can remove unnecessary bullets from texts.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_bullets
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_bullets()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Group Broken Paragraphs
You can merge fragmented paragraphs that were unintentionally split, restoring proper text flow and improving readability.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Group_broken_paragraphs
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Group_broken_paragraphs()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Remove Punctuations
You can remove unnecessary punctuations from texts.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Remove_punctuation
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Remove_punctuation()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Convert Byte Strings
You can convert a byte string (such as b'hello'
) into a regular string ('hello'
), ensuring proper text formatting and usability.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Bytes_string_to_string
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Bytes_string_to_string()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Clean Non-ASCII Characters
You can remove non-ASCII characters from text, ensuring compatibility with systems that only support standard ASCII encoding.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_non_ascii_chars
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_non_ascii_chars()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Clean Ordered Bullets
You can remove ordered bullet points(such as 1.
, 2)
, or III.
) from text.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_ordered_bullets
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_ordered_bullets()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Clean Prefix
You can remove a specified prefix from a document. The Clean_prefix
object supports the following arguments:
pattern
— Defines the prefix to remove. The pattern must be provided, and it can be a simple string or a regex pattern.ignore_case
(optional, default isFalse
) — Determines whether to ignore case. IfTrue
, ensures case-insensitive matching.strip
(optional, default isTrue
) — IfTrue
, removes any leading whitespace after the prefix is removed.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_prefix
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_prefix(pattern="Example", ignore_case=True, strip=True)
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Clean Postfix
You can remove a specified postfix from a document using the Clean_postfix
object, which supports the same arguments as Clean_prefix
.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_postfix
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_postfix(pattern="Example", ignore_case=True, strip=True)
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Text Extraction
The Data Ingestion Pipelines framework allows you to identify and retrieve meaningful texts from documents.
Extract Email Addresses
You can extract email addresses from texts. Note that if a chunk contains the addresses, they will be extracted and stored in the email_address
metadata field of the uploaded input on the Clarifai platform, as previously mentioned.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.extractors import ExtractEmailAddress
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractEmailAddress()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Datetime With Time Zones
You can extract datetime values with time zones from texts, ensuring accurate timestamp retrieval. Note that if a chunk contains the values, they will be extracted and stored in the date_time
metadata field of the uploaded input on the Clarifai platform.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.extractors import ExtractDateTimeTz
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractDateTimeTz()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Extract IP Addresses
You can extract IP addresses from texts. Note that if a chunk contains the addresses, they will be extracted and stored in the ip_address
metadata field of the uploaded input on the Clarifai platform.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.extractors import ExtractIpAddress
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractIpAddress()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Extract IP Addresses Names
You can extract IP addresses along with associated names from texts. Note that if a chunk contains the names, they will be extracted and stored in the ip_address_name
metadata field of the uploaded input on the Clarifai platform.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.extractors import ExtractIpAddressName
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractIpAddressName()
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Extract Text After
You can extract text appearing after a specified string in a given text input. The ExtractTextAfter
object supports the following string arguments:
key
— Key to store the extracted text in the metadata field of the uploaded input on the Clarifai platform.string
— The reference string after which the text will be extracted.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.extractors import ExtractTextAfter
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractTextAfter(key="example", string="Example:")
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Extract Text Before
You can extract text appearing before a specified string in a given text input. The ExtractTextBefore
object supports the following string arguments:
key
— Key to store the extracted text in the metadata field of the uploaded input on the Clarifai platform.string
— The reference string before which the text will be extracted.
- Python
from clarifai_datautils.multimodal import Pipeline, PDFPartition
from clarifai_datautils.multimodal.pipeline.extractors import ExtractTextBefore
from clarifai.client import Dataset
import os
# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"
# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractTextBefore(key="example", string="Example:")
]
)
LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"
# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")
# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")
# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))