Skip to main content

Data Ingestion Pipelines

Pre-process and ingest diverse data formats, including images and text-based documents


The Data Ingestion Pipelines framework, part of the Data Utils library, offers a comprehensive suite of robust functions — commonly referred to as pipelines — designed to pre-process, transform, and prepare images and text documents for seamless ingestion into the Clarifai platform.

These ready-to-use pipelines enable efficient processing of unstructured data, including partitioning, chunking, cleaning, and extracting valuable information, ensuring the data is optimized for downstream use cases such as Retrieval Augmented Generation (RAG).

Leveraging the capabilities of the open-source Unstructured library, this framework is designed to streamline data processing workflows, making it an essential tool for working with Large Language Models (LLMs) and other AI-driven applications.

It supports these file formats:

  • PDF
  • Text (.txt)
  • Docx
  • Markdown

Prerequisites

Install Python SDK and Data Utils

Install the latest version of the clarifai Python SDK package. Also, install the Data Utils library.

pip install --upgrade clarifai pip install clarifai-datautils

Install Extra Dependencies

The Data Ingestion Pipelines framework requires additional libraries to function properly. First, create a requirements-dev.txt file and add the following dependencies:

unstructured[pdf] @ git+https://github.com/clarifai/unstructured.git@support_clarifai_model llama-index-core==0.10.33 llama-index-llms-clarifai==0.1.2 pi_heif==0.18.0 markdown==3.7 python-docx==1.1.2 schema==0.7.5

Note that this command pip install unstructured[pdf] @ git+https://github.com/clarifai/unstructured.git@support_clarifai_model installs the support_clarifai_model branch from the Clarifai fork of unstructuredlibrary.

Then, run the following command to install the required dependencies:

pip install -r requirements-dev.txt

You can also install the following system dependencies if they are not already available on your system. Based on the document types you're handling, you may not need all of them.

  • opencv-python-headless — A lightweight version of OpenCV (Open Source Computer Vision Library) designed for environments where GUI functionalities (such as image or video display) are not needed. You can install it by running: pip install opencv-python-headless.
  • poppler-utils — Essential for processing and extracting data from PDF files. You can install it by running: sudo apt update && sudo apt install poppler-utils.
  • tesseract-ocr — Required for performing OCR on images or scanned documents to extract text. You can install it by running: sudo apt update && sudo apt install tesseract-ocr.
  • libgl1-mesa-glx — Ensures compatibility with graphical operations, which may be required by certain libraries (e.g., OpenCV) even in headless environments. You can install it by running: sudo apt update && sudo apt install libgl1-mesa-glx.
  • punkt_tab — Enables tokenization of text data with tab-separated values; it's part of the NLTK library. You can install it by running: nltk.download('punkt_tab').
  • averaged_perceptron_tagger_eng — Provides a pre-trained model for accurate part-of-speech tagging in English; it's part of the NLTK library. You can install it by running: nltk.download('averaged_perceptron_tagger_eng').

Get a PAT

You need a PAT (Personal Access Token) key to authenticate your connection to the Clarifai platform. You can generate it in your Personal Settings page by navigating to the Security section.

Then, set it as an environment variable in your script.

import os os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE" # replace with your own PAT key

Create a Dataset

Create a dataset on the Clarifai platform to use for uploading your pre-processed data.

from clarifai.client.app import App

app = App(app_id="YOUR_APP_ID_HERE", user_id="YOUR_USER_ID_HERE",pat="YOUR_PAT_HERE")
# Provide the dataset name as parameter in the create_dataset function
dataset = app.create_dataset(dataset_id="annotations_dataset")

Building Pipelines

When working with unstructured documents like PDFs, building pipelines is a crucial step to automate the processing and transformation of data.

Here is an example of a basic pipeline for PDF partitioning.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  

# Define the processing pipeline
pipeline = Pipeline(
name="basic_pdf",
transformations=[
PDFPartition()
]
)

# Load predefined pipeline
pipeline = Pipeline.load(name="basic_pdf")

# View the pipeline
pipeline.info()
Output Example
Pipeline: basic_pdf
<clarifai_datautils.multimodal.pipeline.PDF.PDFPartition object at 0x0000017BFFC92000>

Note that:

  • Pipeline and PDFPartition classes are imported from clarifai_datautils.multimodal. These are used to define and execute processing pipelines for PDF documents.
  • A Pipeline object is created with the name "basic_pdf". You can provide any arbitrary name for the pipeline. The name can be used to identify or call the pipeline.
  • PDFPartition() uses default parameters (such as max_characters=500) for ingesting PDFs.
  • After loading a predefined pipeline, you can view its details.

Partitioning & Chunking

Partitioning is the first step in document processing. It breaks down a raw, unstructured document into smaller, meaningful units called document elements, while preserving the document’s semantic structure.

These elements — such as paragraphs, titles, tables, and images — help maintain the original context. The process involves reading the document, segmenting it into sections, categorizing those sections, and extracting the relevant text.

Chunking follows partitioning and involves grouping or rearranging document elements generated by partitioning into "chunks" based on specific size constraints or criteria. This step ensures that the resulting segments are optimized for use cases like search, summarization, or content retrieval.

info

Once a chunk of text or image data is uploaded to the Clarifai platform, metadata fields — such as filename, page_number, orig_elements, and type — are automatically added to provide detailed information about the uploaded input.

Example

PDF Partitioning

PDF partitioning helps transform PDFs into a structured format that can be used for further processing.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024)
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Run the pipeline on a PDF file
# Set `loader=False` to return the transformed chunks as elements instead of loading them into a dataset
elements = pipeline.run(files=LOCAL_PDF_PATH, loader=False)

# Print the resulting chunks (document elements)
print(elements)

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Output Example
Transforming Files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.82s/it]
[<unstructured.documents.elements.CompositeElement object at 0x0000013D30FE63F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314540B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31454B90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31335760>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31355370>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315034D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314C53D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501C40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31500F80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501370>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502DB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502D50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501F40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501190>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315008F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502CC0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31500470>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315017C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502060>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503410>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315028D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31500770>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502D80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502810>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502090>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503320>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501670>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503AA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31500DA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502300>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503200>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502EA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315018B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31503A70>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315023C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31501520>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315013A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502C60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D315000E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31502540>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322805F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280D10>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282780>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322837A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282870>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281B50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322838C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281BE0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280200>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283680>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282C60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322823C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283350>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283D10>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280950>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282750>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283560>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322809E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280FB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281010>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322801A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322823F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281400>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322808F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283800>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280530>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282060>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283110>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322821E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281640>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281490>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281100>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280CE0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280290>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281820>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283500>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281A60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282B40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280B60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282CC0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281A90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281190>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280C20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281C40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322838F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280EF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281D00>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281850>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280DD0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322812E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280CB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322803E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283D40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280BF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280500>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322827E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283EF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281280>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283E90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283890>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282240>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281250>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322836E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322803B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280650>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322811C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283DD0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280C80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283C50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282000>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281730>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322824B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322834D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282D50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282CF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282420>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283AA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322818B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280C50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283470>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283380>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322822D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322815B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283590>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280E90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280A70>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280F50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281CD0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283440>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322804D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283B60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281D30>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282D80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322825A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32283020>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322802C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322820F0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281B20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280A40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322820C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322810A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32282690>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281670>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322808C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281460>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32280800>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281550>, <unstructured.documents.elements.CompositeElement object at 0x0000013D32281D60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D322831D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387020>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387E30>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387470>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31384380>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313846B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31384E90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387680>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31385070>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31386D20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31387B60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31385A90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31386600>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313842C0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313871A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31386C90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302EA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313025A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300F80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303650>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313031A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303890>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302450>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302B70>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303B00>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302FF0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300E00>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303830>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301430>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313027E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300590>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313013D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301160>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301E20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302930>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300410>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302C90>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301D30>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313016A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301B50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300260>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303C80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303170>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313009E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302F60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302120>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301A60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301700>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300C50>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313018E0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301880>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313034A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313028D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313004D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300FB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313027B0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302AB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313016D0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301EB0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303B60>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302840>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300C20>, <unstructured.documents.elements.CompositeElement object at 0x0000013D313004A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31303200>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300290>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31302D80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31300AA0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31301F70>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B1100>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B2000>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B10A0>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B3C80>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B1040>, <unstructured.documents.elements.CompositeElement object at 0x0000013D314B2E40>, <unstructured.documents.elements.CompositeElement object at 0x0000013D31336060>]
Transforming Files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00, 6.73s/it]
Uploading Dataset: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:16<00:00, 2.42s/it]

Note that:

  • The transformation step uses the PDFPartition object to partition the PDF into smaller chunks.
    • chunking_strategy is set to "by_title", meaning the document is split based on its title sections.
    • max_characters limits each chunk to 1024 characters for better processing and retrieval efficiency. The default behavior is 500 characters.
  • The loader=False, which is the default, argument ensures the transformed chunks are returned as Python objects (elements), allowing for local inspection or further processing. Conversely, setting loader=True directly ingests the transformed chunks into a Clarifai dataset instead of just returning them locally.
  • The partitioned and chunked PDF elements are uploaded to a Clarifai dataset. The uploaded data is automatically annotated with the pipeline name on the Clarifai platform. This makes it easy to identify and distinguish between data processed through different pipelines.
tips

You can also configure the following arguments for the PDFPartition object:

  • Set chunking_strategy="basic" for the document to be chunked purely based on character length and sequential order rather than structural elements like section titles or page boundaries. It's useful when you simply want to group text into evenly sized chunks without preserving the document’s logical structure.
  • Set ocr=True to enable OCR for extracting text from scanned or image-based PDFs. Set it to False, which is the default, to disable OCR.
  • By default, overlap=None or overlap=0 ensures no overlap between chunks; that is, chunks are created without any shared text between them. To enable overlap, provide an integer value (e.g., overlap=100) to specify the number of overlapping characters between consecutive chunks.
  • Set overlap_all=True to enable overlapping across all chunks. Set it to False, which is the default, to disable this behavior.
  • Set strategy="ocr_only" to force the document to be processed using the Tesseract OCR strategy. If Tesseract is unavailable and the document contains extractable text, it falls back to the "fast" strategy. Set strategy="fast" to extract text using pdfminer, which is faster and suitable for text-based PDFs. Otherwise, strategy="auto" is the default that selects the partitioning strategy based on document characteristics and the function kwargs.
  • Use clarifai_ocr_model to set the URL of a Clarifai OCR model for processing the document. The default is None.

PDF Partitioning Multimodal

The PDFPartitionMultimodal ingestion pipeline supports multimodal scenarios, where files containing a mix of text, images, and other elements are to be processed and ingested into the Clarifai platform.

We use the Clarifai-hosted YOLOX object detection model to process the PDFs containing embedded images.

from clarifai_datautils.multimodal import Pipeline
from clarifai_datautils.multimodal.pipeline.PDF import PDFPartitionMultimodal
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartitionMultimodal(chunking_strategy="by_title", max_characters=1024)
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Note that:

  • ThePDFPartitionMultimodal supports the following arguments for configuration: chunking_strategy, max_characters, overlap, and overlap_all, which have been explained earlier.
tips

You can also configure the following arguments for the PDFPartitionMultimodal object:

  • By default, extract_images_in_pdf=True extracts images from a PDF file. In that case, the partitioning strategy is set as strategy="hi_res", which is intended to identify the layout of the document and gain additional information about the document elements. Otherwise, set extract_images_in_pdf=False to disable this behavior.
  • Set extract_image_block_types=["Image"] to specify that you want to extract a list of image block types.
  • Set extract_image_block_to_payload=True to allow for the conversion of extracted images from a PDF into base64 format (return images as bytes). Note that to use this feature, you must set the strategy parameter to hi_res and extract_images_in_pdf to True. Otherwise, set extract_image_block_to_payload=False to disable this behavior.

Text Partitioning

Text partitioning transforms unstructured .txt documents into text elements, making them easier to process, analyze, and utilize in downstream applications.

from clarifai_datautils.multimodal import Pipeline, TextPartition  
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
TextPartition(chunking_strategy="by_title", max_characters=1024)
]
)

LOCAL_TEXT_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.txt"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed text chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_TEXT_PATH, loader=True))

Note that:

  • The TextPartition object supports the following arguments for configuration: chunking_strategy, max_characters, overlap, and overlap_all, which have been explained earlier.

Docx Partitioning

Docx partitioning processes .docx files, extracting and partitioning their contents into structured text elements.

from clarifai_datautils.multimodal import Pipeline, DocxPartition  
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
DocxPartition(chunking_strategy="by_title", max_characters=1024)
]
)

LOCAL_DOCX_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.docx"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed text chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_DOCX_PATH, loader=True))

Note that:

  • The DocxPartition object supports the following arguments for configuration: chunking_strategy, max_characters, overlap, and overlap_all, which have been explained earlier.

Markdown Partitioning

Markdown partitioning processes .md files, breaking them down into structured text elements for improved usability in downstream applications.

from clarifai_datautils.multimodal import Pipeline, MarkdownPartition  
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
MarkdownPartition(chunking_strategy="by_title", max_characters=1024)
]
)

LOCAL_MD_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.md"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed text chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_MD_PATH, loader=True))

Note that:

  • The MarkdownPartition object supports the following arguments for configuration: chunking_strategy, max_characters, overlap, and overlap_all, which have been explained earlier.

Image Summarization

The Image Summarizer pipeline enables you to utilize a Clarifai's multimodal-to-text model to generate text summaries for the uploaded image data.

Each summary is stored as an individual input on the Clarifai platform, and you can view its metadata field to see the source image it’s associated with.

The generated summaries are concise, optimized for retrieval, and enriched with relevant keywords, making them highly effective for search and indexing.

from clarifai_datautils.multimodal import Pipeline, PDFPartitionMultimodal    
from clarifai_datautils.multimodal.pipeline.summarizer import ImageSummarizer
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartitionMultimodal(chunking_strategy="by_title", max_characters=1024),
ImageSummarizer(model_url="https://clarifai.com/openai/chat-completion/models/gpt-4o") # You can use any other multimodal-to-text model available on the Clarifai platform
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))
Example

Text Cleaning

The Data Ingestion Pipelines framework allows you to prepare and refine raw text data by removing or correcting unwanted elements to improve readability, consistency, and usability for downstream applications.

note

The following examples use the PDFPartition object, but they can also be applied to any other supported partitioning objects.

Clean Extra Whitespaces

You can remove unnecessary spaces, tabs, or newlines from documents.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_extra_whitespace
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_extra_whitespace()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Replace Unicode Quotes

You can replace Unicode quotes with ASCII quotes for standardization.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Replace_unicode_quotes
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Replace_unicode_quotes()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Clean Dashes

You can remove unnecessary dashes from texts.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_dashes
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_dashes()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Clean Bullets

You can remove unnecessary bullets from texts.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_bullets
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_bullets()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Group Broken Paragraphs

You can merge fragmented paragraphs that were unintentionally split, restoring proper text flow and improving readability.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Group_broken_paragraphs
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Group_broken_paragraphs()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Remove Punctuations

You can remove unnecessary punctuations from texts.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Remove_punctuation
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Remove_punctuation()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Convert Byte Strings

You can convert a byte string (such as b'hello') into a regular string ('hello'), ensuring proper text formatting and usability.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Bytes_string_to_string
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Bytes_string_to_string()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Clean Non-ASCII Characters

You can remove non-ASCII characters from text, ensuring compatibility with systems that only support standard ASCII encoding.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_non_ascii_chars
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_non_ascii_chars()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Clean Ordered Bullets

You can remove ordered bullet points(such as 1., 2), or III.) from text.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_ordered_bullets
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_ordered_bullets()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Clean Prefix

You can remove a specified prefix from a document. The Clean_prefix object supports the following arguments:

  • pattern — Defines the prefix to remove. The pattern must be provided, and it can be a simple string or a regex pattern.
  • ignore_case (optional, default is False) — Determines whether to ignore case. If True, ensures case-insensitive matching.
  • strip (optional, default is True) — If True, removes any leading whitespace after the prefix is removed.
from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_prefix
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_prefix(pattern="Example", ignore_case=True, strip=True)
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Clean Postfix

You can remove a specified postfix from a document using the Clean_postfix object, which supports the same arguments as Clean_prefix.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.cleaners import Clean_postfix
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
Clean_postfix(pattern="Example", ignore_case=True, strip=True)
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Text Extraction

The Data Ingestion Pipelines framework allows you to identify and retrieve meaningful texts from documents.

Extract Email Addresses

You can extract email addresses from texts. Note that if a chunk contains the addresses, they will be extracted and stored in the email_address metadata field of the uploaded input on the Clarifai platform, as previously mentioned.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.extractors import ExtractEmailAddress
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractEmailAddress()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Datetime With Time Zones

You can extract datetime values with time zones from texts, ensuring accurate timestamp retrieval. Note that if a chunk contains the values, they will be extracted and stored in the date_time metadata field of the uploaded input on the Clarifai platform.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.extractors import ExtractDateTimeTz
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractDateTimeTz()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Extract IP Addresses

You can extract IP addresses from texts. Note that if a chunk contains the addresses, they will be extracted and stored in the ip_address metadata field of the uploaded input on the Clarifai platform.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.extractors import ExtractIpAddress
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractIpAddress()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Extract IP Addresses Names

You can extract IP addresses along with associated names from texts. Note that if a chunk contains the names, they will be extracted and stored in the ip_address_name metadata field of the uploaded input on the Clarifai platform.

from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.extractors import ExtractIpAddressName
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractIpAddressName()
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Extract Text After

You can extract text appearing after a specified string in a given text input. The ExtractTextAfter object supports the following string arguments:

  • key — Key to store the extracted text in the metadata field of the uploaded input on the Clarifai platform.
  • string — The reference string after which the text will be extracted.
from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.extractors import ExtractTextAfter
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractTextAfter(key="example", string="Example:")
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))

Extract Text Before

You can extract text appearing before a specified string in a given text input. The ExtractTextBefore object supports the following string arguments:

  • key — Key to store the extracted text in the metadata field of the uploaded input on the Clarifai platform.
  • string — The reference string before which the text will be extracted.
from clarifai_datautils.multimodal import Pipeline, PDFPartition  
from clarifai_datautils.multimodal.pipeline.extractors import ExtractTextBefore
from clarifai.client import Dataset
import os

# Set the Clarifai Personal Access Token (PAT) for authentication
os.environ["CLARIFAI_PAT"] = "YOUR_PAT_HERE"

# Define the processing pipeline
pipeline = Pipeline(
name="pipeline-1",
transformations=[
PDFPartition(chunking_strategy="by_title", max_characters=1024),
ExtractTextBefore(key="example", string="Example:")
]
)

LOCAL_PDF_PATH = "YOUR_LOCAL_PDF_PATH_HERE" # Example: "./assets/multimodal/DA-1p.pdf"

# Initialize dataset object for ingesting data
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="YOUR_DATASET_ID_HERE")

# Alternative: Initialize the dataset using a dataset URL (commented out)
# dataset = Dataset("DATASET_URL_HERE")

# Use Python SDK to upload the processed PDF chunks to Clarifai
dataset.upload_dataset(pipeline.run(files=LOCAL_PDF_PATH, loader=True))