Data Ingestion Using Unstructured.io
Learn about the data ingestion process in Unstructured.io
Unstructured.io provides a powerful platform for handling the ingestion of unstructured data. Central to this process are the source and destination connectors, which facilitate the movement of data from its origin to a storage or processing system.
Source Connectors
Source connectors are designed to interface with various unstructured data sources, allowing you to seamlessly ingest data into the Clarifai platform. Click here to learn more about source connectors.
Below is an example of using S3 as a source connector.
# Importing necessary modules from the 'os' library
import os
# Importing necessary configurations and classes from unstructured.ingest.connector.fsspec.s3
from unstructured.ingest.connector.fsspec.s3 import S3AccessConfig, SimpleS3Config
# Importing configuration classes from unstructured.ingest.interfaces
from unstructured.ingest.interfaces import (
PartitionConfig,
ProcessorConfig,
ChunkingConfig,
ReadConfig,
)
# Importing the S3Runner class from unstructured.ingest.runner
from unstructured.ingest.runner import S3Runner
# Importing necessary configurations and classes from unstructured.ingest.connector.clarifai
from unstructured.ingest.connector.clarifai import (
ClarifaiAccessConfig,
ClarifaiWriteConfig,
SimpleClarifaiConfig,
)
# Importing base writer and ClarifaiWriter from unstructured.ingest.runner.writers.clarifai
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.clarifai import (
ClarifaiWriter,
)
if __name__ == "__main__":
# Creating an instance of ClarifaiWriter
writer = clarifai_writer()
# Creating an instance of S3Runner with various configurations
runner = S3Runner(
processor_config=ProcessorConfig(
verbose=True, # Enable verbose output
output_dir="s3-output-local", # Directory to store output locally
num_processes=2, # Number of processes to use
),
read_config=ReadConfig(), # Configuration for reading data
partition_config=PartitionConfig(), # Configuration for partitioning data
connector_config=SimpleS3Config(
access_config=S3AccessConfig(
key=access_key, # S3 access key
secret=secret_access, # S3 secret access key
),
remote_url="s3 URL", # URL of the S3 bucket
),
writer=writer, # Writer to use for output
writer_kwargs={}, # Additional arguments for the writer
)
# Running the S3Runner
runner.run()