Skip to main content

Chat With Dropbox Using Unstructured.io

Learn how to chat with data from Dropbox


Dropbox is a cloud storage service that allows users to store, sync, and share files online. It provides seamless file synchronization across devices, enabling access to updated files from anywhere with an internet connection. Users can easily share files and folders with others, even if they don't have a Dropbox account. Using Dropbox as a source connector you can now ingest data to a Clarifai app and then leverage all of Clarifai platform's abilities. In this example, we are going to chat with our data ingested into the Clarifai app using RAG.

Prerequisites

  • Setting up the Clarifai Python SDK along with PAT. Refer to the installation and configuration with the PAT token here.
note

Guide to get your PAT

import os
os.environ['CLARIFAI_PAT'] ="YOUR_PAT"
  • Install the required packages.
! pip install "unstructured[clarifai]"
! pip install "unstructured[dropbox]"

Initialization

First, let us setup the data we are going to ingest into the app. The data we are going to use will be stored in Dropbox. To access the data using Unstructured.io, we have to provide Dropbox access token.

info

Setup dropbox access token. Refer this page for instructions.

DROPBOX_ACCESS_TOKEN="YOUR_ACCESS_TOKEN"

After setting up the access tokens, let’s import some necessary libraries.

import os  # Importing the os module for environment variable access

# Importing necessary configurations and classes from unstructured.ingest.connector.fsspec.dropbox
from unstructured.ingest.connector.fsspec.dropbox import DropboxAccessConfig, SimpleDropboxConfig

# Importing configuration classes from unstructured.ingest.interfaces
from unstructured.ingest.interfaces import (
PartitionConfig,
ProcessorConfig,
ReadConfig,
)

# Importing the DropboxRunner class from unstructured.ingest.runner
from unstructured.ingest.runner import DropboxRunner

# Importing necessary configurations and classes from unstructured.ingest.connector.clarifai
from unstructured.ingest.connector.clarifai import (
ClarifaiAccessConfig,
ClarifaiWriteConfig,
SimpleClarifaiConfig,
)

# Importing base writer and ClarifaiWriter from unstructured.ingest.runner.writers.clarifai
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.clarifai import (
ClarifaiWriter,
)

Next, we will have to write a function to set up the ingestion configurations required to upload the data into our app in the Clarifai platform.

def clarifai_writer() -> Writer:
# This function defines a writer for the Clarifai service.
# It returns an instance of ClarifaiWriter class.

return ClarifaiWriter(
connector_config=SimpleClarifaiConfig(
# Configuration for accessing the Clarifai API.
access_config=ClarifaiAccessConfig(
api_key="PAT" # API key for accessing the Clarifai service.
),
# Configuration specific to the Clarifai application.
app_id="app_id", # The ID of the Clarifai application.
user_id="user_id" # The ID of the Clarifai user.
),
write_config=ClarifaiWriteConfig() # Configuration for writing data to Clarifai.
)

Data Ingestion

In data ingestion, there are two important concepts Source Connector and Destination Connector. For our use case the Source Connector will fetch the data from Dropbox and the Destination Connector will send the transformed data to the Clarifai app.

Click here to learn more about Ingestion.

if __name__ == "__main__":
# Creating a writer instance using the clarifai_writer function
writer = clarifai_writer()

# Creating an instance of DropboxRunner with various configurations
runner = DropboxRunner(
processor_config=ProcessorConfig(
verbose=True, # Enable verbose output
output_dir="dropbox-output", # Directory to store output locally
num_processes=2, # Number of processes to use
),
read_config=ReadConfig(), # Configuration for reading data
partition_config=PartitionConfig(), # Configuration for partitioning data
connector_config=SimpleDropboxConfig(
access_config=DropboxAccessConfig(token=os.getenv("DROPBOX_ACCESS_TOKEN")), # Access config using environment variable for Dropbox token
remote_url="dropbox file URL", # URL of the Dropbox file
recursive=True, # Whether to recursively read files in the directory
),
writer=writer, # Writer to use for output
writer_kwargs={}, # Additional arguments for the writer
)

# Running the DropboxRunner
runner.run()
Output
2024-06-11 10:03:55,063 MainProcess DEBUG    updating download directory to: /root/.cache/unstructured/ingest/dropbox/a5d8d1c6ed
2024-06-11 10:03:55,068 MainProcess INFO running pipeline: DocFactory -> Reader -> Partitioner -> Writer -> Copier with config: {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "dropbox-output", "num_processes": 2, "raise_on_error": false}
2024-06-11 10:03:55,152 MainProcess INFO Running doc factory to generate ingest docs. Source connector: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "dropbox-output", "num_processes": 2, "raise_on_error": false}, "read_config": {"download_dir": "/root/.cache/unstructured/ingest/dropbox/a5d8d1c6ed", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"remote_url": "dropbox://test", "uncompress": false, "recursive": true, "file_glob": null, "access_config": {"token": "*******"}, "protocol": "dropbox", "path_without_protocol": "test", "dir_path": "test", "file_path": ""}}
2024-06-11 10:03:55,568 MainProcess INFO processing 2 docs via 2 processes
2024-06-11 10:03:55,571 MainProcess INFO Calling Reader with 2 docs
2024-06-11 10:03:55,573 MainProcess INFO Running source node to download data associated with ingest docs
2024-06-11 10:04:03,339 MainProcess INFO Calling Partitioner with 2 docs
2024-06-11 10:04:03,341 MainProcess INFO Running partition node to extract content from json files. Config: {"pdf_infer_table_structure": false, "strategy": "auto", "ocr_languages": null, "encoding": null, "additional_partition_args": {}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructured.io/general/v0/general", "partition_by_api": false, "api_key": "*******", "hi_res_model_name": null}, partition kwargs: {}]
2024-06-11 10:04:03,346 MainProcess INFO Creating /root/.cache/unstructured/ingest/pipeline/partitioned
2024-06-11 10:04:14,063 MainProcess INFO Calling Copier with 1 docs
2024-06-11 10:04:14,067 MainProcess INFO Running copy node to move content to desired output location
2024-06-11 10:04:15,970 MainProcess INFO uploading elements from 1 document(s) to the destination
2024-06-11 10:04:15,972 MainProcess INFO Calling Writer with 1 docs
2024-06-11 10:04:15,975 MainProcess INFO Running write node to upload content. Destination connector: {"write_config": {"batch_size": 50}, "connector_config": {"access_config": {"api_key": "*******"}, "app_id": "unst-clf", "user_id": "8tzpjy1a841y", "dataset_id": null}, "_client": null}]
2024-06-11 10:04:16,425 MainProcess INFO Extending 506 json elements from content in dropbox-output/Crawfords_Auto_Repair_Guide.txt.json
2024-06-11 10:04:16,445 MainProcess INFO writing 506 objects to destination app unst-clf
2024-06-11 10:04:19 INFO clarifai.client.input: input.py:706
Inputs Uploaded
code: SUCCESS
description: "Ok"
details: "All inputs successfully added"
req_id: "2216655c3ae641f1b3789c45f367fdd0"

Chat

In the final step, we are going to perform information retrieval using RAG based on the data we ingested from Dropbox to the Clarifai app using Unstructured.io. You can use a workflow with a RAG prompter for initialising RAG. After successfully creating a workflow, you can get the URL from the Clarifai portal. After creating the rag object using workflow URL you can start retrieving text from the data we ingested using Unstructured.io.

from clarifai.rag import RAG

WORKFLOW_URL = 'rag_workflow_url'
# creating RAG object with prebuilt workflow
rag_object_from_url = RAG(workflow_url = WORKFLOW_URL)
result=rag_object_from_url.chat(messages=[{"role":"human", "content":"what is brake fluid"}])
# Extract the content of the response and split it by newline character ('\n') into a list 'ans'.
answer = result[0]["content"].split('\n')
print(answer)
Output
'Brake fluid is a type of hydraulic fluid used in hydraulic brake and hydraulic clutch applications in vehicles. 
It is responsible for transferring force into pressure, and to amplify braking force.
The level of brake fluid in a vehicle can be checked via a clear reservoir, and it should ideally be between the minimum and maximum level marks. If the fluid is low, it could indicate a potential issue and might require a visit to a mechanic. Most vehicles also have a dashboard light that illuminates when the brake fluid is low.'