Chat With Dropbox Using Unstructured.io

Learn how to chat with data from Dropbox

Dropbox is a cloud storage service that allows users to store, sync, and share files online. It provides seamless file synchronization across devices, enabling access to updated files from anywhere with an internet connection. Users can easily share files and folders with others, even if they don't have a Dropbox account. Using Dropbox as a source connector you can now ingest data to a Clarifai app and then leverage all of Clarifai platform's abilities. In this example, we are going to chat with our data ingested into the Clarifai app using RAG.

Prerequisites

Setting up the Clarifai Python SDK along with PAT. Refer to the installation and configuration with the PAT token here.

note

Guide to get your PAT

import os
os.environ['CLARIFAI_PAT'] ="YOUR_PAT"

Install the required packages.

! pip install "unstructured[clarifai]"
! pip install "unstructured[dropbox]"

Initialization

First, let us setup the data we are going to ingest into the app. The data we are going to use will be stored in Dropbox. To access the data using Unstructured.io, we have to provide Dropbox access token.

info

Setup dropbox access token. Refer this page for instructions.

DROPBOX_ACCESS_TOKEN="YOUR_ACCESS_TOKEN"

After setting up the access tokens, let’s import some necessary libraries.

Python

import os  # Importing the os module for environment variable access

# Importing necessary configurations and classes from unstructured.ingest.connector.fsspec.dropbox
from unstructured.ingest.connector.fsspec.dropbox import DropboxAccessConfig, SimpleDropboxConfig

# Importing configuration classes from unstructured.ingest.interfaces
from unstructured.ingest.interfaces import (
   PartitionConfig,
   ProcessorConfig,
   ReadConfig,
)

# Importing the DropboxRunner class from unstructured.ingest.runner
from unstructured.ingest.runner import DropboxRunner

# Importing necessary configurations and classes from unstructured.ingest.connector.clarifai
from unstructured.ingest.connector.clarifai import (
   ClarifaiAccessConfig,
   ClarifaiWriteConfig,
   SimpleClarifaiConfig,
)

# Importing base writer and ClarifaiWriter from unstructured.ingest.runner.writers.clarifai
from unstructured.ingest.runner.writers.base_writer import Writer
from unstructured.ingest.runner.writers.clarifai import (
   ClarifaiWriter,
)

Next, we will have to write a function to set up the ingestion configurations required to upload the data into our app in the Clarifai platform.

Python

def clarifai_writer() -> Writer:
    # This function defines a writer for the Clarifai service.
    # It returns an instance of ClarifaiWriter class.

    return ClarifaiWriter(
        connector_config=SimpleClarifaiConfig(
            # Configuration for accessing the Clarifai API.
            access_config=ClarifaiAccessConfig(
                api_key="PAT"  # API key for accessing the Clarifai service.
            ),
            # Configuration specific to the Clarifai application.
            app_id="app_id",  # The ID of the Clarifai application.
            user_id="user_id"  # The ID of the Clarifai user.
        ),
        write_config=ClarifaiWriteConfig()  # Configuration for writing data to Clarifai.
    )

Data Ingestion

In data ingestion, there are two important concepts Source Connector and Destination Connector. For our use case the Source Connector will fetch the data from Dropbox and the Destination Connector will send the transformed data to the Clarifai app.

Click here to learn more about Ingestion.

Python

if __name__ == "__main__":
    # Creating a writer instance using the clarifai_writer function
    writer = clarifai_writer()
    
    # Creating an instance of DropboxRunner with various configurations
    runner = DropboxRunner(
        processor_config=ProcessorConfig(
            verbose=True,  # Enable verbose output
            output_dir="dropbox-output",  # Directory to store output locally
            num_processes=2,  # Number of processes to use
        ),
        read_config=ReadConfig(),  # Configuration for reading data
        partition_config=PartitionConfig(),  # Configuration for partitioning data
        connector_config=SimpleDropboxConfig(
            access_config=DropboxAccessConfig(token=os.getenv("DROPBOX_ACCESS_TOKEN")),  # Access config using environment variable for Dropbox token
            remote_url="dropbox file URL",  # URL of the Dropbox file
            recursive=True,  # Whether to recursively read files in the directory
        ),
        writer=writer,  # Writer to use for output
        writer_kwargs={},  # Additional arguments for the writer
    )
    
    # Running the DropboxRunner
    runner.run()

Output

2024-06-11 10:03:55,063 MainProcess DEBUG    updating download directory to: /root/.cache/unstructured/ingest/dropbox/a5d8d1c6ed
2024-06-11 10:03:55,068 MainProcess INFO     running pipeline: DocFactory -> Reader -> Partitioner -> Writer -> Copier with config: {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "dropbox-output", "num_processes": 2, "raise_on_error": false}
2024-06-11 10:03:55,152 MainProcess INFO     Running doc factory to generate ingest docs. Source connector: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "dropbox-output", "num_processes": 2, "raise_on_error": false}, "read_config": {"download_dir": "/root/.cache/unstructured/ingest/dropbox/a5d8d1c6ed", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"remote_url": "dropbox://test", "uncompress": false, "recursive": true, "file_glob": null, "access_config": {"token": "*******"}, "protocol": "dropbox", "path_without_protocol": "test", "dir_path": "test", "file_path": ""}}
2024-06-11 10:03:55,568 MainProcess INFO     processing 2 docs via 2 processes
2024-06-11 10:03:55,571 MainProcess INFO     Calling Reader with 2 docs
2024-06-11 10:03:55,573 MainProcess INFO     Running source node to download data associated with ingest docs
2024-06-11 10:04:03,339 MainProcess INFO     Calling Partitioner with 2 docs
2024-06-11 10:04:03,341 MainProcess INFO     Running partition node to extract content from json files. Config: {"pdf_infer_table_structure": false, "strategy": "auto", "ocr_languages": null, "encoding": null, "additional_partition_args": {}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "https://api.unstructured.io/general/v0/general", "partition_by_api": false, "api_key": "*******", "hi_res_model_name": null}, partition kwargs: {}]
2024-06-11 10:04:03,346 MainProcess INFO     Creating /root/.cache/unstructured/ingest/pipeline/partitioned
2024-06-11 10:04:14,063 MainProcess INFO     Calling Copier with 1 docs
2024-06-11 10:04:14,067 MainProcess INFO     Running copy node to move content to desired output location
2024-06-11 10:04:15,970 MainProcess INFO     uploading elements from 1 document(s) to the destination
2024-06-11 10:04:15,972 MainProcess INFO     Calling Writer with 1 docs
2024-06-11 10:04:15,975 MainProcess INFO     Running write node to upload content. Destination connector: {"write_config": {"batch_size": 50}, "connector_config": {"access_config": {"api_key": "*******"}, "app_id": "unst-clf", "user_id": "8tzpjy1a841y", "dataset_id": null}, "_client": null}]
2024-06-11 10:04:16,425 MainProcess INFO     Extending 506 json elements from content in dropbox-output/Crawfords_Auto_Repair_Guide.txt.json
2024-06-11 10:04:16,445 MainProcess INFO     writing 506 objects to destination app unst-clf 
2024-06-11 10:04:19 INFO     clarifai.client.input:                                                    input.py:706
                             Inputs Uploaded                                                                       
                             code: SUCCESS                                                                         
                             description: "Ok"                                                                     
                             details: "All inputs successfully added"                                              
                             req_id: "2216655c3ae641f1b3789c45f367fdd0" 

Chat

In the final step, we are going to perform information retrieval using RAG based on the data we ingested from Dropbox to the Clarifai app using Unstructured.io. You can use a workflow with a RAG prompter for initialising RAG. After successfully creating a workflow, you can get the URL from the Clarifai portal. After creating the rag object using workflow URL you can start retrieving text from the data we ingested using Unstructured.io.

Python

from clarifai.rag import RAG

WORKFLOW_URL = 'rag_workflow_url'
# creating RAG object with prebuilt workflow
rag_object_from_url = RAG(workflow_url = WORKFLOW_URL)
result=rag_object_from_url.chat(messages=[{"role":"human", "content":"what is brake fluid"}])
# Extract the content of the response and split it by newline character ('\n') into a list 'ans'.
answer = result[0]["content"].split('\n')
print(answer)

Output

'Brake fluid is a type of hydraulic fluid used in hydraulic brake and hydraulic clutch applications in vehicles. 
It is responsible for transferring force into pressure, and to amplify braking force. 
The level of brake fluid in a vehicle can be checked via a clear reservoir, and it should ideally be between the minimum and maximum level marks. If the fluid is low, it could indicate a potential issue and might require a visit to a mechanic. Most vehicles also have a dashboard light that illuminates when the brake fluid is low.'

Prerequisites​

Initialization​

Data Ingestion​

Chat​

Prerequisites

Initialization

Data Ingestion

Chat