Skip to main content

Upload Data

Seamlessly upload your data from Databricks into Clarifai


You can ingest datasets from Databricks into your Clarifai environment. This allows you to easily take advantage of Clarifai's AI capabilities to analyze and extract insights from your data, without having to manually move it between the two platforms.

Once your data has been uploaded to Clarifai, you can leverage Clarifai’s AI capabilities to get the most out of your data. For example, you can use Clarifai to identify and classify objects and scenes in images.

Let’s illustrate how you can seamlessly transfer data from a Databricks volume to Clarifai.

Prerequisites

  • Databricks notebook development environment. Also, ensure your Databricks workspace is enabled to work with Unity Catalog
  • Get the path URL of the Databricks location having your data
  • Get your PAT (Personal Access Token) from the Clarifai’s portal under the Settings/Security section
  • Get your Clarifai user ID
  • Get the ID of the Clarifai app where you want to upload the data
  • Get the ID of a dataset within your app
  • Install the Clarifai PySpark package by running pip install clarifai-pyspark
  • Install Protocol Buffers by running pip install protobuf==4.24.2 . It’s a cross-platform, serialization protocol that describes the structure of the data to be sent
info

You can learn how to authenticate with the Clarifai platform here.

Upload Data From a Volume Folder

You can upload images or text files stored in a Databricks volume to your Clarifai application. It’s important to ensure that the folder exclusively contains either images or text files.

###################################################################################
# In this section, we set the user authentication, user and app ID, dataset ID,
# and Databricks folder path. Change these strings to run your own example.
##################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"
# URL path of your Databricks folder; Example: "/Volumes/test1/default/volume1/folder1"
FOLDER_PATH = "YOUR_DATABRICKS_FOLDER_PATH_HERE"

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark

# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# This creates a new dataset in the app if it doesn't already exist
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)

# Upload from a volume folder
dataset_obj.upload_dataset_from_folder(
folder_path=FOLDER_PATH,
input_type="image",
# input_type="text", # Or, specify to upload text data
labels=False, # Set to True if the folder name serves as the label for all the images within it
)

Upload Data From CSV

You can upload data from a CSV file stored in a Databricks volume to your Clarifai application. The CSV file must include two essential columns: inputid and input.

You can also include additional supported columns such as concepts, metadata, and geopoints. The input column within the CSV is a versatile field, capable of accommodating either a file URL or path, or it can contain raw text.

###################################################################################
# In this section, we set the user authentication, user and app ID, dataset ID,
# and Databricks CSV path. Change these strings to run your own example.
##################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"
# URL path of your Databricks CSV file; Example: "/Volumes/test1/default/volume1/SMS_train_1.csv"
CSV_PATH = "YOUR_DATABRICKS_CSV_PATH_HERE"

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark

# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# This creates a new dataset in the app if it doesn't already exist
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)

# Upload from CSV
dataset_obj.upload_dataset_from_csv(
csv_path=CSV_PATH,
input_type="text",
# input_type="image", # Or, specify to upload image data
labels=False, # Set to True if "concepts" column exists
csv_type="raw", # Or, specify as "url" or "filepath"
source="volume", # Specify as "s3" to use a CSV file directly from your AWS S3 bucket
)

Upload From Delta Table

You can upload data from a delta table in a Databricks volume to your Clarifai application. The table must include two essential columns: inputid and input.

You can also include additional supported columns such as concepts, metadata, and geopoints. The input column within the table is a versatile field, capable of accommodating either a file URL or path, or it can contain raw text.

################################################################################### 
# In this section, we set the user authentication, user and app ID, dataset ID,
# and Databricks delta table path. Change these strings to run your own example.
##################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"
# URL path of your Databricks delta table
TABLE_PATH = "YOUR_DATABRICKS_TABLE_PATH_HERE"

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark

# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# This creates a new dataset in the app if it doesn't already exist
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)

# Upload from delta table
dataset_obj.upload_dataset_from_table(
table_path=TABLE_PATH,
input_type="text",
# input_type="image", # Or, specify to upload image data
labels=False, # Set to True if "concepts" column exists
table_type="raw", # Or, specify as "url" or "filepath"
)

Upload From Dataframe

You can upload data from a PySpark dataframe in a Databricks volume to your Clarifai application. The dataframe must include two essential columns: inputid and input.

You can also include additional supported columns such as concepts, metadata, and geopoints. The input column within the table is a versatile field, capable of accommodating either a file URL or path, or it can contain raw text.

######################################################################################
# In this section, we set the user authentication, user and app ID, and dataset ID.
# Change these strings to run your own example.
#####################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark

# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# This creates a new dataset in the app if it doesn't already exist
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)

# Upload from dataframe
dataset_obj.upload_dataset_from_dataframe(
dataframe=spark_dataframe,
input_type="text",
# input_type="image", # Or, specify to upload image data
labels=False, # Set to True if "concepts" column exists
csv_type="raw", # Or, specify as "url" or "filepath"
)

Upload With Custom Dataloader

You can utilize the custom data loader option if your dataset is stored in an alternative format or necessitates preprocessing. This grants you the flexibility to provide a specialized data loader tailored to your specific requirements.

For reference, you can explore a variety of data loader examples here.

Ensure that the necessary files and folders for the dataloader are stored in Databricks volume storage to facilitate seamless integration and accessibility.

################################################################################### 
# In this section, we set the user authentication, user and app ID, dataset ID,
# and volume module path. Change these strings to run your own example.
##################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"
# URL path of your Databricks volume
VOLUME_MODULE_PATH = "YOUR_VOLUME_MODULE_PATH_HERE"

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark

# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# This creates a new dataset in the app if it doesn't already exist
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)

# Upload with custom dataloader
dataset_obj.upload_dataset_from_dataloader(
task="visual-classification",
split ="train",
module_dir=VOLUME_MODULE_PATH
)
info

You can get examples for integrating Clarifai with Databricks here.