Skip to main content

Fetch Data

Seamlessly retrieve your data from Clarifai into Databricks


You may use Clarifai for tasks like image recognition and analysis. Then, you may want to bring the results or the processed data into Databricks for more in-depth exploration, analysis, or integration with other data sources.

Let’s illustrate how you can effortlessly transfer data from Clarifai into the Databricks environment.

Prerequisites

  • Databricks notebook development environment
  • Get your PAT (Personal Access Token) from the Clarifai’s portal under the Settings/Security section
  • Get your Clarifai user ID
  • Get the ID of the Clarifai app where you want to fetch the data from
  • Get the ID of the dataset having the data within your app
  • Install the Clarifai PySpark package by running pip install clarifai-pyspark
  • Install Protocol Buffers by running pip install protobuf==4.24.2 . It’s a cross-platform, serialization protocol that describes the structure of the data to be sent
info

You can learn how to authenticate with the Clarifai platform here.

Retrieve Data Files in JSON Format

You can retrieve detailed information about the input data in your Clarifai app’s dataset. You’ll get a JSON response containing comprehensive details about the dataset files.

Ensure you use the input_type parameter for targeted retrieval based on the data file types. You can specify the desired type, such as "image", "video", "audio", or "text", to obtain specific details relevant to that file type.

#######################################################################################
# In this section, we set the user authentication, user and app ID, and dataset ID.
# Change these strings to run your own example.
#######################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark

# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# Specify the dataset
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)

# Retrieve data files in JSON format
inputs_response = list(
dataset_obj.list_inputs(
input_type="image" # Or, specify as "video", "audio", or "text"
)
)
print(inputs_response)

Retrieve Data Files as a Dataframe

You can retrieve detailed information about your data files in a structured dataframe format. The dataframe includes key columns like input_id, image_url/text_url, image_info/text_info, input_created_at, and input_modified_at.

Ensure to specify the input_type parameter to tailor the results to a specific type, such as "image", or "text".

Note that the JSON response may include additional attributes, offering comprehensive insights beyond the specified columns in the dataframe.

#######################################################################################
# In this section, we set the user authentication, user and app ID, and dataset ID.
# Change these strings to run your own example.
######################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark

# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# Specify the dataset
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)

# Retrieve data files as a dataframe
inputs_df = dataset_obj.export_inputs_to_dataframe(
input_type="image" # Or, specify as "text"
)
print(inputs_df)

Retrieve Data Files to Databricks Volume

You can effortlessly download image and text files from your Clarifai app’s dataset to your Databricks volume.

You need to specify the path where the retrieved data will be stored in the volume and utilize the response obtained from the list_inputs() function as the parameter.

##################################################################################
# In this section, we set the user authentication, user and app ID, dataset ID,
# and destination volume path. Change these strings to run your own example.
##################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"
# URL path of your Databricks volume
DESTINATION_VOLUME_PATH = "YOUR_DATABRICKS_VOLUME_PATH_HERE"

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark
# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# Specify the dataset
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)

# Retrieve data files in JSON format
inputs_response = list(
dataset_obj.list_inputs(
input_type="image" # Or, specify as "text"
)
)
#For images
dataset_obj.export_images_to_volume(path=DESTINATION_VOLUME_PATH,
input_response=inputs_response)
#For text
#dataset_obj.export_text_to_volume(path=DESTINATION_VOLUME_PATH,
#input_response=inputs_response)



info

You can get examples for integrating Clarifai with Databricks here.