Fetch Annotations
Seamlessly retrieve annotations from Clarifai into Databricks
The Clarifai platform allows you to annotate your inputs, enriching them with valuable labels and metadata.
You can effortlessly fetch annotations from a Clarifai application to Databricks. This integration is particularly valuable for machine learning workflows in Databricks, as it introduces annotated data from Clarifai into the platform.
Annotated data enhances the quality of training data, a crucial factor in improving the accuracy and performance of machine learning models.
Let’s illustrate how you can seamlessly transfer annotations from Clarifai into the Databricks environment.
Prerequisites
- Databricks notebook development environment
- Get your PAT (Personal Access Token) from the Clarifai’s portal under the Settings/Security section
- Get your Clarifai user ID
- Get the ID of the Clarifai app where you want to fetch the annotations from
- Get the ID of the dataset having the annotations within your app
- Install the Clarifai PySpark package by running
pip install clarifai-pyspark
- Install Protocol Buffers by running
pip install protobuf==4.24.2
. It’s a cross-platform, serialization protocol that describes the structure of the data to be sent
You can learn how to authenticate with the Clarifai platform here.
Retrieve Annotations in JSON Format
You can retrieve detailed information about the annotations in your Clarifai app’s dataset. You’ll get a JSON response containing comprehensive details about the annotations.
Optionally, you can specify a list of input IDs for which you want to fetch their annotations.
- Python
######################################################################################
# In this section, we set the user authentication, user and app ID, and dataset ID.
# Change these strings to run your own example.
#####################################################################################
# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"
############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################
# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark
# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# Specify the dataset
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)
# Retrieve annotations in JSON format
annotations_response = list(dataset_obj.list_annotations(input_ids=None))
print(annotations_response)
Retrieve Annotations as a Dataframe
You can retrieve detailed information about your annotations in a structured dataframe format. The dataframe includes key columns like annotation_id
, annotation
, annotation_user_id
, input_id
, annotation_created_at
and annotation_modified_at
.
Note that the JSON response may include supplementary attributes, offering comprehensive insights beyond the specified columns in the dataframe.
Optionally, you can specify a list of input IDs for which you want to fetch their annotations.
- Python
######################################################################################
# In this section, we set the user authentication, user and app ID, and dataset ID.
# Change these strings to run your own example.
######################################################################################
# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"
############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################
# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark
# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# Specify the dataset
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)
# Retrieve annotations as a dataframe
annotations_df = dataset_obj.export_annotations_to_dataframe(input_ids=None)
print(annotations_df)
Retrieve Inputs With Annotations as a Dataframe
You can obtain inputs along with their corresponding annotations in a structured dataframe. This capability allows for the simultaneous retrieval of input details and their associated annotations.
The resulting dataframe consolidates information seamlessly from both the annotations and inputs dataframes, as outlined in the previously mentioned functions.
- Python
#####################################################################################
# In this section, we set the user authentication, user and app ID, and dataset ID.
# Change these strings to run your own example.
#####################################################################################
# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
USER_ID = "YOUR_USER_ID_HERE"
APP_ID = "YOUR_APP_ID_HERE"
DATASET_ID = "YOUR_DATASET_ID_HERE"
############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################
# Import the required packages
import os
from clarifaipyspark.client import ClarifaiPySpark
# Set Clarifai PAT as environment variable
os.environ["CLARIFAI_PAT"] = PAT
# Create a Clarifai-PySpark client object to connect to your app on Clarifai
cspark_obj = ClarifaiPySpark(user_id=USER_ID, app_id=APP_ID)
# Specify the dataset
dataset_obj = cspark_obj.dataset(dataset_id=DATASET_ID)
# Retrieve inputs with annotations as a dataframe
dataset_df = dataset_obj.export_dataset_to_dataframe(
input_type="image" # Or, specify as "text"
)
print(dataset_df)
You can get examples for integrating Clarifai with Databricks here.