Skip to main content

Managing Datasets

Learn how to interact with datasets using Clarifai SDKs


A dataset is a structured collection of data that serves as the foundation for training and evaluating machine learning models. Effective dataset management is essential to ensure data quality, consistency, and accessibility throughout the entire development lifecycle, from initial data collection to model deployment.

Clarifai’s robust SDKs empower you to simplify and optimize interactions with datasets, enabling seamless organization, annotation, and utilization of your data for AI applications.

Creating Datasets

Leverage the robust capabilities of the Clarifai SDKs to seamlessly generate datasets within your application. Through the API, you can initiate the creation of a dataset by specifying a unique dataset ID. This process empowers you to tailor your datasets to the specific needs of your application, ensuring a customized and efficient data.

Visit this link for more information.

from clarifai.client.app import App

app = App(app_id="test_app", user_id="user_id",pat=”YOUR_PAT”)
# Provide the dataset name as parameter in the create_dataset function
dataset = app.create_dataset(dataset_id="first_dataset")
Output
2024-01-19 14:22:26 INFO     clarifai.client.app:                                                        app.py:310
Dataset created
code: SUCCESS
description: "Ok"
req_id: "1dd6eeb1a82394a9a92becee55faf50e"

Create a Dataset Version

Leveraging the power of the Clarifai SDKs, you can effortlessly generate a new dataset version tailored to your specific needs. This process involves utilizing the API to initiate the creation of a version for a designated dataset, identified by its unique dataset ID. By seamlessly integrating this functionality into your workflow, you gain the ability to manage and track different iterations of your datasets effectively.

Visit this page for more information.

from clarifai.client.dataset import Dataset
# Create a dataset object
dataset = Dataset(dataset_id='first_dataset', user_id='user_id', app_id='test_app',pat=’YOUR_PAT’)
# Create a new version of the dataset
dataset_version = dataset.create_version(description='dataset_version_description')
Output
2024-01-19 14:26:31 INFO     clarifai:                                                                dataset.py:96

Dataset Version created

code: SUCCESS

description: "Ok"

req_id: "14802ff0826d6487dc454aa39877667e"

Patch a Dataset

You can apply patch operations to a dataset — merging, removing, or overwriting data. While all these actions support overwriting by default, they have specific behaviors when handling lists of objects.

  • The merge action replaces a key:valuepair with key:new_value, or appends to an existing list. For dictionaries, it merges entries that share the same id field.
  • The remove action is only used to delete the dataset's cover image on the platform UI.
  • The overwrite action completely replaces an existing object with a new one.

Below is an example of patching a dataset to update its description, notes, and image URL.

from clarifai.client.app import App

app = App(app_id="YOUR_APP_ID_HERE", user_id="YOUR_USER_ID_HERE", pat="YOUR_PAT_HERE")

# Update the dataset by merging the new description and notes
app.patch_dataset(dataset_id='YOUR_DATASET_ID_HERE', action='merge', description='Demo testing', notes="Hi Guys! This note is for Demo")

# Update the dataset's image URL with a new one
app.patch_dataset(dataset_id='YOUR_DATASET_ID_HERE', action='merge', image_url='https://samples.clarifai.com/metro-north.jpg')

# Remove the dataset's image by specifying the 'remove' action
app.patch_dataset(dataset_id='YOUR_DATASET_ID_HERE', action='remove', image_url='https://samples.clarifai.com/metro-north.jpg')

List Dataset Inputs

You can list the inputs in a dataset by providing the dataset ID.

from clarifai.client.input import Inputs

# Replace your "user_id", "app_id", "pat", and "dataset_id"
input_obj = Inputs(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", pat="YOUR_PAT_HERE")

inputs_generator = input_obj.list_inputs(dataset_id="YOUR_DATASET_ID_HERE")

inputs = list(inputs_generator)

print(inputs)

Merge Dataset

Here’s an example of merging a dataset with the ID merge_dataset_id into another dataset with the ID dataset_id using the merge_dataset feature from the Dataset class.

Note that all inputs from the source dataset (merge_dataset_id) will be added to the target dataset (dataset_id).

from clarifai.client.dataset import Dataset

# Replace your "user_id", "app_id", "pat", and "dataset_id"
dataset = Dataset(user_id="YOUR_USER_ID_HERE", app_id="YOUR_APP_ID_HERE", dataset_id="dataset_id", pat="YOUR_PAT_HERE")

dataset.merge_dataset(merge_dataset_id="merge_dataset_id")

Export Dataset

With our API, you can efficiently retrieve your datasets in a compressed zip file format, streamlining the process of data retrieval and enhancing your dataset management capabilities. Whether you're archiving your data for backup, sharing datasets across applications, or conducting in-depth analyses externally, the export functionality provides a convenient and efficient solution.

Visit this page for more information.

info

The clarifai-data-protobuf.zip file can be downloaded from the dataset section in the portal.

from clarifai.client.dataset import Dataset
import os

os.environ["CLARIFAI_PAT"]="YOUR_PAT"

# The “clarifai-data-protobuf.zip” file can be downloaded from the dataset section in the portal.
Dataset().export(save_path='path to output.zip file', local_archive_path='path to clarifai-data-protobuf.zip file')
Output
2024-01-22 15:14:03 INFO     clarifai.datasets.export.inputs_annotations:  path:           inputs_annotations.py:48

Downloads/clarifai-data-protobuf.zip

2024-01-22 15:14:03 INFO clarifai.datasets.export.inputs_annotations: path: inputs_annotations.py:48

Downloads/clarifai-data-protobuf.zip

INFO clarifai.datasets.export.inputs_annotations: Obtained file inputs_annotations.py:56

name list. 1 entries.

INFO clarifai.datasets.export.inputs_annotations: Obtained file inputs_annotations.py:56

name list. 1 entries.

Downloading Dataset: 100%|██████████| 13/13 [00:00<00:00, 6486.55it/s]

2024-01-22 15:14:04 INFO clarifai.datasets.export.inputs_annotations: Downloaded 0 inputs_annotations.py:221

inputs+annotations to output.zip

2024-01-22 15:14:04 INFO clarifai.datasets.export.inputs_annotations: Downloaded 0 inputs_annotations.py:221

inputs+annotations to output.zip

INFO clarifai.datasets.export.inputs_annotations: closing file inputs_annotations.py:92

objects.

INFO clarifai.datasets.export.inputs_annotations: closing file inputs_annotations.py:92

objects.

SDH Enabled Inputs Download

This functionality empowers users to seamlessly retrieve and download inputs that have been enhanced or optimized through SDH technology. By harnessing the power of SDH, this feature ensures a superior and efficient download experience for inputs, providing a level of performance and flexibility that aligns with modern computing demands.

Visit this page for more information.

from clarifai.client.input import Inputs
input_obj = Inputs( user_id='user_id', app_id='test_app')

#listing inputs
input_generator = input_obj.list_inputs(page_no=1,per_page=1,input_type='image')
inputs_list = list(input_generator)

#downloading_inputs
input_bytes = input_obj.download_inputs(inputs_list)
with open('demo.jpg','wb') as f:
f.write(input_bytes[0])

Delete Dataset Version

Within the Clarifai SDKs, you have the capability to precisely manage your datasets by removing specific versions with ease. This feature empowers you to selectively delete a particular version of your dataset through the API. Whether you are refining your dataset collection, optimizing storage resources, or ensuring data accuracy, this functionality provides a targeted and efficient solution.

Visit this page for more information.

caution

Be certain that you want to delete a particular dataset version as the operation cannot be undone.

from clarifai.client.dataset import Dataset


#Create dataset object
dataset = Dataset(dataset_id='first_dataset', user_id='user_id', app_id='test_app')
#Delete dataset version
dataset.delete_version(version_id='dataset_version')
Output
Output
2024-01-22 15:22:25 INFO clarifai: dataset.py:124
Dataset Version Deleted
code: SUCCESS
description: "Ok"
details: "Dataset version \'a4b032e9083f4cbfbdfe5617b1a4d5e7\' deleted"
req_id: "1fac439af87c37dae27684fcbe49b80b"

Delete Dataset

Within the Clarifai SDKs, removing a dataset is a straightforward process enabled by the API. By supplying the unique identifier, known as the dataset ID, you gain the capability to seamlessly eliminate a dataset from your Clarifai account. It's essential to note that this functionality extends beyond a singular dataset removal; it also initiates the deletion of all associated dataset versions.

Visit this page for more information.

caution

Be certain that you want to delete a particular dataset as the operation cannot be undone.

from clarifai.client.app import App

app = App(app_id="test_app", user_id="user_id")
# Provide the dataset name as parameter in delete_dataset function
app.delete_dataset(dataset_id="demo_dataset")
Output
2024-01-22 15:24:29 INFO     clarifai.client.app:                                                        app.py:617

Dataset Deleted

code: SUCCESS

description: "Ok"

details: "Dataset \'demo_dataset\' deleted"

req_id: "97a8c49418da156d1b0227f9fa5f8dda"