Skip to main content

Upload Data to Dataset via API

Learn how to upload data to a dataset via the API


Uploading data to a dataset in Clarifai is essential for training and evaluating your machine learning models.

Whether you're working with images, videos, text, audio, or other data types, we provide flexible and efficient methods to upload data from various sources.

info

Before using the Python SDK, Node.js SDK, or any of our gRPC clients, ensure they are properly installed on your machine. Refer to their respective installation guides for instructions on how to install and initialize them.

tip

Click here to learn more about working with the Dataset class.

note

Customize Batch Size

When uploading inputs to the Clarifai platform, there are limits on the size and number of inputs per upload, as detailed here. However, by using methods from the Dataset class — such as Dataset.upload_from_folder() or Dataset.upload_from_csv() — you can bypass these restrictions and efficiently upload larger volumes of inputs.

For example, when uploading images in bulk, such methods incrementally process and upload them in multiple batches, ensuring that each batch contains a maximum of 128 images and does not exceed 128MB in size – which ensures adherence to the upload restrictions.

You can also customize the batch_size variable, which allows for concurrent upload of inputs and annotations. For example, if your images folder exceeds 128MB, you can set the variable to ensure that each batch contains an appropriate number of images while staying within the 128MB per batch limit.

The default batch_size is set to 32, but you can customize it to any value between 1 (minimum) and 128 (maximum).

Here is an example:

dataset.upload_from_folder(folder_path='/path/to/your/folder', input_type='image', labels=True, batch_size=50)

Add Inputs to a Dataset

After uploading inputs to the Clarifai platform, you can add them to a dataset by specifying their input IDs.

curl --location --request POST "https://api.clarifai.com/v2/users/YOUR_USER_ID_HERE/apps/YOUR_APP_ID_HERE/datasets/YOUR_DATASET_ID_HERE/inputs" \
--header "Authorization: Key YOUR_PAT_HERE" \
--header "Content-Type: application/json" \
--data-raw '{
"dataset_inputs": [
{
"input": {
"id": "YOUR_EXISTING_INPUT_ID_HERE"
}
}
]
}'

Upload From Folder

The upload_from_folder method lets you bulk-upload images or text files from a local folder directly into a Clarifai dataset.

from clarifai.client.dataset import Dataset

# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows

# Create a dataset object
dataset = Dataset(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID",
dataset_id="YOUR_DATASET_ID"
)

# Upload data from a folder
dataset.upload_from_folder(
folder_path="/path/to/your/folder",
input_type="image", # or "text" for text files
labels=True # Set to False to upload without concepts
)

Note that:

  • The upload_from_folder method only supports "image" and "text" input types.
  • Ensure your dataset (dataset_id) already exists before calling this method.
  • Large datasets should be uploaded with an appropriate batch_size (default 128).
  • If labels=True, the folder name is assigned as the input’s concept label.
  • The filename (without extension) is used as the input_id in Clarifai.
  • When uploading text data, the target app should be configured to accept text inputs. Set the primary input type to Text/Document when creating the app.

Upload From CSV

The upload_from_csv method lets you bulk-upload data into a Clarifai dataset using a CSV file. This method is useful when your data is already structured in tabular form with URLs, local file paths, or raw text.

from clarifai.client.dataset import Dataset

# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows

# Create a dataset object
dataset = Dataset(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID",
dataset_id="YOUR_DATASET_ID"
)

# Upload local image files
dataset.upload_from_csv(
csv_path="path_to_your_csv_file.csv",
input_type="image",
csv_type="file_path",
labels=True # Set to False to upload without concepts
)

'''
# Upload image data from URLs
dataset.upload_from_csv(
csv_path="sample_images.csv",
input_type="image",
csv_type="url",
labels=True
)

# Upload raw text data
dataset.upload_from_csv(
csv_path="sample_texts.csv",
input_type="text",
csv_type="raw",
labels=True
)

# Upload video data from file paths
dataset.upload_from_csv(
csv_path="sample_videos.csv",
input_type="video",
csv_type="file_path",
labels=True
)

# Upload audio data from URLs
dataset.upload_from_csv(
csv_path="sample_audio.csv",
input_type="audio",
csv_type="url",
labels=True
)
'''
Example CSV Files
File-path based (for local files)
inputid,input,concepts,metadata
img1,"data/metro-north.jpg","train","{'source': 'local'}"
img2,"data/puppy.jpeg","dog","{'source': 'local'}"
Raw text dataset (only valid with input_type="text")
inputid,input,concepts,metadata
txt1,"The sky is clear and blue","weather","{'lang': 'en'}"
txt2,"The puppy is playing in the garden","dog","{'lang': 'en'}"
With geopoints
inputid,input,concepts,metadata,geopoints
img1,"data/metro-north.jpg","train","{'source': 'clarifai-samples'}","-73.935242,40.730610"
img2,"data/puppy.jpeg","dog","{'source': 'clarifai-samples'}","-118.243683,34.052235"

Note that:

  • The upload_from_csv method supports "image", "text", "video", and "audio" file types.
  • The csv_type parameter defines how the CSV file will be interpreted. It can be:
    • "url" — Inputs are hosted online, and the CSV provides URLs.
    • "file_path" — Inputs are stored locally, and the CSV provides file paths.
    • "raw" — Only valid for text datasets; the CSV provides raw text strings.
  • If labels=True (default), the CSV must include a concepts column with labels. If False, inputs are uploaded without labels.
  • The batch_size (default = 128) parameter defines the maximum number of inputs to upload concurrently in one batch.
  • The CSV file must include column headers. These are the supported headers:
    • inputid — Unique identifier for the input.
    • input — URL, file path, or raw text depending on csv_type.
    • concepts — Concept labels (if labels=True).
    • metadata — JSON metadata, formatted with single quotes inside. Example: "{'source': 'web'}".
    • geopoints — Geolocation in "longitude,latitude" format.
  • All the data in the CSV file should be enclosed in double quotes.
  • When uploading text data, ensure the target app is configured to accept text inputs. Set the primary input type to Text/Document when creating the app.

Upload Image Data With Annotations

You can upload image data together with bounding box annotations into a Clarifai dataset, adding richer context and detail to your visual data.

from clarifai.client.input import Inputs
import time

# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows

# Initialize the Inputs client
input_client = Inputs(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID"
)

# Upload image data from a specified URL with a unique input ID
input_id = "bbox_example"
input_client.upload_from_url(
input_id=input_id,
dataset_id="YOUR_DATASET_ID", # Optional: specify dataset ID to add the input to a dataset
image_url="https://samples.clarifai.com/BarackObama.jpg"
)

# Poll until input is processed successfully
status = None
for _ in range(10): # max retries
inp = input_client.get_input(input_id)
status = inp.status.code
if status == 30000: # SUCCESS
break
time.sleep(2)

if status != 30000:
raise RuntimeError("Input not processed, cannot add annotations yet.")

# Define bounding box coordinates (format: [left, top, right, bottom])
bbox_points = [0.1, 0.1, 0.8, 0.9]

# Generate a bounding box annotation with specified label ("face") and bounding box coordinates
annotation = input_client.get_bbox_proto(
input_id=input_id,
label="face",
bbox=bbox_points
)

# Upload the generated annotation to associate with the previously uploaded image
input_client.upload_annotations([annotation])

Upload Image Data With Mask Annotations

You can add masks to image data in a Clarifai dataset by providing polygon coordinates with the image, enabling precise region-based annotations.

from clarifai.client.input import Inputs
import time

# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows

# Initialize the Inputs client
input_client = Inputs(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID"
)

# Upload image data from a specified URL with a unique input ID
input_id = "mask_example"
input_client.upload_from_url(
input_id=input_id,
dataset_id="YOUR_DATASET_ID", # Optional: specify dataset ID to add the input to a dataset
image_url="https://samples.clarifai.com/BarackObama.jpg"
)

# Poll until input is processed successfully
status = None
for _ in range(10): # max retries
inp = input_client.get_input(input_id)
status = inp.status.code
if status == 30000: # SUCCESS
break
time.sleep(2)

if status != 30000:
raise RuntimeError("Input not processed, cannot add annotations yet.")

# Define polygon points for the mask
# Coordinates are normalized (0.0 to 1.0) relative to image width and height
mask_points = [
[0.30, 0.20], # top-left forehead
[0.70, 0.20], # top-right forehead
[0.85, 0.45], # right cheek
[0.70, 0.80], # right jaw
[0.30, 0.80], # left jaw
[0.15, 0.45] # left cheek
]

# Create a mask annotation with label "obama"
annotation = input_client.get_mask_proto(
input_id=input_id,
label="obama",
polygons=mask_points
)

# Upload the generated annotation to associate with the previously uploaded image
input_client.upload_annotations([annotation])

Upload Video Data With Annotations

You can upload videos in a Clarifai dataset with enriched annotations by including bounding box coordinates that define regions of interest within individual frames, adding valuable context to your video content.

from clarifai.client.input import Inputs
import time
from clarifai_grpc.grpc.api import resources_pb2 as r

# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows

# Initialize the Inputs client
input_client = Inputs(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID",
)

# Upload video data from a specified URL with a unique input ID
input_id = "video_bbox_example" # change per test if re-running
input_client.upload_from_url(
input_id=input_id,
dataset_id="YOUR_DATASET_ID", # Optional: specify dataset ID to add the input to a dataset
video_url="https://samples.clarifai.com/beer.mp4"
)

# Poll until input is processed successfully
status = None
for _ in range(10): # max retries
inp = input_client.get_input(input_id)
status = inp.status.code
if status == 30000: # SUCCESS
break
time.sleep(2)

if status != 30000:
raise RuntimeError("Input not processed, cannot add annotations yet.")

# Bounding box coordinates for annotation, (top_row, left_col, bottom_row, right_col), relative 0–1
bbox_points = [0.1, 0.1, 0.8, 0.9]

# Build a video frame annotation with one region (Clarifai requires frame -> region -> bbox)
bbox_pb = r.BoundingBox(
top_row=bbox_points[0],
left_col=bbox_points[1],
bottom_row=bbox_points[2],
right_col=bbox_points[3],
)
region = r.Region(
region_info=r.RegionInfo(bounding_box=bbox_pb),
data=r.Data(concepts=[r.Concept(id="glass", name="glass", value=1.0)])
)
frame = r.Frame(
frame_info=r.FrameInfo(time=0, index=0), # first frame (time in ms)
data=r.Data(regions=[region])
)
annotation = r.Annotation(
input_id=input_id,
data=r.Data(frames=[frame])
)

# Upload the annotation associated with the video
input_client.upload_annotations([annotation])

# -------- OPTIONAL: multiple labels or multiple frames --------
'''
# Example: multiple labels on separate frames (one annotation per frame)
labels = ["glass", "person", "dog"]
annotations = []
for i, lab in enumerate(labels):
bb = r.BoundingBox(top_row=0.1+i*0.05, left_col=0.1, bottom_row=0.4+i*0.05, right_col=0.4)
reg = r.Region(region_info=r.RegionInfo(bounding_box=bb),
data=r.Data(concepts=[r.Concept(id=lab, name=lab, value=1.0)]))
frm = r.Frame(frame_info=r.FrameInfo(time=i*1000, index=i), data=r.Data(regions=[reg]))
annotations.append(r.Annotation(input_id=input_id, data=r.Data(frames=[frm])))

input_client.upload_annotations(annotations)
'''

Upload Text Data With Annotations

You can upload text data in a Clarifai dataset and enrich it by attaching metadata, categorizing the content, or adding detailed annotations to enhance structure and context.

from clarifai.client.input import Inputs

# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows

# Initialize the Inputs client
input_client = Inputs(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID",
)

# Define input details
input_id = "text_example"
concepts = ["mobile", "camera"]

# Upload data from URL and annotate with concepts
input_client.upload_from_url(
input_id=input_id,
dataset_id="YOUR_DATASET_ID", # Optional: specify dataset ID to add the input to a dataset
text_url="https://samples.clarifai.com/featured-models/Llama2_Conversational-agent.txt",
labels=concepts
)

Batch Upload Image Data While Tracking Status

You can actively monitor the status of your dataset upload, giving you clear visibility into the progress and making it easy to track and analyze the data transfer process.

from clarifai.client.dataset import Dataset
from clarifai.datasets.upload.utils import load_module_dataloader


#replace your "user_id", "app_id", "dataset_id".
dataset = Dataset(user_id="user_id", app_id="test_app", dataset_id="first_dataset")
#create dataloader object
cifar_dataloader = load_module_dataloader('./image_classification/cifar10')
#set get_upload_status=True for showing upload status
dataset.upload_dataset(dataloader=cifar_dataloader,get_upload_status=True)

Retry Upload From Log File

You can retry uploads for failed inputs directly from the logs. When using the upload_dataset function, any failed inputs are automatically logged to a file, which can later be used to resume and retry the upload process seamlessly.

info

Set retry_duplicates to True if you want to retry duplicate with new input_id in current dataset.

#importing load_module_dataloader for calling the dataloader object in dataset.py in the local data folder
from clarifai.datasets.upload.utils import load_module_dataloader
from clarifai.client.dataset import Dataset


#replace your "user_id", "app_id", "dataset_id".
dataset = Dataset(user_id="user_id", app_id="app_id", dataset_id="dataset_id")

cifar_dataloader = load_module_dataloader('./image_classification/cifar10')

dataset.retry_upload_from_logs(dataloader=cifar_dataloader, log_file_path='path to log file', retry_duplicates=True, log_warnings=True)