Upload Data to Dataset via API
Learn how to upload data to a dataset via the API
Uploading data to a dataset in Clarifai is essential for training and evaluating your machine learning models.
Whether you're working with images, videos, text, audio, or other data types, we provide flexible and efficient methods to upload data from various sources.
Before using the Python SDK, Node.js SDK, or any of our gRPC clients, ensure they are properly installed on your machine. Refer to their respective installation guides for instructions on how to install and initialize them.
Click here to learn more about working with the Dataset
class.
Customize Batch Size
When uploading inputs to the Clarifai platform, there are limits on the size and number of inputs per upload, as detailed here. However, by using methods from the Dataset
class — such as Dataset.upload_from_folder()
or Dataset.upload_from_csv()
— you can bypass these restrictions and efficiently upload larger volumes of inputs.
For example, when uploading images in bulk, such methods incrementally process and upload them in multiple batches, ensuring that each batch contains a maximum of 128 images and does not exceed 128MB in size – which ensures adherence to the upload restrictions.
You can also customize the batch_size
variable, which allows for concurrent upload of inputs and annotations. For example, if your images folder exceeds 128MB, you can set the variable to ensure that each batch contains an appropriate number of images while staying within the 128MB per batch limit.
The default batch_size
is set to 32, but you can customize it to any value between 1 (minimum) and 128 (maximum).
Here is an example:
dataset.upload_from_folder(folder_path='/path/to/your/folder', input_type='image', labels=True, batch_size=50)
Add Inputs to a Dataset
After uploading inputs to the Clarifai platform, you can add them to a dataset by specifying their input IDs.
- cURL
curl --location --request POST "https://api.clarifai.com/v2/users/YOUR_USER_ID_HERE/apps/YOUR_APP_ID_HERE/datasets/YOUR_DATASET_ID_HERE/inputs" \
--header "Authorization: Key YOUR_PAT_HERE" \
--header "Content-Type: application/json" \
--data-raw '{
"dataset_inputs": [
{
"input": {
"id": "YOUR_EXISTING_INPUT_ID_HERE"
}
}
]
}'
Upload From Folder
The upload_from_folder
method lets you bulk-upload images or text files from a local folder directly into a Clarifai dataset.
- Python SDK
- Node.js SDK
from clarifai.client.dataset import Dataset
# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows
# Create a dataset object
dataset = Dataset(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID",
dataset_id="YOUR_DATASET_ID"
)
# Upload data from a folder
dataset.upload_from_folder(
folder_path="/path/to/your/folder",
input_type="image", # or "text" for text files
labels=True # Set to False to upload without concepts
)
const { Dataset } = require("clarifai-nodejs");
const path = require("path");
(async () => {
const dataset = new Dataset({
datasetId: "YOUR_DATASET_ID",
authConfig: {
pat: process.env.CLARIFAI_PAT,
userId: "YOUR_USER_ID",
appId: "YOUR_APP_ID",
},
});
await dataset.uploadFromFolder({
folderPath: path.resolve(__dirname, "/path/to/your/folder"),
inputType: "image",
labels: true,
});
})();
Note that:
- The
upload_from_folder
method only supports"image"
and"text"
input types. - Ensure your dataset (
dataset_id
) already exists before calling this method. - Large datasets should be uploaded with an appropriate
batch_size
(default 128). - If
labels=True
, the folder name is assigned as the input’s concept label. - The filename (without extension) is used as the
input_id
in Clarifai. - When uploading text data, the target app should be configured to accept text inputs. Set the primary input type to Text/Document when creating the app.
Upload From CSV
The upload_from_csv
method lets you bulk-upload data into a Clarifai dataset using a CSV file. This method is useful when your data is already structured in tabular form with URLs, local file paths, or raw text.
- Python SDK
- Node.js SDK
from clarifai.client.dataset import Dataset
# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows
# Create a dataset object
dataset = Dataset(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID",
dataset_id="YOUR_DATASET_ID"
)
# Upload local image files
dataset.upload_from_csv(
csv_path="path_to_your_csv_file.csv",
input_type="image",
csv_type="file_path",
labels=True # Set to False to upload without concepts
)
'''
# Upload image data from URLs
dataset.upload_from_csv(
csv_path="sample_images.csv",
input_type="image",
csv_type="url",
labels=True
)
# Upload raw text data
dataset.upload_from_csv(
csv_path="sample_texts.csv",
input_type="text",
csv_type="raw",
labels=True
)
# Upload video data from file paths
dataset.upload_from_csv(
csv_path="sample_videos.csv",
input_type="video",
csv_type="file_path",
labels=True
)
# Upload audio data from URLs
dataset.upload_from_csv(
csv_path="sample_audio.csv",
input_type="audio",
csv_type="url",
labels=True
)
'''
const { Dataset } = require("clarifai-nodejs");
const path = require("path");
(async () => {
const dataset = new Dataset({
datasetId: "YOUR_DATASET_ID",
authConfig: {
pat: process.env.CLARIFAI_PAT,
userId: "YOUR_USER_ID",
appId: "YOUR_APP_ID",
},
});
await dataset.uploadFromCSV({
csvPath: path.resolve(__dirname, "/path/to/your/folder"),
inputType: "image",
csvType: "file", // can also be "url" or "raw"
labels: true
});
})();
Example CSV Files
File-path based (for local files)
inputid,input,concepts,metadata
img1,"data/metro-north.jpg","train","{'source': 'local'}"
img2,"data/puppy.jpeg","dog","{'source': 'local'}"
Raw text dataset (only valid with input_type="text")
inputid,input,concepts,metadata
txt1,"The sky is clear and blue","weather","{'lang': 'en'}"
txt2,"The puppy is playing in the garden","dog","{'lang': 'en'}"
With geopoints
inputid,input,concepts,metadata,geopoints
img1,"data/metro-north.jpg","train","{'source': 'clarifai-samples'}","-73.935242,40.730610"
img2,"data/puppy.jpeg","dog","{'source': 'clarifai-samples'}","-118.243683,34.052235"
Note that:
- The
upload_from_csv
method supports"image"
,"text"
,"video"
, and"audio"
file types. - The
csv_type
parameter defines how the CSV file will be interpreted. It can be:"url"
— Inputs are hosted online, and the CSV provides URLs."file_path"
— Inputs are stored locally, and the CSV provides file paths."raw"
— Only valid for text datasets; the CSV provides raw text strings.
- If
labels=True
(default), the CSV must include aconcepts
column with labels. IfFalse
, inputs are uploaded without labels. - The
batch_size
(default = 128) parameter defines the maximum number of inputs to upload concurrently in one batch. - The CSV file must include column headers. These are the supported headers:
inputid
— Unique identifier for the input.input
— URL, file path, or raw text depending oncsv_type
.concepts
— Concept labels (iflabels=True
).metadata
— JSON metadata, formatted with single quotes inside. Example:"{'source': 'web'}"
.geopoints
— Geolocation in"longitude,latitude"
format.
- All the data in the CSV file should be enclosed in double quotes.
- When uploading text data, ensure the target app is configured to accept text inputs. Set the primary input type to Text/Document when creating the app.
Upload Image Data With Annotations
You can upload image data together with bounding box annotations into a Clarifai dataset, adding richer context and detail to your visual data.
- Python SDK
- Node.js SDK
from clarifai.client.input import Inputs
import time
# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows
# Initialize the Inputs client
input_client = Inputs(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID"
)
# Upload image data from a specified URL with a unique input ID
input_id = "bbox_example"
input_client.upload_from_url(
input_id=input_id,
dataset_id="YOUR_DATASET_ID", # Optional: specify dataset ID to add the input to a dataset
image_url="https://samples.clarifai.com/BarackObama.jpg"
)
# Poll until input is processed successfully
status = None
for _ in range(10): # max retries
inp = input_client.get_input(input_id)
status = inp.status.code
if status == 30000: # SUCCESS
break
time.sleep(2)
if status != 30000:
raise RuntimeError("Input not processed, cannot add annotations yet.")
# Define bounding box coordinates (format: [left, top, right, bottom])
bbox_points = [0.1, 0.1, 0.8, 0.9]
# Generate a bounding box annotation with specified label ("face") and bounding box coordinates
annotation = input_client.get_bbox_proto(
input_id=input_id,
label="face",
bbox=bbox_points
)
# Upload the generated annotation to associate with the previously uploaded image
input_client.upload_annotations([annotation])
import { Input } from "clarifai-nodejs";
const imageUrl = "https://samples.clarifai.com/BarackObama.jpg";
const input = new Input({
authConfig: {
userId: process.env.CLARIFAI_USER_ID,
pat: process.env.CLARIFAI_PAT,
appId: "test_app",
},
});
await input.uploadFromUrl({
inputId: "bbox",
imageUrl,
});
const bboxPoints = [0.1, 0.1, 0.8, 0.9];
const annotation = Input.getBboxProto({
inputId: "bbox",
label: "face",
bbox: bboxPoints,
});
await input.uploadAnnotations({
batchAnnot: [annotation],
});
Upload Image Data With Mask Annotations
You can add masks to image data in a Clarifai dataset by providing polygon coordinates with the image, enabling precise region-based annotations.
- Python SDK
- Node.js SDK
from clarifai.client.input import Inputs
import time
# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows
# Initialize the Inputs client
input_client = Inputs(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID"
)
# Upload image data from a specified URL with a unique input ID
input_id = "mask_example"
input_client.upload_from_url(
input_id=input_id,
dataset_id="YOUR_DATASET_ID", # Optional: specify dataset ID to add the input to a dataset
image_url="https://samples.clarifai.com/BarackObama.jpg"
)
# Poll until input is processed successfully
status = None
for _ in range(10): # max retries
inp = input_client.get_input(input_id)
status = inp.status.code
if status == 30000: # SUCCESS
break
time.sleep(2)
if status != 30000:
raise RuntimeError("Input not processed, cannot add annotations yet.")
# Define polygon points for the mask
# Coordinates are normalized (0.0 to 1.0) relative to image width and height
mask_points = [
[0.30, 0.20], # top-left forehead
[0.70, 0.20], # top-right forehead
[0.85, 0.45], # right cheek
[0.70, 0.80], # right jaw
[0.30, 0.80], # left jaw
[0.15, 0.45] # left cheek
]
# Create a mask annotation with label "obama"
annotation = input_client.get_mask_proto(
input_id=input_id,
label="obama",
polygons=mask_points
)
# Upload the generated annotation to associate with the previously uploaded image
input_client.upload_annotations([annotation])
import { Input, Polygon } from "clarifai-nodejs";
const imageUrl = "https://samples.clarifai.com/BarackObama.jpg";
const input = new Input({
authConfig: {
userId: process.env.CLARIFAI_USER_ID,
pat: process.env.CLARIFAI_PAT,
appId: process.env.CLARIFAI_APP_ID,
},
});
await input.uploadFromUrl({
inputId: "mask",
imageUrl,
});
const maskPoints:Polygon[] = [[[0.87, 0.66],[0.45 , 1.0], [0.82 ,0.42]]];
const annotation = Input.getMaskProto({
inputId: "mask",
label: "obama",
polygons: maskPoints,
});
await input.uploadAnnotations({
batchAnnot: [annotation],
});
Upload Video Data With Annotations
You can upload videos in a Clarifai dataset with enriched annotations by including bounding box coordinates that define regions of interest within individual frames, adding valuable context to your video content.
- Python SDK
- Node.js SDK
from clarifai.client.input import Inputs
import time
from clarifai_grpc.grpc.api import resources_pb2 as r
# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows
# Initialize the Inputs client
input_client = Inputs(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID",
)
# Upload video data from a specified URL with a unique input ID
input_id = "video_bbox_example" # change per test if re-running
input_client.upload_from_url(
input_id=input_id,
dataset_id="YOUR_DATASET_ID", # Optional: specify dataset ID to add the input to a dataset
video_url="https://samples.clarifai.com/beer.mp4"
)
# Poll until input is processed successfully
status = None
for _ in range(10): # max retries
inp = input_client.get_input(input_id)
status = inp.status.code
if status == 30000: # SUCCESS
break
time.sleep(2)
if status != 30000:
raise RuntimeError("Input not processed, cannot add annotations yet.")
# Bounding box coordinates for annotation, (top_row, left_col, bottom_row, right_col), relative 0–1
bbox_points = [0.1, 0.1, 0.8, 0.9]
# Build a video frame annotation with one region (Clarifai requires frame -> region -> bbox)
bbox_pb = r.BoundingBox(
top_row=bbox_points[0],
left_col=bbox_points[1],
bottom_row=bbox_points[2],
right_col=bbox_points[3],
)
region = r.Region(
region_info=r.RegionInfo(bounding_box=bbox_pb),
data=r.Data(concepts=[r.Concept(id="glass", name="glass", value=1.0)])
)
frame = r.Frame(
frame_info=r.FrameInfo(time=0, index=0), # first frame (time in ms)
data=r.Data(regions=[region])
)
annotation = r.Annotation(
input_id=input_id,
data=r.Data(frames=[frame])
)
# Upload the annotation associated with the video
input_client.upload_annotations([annotation])
# -------- OPTIONAL: multiple labels or multiple frames --------
'''
# Example: multiple labels on separate frames (one annotation per frame)
labels = ["glass", "person", "dog"]
annotations = []
for i, lab in enumerate(labels):
bb = r.BoundingBox(top_row=0.1+i*0.05, left_col=0.1, bottom_row=0.4+i*0.05, right_col=0.4)
reg = r.Region(region_info=r.RegionInfo(bounding_box=bb),
data=r.Data(concepts=[r.Concept(id=lab, name=lab, value=1.0)]))
frm = r.Frame(frame_info=r.FrameInfo(time=i*1000, index=i), data=r.Data(regions=[reg]))
annotations.append(r.Annotation(input_id=input_id, data=r.Data(frames=[frm])))
input_client.upload_annotations(annotations)
'''
import { Input } from "clarifai-nodejs";
const videoUrl = "https://samples.clarifai.com/beer.mp4";
const input = new Input({
authConfig: {
userId: process.env.CLARIFAI_USER_ID,
pat: process.env.CLARIFAI_PAT,
appId: "test_app",
},
});
await input.uploadFromUrl({
inputId: "video-bbox",
videoUrl,
});
const bboxPoints = [0.1, 0.1, 0.8, 0.9];
const annotation = Input.getBboxProto({
inputId: "bbox",
label: "glass",
bbox: bboxPoints,
});
await input.uploadAnnotations({
batchAnnot: [annotation],
});
Upload Text Data With Annotations
You can upload text data in a Clarifai dataset and enrich it by attaching metadata, categorizing the content, or adding detailed annotations to enhance structure and context.
- Python SDK
- Node.js SDK
from clarifai.client.input import Inputs
# Set PAT as an environment variable before running this script
# export CLARIFAI_PAT=YOUR_PAT_HERE # Unix-like systems
# set CLARIFAI_PAT=YOUR_PAT_HERE # Windows
# Initialize the Inputs client
input_client = Inputs(
user_id="YOUR_USER_ID",
app_id="YOUR_APP_ID",
)
# Define input details
input_id = "text_example"
concepts = ["mobile", "camera"]
# Upload data from URL and annotate with concepts
input_client.upload_from_url(
input_id=input_id,
dataset_id="YOUR_DATASET_ID", # Optional: specify dataset ID to add the input to a dataset
text_url="https://samples.clarifai.com/featured-models/Llama2_Conversational-agent.txt",
labels=concepts
)
import { Input } from "clarifai-nodejs";
const textUrl =
"https://samples.clarifai.com/featured-models/Llama2_Conversational-agent.txt";
const concepts = ["mobile", "camera"];
const input = new Input({
authConfig: {
userId: process.env.CLARIFAI_USER_ID,
pat: process.env.CLARIFAI_PAT,
appId: "test_app",
},
});
await input.uploadFromUrl({
inputId: "text1",
textUrl,
labels: concepts,
});
Batch Upload Image Data While Tracking Status
You can actively monitor the status of your dataset upload, giving you clear visibility into the progress and making it easy to track and analyze the data transfer process.
- Python SDK
from clarifai.client.dataset import Dataset
from clarifai.datasets.upload.utils import load_module_dataloader
#replace your "user_id", "app_id", "dataset_id".
dataset = Dataset(user_id="user_id", app_id="test_app", dataset_id="first_dataset")
#create dataloader object
cifar_dataloader = load_module_dataloader('./image_classification/cifar10')
#set get_upload_status=True for showing upload status
dataset.upload_dataset(dataloader=cifar_dataloader,get_upload_status=True)
Retry Upload From Log File
You can retry uploads for failed inputs directly from the logs. When using the upload_dataset
function, any failed inputs are automatically logged to a file, which can later be used to resume and retry the upload process seamlessly.
Set retry_duplicates
to True
if you want to retry duplicate with new input_id
in current dataset.
- Python SDK
#importing load_module_dataloader for calling the dataloader object in dataset.py in the local data folder
from clarifai.datasets.upload.utils import load_module_dataloader
from clarifai.client.dataset import Dataset
#replace your "user_id", "app_id", "dataset_id".
dataset = Dataset(user_id="user_id", app_id="app_id", dataset_id="dataset_id")
cifar_dataloader = load_module_dataloader('./image_classification/cifar10')
dataset.retry_upload_from_logs(dataloader=cifar_dataloader, log_file_path='path to log file', retry_duplicates=True, log_warnings=True)