Skip to main content

Upload Data to Dataset via API

Learn how to upload data to a dataset via the API


Uploading data to a dataset in Clarifai is essential for training and evaluating your machine learning models.

Whether you're working with images, videos, text, audio, or other data types, we provide flexible and efficient methods to upload data from various sources.

info

Before using the Python SDK, Node.js SDK, or any of our gRPC clients, ensure they are properly installed on your machine. Refer to their respective installation guides for instructions on how to install and initialize them.

tip

Click here to learn more about the different methods of uploading data to a dataset.

Customize Batch Size

When uploading inputs to the Clarifai platform, there are limits on the size and number of inputs per upload, as detailed here. However, by using methods from the Dataset class — such as Dataset.upload_from_folder(), Dataset.upload_from_url(), or Dataset.upload_dataset() — you can bypass these restrictions and efficiently upload larger volumes of inputs.

For example, when uploading images in bulk, such methods incrementally process and upload them in multiple batches, ensuring that each batch contains a maximum of 128 images and does not exceed 128MB in size – which ensures adherence to the upload restrictions.

You can also customize the batch_size variable, which allows for concurrent upload of inputs and annotations. For example, if your images folder exceeds 128MB, you can set the variable to ensure that each batch contains an appropriate number of images while staying within the 128MB per batch limit.

The default batch_size is set to 32, but you can customize it to any value between 1 (minimum) and 128 (maximum).

Here is an example:

dataset.upload_from_folder(folder_path='./images', input_type='image', labels=True, batch_size=50)

Add Inputs to a Dataset

You can add inputs to a dataset by specifying their input IDs.

curl --location --request POST "https://api.clarifai.com/v2/users/YOUR_USER_ID_HERE/apps/YOUR_APP_ID_HERE/datasets/YOUR_DATASET_ID_HERE/inputs" \
--header "Authorization: Key YOUR_PAT_HERE" \
--header "Content-Type: application/json" \
--data-raw '{
"dataset_inputs": [
{
"input": {
"id": "YOUR_INPUT_ID_HERE"
}
}
]
}'

Upload Image Data

You can upload image data in bulk either from a folder or by using a CSV file.

from clarifai.client.dataset import Dataset


# Create a dataset object
dataset = Dataset(user_id="user_id", app_id="test_app", dataset_id="first_dataset",pat=”YOUR_PAT”)
#To upload without concepts(labels=False)
#upload data from folder
dataset.upload_from_folder(folder_path='./images', input_type='image', labels=True)

Upload Text Data

You can upload text data in bulk either from a folder or by using a CSV file.

from clarifai.client.dataset import Dataset

# Create the dataset object
dataset = Dataset(user_id="user_id", app_id="test_app", dataset_id="first_dataset",pat=”YOUR_PAT”)
#To upload without concepts(labels=False)
# upload dataset from folder
dataset.upload_from_folder(folder_path='./data', input_type='text', labels=True)

Upload Audio Data

You can upload audio data in bulk either from a folder or by using a CSV file.

from clarifai.client.dataset import Dataset


#Create a dataset object
dataset = Dataset(user_id="user_id", app_id="test_app", dataset_id="first_dataset",pat=”YOUR_PAT”)
#To upload without concepts(labels=False)
#Upload data from csv
dataset.upload_from_csv(csv_path='/Users/adithyansukumar/Desktop/data/test.csv', input_type='audio',csv_type='url', labels=True)

Upload Video Data

You can upload video data in bulk either from a folder or by using a CSV file.

from clarifai.client.dataset import Dataset


#Create a dataset object
dataset = Dataset(user_id="user_id", app_id="test_app", dataset_id="first_dataset",pat=”YOUR_PAT”)
#To upload without concepts(labels=False)
#Upload data from csv
dataset.upload_from_csv(csv_path='/Users/adithyansukumar/Desktop/data/test.csv', input_type='audio',csv_type='url', labels=True)

Upload Image Data With Annotations

You can upload image data along with bounding box annotations, allowing you to add depth and contextual information to your visual data.

from clarifai.client.input import Inputs


url = "https://samples.clarifai.com/BarackObama.jpg"
#replace your "user_id", "app_id", "dataset_id".
input_object = Inputs(user_id="user_id", app_id="test_app",pat=”YOUR_PAT”)

# Upload image data from a specified URL with a unique input ID "bbox"
input_object.upload_from_url(input_id="bbox", image_url=url)

# Define bounding box coordinates for the annotation (left, top, right, bottom)
bbox_points = [.1, .1, .8, .9]

# Generate a bounding box annotation proto with specified label ("face") and bounding box coordinates
annotation = input_object.get_bbox_proto(input_id="bbox", label="face", bbox=bbox_points)

# Upload the generated annotation to associate with the previously uploaded image
input_object.upload_annotations([annotation])

Upload Image Data With Mask Annotations

You can add masks to image data by providing polygon coordinates along with the image, enabling precise region-based annotations.

from clarifai.client.input import Inputs


url = "https://samples.clarifai.com/BarackObama.jpg"
#replace your "user_id", "app_id", "dataset_id".
input_object = Inputs(user_id="USER_ID", app_id="APP_ID",pat="YOUR_PAT")

# Upload image data from a specified URL with a unique input ID "mask"
input_object.upload_from_url(input_id="mask", image_url=url)

# Define mask points
mask = [[0.87, 0.66],[0.45 , 1.0], [0.82 ,0.42]]# polygon points

annotation = input_object.get_mask_proto(input_id="mask", label="obama", polygons=mask)

# Upload the generated annotation to associate with the previously uploaded image
input_object.upload_annotations([annotation])

Upload Video Data With Annotations

You can upload videos with enriched annotations by including bounding box coordinates that define regions of interest within individual frames, adding valuable context to your video content.

from clarifai.client.input import Inputs

url = "https://samples.clarifai.com/beer.mp4"
#replace your "user_id", "app_id", "dataset_id".
input_object = Inputs(user_id="user_id", app_id="test_app",pat=”YOUR_PAT”)

# Upload an image from a URL with a specified input ID
input_object.upload_from_url(input_id="bbox", video_url=url)

# Define bounding box coordinates for annotation
bbox_points = [.1, .1, .8, .9]

# Create an annotation using the bounding box coordinates
annotation = input_object.get_bbox_proto(input_id="video_bbox", label="glass", bbox=bbox_points)

# Upload the annotation associated with the image
input_object.upload_annotations([annotation])

Upload Text Data With Annotations

You can enrich your uploaded text data by attaching metadata, categorizing the content, or adding detailed annotations to enhance structure and context.

from clarifai.client.input import Inputs

url = "https://samples.clarifai.com/featured-models/Llama2_Conversational-agent.txt"
concepts = ["mobile","camera"]
#replace your "user_id", "app_id", "dataset_id".
input_object = Inputs(user_id="user_id", app_id="test_app",pat=”YOUR_PAT”)
#Upload data from url with annotation
input_object.upload_from_url(input_id="text1",text_url=url, labels=concepts)

Batch Upload Image Data While Tracking Status

You can actively monitor the status of your dataset upload, giving you clear visibility into the progress and making it easy to track and analyze the data transfer process.

from clarifai.client.dataset import Dataset
from clarifai.datasets.upload.utils import load_module_dataloader


#replace your "user_id", "app_id", "dataset_id".
dataset = Dataset(user_id="user_id", app_id="test_app", dataset_id="first_dataset")
#create dataloader object
cifar_dataloader = load_module_dataloader('./image_classification/cifar10')
#set get_upload_status=True for showing upload status
dataset.upload_dataset(dataloader=cifar_dataloader,get_upload_status=True)

Retry Upload From Log File

You can retry uploads for failed inputs directly from the logs. When using the upload_dataset function, any failed inputs are automatically logged to a file, which can later be used to resume and retry the upload process seamlessly.

info

Set retry_duplicates to True if you want to retry duplicate with new Input_id in current dataset.

#importing load_module_dataloader for calling the dataloader object in dataset.py in the local data folder
from clarifai.datasets.upload.utils import load_module_dataloader
from clarifai.client.dataset import Dataset


#replace your "user_id", "app_id", "dataset_id".
dataset = Dataset(user_id="user_id", app_id="app_id", dataset_id="dataset_id")

cifar_dataloader = load_module_dataloader('./image_classification/cifar10')

dataset.retry_upload_from_logs(dataloader=cifar_dataloader, log_file_path='path to log file', retry_duplicates=True, log_warnings=True)