Skip to main content

Datasets Creation

Learn how to create datasets and their versions


Create via the UI

Create a New Dataset

To create a new dataset, head to the individual page of your application. Then, select the Datasets option in the collapsible left sidebar.

You'll be redirected to the Datasets manager page, where you can create new datasets and view already created ones. Click the Create Dataset button in the upper-right corner of the page.

On the New Dataset page, provide an ID and, optionally, a short description of the dataset. Then, click the Create button.

You'll be redirected to the created dataset's page, where you can accomplish various tasks. Let's talk about them.

View in Leaderboard

To view the dataset in the Leaderboard, click the View in Leaderboard button in the upper section of the page. You'll be redirected to the Leaderboard page, where you can assess how the dataset performs when used to evaluate a model.

Click here to learn how the leaderboard feature works.

Edit Visibility

To change the visibility of the dataset, whether from private to public or vice versa, click the Edit Visibility button located in the upper section of the page. A pop-up window will appear, allowing you to modify the dataset's visibility settings.

note

If you make a specific version of a dataset public, your user ID and app ID will also be set to public.

After editing the settings, click the Confirm button.

Upload Inputs

To add inputs to the dataset, click the Upload Inputs button in the upper section of the page. You'll be redirected to the Inputs-Manager page, where you can upload inputs to the dataset.

Click here to learn how to upload inputs to our platform.

Create Labeling Task

To create a labeling task, click the Create Labeling Task button in the upper section of the page. You'll be redirected to the New Labeling Task page, where you can create a new labeling task to label the inputs in your dataset either manually or automatically.

Click here to learn how to create a labeling task.

Train Model

To train a model, click the Train Model button in the upper section of the page. You'll be redirected to a page where you can create a new custom model for your use case.

Click here to learn how to create and train a model.

Create Dataset Version

A dataset can change over time for various reasons, such as by adding or removing inputs. With dataset versioning, you can assign a unique identifier to a specific version of a dataset.

A dataset version can help you achieve many things, such as:

  • Refer to a specific dataset version and recreate the same results. This can help you to have a clear reference to what data was used at a particular point in time.
  • Ensure everyone in your team is working on the same dataset. This reduces confusion and errors, and leads to accurate results.
  • Track the changes you've made to a dataset over time. This can help you to determine whether you're improving the quality and quantity of your dataset.

After adding inputs to a dataset, you can create a version that bookmarks the state of your data so that you can apply a specific version of the dataset for future iterations.

To create a new dataset version, go to the individual page of your created dataset and click the New version button.

Finally, click the Update status button.

The total number of inputs and their respective annotations are displayed in the Overview tab.

note

If you click the Explore Inputs button, you'll be redirected to the Inputs-Manager page, where all the inputs in your dataset are displayed.

The versions of the datasets you've created are displayed in the Versions tab.

As you can see in the screenshot above, the Versions tab has a chart that displays the total number of inputs in your dataset over time, with data plotted against specific dates.

By default, it shows the annotation metrics based on the dataset version, with each annotation type represented by a distinct color. This makes it easy to track and compare trends across different dates.

You can switch the chart to display metrics by input type by clicking the by inputs button in the upper-right corner. Each type of input is marked with its own color.

If you hover over the chart, a tooltip is activated that provides detailed information for that specific date, including the exact count of the type of annotation or type of input. The hovered item is highlighted in the tooltip, which allows for quick identification.

Additionally, a table is provided below the chart. It lists each dataset version alongside the creation date, description, input count, and annotation count. The table includes a search function for easy lookup and allows the sorting of columns for streamlined navigation.

Auto-Generated Versions

As you navigate through the Versions tab, you might come across auto-generated dataset versions.

These are some cases where "auto-generated-*" dataset versions could be created:

  • If you train a model and only select a dataset, but not a corresponding dataset version.
  • During the model evaluation process, the auto-generated dataset versions are used to store the different ground truths and predictions, which are then used to further calculate the actual evaluation metrics.

Create via the API

info

Before using the Python SDK, Node.js SDK, or any of our gRPC clients, ensure they are properly installed on your machine. Refer to their respective installation guides for instructions on how to install and initialize them.

Create a Dataset

You can create a dataset by specifying a unique dataset ID.

from clarifai.client.app import App

app = App(app_id="test_app", user_id="user_id",pat=”YOUR_PAT”)
# Provide the dataset name as parameter in the create_dataset function
dataset = app.create_dataset(dataset_id="first_dataset")

Create a Dataset Version

After making changes to a dataset, such as adding new inputs, you can create a new version to reflect those updates, as previously explained.

from clarifai.client.dataset import Dataset
# Create a dataset object
dataset = Dataset(dataset_id='first_dataset', user_id='user_id', app_id='test_app',pat=’YOUR_PAT’)
# Create a new version of the dataset
dataset_version = dataset.create_version(description='dataset_version_description')