Skip to main content

Dataset Export

This Python script allows you to export labeled data from Clarifai's Scribe labeler. The exported data can be saved as a ZIP archive or in a filesystem directory. The script provides two main classes, DatasetExportReader and InputDownloader, that handle the data export and download processes, respectively.

You can find the script in our GitHub clarifai-python-utils repository under clarifai/dataset_export

Installation

Before running this script, make sure you have the following dependencies installed:

  • Pillow
  • protobuf
  • requests
  • tqdm

You can install the dependencies using pip:

pip install Pillow protobuf requests tqdm

Usage

To use the script, run it with the following command:

python <script_name> <archive-url> [<save-path>]

Replace <script_name> with the name of the Python script, <archive-url> with the URL of the exported archive, and <save-path> (optional) with the path where you want to save the downloaded data. If <save-path> is not provided, the default output will be saved to "output.zip".

Classes

DatasetExportReader

This class is responsible for unpacking the ZIP file from the exported dataset version. It downloads the archive onto disk and reads the dataset version exports in memory without extracting all contents. It yields each api.Input object.

with DatasetExportReader(archive_url=archive_url) as reader:
# Use the reader object to access the dataset

InputDownloader

This class takes an iterator or a list of api.Input instances as input and provides a method to download all inputs (currently only images) of that data. It has the ability to write the downloaded inputs to a new ZIP archive or a filesystem directory.

input_downloader = InputDownloader(reader)
input_downloader.download_image_archive(save_path=save_path)

Example

Here's an example of how to use this script with an archive URL and an optional save path:

import sys
if len(sys.argv) < 2:
print(f"usage: {sys.argv[0]} <archive-url> [<save-path>]")
sys.exit(2)
archive_url = sys.argv[1]
save_path = sys.argv[2] if len(sys.argv) > 2 else "output.zip"

with DatasetExportReader(archive_url=archive_url) as reader:
InputDownloader(reader).download_image_archive(save_path=save_path)

When executed, this script will download the labeled images from the archive URL and save them to the specified save path.