Skip to main content

Upload Your First Model

Upload a model from Hugging Face to the Clarifai platform


The Clarifai platform allows you to upload custom models for a wide range of use cases. With just a few simple steps, you can get your models up and running and leverage the platform’s powerful capabilities.

Let's demonstrate how you can upload the Llama-3.2-1B-Instruct model from Hugging Face to the Clarifai platform.

tip

To learn more about how to upload different types of models, check out this comprehensive guide.

Step 1: Perform Prerequisites

Install Clarifai Package

Install the latest version of the clarifai Python SDK. This also installs the Clarifai Command Line Interface (CLI), which we'll use for uploading the model.

 pip install --upgrade clarifai 

Set a PAT Key

You need to set the CLARIFAI_PAT (Personal Access Token) as an environment variable. You can generate the PAT key in your personal settings page by navigating to the Security section.

This token is essential for authenticating your connection to the Clarifai platform.

 export CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE 

Get a Hugging Face Access Token

To download models from the Hugging Face platform, you'll need to authenticate your connection. You can create a Hugging Face account, then generate an access token to authorize your downloads.

You can follow the guide here to get it.

Step 2: Create Files

Create a project directory and organize your files as indicated below to fit the requirements of uploading models to the Clarifai platform.

your_model_directory/
├── 1/
│ └── model.py
├── requirements.txt
└── config.yaml
  • your_model_directory/ – The main directory containing your model files.
    • 1/ – A subdirectory that holds the model file (Note that the folder is named as 1).
      • model.py – Contains the code that defines your model, including loading the model and running inference.
    • requirements.txt – Lists the Python libraries and dependencies required to run your model.
    • config.yaml – Contains model metadata and configuration details necessary for building the Docker image, defining compute resources, and uploading the model to Clarifai.

Add the following snippets to each of the respective files.

model.py

from typing import List, Iterator
from threading import Thread
import os
import torch

from clarifai.runners.models.model_class import ModelClass
from clarifai.utils.logging import logger
from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.utils.openai_convertor import openai_response
from transformers import (AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer)


class MyModel(ModelClass):
"""A custom runner for llama-3.2-1b-instruct llm that integrates with the Clarifai platform"""

def load_model(self):
"""Load the model here."""
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
logger.info(f"Running on device: {self.device}")

# Load checkpoints
model_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(model_path, download_validation_only=True)
self.checkpoints = builder.download_checkpoints(stage="runtime")

# Load model and tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(self.checkpoints,)
self.tokenizer.pad_token = self.tokenizer.eos_token # Set pad token to eos token
self.model = AutoModelForCausalLM.from_pretrained(
self.checkpoints,
low_cpu_mem_usage=True,
device_map=self.device,
torch_dtype=torch.bfloat16,
)
self.streamer = TextIteratorStreamer(tokenizer=self.tokenizer,)
self.chat_template = None
logger.info("Done loading!")

@ModelClass.method
def predict(self,
prompt: str ="",
chat_history: List[dict] = None,
max_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.8) -> str:
"""
Predict the response for the given prompt and chat history using the model.
"""
# Construct chat-style messages
messages = chat_history if chat_history else []
if prompt:
messages.append({
"role": "user",
"content": [{"type": "text", "text": prompt}]
})

inputs = self.tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(self.model.device)

generation_kwargs = {
"input_ids": inputs["input_ids"],
"do_sample": True,
"max_new_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"eos_token_id": self.tokenizer.eos_token_id,
}

output = self.model.generate(**generation_kwargs)
generated_tokens = output[0][inputs["input_ids"].shape[-1]:]
return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)

@ModelClass.method
def generate(self,
prompt: str="",
chat_history: List[dict] = None,
max_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.8) -> Iterator[str]:
"""Stream generated text tokens from a prompt + optional chat history."""

# Construct chat-style messages
messages = chat_history if chat_history else []
if prompt:
messages.append({
"role": "user",
"content": [{"type": "text", "text": prompt}]
})

response = self.chat(
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p
)
for each in response:
yield each['choices'][0]['delta']['content']


@ModelClass.method
def chat(self,
messages: List[dict],
max_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.8) -> Iterator[dict]:
"""
Stream back JSON dicts for assistant messages.
Example return format:
{"role": "assistant", "content": [{"type": "text", "text": "response here"}]}
"""

# Tokenize using chat template
inputs = self.tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(self.model.device)

generation_kwargs = {
"input_ids": inputs["input_ids"],
"do_sample": True,
"max_new_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"eos_token_id": self.tokenizer.eos_token_id,
"streamer": self.streamer
}

thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start()

# Accumulate response text
for token_text in self.streamer:
yield openai_response(token_text)

thread.join()


def test(self):
"""Test the model here."""
try:
print("Testing predict...")
# Test predict
print(self.predict(prompt="What is the capital of India?",))
except Exception as e:
print("Error in predict", e)

try:
print("Testing generate...")
# Test generate
for each in self.generate(prompt="What is the capital of India?",):
print(each, end="")
print()
except Exception as e:
print("Error in generate", e)

try:
print("Testing chat...")
messages = [
{"role": "system", "content": "You are an helpful assistant."},
{"role": "user", "content": "What is the capital of India?"},
]
for each in self.chat(messages=messages,):
print(each, end="")
print()
except Exception as e:
print("Error in generate", e)

requirements.txt

torch==2.5.1
tokenizers>=0.21.0
transformers>=4.47.0
accelerate>=1.2.0
scipy==1.10.1
optimum>=1.23.3
protobuf==5.27.3
einops>=0.8.0
requests==2.32.3
clarifai>=11.3.0

config.yaml

important

In the model section of the config.yaml file, specify your model ID, Clarifai user ID, and Clarifai app ID. These will define where your model will be uploaded on the Clarifai platform. You also need to specify the hf_token to authenticate your connection to Hugging Face, as described earlier.

model:
id: "llama_3_2_1b_instruct"
user_id: "user_id"
app_id: "app_id"
model_type_id: "text-to-text"

build_info:
python_version: "3.11"

inference_compute_info:
cpu_limit: "1"
cpu_memory: "13Gi"
num_accelerators: 1
accelerator_type: ["NVIDIA-*"]
accelerator_memory: "18Gi"

checkpoints:
type: "huggingface"
repo_id: "unsloth/Llama-3.2-1B-Instruct"
hf_token: "hf_token"
when: "runtime"

Step 3: Upload the Model

Once your custom model is ready, upload it to the Clarifai platform by navigating to the directory containing the model and running the following command:

 clarifai model upload 

Congratulations — you've just uploaded your first model to the Clarifai platform!

Now, you can deploy the model to a cluster and nodepool. This allows you to cost-efficiently and scalably make inferences with it.