Skip to main content

Hugging Face

Download and run Hugging Face models locally and make them available via a public API


Hugging Face is an open-source platform for sharing, exploring, and collaborating on a wide range of pre-trained models and related assets.

With Clarifai’s Local Runners, you can run these models directly on your machine, expose them securely via a public URL, and tap into Clarifai’s powerful platform — all while preserving the speed, privacy, and control of local deployment.

Note: After initializing a model using the Hugging Face toolkit, you can upload it to Clarifai to leverage the platform’s capabilities.

Step 1: Perform Prerequisites

Get User ID and PAT

Start by logging in to your existing Clarifai account or signing up for a new one. Once logged in, you’ll need your Personal Access Token (PAT) for authentication:

  • In the collapsible left sidebar, select Settings and choose Secrets to generate or copy your PAT.

You can then set the PAT as an environment variable using CLARIFAI_PAT.

export CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE

Install Clarifai CLI

Install the latest version of the Clarifai CLI tool. It includes built-in support for Local Runners.

pip install --upgrade clarifai

Note: You'll need Python 3.11 or 3.12 installed to successfully run the Local Runners.

Get Hugging Face Token

A Hugging Face access token is required to authenticate with Hugging Face services, especially when downloading models from private or restricted repositories.

You can create one by following these instructions. Once you have it, provide the token either in your model’s config.yaml file (as described below) or as an environment variable.

Note: If hf_token is not specified in the config.yaml file, the CLI automatically falls back to the HF_TOKEN environment variable to authenticate with Hugging Face.

export HF_TOKEN="YOUR_HF_ACCESS_TOKEN_HERE"

Install Hugging Face Hub

The huggingface_hub library is used under the hood to fetch files from the Hugging Face Hub. While you won’t interact with it directly, it’s required for downloading the models and resources automatically.

pip install huggingface_hub

Step 2: Initialize a Model

With the Clarifai CLI, you can download and set up any supported Hugging Face model directly in your local environment. Use the --model-name flag to specify a HuggingFace model:

clarifai model init --toolkit huggingface --model-name Qwen/Qwen2-0.5B

This creates a Qwen2-0.5B/ directory with all required files pre-configured. The CLI auto-selects an appropriate GPU instance based on the model's VRAM requirements.

Note: You can initialize a model in a specific location by passing a MODEL_PATH.

Example Output
clarifai model init --toolkit huggingface --model-name google/gemma-2b
[INFO] Initializing model with huggingface toolkit...
[INFO] Updated Hugging Face model repo_id to: google/gemma-2b
Instance: g6e.2xlarge (Estimated 21.6 GiB VRAM, fits g6e.2xlarge (45 GiB))

Model initialized in ./gemma-2b

Test locally:
clarifai model serve ./gemma-2b
clarifai model serve ./gemma-2b --mode env # auto-create venv and install deps
clarifai model serve ./gemma-2b --mode container # run inside Docker

Deploy to Clarifai:
clarifai model deploy ./gemma-2b
clarifai list-instances # list available instances

You can customize or optimize the model by editing the generated files as needed.

tip

To initialize with a default model (unsloth/Llama-3.2-1B-Instruct), omit --model-name:

clarifai model init --toolkit huggingface
Supported Models
- unsloth/Llama-3.2-1B-Instruct
- Qwen/Qwen2-0.5B
- Qwen/Qwen3-1.7B
- Qwen/Qwen3-0.6B
- Qwen/Qwen3-4B-Thinking-2507
- Qwen/Qwen3-4B-Instruct-2507
- HuggingFaceTB/SmolLM-1.7B-Instruct
- stabilityai/stablelm-zephyr-3b
- microsoft/Phi-3-mini-4k-instruct
- google/gemma-3n-E2B-it

Note: Some models are quite large and require substantial memory or GPU resources. Ensure your machine has sufficient compute capacity to load and run the model locally before initializing it.

The generated structure includes:

├── 1/
│ └── model.py
├── requirements.txt
└── config.yaml

model.py

Example: model.py
from typing import List, Iterator
from threading import Thread
import os
import torch

from clarifai.runners.models.model_class import ModelClass
from clarifai.utils.logging import logger
from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.utils.openai_convertor import openai_response
from clarifai.runners.utils.data_utils import Param
from transformers import (AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer)


class MyModel(ModelClass):
"""A custom runner for llama-3.2-1b-instruct llm that integrates with the Clarifai platform"""

def load_model(self):
"""Load the model here."""
if torch.backends.mps.is_available():
self.device = 'mps'
elif torch.cuda.is_available():
self.device = 'cuda'
else:
self.device = 'cpu'

logger.info(f"Running on device: {self.device}")

# Load checkpoints
model_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(model_path, download_validation_only=True)
self.checkpoints = builder.config['checkpoints']['repo_id']
logger.info(f"Loading model from: {self.checkpoints}")
# Load model and tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(self.checkpoints,)
self.tokenizer.pad_token = self.tokenizer.eos_token # Set pad token to eos token
self.model = AutoModelForCausalLM.from_pretrained(
self.checkpoints,
low_cpu_mem_usage=True,
device_map=self.device,
torch_dtype=torch.bfloat16,
)
self.streamer = TextIteratorStreamer(tokenizer=self.tokenizer, skip_prompt=True, skip_special_tokens=True)
self.chat_template = None
logger.info("Done loading!")

@ModelClass.method
def predict(self,
prompt: str ="",
chat_history: List[dict] = None,
max_tokens: int = Param(default=512, description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.", ),
temperature: float = Param(default=0.7, description="A decimal number that determines the degree of randomness in the response", ),
top_p: float = Param(default=0.8, description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.", )) -> str:
"""
Predict the response for the given prompt and chat history using the model.
"""
# Construct chat-style messages
messages = chat_history if chat_history else []
if prompt:
messages.append({
"role": "user",
"content": prompt
})

inputs = self.tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(self.model.device)

generation_kwargs = {
"do_sample": True,
"max_new_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"eos_token_id": self.tokenizer.eos_token_id,
}

output = self.model.generate(**inputs, **generation_kwargs)
generated_tokens = output[0][inputs["input_ids"].shape[-1]:]
return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)

@ModelClass.method
def generate(self,
prompt: str="",
chat_history: List[dict] = None,
max_tokens: int = Param(default=512, description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.", ),
temperature: float = Param(default=0.7, description="A decimal number that determines the degree of randomness in the response", ),
top_p: float = Param(default=0.8, description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.", )) -> Iterator[str]:
"""Stream generated text tokens from a prompt + optional chat history."""


# Construct chat-style messages
messages = chat_history if chat_history else []
if prompt:
messages.append({
"role": "user",
"content": prompt
})
logger.info(f"Generating response for messages: {messages}")
response = self.chat(
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p
)

for each in response:
if 'choices' in each and 'delta' in each['choices'][0] and 'content' in each['choices'][0]['delta']:
yield each['choices'][0]['delta']['content']

@ModelClass.method
def chat(self,
messages: List[dict],
max_tokens: int = Param(default=512, description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.", ),
temperature: float = Param(default=0.7, description="A decimal number that determines the degree of randomness in the response", ),
top_p: float = Param(default=0.8, description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.", )
) -> Iterator[dict]:
"""
Stream back JSON dicts for assistant messages.
Example return format:
{"role": "assistant", "content": [{"type": "text", "text": "response here"}]}
"""
# Tokenize using chat template
inputs = self.tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(self.model.device)

generation_kwargs = {
"input_ids": inputs,
"do_sample": True,
"max_new_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"eos_token_id": self.tokenizer.eos_token_id,
"streamer": self.streamer
}

thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start()

# Accumulate response text
for chunk in openai_response(self.streamer):
yield chunk

thread.join()

def test(self):
"""Test the model here."""
try:
print("Testing predict...")
# Test predict
print(self.predict(prompt="What is the capital of India?",))
except Exception as e:
print("Error in predict", e)

try:
print("Testing generate...")
# Test generate
for each in self.generate(prompt="What is the capital of India?",):
print(each, end="")
print()
except Exception as e:
print("Error in generate", e)

try:
print("Testing chat...")
messages = [
{"role": "system", "content": "You are an helpful assistant."},
{"role": "user", "content": "What is the capital of India?"},
]
for each in self.chat(messages=messages,):
print(each, end="")
print()
except Exception as e:
print("Error in generate", e)

The model.py file, which is located inside the 1 folder, defines the logic for the Hugging Face model, including how predictions are made.

config.yaml

Example: config.yaml
build_info:
python_version: '3.11'
checkpoints:
repo_id: google/gemma-2b
type: huggingface
when: runtime
compute:
instance: g6e.2xlarge
model:
id: gemma-2b

The config.yaml file defines your Hugging Face model’s configuration in a simplified format:

  • model.id — A unique identifier for your model. Auto-generated from the Hugging Face model name when you use --model-name.
  • build_info.python_version — The Python version to use (default: 3.11).
  • compute.instance — The GPU instance type, auto-selected based on the model’s VRAM requirements. Run clarifai list-instances to see all available options.
  • checkpoints — Defines how to retrieve model weights. If you’re using a gated repository, add your Hugging Face access token via hf_token or set the HF_TOKEN environment variable.

user_id and app_id are auto-filled from your active context at deploy time. You don’t need to add them manually.

Tip: Use when: runtime (the default) for large models to reduce image size and improve load times.

requirements.txt

Example: requirements.txt
torch==2.5.1
tokenizers>=0.21.0
transformers>=4.47.0
accelerate>=1.2.0
scipy>=1.10.0
optimum>=1.23.3
protobuf==5.27.3
einops>=0.8.0
requests==2.32.3
clarifai>=11.4.1
timm

The requirements.txt file lists Python dependencies needed by your model. If you haven’t installed them yet, run the following command to install the dependencies:

pip install -r requirements.txt

Step 3: Log In to Clarifai

Run the following command to log in to the Clarifai platform, create a configuration context, and establish a connection:

clarifai login

You’ll be prompted to provide a few details for authentication:

  • User ID – Enter your Clarifai user ID.
  • PAT – Enter your Clarifai PAT. If you’ve already set the CLARIFAI_PAT environment variable, type ENVVAR to use it automatically.
  • Context name – Assign a custom name to this configuration context, or press Enter to accept the default name, "default". This is helpful if you manage multiple environments or configurations.
Example Output
clarifai login                                      
Enter your Clarifai user ID: alfrick
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/alfrick/settings/security

Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): ENVVAR

> Verifying token...
[INFO] 13:39:42.773825 Validating the Context Credentials... | thread=8800297152
[INFO] 13:39:46.740886 ✅ Context is valid | thread=8800297152

> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.

Enter a name for this context [default]: default
✅ Success! You are now logged in.
Credentials saved to the 'default' context.

💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 13:41:01.395603 Login successful for user 'alfrick' in context 'default' | thread=8800297152

Step 4: Serve the Model Locally

Start the model using clarifai model serve:

clarifai model serve

Note: The older clarifai model local-runner command still works as an alias.

If the necessary context configurations aren’t detected, the CLI will guide you through creating them using default values.

This setup ensures all required components — such as compute clusters, nodepools, and deployments — are properly included in your configuration context, which are described here.

Example Output
clarifai model local-runner
[INFO] 15:04:30.110675 Checking setup for local runner... | thread=8800297152
[INFO] 15:04:30.110764 Current context: default | thread=8800297152
[INFO] 15:04:30.110803 Current user_id: alfrick | thread=8800297152
[INFO] 15:04:30.133269 Current compute_cluster_id: local-runner-compute-cluster | thread=8800297152
[INFO] 15:04:32.213980 Failed to get compute cluster with ID local-runner-compute-cluster: code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "ComputeCluster with ID \'local-runner-compute-cluster\' not found. Check your request fields."
req_id: "sdk-python-11.6.4-adac6224603147b4a6422e7ab3d8999f"
| thread=8800297152
Compute cluster not found. Do you want to create a new compute cluster alfrick/local-runner-compute-cluster? (y/n): y
[INFO] 15:04:39.978695
Compute Cluster created
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.6.4-02d952ca15d4431ebf1d998247b5559f"
| thread=8800297152
[INFO] 15:04:39.986611 Current nodepool_id: local-runner-nodepool | thread=8800297152
[INFO] 15:04:41.235547 Failed to get nodepool with ID local-runner-nodepool: code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "Nodepool not found. Check your request fields."
req_id: "sdk-python-11.6.4-f1af74b390d54ee68b1c0d7025c412a8"
| thread=8800297152
Nodepool not found. Do you want to create a new nodepool alfrick/local-runner-compute-cluster/local-runner-nodepool? (y/n): y
[INFO] 15:04:43.256490
Nodepool created
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.6.4-3c0ebab572cf495996ce5337da2cc24e"
| thread=8800297152
[INFO] 15:04:43.269204 Current app_id: local-runner-app | thread=8800297152
[INFO] 15:04:43.580198 Current model_id: local-runner-model | thread=8800297152
[INFO] 15:04:46.328600 Current model version 9d38bb9398944de4bdef699835f17ec9 | thread=8800297152
[INFO] 15:04:46.329168 Create the local runner tying this
alfrick/local-runner-app/models/local-runner-model model (version: 9d38bb9398944de4bdef699835f17ec9) to the
alfrick/local-runner-compute-cluster/local-runner-nodepool nodepool. | thread=8800297152
[INFO] 15:04:47.564767
Runner created
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.6.4-5a17241947ba4ac593f29eacecb4d61d"
with id: 7dbd2b733acb4684a4cb8d3b11ee626a | thread=8800297152
[INFO] 15:04:47.573245 Current runner_id: 7dbd2b733acb4684a4cb8d3b11ee626a | thread=8800297152
[INFO] 15:04:47.828638 Failed to get deployment with ID local-runner-deployment: code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "Deployment with ID \'local-runner-deployment\' not found. Check your request fields."
req_id: "sdk-python-11.6.4-1f51c07b0cc54f14893175401f6fda1d"
| thread=8800297152
Deployment not found. Do you want to create a new deployment alfrick/local-runner-compute-cluster/local-runner-nodepool/local-runner-deployment? (y/n): y
[INFO] 15:04:50.307460
Deployment created
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.6.4-6afe596327fb42bc818940fb324cc8bc"
| thread=8800297152
[INFO] 15:04:50.315366 Current deployment_id: local-runner-deployment | thread=8800297152
[INFO] 15:04:50.315542 Full url for the model: https://clarifai.com/users/alfrick/apps/local-runner-app/models/local-runner-model/versions/9d38bb9398944de4bdef699835f17ec9 | thread=8800297152
[INFO] 15:04:50.318223 Current model section of config.yaml: {'app_id': 'items-app', 'id': 'first-local-runner-model', 'model_type_id': 'text-to-text', 'user_id': 'alfrick'} | thread=8800297152
Do you want to backup config.yaml to config.yaml.bk then update the config.yaml with the new model information? (y/n): y
[INFO] 15:04:53.150497 Hugging Face repo access validated | thread=8800297152
[INFO] 15:05:00.624489

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# About to start up the local runner in this terminal...
# Here is a code snippet to call this model once it start from another terminal:
| thread=8800297152
[INFO] 15:05:00.624556
# Clarifai Model Client Script
# Set the environment variables `CLARIFAI_DEPLOYMENT_ID` and `CLARIFAI_PAT` to run this script.
# Example usage:
import os

from clarifai.client import Model
from clarifai.runners.utils import data_types

model = Model("https://clarifai.com/alfrick/local-runner-app/models/local-runner-model",
deployment_id = 'local-runner-deployment', # Only needed for dedicated deployed models
base_url='https://api.clarifai.com',
)


# Example model prediction from different model methods:

response = model.predict(prompt="What is the future of AI?", max_tokens=512, temperature=0.7, top_p=0.8)
print(response)

response = model.generate(prompt="What is the future of AI?", max_tokens=512, temperature=0.7, top_p=0.8)
for res in response:
print(res)

response = model.chat(max_tokens=512, temperature=0.7, top_p=0.8)
for res in response:
print(res)

| thread=8800297152
[INFO] 15:05:00.624593 Now starting the local runner... | thread=8800297152
[INFO] 15:05:01.132806 Hugging Face repo access validated | thread=8800297152
[INFO] 15:05:01.170311 Running on device: mps | thread=8800297152
[INFO] 15:05:01.790161 Hugging Face repo access validated | thread=8800297152
[INFO] 15:05:01.791351 Loading model from: unsloth/Llama-3.2-1B-Instruct | thread=8800297152
tokenizer_config.json: 54.7kB [00:00, 49.0MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████| 17.2M/17.2M [00:07<00:00, 2.27MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████| 454/454 [00:00<00:00, 1.33MB/s]
chat_template.jinja: 3.83kB [00:00, 7.36MB/s]
config.json: 100%|████████████████████████████████████████████████████████████| 894/894 [00:00<00:00, 3.20MB/s]
`torch_dtype` is deprecated! Use `dtype` instead!

Step 5: Test Your Runner

Once the local runner starts, it provides a sample client code snippet you can use for quick testing.

You can run the snippet in a separate terminal within the same directory to see the model’s response.

Here’s an example snippet:

# Before running this script, set the environment variables:
# CLARIFAI_DEPLOYMENT_ID (optional – only required for dedicated deployments)
# CLARIFAI_PAT (your Personal Access Token)

from clarifai.client import Model

# Initialize the model
model = Model(
"https://clarifai.com/<user-id>/local-runner-app/models/local-runner-model",
# deployment_id="local-runner-deployment", # Uncomment if using a deployed model
)

# Run a basic prediction
response = model.predict(
prompt="What is the future of AI?",
max_tokens=512,
temperature=0.7,
top_p=0.8,
)

print(response)

'''
--- Additional examples ---

# Using the generate method
response = model.generate(
prompt="What is the future of AI?",
max_tokens=512,
temperature=0.7,
top_p=0.8,
)
for res in response:
print(res)

# Using the chat method
response = model.chat(
max_tokens=512,
temperature=0.7,
top_p=0.8,
)
for res in response:
print(res)
'''

Deploy to Cloud

After testing locally, you can deploy your Hugging Face model to Clarifai's cloud compute with a single command. All infrastructure (compute cluster, nodepool, deployment) is created automatically.

Step 1: Initialize

If you haven't already, scaffold a Hugging Face model project:

clarifai model init --toolkit huggingface --model-name Qwen/Qwen2-0.5B

The CLI auto-selects an appropriate GPU instance based on the model's VRAM requirements.

Step 2: Deploy

clarifai model deploy ./Qwen2-0.5B --instance g5.xlarge

If your config.yaml already has a compute.instance value (set during init), you can omit the --instance flag:

clarifai model deploy ./Qwen2-0.5B

To override the instance with a larger GPU, use the --instance flag — it always takes priority over the config:

clarifai model deploy ./Qwen2-0.5B --instance g5.xlarge  # override to a larger GPU

Browse available GPU instances with clarifai list-instances or clarifai model deploy --instance-info.

Step 3: Monitor and Manage

# Check deployment status
clarifai model status --deployment <deployment-id>

# Stream live logs
clarifai model logs --deployment <deployment-id>

# Run predictions
clarifai model predict user/app/models/Qwen2-0.5B "Explain AI in one sentence"

# Clean up when done
clarifai model undeploy --deployment <deployment-id>

For the full deploy options reference, see the CLI Reference.