Hugging Face
Download and run Hugging Face models locally and make them available via a public API
Hugging Face is an open-source platform for sharing, exploring, and collaborating on a wide range of pre-trained models and related assets.
With Clarifai’s Local Runners, you can run these models directly on your machine, expose them securely via a public URL, and tap into Clarifai’s powerful platform — all while preserving the speed, privacy, and control of local deployment.
Step 1: Perform Prerequisites
Sign Up or Log In
Log in to your existing Clarifai account or sign up for a new one. Once logged in, you’ll need the following credentials for setup:
- App ID – Navigate to the application you want to use to run the model and select the Overview option in the collapsible left sidebar. Get the app ID from there.
- User ID – In the collapsible left sidebar, select Settings and choose Account from the dropdown list. Then, locate your user ID.
- Personal Access Token (PAT) – From the same Settings option, choose Secrets to generate or copy your PAT. This token is used to authenticate your connection with the Clarifai platform.
You can then set the PAT as an environment variable using CLARIFAI_PAT
.
- Unix-Like Systems
- Windows
export CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE
set CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE
Install Clarifai CLI
Install the latest version of the Clarifai CLI tool. It includes built-in support for Local Runners.
- Bash
pip install --upgrade clarifai
Note: You'll need Python 3.11 or 3.12 installed to successfully run the Local Runners.
Get Hugging Face Token
A Hugging Face access token is required to authenticate with Hugging Face services, especially when downloading models from private or restricted repositories.
You can create one by following these instructions. Once you have it, provide the token either in your model’s config.yaml
file (as described below) or as an environment variable.
- Unix-Like Systems
- Windows
export HF_TOKEN="YOUR_HF_ACCESS_TOKEN_HERE"
set HF_TOKEN="YOUR_HF_ACCESS_TOKEN_HERE"
Install Hugging Face Hub
The huggingface_hub
library is used under the hood to fetch files from the Hugging Face Hub. While you won’t interact with it directly, it’s required for downloading the models and resources automatically.
- Bash
pip install huggingface_hub
Step 2: Initialize a Model
With the Clarifai CLI, you can download and set up any supported Hugging Face model directly in your local environment.
For example, the command below initializes the default model (unsloth/Llama-3.2-1B-Instruct
) in your current directory.
- Bash
clarifai model init --toolkit huggingface
Example Output
clarifai model init --toolkit huggingface
[INFO] 14:15:28.372128 Parsed GitHub repository: owner=Clarifai, repo=runners-examples, branch=huggingface, folder_path= | thread=8800297152
[INFO] 14:15:29.583471 Files to be downloaded are:
1. 1/model.py
2. config.yaml
3. requirements.txt | thread=8800297152
Press Enter to continue...
[INFO] 14:15:31.611534 Initializing model from GitHub repository: https://github.com/Clarifai/runners-examples | thread=8800297152
[INFO] 14:15:34.840210 Successfully cloned repository from https://github.com/Clarifai/runners-examples (branch: huggingface) | thread=8800297152
[INFO] 14:15:34.845345 Model initialization complete with GitHub repository | thread=8800297152
[INFO] 14:15:34.845394 Next steps: | thread=8800297152
[INFO] 14:15:34.845423 1. Review the model configuration | thread=8800297152
[INFO] 14:15:34.845444 2. Install any required dependencies manually | thread=8800297152
[INFO] 14:15:34.845466 3. Test the model locally using 'clarifai model local-test' | thread=8800297152
The command above generates a new model directory structure that is compatible with the Clarifai platform. You can customize or optimize the model by editing the generated files as needed.
You can use the --model-name
parameter to initialize any supported Hugging Face model. Example: clarifai model init --toolkit huggingface --model-name Qwen/Qwen2-0.5B
.
Supported Models
- unsloth/Llama-3.2-1B-Instruct
- Qwen/Qwen2-0.5B
- Qwen/Qwen3-1.7B
- Qwen/Qwen3-0.6B
- Qwen/Qwen3-4B-Thinking-2507
- Qwen/Qwen3-4B-Instruct-2507
- HuggingFaceTB/SmolLM-1.7B-Instruct
- stabilityai/stablelm-zephyr-3b
- microsoft/Phi-3-mini-4k-instruct
- google/gemma-3n-E2B-it
The generated structure includes:
├── 1/
│ └── model.py
├── requirements.txt
└── config.yaml
model.py
model.py Example
from typing import List, Iterator
from threading import Thread
import os
import torch
from clarifai.runners.models.model_class import ModelClass
from clarifai.utils.logging import logger
from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.utils.openai_convertor import openai_response
from clarifai.runners.utils.data_utils import Param
from transformers import (AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer)
class MyModel(ModelClass):
"""A custom runner for llama-3.2-1b-instruct llm that integrates with the Clarifai platform"""
def load_model(self):
"""Load the model here."""
if torch.backends.mps.is_available():
self.device = 'mps'
elif torch.cuda.is_available():
self.device = 'cuda'
else:
self.device = 'cpu'
logger.info(f"Running on device: {self.device}")
# Load checkpoints
model_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(model_path, download_validation_only=True)
self.checkpoints = builder.config['checkpoints']['repo_id']
logger.info(f"Loading model from: {self.checkpoints}")
# Load model and tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(self.checkpoints,)
self.tokenizer.pad_token = self.tokenizer.eos_token # Set pad token to eos token
self.model = AutoModelForCausalLM.from_pretrained(
self.checkpoints,
low_cpu_mem_usage=True,
device_map=self.device,
torch_dtype=torch.bfloat16,
)
self.streamer = TextIteratorStreamer(tokenizer=self.tokenizer, skip_prompt=True, skip_special_tokens=True)
self.chat_template = None
logger.info("Done loading!")
@ModelClass.method
def predict(self,
prompt: str ="",
chat_history: List[dict] = None,
max_tokens: int = Param(default=512, description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.", ),
temperature: float = Param(default=0.7, description="A decimal number that determines the degree of randomness in the response", ),
top_p: float = Param(default=0.8, description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.", )) -> str:
"""
Predict the response for the given prompt and chat history using the model.
"""
# Construct chat-style messages
messages = chat_history if chat_history else []
if prompt:
messages.append({
"role": "user",
"content": prompt
})
inputs = self.tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True).to(self.model.device)
generation_kwargs = {
"do_sample": True,
"max_new_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"eos_token_id": self.tokenizer.eos_token_id,
}
output = self.model.generate(**inputs, **generation_kwargs)
generated_tokens = output[0][inputs["input_ids"].shape[-1]:]
return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
@ModelClass.method
def generate(self,
prompt: str="",
chat_history: List[dict] = None,
max_tokens: int = Param(default=512, description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.", ),
temperature: float = Param(default=0.7, description="A decimal number that determines the degree of randomness in the response", ),
top_p: float = Param(default=0.8, description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.", )) -> Iterator[str]:
"""Stream generated text tokens from a prompt + optional chat history."""
# Construct chat-style messages
messages = chat_history if chat_history else []
if prompt:
messages.append({
"role": "user",
"content": prompt
})
logger.info(f"Generating response for messages: {messages}")
response = self.chat(
messages=messages,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p
)
for each in response:
if 'choices' in each and 'delta' in each['choices'][0] and 'content' in each['choices'][0]['delta']:
yield each['choices'][0]['delta']['content']
@ModelClass.method
def chat(self,
messages: List[dict],
max_tokens: int = Param(default=512, description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.", ),
temperature: float = Param(default=0.7, description="A decimal number that determines the degree of randomness in the response", ),
top_p: float = Param(default=0.8, description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.", )
) -> Iterator[dict]:
"""
Stream back JSON dicts for assistant messages.
Example return format:
{"role": "assistant", "content": [{"type": "text", "text": "response here"}]}
"""
# Tokenize using chat template
inputs = self.tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to(self.model.device)
generation_kwargs = {
"input_ids": inputs,
"do_sample": True,
"max_new_tokens": max_tokens,
"temperature": temperature,
"top_p": top_p,
"eos_token_id": self.tokenizer.eos_token_id,
"streamer": self.streamer
}
thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start()
# Accumulate response text
for chunk in openai_response(self.streamer):
yield chunk
thread.join()
def test(self):
"""Test the model here."""
try:
print("Testing predict...")
# Test predict
print(self.predict(prompt="What is the capital of India?",))
except Exception as e:
print("Error in predict", e)
try:
print("Testing generate...")
# Test generate
for each in self.generate(prompt="What is the capital of India?",):
print(each, end="")
print()
except Exception as e:
print("Error in generate", e)
try:
print("Testing chat...")
messages = [
{"role": "system", "content": "You are an helpful assistant."},
{"role": "user", "content": "What is the capital of India?"},
]
for each in self.chat(messages=messages,):
print(each, end="")
print()
except Exception as e:
print("Error in generate", e)
The model.py
file, which is located inside the 1
folder, defines the logic for the Hugging Face model, including how predictions are made.
config.yaml
config.yaml Example
build_info:
python_version: '3.11'
checkpoints:
hf_token: hf_token
repo_id: unsloth/Llama-3.2-1B-Instruct
type: huggingface
when: runtime
inference_compute_info:
accelerator_memory: 44Gi
accelerator_type:
- NVIDIA-*
cpu_limit: '1'
cpu_memory: 13Gi
num_accelerators: 1
model:
app_id: your-app-id-here
id: hf-local-runner-model
model_type_id: text-to-text
user_id: your-user-id-here
The config.yaml
file specifies the model’s configuration, including compute resource requirements, checkpoints, and other essential settings.
-
In the
model
section, you need to specify a unique model ID (any name you choose), along with your Clarifai user ID and app ID, which together determine where your model will run on the Clarifai platform. -
In the
checkpoints
section, you can provide your Hugging Face token using thehf_token
parameter if you need to access private or restricted repositories. This section also includes thewhen
parameter, which controls when model checkpoints are downloaded and stored. The available options areruntime
(the default), which downloads checkpoints when the model is loaded;build
, which downloads checkpoints during the image build process; andupload
, which downloads checkpoints before the model is uploaded.Note: For large models, it is strongly recommended to set
when: runtime
. Doing so helps prevent image sizes from becoming unnecessarily large, which keeps build times shorter, uploads faster, and inference more efficient on the Clarifai platform. By contrast, choosingbuild
orupload
can significantly increase the image size, leading to slower uploads and higher cold start latency.
requirements.txt
requirements.txt Example
torch==2.5.1
tokenizers>=0.21.0
transformers>=4.47.0
accelerate>=1.2.0
scipy>=1.10.0
optimum>=1.23.3
protobuf==5.27.3
einops>=0.8.0
requests==2.32.3
clarifai>=11.4.1
timm
The requirements.txt
file lists Python dependencies needed by your model. You need to install them by running the following command:
- Bash
pip install -r requirements.txt
Step 3: Log In to Clarifai
Run the following command to log in to the Clarifai platform, create a configuration context, and establish a connection:
clarifai login
You’ll be prompted to provide a few details for authentication:
- User ID – Enter your Clarifai user ID.
- PAT – Enter your Clarifai PAT. If you’ve already set the
CLARIFAI_PAT
environment variable, typeENVVAR
to use it automatically. - Context name – Assign a custom name to this configuration context, or press Enter to accept the default name,
"default"
. This is helpful if you manage multiple environments or configurations.
Example Output
clarifai login
Enter your Clarifai user ID: alfrick
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/alfrick/settings/security
Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): ENVVAR
> Verifying token...
[INFO] 13:39:42.773825 Validating the Context Credentials... | thread=8800297152
[INFO] 13:39:46.740886 ✅ Context is valid | thread=8800297152
> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.
Enter a name for this context [default]: default
✅ Success! You are now logged in.
Credentials saved to the 'default' context.
💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 13:41:01.395603 Login successful for user 'alfrick' in context 'default' | thread=8800297152
Step 4: Start Your Local Runner
Start a local runner with the following command:
clarifai model local-runner
If the required context configurations aren’t found, the CLI will walk you through creating them with default values.
This process ensures that all necessary components — such as compute clusters, nodepools, and deployments — are included in your configuration context, which are described here.
Simply review each prompt and confirm to continue.
Example Output
clarifai model local-runner
[INFO] 15:04:30.110675 Checking setup for local runner... | thread=8800297152
[INFO] 15:04:30.110764 Current context: default | thread=8800297152
[INFO] 15:04:30.110803 Current user_id: alfrick | thread=8800297152
[INFO] 15:04:30.133269 Current compute_cluster_id: local-runner-compute-cluster | thread=8800297152
[INFO] 15:04:32.213980 Failed to get compute cluster with ID local-runner-compute-cluster: code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "ComputeCluster with ID \'local-runner-compute-cluster\' not found. Check your request fields."
req_id: "sdk-python-11.6.4-adac6224603147b4a6422e7ab3d8999f"
| thread=8800297152
Compute cluster not found. Do you want to create a new compute cluster alfrick/local-runner-compute-cluster? (y/n): y
[INFO] 15:04:39.978695
Compute Cluster created
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.6.4-02d952ca15d4431ebf1d998247b5559f"
| thread=8800297152
[INFO] 15:04:39.986611 Current nodepool_id: local-runner-nodepool | thread=8800297152
[INFO] 15:04:41.235547 Failed to get nodepool with ID local-runner-nodepool: code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "Nodepool not found. Check your request fields."
req_id: "sdk-python-11.6.4-f1af74b390d54ee68b1c0d7025c412a8"
| thread=8800297152
Nodepool not found. Do you want to create a new nodepool alfrick/local-runner-compute-cluster/local-runner-nodepool? (y/n): y
[INFO] 15:04:43.256490
Nodepool created
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.6.4-3c0ebab572cf495996ce5337da2cc24e"
| thread=8800297152
[INFO] 15:04:43.269204 Current app_id: local-runner-app | thread=8800297152
[INFO] 15:04:43.580198 Current model_id: local-runner-model | thread=8800297152
[INFO] 15:04:46.328600 Current model version 9d38bb9398944de4bdef699835f17ec9 | thread=8800297152
[INFO] 15:04:46.329168 Create the local runner tying this
alfrick/local-runner-app/models/local-runner-model model (version: 9d38bb9398944de4bdef699835f17ec9) to the
alfrick/local-runner-compute-cluster/local-runner-nodepool nodepool. | thread=8800297152
[INFO] 15:04:47.564767
Runner created
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.6.4-5a17241947ba4ac593f29eacecb4d61d"
with id: 7dbd2b733acb4684a4cb8d3b11ee626a | thread=8800297152
[INFO] 15:04:47.573245 Current runner_id: 7dbd2b733acb4684a4cb8d3b11ee626a | thread=8800297152
[INFO] 15:04:47.828638 Failed to get deployment with ID local-runner-deployment: code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "Deployment with ID \'local-runner-deployment\' not found. Check your request fields."
req_id: "sdk-python-11.6.4-1f51c07b0cc54f14893175401f6fda1d"
| thread=8800297152
Deployment not found. Do you want to create a new deployment alfrick/local-runner-compute-cluster/local-runner-nodepool/local-runner-deployment? (y/n): y
[INFO] 15:04:50.307460
Deployment created
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.6.4-6afe596327fb42bc818940fb324cc8bc"
| thread=8800297152
[INFO] 15:04:50.315366 Current deployment_id: local-runner-deployment | thread=8800297152
[INFO] 15:04:50.315542 Full url for the model: https://clarifai.com/users/alfrick/apps/local-runner-app/models/local-runner-model/versions/9d38bb9398944de4bdef699835f17ec9 | thread=8800297152
[INFO] 15:04:50.318223 Current model section of config.yaml: {'app_id': 'items-app', 'id': 'first-local-runner-model', 'model_type_id': 'text-to-text', 'user_id': 'alfrick'} | thread=8800297152
Do you want to backup config.yaml to config.yaml.bk then update the config.yaml with the new model information? (y/n): y
[INFO] 15:04:53.150497 Hugging Face repo access validated | thread=8800297152
[INFO] 15:05:00.624489
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# About to start up the local runner in this terminal...
# Here is a code snippet to call this model once it start from another terminal:
| thread=8800297152
[INFO] 15:05:00.624556
# Clarifai Model Client Script
# Set the environment variables `CLARIFAI_DEPLOYMENT_ID` and `CLARIFAI_PAT` to run this script.
# Example usage:
import os
from clarifai.client import Model
from clarifai.runners.utils import data_types
model = Model("https://clarifai.com/alfrick/local-runner-app/models/local-runner-model",
deployment_id = 'local-runner-deployment', # Only needed for dedicated deployed models
base_url='https://api.clarifai.com',
)
# Example model prediction from different model methods:
response = model.predict(prompt="What is the future of AI?", max_tokens=512, temperature=0.7, top_p=0.8)
print(response)
response = model.generate(prompt="What is the future of AI?", max_tokens=512, temperature=0.7, top_p=0.8)
for res in response:
print(res)
response = model.chat(max_tokens=512, temperature=0.7, top_p=0.8)
for res in response:
print(res)
| thread=8800297152
[INFO] 15:05:00.624593 Now starting the local runner... | thread=8800297152
[INFO] 15:05:01.132806 Hugging Face repo access validated | thread=8800297152
[INFO] 15:05:01.170311 Running on device: mps | thread=8800297152
[INFO] 15:05:01.790161 Hugging Face repo access validated | thread=8800297152
[INFO] 15:05:01.791351 Loading model from: unsloth/Llama-3.2-1B-Instruct | thread=8800297152
tokenizer_config.json: 54.7kB [00:00, 49.0MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████| 17.2M/17.2M [00:07<00:00, 2.27MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████| 454/454 [00:00<00:00, 1.33MB/s]
chat_template.jinja: 3.83kB [00:00, 7.36MB/s]
config.json: 100%|████████████████████████████████████████████████████████████| 894/894 [00:00<00:00, 3.20MB/s]
`torch_dtype` is deprecated! Use `dtype` instead!
Step 5: Test Your Runner
Once the local runner starts, it provides a sample client code snippet you can use for quick testing.
You can run the snippet in a separate terminal within the same directory to see the model’s response.
Here’s an example snippet:
- Python SDK
# Before running this script, set the environment variables:
# CLARIFAI_DEPLOYMENT_ID (optional – only required for dedicated deployments)
# CLARIFAI_PAT (your Personal Access Token)
from clarifai.client import Model
# Initialize the model
model = Model(
"https://clarifai.com/alfrick/local-runner-app/models/local-runner-model",
# deployment_id="local-runner-deployment", # Uncomment if using a deployed model
)
# Run a basic prediction
response = model.predict(
prompt="What is the future of AI?",
max_tokens=512,
temperature=0.7,
top_p=0.8,
)
print(response)
'''
--- Additional examples ---
# Using the generate method
response = model.generate(
prompt="What is the future of AI?",
max_tokens=512,
temperature=0.7,
top_p=0.8,
)
for res in response:
print(res)
# Using the chat method
response = model.chat(
max_tokens=512,
temperature=0.7,
top_p=0.8,
)
for res in response:
print(res)
'''