vLLM
Run vLLM models locally and make them accessible through a public API
vLLM is an open-source, high-performance inference engine that allows you to serve large language models (LLMs) locally with remarkable speed and efficiency. It supports OpenAI-compatible APIs, making it easy to integrate with the Clarifai platform.
With Clarifai’s Local Runners, you can seamlessly deploy vLLM-powered models on your own machine, expose them through a secure public URL, and take full advantage of Clarifai’s AI capabilities — while retaining control, privacy, and performance.
Note: After downloading the model using the vLLM toolkit, you can upload it to Clarifai to leverage the platform’s capabilities.
Step 1: Perform Prerequisites
Sign Up or Log In
First, either log in to your existing Clarifai account or sign up for a new one. Once logged in, you'll need these credentials to set up your project:
- App ID – Navigate to the application you'll use for your model. In the collapsible left sidebar, select the Overview option. Get the app ID from there.
- User ID – In the collapsible left sidebar, go to Settings and select the Account option. Then, find your user ID.
- Personal Access Token (PAT) – This token is essential to authenticate your connection with the Clarifai platform. To create or copy your PAT, go to Settings and choose the Secrets option.
Then, store it as an environment variable for secure authentication:
- Unix-like Systems
- Windows
export CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE
set CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE
Install the Clarifai CLI
Install the Clarifai CLI to access Local Runners and manage your deployments.
- Bash
pip install --upgrade clarifai
Note: Ensure you have Python 3.11 or 3.12 installed to run Local Runners successfully.
Install vLLM
Install the vLLM package.
- Bash
pip install vllm
vLLM supports models from the Hugging Face Hub (e.g., LLaMA, Mistral, Falcon, etc.) and serves them via a local OpenAI-compatible API.
Note: You need a Hugging Face access token to download models from private or restricted repositories. You can learn how to get it from here.
Install the OpenAI Package
Install the openai
client library — it will be used to send requests to your vLLM server.
- Bash
pip install openai
Step 2: Initialize a Model
Use the Clarifai CLI to initialize a vLLM-based model directory. This setup prepares all required files for local execution and Clarifai integration.
You can further customize or optimize the model by modifying the generated files as necessary.
- Bash
clarifai model init --toolkit vllm
You can initialize any model supported by vLLM. If you want to initialize a specific vLLM model, use the --model-name
flag.
- Bash
clarifai model init --toolkit vllm --model-name HuggingFaceH4/zephyr-7b-beta
Example Output
clarifai model init --toolkit vllm --model-name HuggingFaceH4/zephyr-7b-beta
[INFO] 12:37:30.485152 Parsed GitHub repository: owner=Clarifai, repo=runners-examples, branch=vllm, folder_path= | thread=8309383360
[INFO] 12:37:38.228774 Files to be downloaded are:
1. 1/model.py
2. config.yaml
3. requirements.txt | thread=8309383360
Press Enter to continue...
[INFO] 12:37:40.819580 Initializing model from GitHub repository: https://github.com/Clarifai/runners-examples | thread=8309383360
[INFO] 12:40:28.485625 Successfully cloned repository from https://github.com/Clarifai/runners-examples (branch: vllm) | thread=8309383360
[INFO] 12:40:28.494056 Updated Hugging Face model repo_id to: HuggingFaceH4/zephyr-7b-beta | thread=8309383360
[INFO] 12:40:28.494107 Model initialization complete with GitHub repository | thread=8309383360
[INFO] 12:40:28.494133 Next steps: | thread=8309383360
[INFO] 12:40:28.494152 1. Review the model configuration | thread=8309383360
[INFO] 12:40:28.494169 2. Install any required dependencies manually | thread=8309383360
[INFO] 12:40:28.494186 3. Test the model locally using 'clarifai model local-test' | thread=8309383360
You’ll get a folder structure similar to:
├── 1/
│ └── model.py
├── requirements.txt
└── config.yaml
model.py
Example: model.py
import os
import sys
from typing import List, Iterator
from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.models.openai_class import OpenAIModelClass
from clarifai.runners.utils.openai_convertor import build_openai_messages
from clarifai.runners.utils.data_utils import Param
from clarifai.utils.logging import logger
from openai import OpenAI
PYTHON_EXEC = sys.executable
def vllm_openai_server(checkpoints, **kwargs):
"""Start vLLM OpenAI compatible server."""
from clarifai.runners.utils.model_utils import execute_shell_command, wait_for_server, terminate_process
# Start building the command
cmds = [
PYTHON_EXEC, '-m', 'vllm.entrypoints.openai.api_server', '--model', checkpoints,
]
# Add all parameters from kwargs to the command
for key, value in kwargs.items():
if value is None: # Skip None values
continue
param_name = key.replace('_', '-')
if isinstance(value, bool):
if value: # Only add the flag if True
cmds.append(f'--{param_name}')
else:
cmds.extend([f'--{param_name}', str(value)])
# Create server instance
server = type('Server', (), {
'host': kwargs.get('host', '0.0.0.0'),
'port': kwargs.get('port', 23333),
'backend': "vllm",
'process': None
})()
try:
server.process = execute_shell_command(" ".join(cmds))
logger.info("Waiting for " + f"http://{server.host}:{server.port}")
wait_for_server(f"http://{server.host}:{server.port}")
logger.info("Server started successfully at " + f"http://{server.host}:{server.port}")
except Exception as e:
logger.error(f"Failed to start vllm server: {str(e)}")
if server.process:
terminate_process(server.process)
raise RuntimeError(f"Failed to start vllm server: {str(e)}")
return server
class VLLMModel(OpenAIModelClass):
"""
A Model that integrates with the Clarifai platform and uses vLLM framework for inference to run the Llama 3.1 8B model with tool calling capabilities.
"""
client = True # This will be set in load_model method
model = True # This will be set in load_model method
def load_model(self):
"""Load the model here and start the server."""
os.path.join(os.path.dirname(__file__))
# This is the path to the chat template file and you can get this chat template from vLLM repo(https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.1_json.jinja)
chat_template = 'examples/tool_chat_template_llama3.1_json.jinja'
server_args = {
'max_model_len': 2048,
'dtype': 'auto',
'task': 'auto',
'kv_cache_dtype': 'auto',
'tensor_parallel_size': 1,
'quantization': None,
'cpu_offload_gb': 5.0,
'chat_template': None,
'port': 23333,
'host': 'localhost',
}
model_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(model_path, download_validation_only=True)
model_config = builder.config
stage = model_config["checkpoints"]['when']
checkpoints = builder.config["checkpoints"]['repo_id']
if stage in ["build", "runtime"]:
checkpoints = builder.download_checkpoints(stage=stage)
# Start server
self.server = vllm_openai_server(checkpoints, **server_args)
# CLIent initialization
self.client = OpenAI(
api_key="notset",
base_url=f'http://{self.server.host}:{self.server.port}/v1')
self.model = self.client.models.list().data[0].id
@OpenAIModelClass.method
def predict(self,
prompt: str,
chat_history: List[dict] = None,
tools: List[dict] = None,
tool_choice: str = None,
max_tokens: int = Param(default=512, description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.", ),
temperature: float = Param(default=0.7, description="A decimal number that determines the degree of randomness in the response", ),
top_p: float = Param(default=0.95, description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass."),
) -> str:
"""
This method is used to predict the response for the given prompt and chat history using the model and tools.
"""
if tools is not None and tool_choice is None:
tool_choice = "auto"
messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=tools,
tool_choice=tool_choice,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p)
if response.choices[0] and response.choices[0].message.tool_calls:
import json
# If the response contains tool calls, return as a string
tool_calls = response.choices[0].message.tool_calls
tool_calls_json = json.dumps([tc.to_dict() for tc in tool_calls], indent=2)
return tool_calls_json
else:
# Otherwise, return the content of the first choice
return response.choices[0].message.content
@OpenAIModelClass.method
def generate(self,
prompt: str,
chat_history: List[dict] = None,
tools: List[dict] = None,
tool_choice: str = None,
max_tokens: int = Param(default=512, description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.", ),
temperature: float = Param(default=0.7, description="A decimal number that determines the degree of randomness in the response", ),
top_p: float = Param(default=0.95, description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.")) -> Iterator[str]:
"""
This method is used to stream generated text tokens from a prompt + optional chat history and tools.
"""
messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=tools,
tool_choice=tool_choice,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=True)
for chunk in response:
if chunk.choices:
if chunk.choices[0].delta.tool_calls:
# If the response contains tool calls, return the first one as a string
import json
tool_calls = chunk.choices[0].delta.tool_calls
tool_calls_json = [tc.to_dict() for tc in tool_calls]
# Convert to JSON string
json_string = json.dumps(tool_calls_json, indent=2)
# Yield the JSON string
yield json_string
else:
# Otherwise, return the content of the first choice
text = (chunk.choices[0].delta.content
if (chunk and chunk.choices[0].delta.content) is not None else '')
yield text
The model.py
file inside the 1/
directory defines how your model performs inference through the vLLM runtime, using the OpenAI-compatible API endpoint served locally.
config.yaml
Example: config.yaml
build_info:
python_version: '3.12'
checkpoints:
hf_token: hf_token
repo_id: HuggingFaceH4/zephyr-7b-beta
type: huggingface
when: runtime
inference_compute_info:
accelerator_memory: 5Gi
accelerator_type:
- NVIDIA-*
cpu_limit: '1'
cpu_memory: 5Gi
num_accelerators: 1
model:
app_id: APP_ID
id: MODEL_ID
model_type_id: text-to-text
user_id: USER_ID
The config.yaml
file defines the model’s configuration — including compute requirements, checkpoint sources, and other essential runtime settings.
-
In the
model
section, provide a unique model ID (any name you prefer), along with your Clarifai user ID and app ID. These values specify where your model will be deployed within the Clarifai platform. -
The
checkpoints
section defines how to retrieve the model’s weights from Hugging Face. If you’re using a private or restricted repository, be sure to include your Hugging Face access token to enable secure downloading.
requirements.txt
Example: requirements.txt
# Core dependencies
torch==2.5.1
vllm==0.8.0
transformers==4.50.1
accelerate==1.2.0
optimum==1.23.3
einops==0.8.0
tokenizers==0.21.0
packaging>=24.0
ninja>=1.11.1
# Qwen and vision-language utilities
qwen-vl-utils==0.0.8
timm>=1.0.9
# Audio and video support
soundfile>=0.13.1
librosa>=0.10.2
scipy==1.15.2
# Additional utilities
psutil>=5.9.0
backoff==2.2.1
peft>=0.13.2
openai>=1.14.0
clarifai>=10.0.0
The requirements.txt
file lists Python dependencies required by your model. If you haven’t installed them yet, run the following command to install the dependencies:
- Bash
pip install -r requirements.txt
Step 3: Log In to Clarifai
Run the following command to authenticate your local environment with Clarifai.
clarifai login
You’ll be prompted for your user ID, PAT, and an optional context name.
Example Output
clarifai login
Enter your Clarifai user ID: alfrick
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/alfrick/settings/security
Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): d6570db0fe964ce7a96c357ce84803b1
> Verifying token...
[INFO] 11:15:43.091990 Validating the Context Credentials... | thread=8309383360
[INFO] 11:15:46.647300 ✅ Context is valid | thread=8309383360
> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.
Enter a name for this context [default]:
✅ Success! You are now logged in.
Credentials saved to the 'default' context.
💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 11:15:54.361216 Login successful for user 'alfrick' in context 'default' | thread=8309383360
Step 4: Start the Local Runner
Launch the Local Runner to start serving your vLLM model locally.
clarifai model local-runner
If any configuration is missing, the CLI will prompt you to define or confirm it.
This runner will use vLLM’s backend to serve model predictions and make them accessible via a Clarifai-managed public API endpoint.
Example Output
clarifai model local-runner
[INFO] 12:10:30.600393 Hugging Face repo access validated | thread=8309383360
[INFO] 12:10:31.857533 Detected OpenAI chat completions for Clarifai model streaming - validating stream_options... | thread=8309383360
[ERROR] 12:10:31.857843 Missing configuration to track usage for OpenAI chat completion calls. Go to your model scripts and make sure to set both: 1) stream_options={'include_usage': True}2) set_output_context | thread=8309383360
[INFO] 12:10:31.858162 > Checking local runner requirements... | thread=8309383360
[INFO] 12:10:31.881003 Checking 19 dependencies... | thread=8309383360
[INFO] 12:10:31.882531 ✅ All 19 dependencies are installed! | thread=8309383360
[INFO] 12:10:31.883637 > Verifying local runner setup... | thread=8309383360
[INFO] 12:10:31.883679 Current context: default | thread=8309383360
[INFO] 12:10:31.883712 Current user_id: alfrick | thread=8309383360
[INFO] 12:10:31.883740 Current PAT: d6570**** | thread=8309383360
[INFO] 12:10:31.886325 Current compute_cluster_id: local-runner-compute-cluster | thread=8309383360
[WARNING] 12:10:32.878933 Failed to get compute cluster with ID 'local-runner-compute-cluster':
code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "ComputeCluster with ID \'local-runner-compute-cluster\' not found. Check your request fields."
req_id: "sdk-python-11.8.2-475bf6a34c264ab2910fff508c10fe93"
| thread=8309383360
Compute cluster not found. Do you want to create a new compute cluster alfrick/local-runner-compute-cluster? (y/n): y
[INFO] 12:11:00.503967 Compute Cluster with ID 'local-runner-compute-cluster' is created:
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.8.2-1eb39e52ebb2407d822d96a17f60ba79"
| thread=8309383360
[INFO] 12:11:00.514530 Current nodepool_id: local-runner-nodepool | thread=8309383360
[WARNING] 12:11:01.457525 Failed to get nodepool with ID 'local-runner-nodepool':
code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "Nodepool not found. Check your request fields."
req_id: "sdk-python-11.8.2-42bb02d8ae984f08af1620c548219830"
| thread=8309383360
Nodepool not found. Do you want to create a new nodepool alfrick/local-runner-compute-cluster/local-runner-nodepool? (y/n): y
[INFO] 12:11:05.019304 Nodepool with ID 'local-runner-nodepool' is created:
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.8.2-9eba74dedf464e5d9c55309596745d01"
| thread=8309383360
[INFO] 12:11:05.032966 Current app_id: local-runner-app | thread=8309383360
[INFO] 12:11:05.308616 Current model_id: local-runner-model | thread=8309383360
[WARNING] 12:11:07.729547 Attempting to patch latest version: d2ce23ed22144da1b683161fafa2d5d0 | thread=8309383360
[INFO] 12:11:08.986955 Successfully patched version d2ce23ed22144da1b683161fafa2d5d0 | thread=8309383360
[INFO] 12:11:08.992377 Current model version d2ce23ed22144da1b683161fafa2d5d0 | thread=8309383360
[INFO] 12:11:08.992504 Creating the local runner tying this 'alfrick/local-runner-app/models/local-runner-model' model (version: d2ce23ed22144da1b683161fafa2d5d0) to the 'alfrick/local-runner-compute-cluster/local-runner-nodepool' nodepool. | thread=8309383360
[INFO] 12:11:09.973825 Runner with ID '8d5c067f543847e8aa6741623d7c84e4' is created:
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.8.2-35702721f9324bbe8d862226c7dc2e9a"
| thread=8309383360
[INFO] 12:11:09.985978 Current runner_id: 8d5c067f543847e8aa6741623d7c84e4 | thread=8309383360
[WARNING] 12:11:10.250570 Failed to get deployment with ID local-runner-deployment:
code: CONN_DOES_NOT_EXIST
description: "Resource does not exist"
details: "Deployment with ID \'local-runner-deployment\' not found. Check your request fields."
req_id: "sdk-python-11.8.2-971bef87a5e948b198acd29d338d0ec8"
| thread=8309383360
Deployment not found. Do you want to create a new deployment alfrick/local-runner-compute-cluster/local-runner-nodepool/local-runner-deployment? (y/n): y
[INFO] 12:11:14.749509 Deployment with ID 'local-runner-deployment' is created:
code: SUCCESS
description: "Ok"
req_id: "sdk-python-11.8.2-7720d23dc2b947ffb719d3062b92579f"
| thread=8309383360
[INFO] 12:11:14.761115 Current deployment_id: local-runner-deployment | thread=8309383360
[INFO] 12:11:14.763204 Current model section of config.yaml: {'id': 'MODEL_ID2332', 'user_id': 'alfrick', 'app_id': 'items-app', 'model_type_id': 'text-to-text'} | thread=8309383360
Do you want to backup config.yaml to config.yaml.bk then update the config.yaml with the new model information? (y/n): y
[INFO] 12:11:17.610493 Checking 19 dependencies... | thread=8309383360
[INFO] 12:11:17.612804 ✅ All 19 dependencies are installed! | thread=8309383360
[INFO] 12:11:17.612887 ✅ Starting local runner... | thread=8309383360
[INFO] 12:11:17.612959 No secrets path configured, running without secrets | thread=8309383360
[INFO] 12:11:18.154539 Hugging Face repo access validated | thread=8309383360
[INFO] 12:11:19.116921 Detected OpenAI chat completions for Clarifai model streaming - validating stream_options... | thread=8309383360
[ERROR] 12:11:19.117170 Missing configuration to track usage for OpenAI chat completion calls. Go to your model scripts and make sure to set both: 1) stream_options={'include_usage': True}2) set_output_context | thread=8309383360
[INFO] 12:11:19.729580 Hugging Face repo access validated | thread=8309383360
[INFO] 12:11:20.675478 Detected OpenAI chat completions for Clarifai model streaming - validating stream_options... | thread=8309383360
[ERROR] 12:11:20.675945 Missing configuration to track usage for OpenAI chat completion calls. Go to your model scripts and make sure to set both: 1) stream_options={'include_usage': True}2) set_output_context | thread=8309383360
[INFO] 12:11:21.238145 Hugging Face token validated | thread=8309383360
[INFO] 12:11:22.887162 Total download size: 13815.09 MB | thread=8309383360
[INFO] 12:11:22.887555 Downloading model checkpoints... | thread=8309383360
/Users/macbookpro/Desktop/code/venv/lib/python3.12/site-packages/huggingface_hub/file_download.py:982: UserWarning: `local_dir_use_symlinks` parameter is deprecated and will be ignored. The process to download files to a local folder has been updated and do not rely on symlinks anymore. You only need to pass a destination folder as`local_dir`.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.
warnings.warn(
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 638/638 [00:00<00:00, 4.34MB/s]
eval_results.json: 100%|█████████████████████████████████████████████████████████████████████████████████| 553/553 [00:00<00:00, 5.22MB/s]
README.md: 25.6kB [00:00, 31.3MB/s] | 0.00/638 [00:00<?, ?B/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████████████████ █████████████| 42.0/42.0 [00:00<00:00, 506kB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 111/111 [00:00<00:00, 1.07MB/s]
all_results.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 728/728 [00:00<00:00, 11.9MB/s]
.gitattributes: 1.52kB [00:00, 13.8MB/s] | 0.00/111 [00:00<?, ?B/s]
Fetching 23 files: 4%|███▌ | 1/23 [00:01<00:25, 1.17s/it]
.gitattributes: 0.00B [00:00, ?B/s]
model-00003-of-00008.safetensors: 0%| | 0.00/1.98G [00:00<?, ?B/s]
model-00004-of-00008.safetensors: 0%| | 0.00/1.95G [00:00<?, ?B/s]
model-00005-of-00008.safetensors: 0%| | 0.00/1.98G [00:00<?, ?B/s]
model-00006-of-00008.safetensors: 0%| | 0.00/1.95G [00:00<?, ?B/s]
model-00008-of-00008.safetensors: 0%| | 0.00/816M [00:00<?, ?B/s]
model-00002-of-00008.safetensors: 0%| | 0.00/1.95G [00:00<?, ?B/s]
model-00007-of-00008.safetensors: 0%| | 0.00/1.98G [00:00<?, ?B/s]
Step 5: Test Your Runner
Once your Local Runner is running and the model has finished downloading, you can test it using the OpenAI-compatible API format.
- Python (OpenAI)
import os
from openai import OpenAI
# Initialize the OpenAI client, pointing to Clarifai's API
client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1", # Clarifai's OpenAI-compatible API endpoint
api_key=os.environ["CLARIFAI_PAT"] # Ensure CLARIFAI_PAT is set as an environment variable
)
# Make a chat completion request to a Clarifai-hosted model
response = client.chat.completions.create(
model="https://clarifai.com/<user-id>/local-runner-app/models/local-runner-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the future of AI?"}
],
)
# Print the model's response
print(response.choices[0].message.content)
The script sends a sample prompt to your locally running vLLM model and prints the response. Your model will now be serving predictions through Clarifai’s local inference layer.