vLLM
Serve any HuggingFace LLM locally or deploy to cloud compute
vLLM is a high-performance open-source inference engine for LLMs. With Clarifai, you can deploy vLLM models to cloud GPUs with a single command.
The workflow is: init → deploy
Step 1: Install Clarifai
- Bash
pip install --upgrade clarifai
Note: Python 3.11 or 3.12 is required. The
openaipackage is included withclarifai.
Step 2: Log In
- CLI
clarifai login
You'll be prompted for your user ID and PAT. This saves your credentials locally so you don't need to set environment variables manually.
Example Output
clarifai login
Enter your Clarifai user ID: alfrick
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/alfrick/settings/security
Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): d6570db0fe964ce7a96c357ce84803b1
> Verifying token...
[INFO] 11:15:43.091990 Validating the Context Credentials... | thread=8309383360
[INFO] 11:15:46.647300 ✅ Context is valid | thread=8309383360
> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.
Enter a name for this context [default]:
✅ Success! You are now logged in.
Credentials saved to the 'default' context.
💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 11:15:54.361216 Login successful for user 'alfrick' in context 'default' | thread=8309383360
Step 3: Initialize a Model
Scaffold a model project using a HuggingFace model name:
- CLI
clarifai model init --toolkit vllm --model-name Qwen/Qwen3-0.6B
The CLI auto-selects the optimal GPU instance based on the model's VRAM requirements. You can initialize any model supported by vLLM — just change --model-name to a different HuggingFace repo ID.
Example Output
clarifai model init --toolkit vllm --model-name Qwen/Qwen3-0.6B
[INFO] Initializing model with vllm toolkit...
[INFO] Updated Hugging Face model repo_id to: Qwen/Qwen3-0.6B
Instance: g4dn.xlarge (Estimated 7.9 GiB VRAM (1.4 GiB weights + 4.4 GiB KV cache for 40960 ctx), fits g4dn.xlarge (15 GiB))
Model initialized in ./Qwen3-0.6B
Test locally:
clarifai model serve ./Qwen3-0.6B
clarifai model serve ./Qwen3-0.6B --mode env # auto-create venv and install deps
clarifai model serve ./Qwen3-0.6B --mode container # run inside Docker
Deploy to Clarifai:
clarifai model deploy ./Qwen3-0.6B
clarifai list-instances # list available instances
This creates a ./Qwen3-0.6B/ directory:
Qwen3-0.6B/
├── 1/
│ └── model.py # vLLM inference logic
├── requirements.txt # Lightweight deps (vLLM is pre-installed in the base image)
└── config.yaml # Model config (user_id/app_id auto-filled from login)
Note: For private or gated models (Llama, Gemma, etc.), set
HF_TOKENin your environment before initializing:export HF_TOKEN=your_token_here
model.py
import os
import sys
from typing import Iterator, List
from openai import OpenAI
from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.models.openai_class import OpenAIModelClass
from clarifai.runners.utils.data_utils import Param
from clarifai.runners.utils.openai_convertor import build_openai_messages
from clarifai.utils.logging import logger
PYTHON_EXEC = sys.executable
def vllm_openai_server(checkpoints, **kwargs):
"""Start vLLM OpenAI-compatible server."""
from clarifai.runners.utils.model_utils import (
execute_shell_command,
terminate_process,
wait_for_server,
)
cmds = [
PYTHON_EXEC,
'-m',
'vllm.entrypoints.openai.api_server',
'--model',
checkpoints,
]
for key, value in kwargs.items():
if value is None:
continue
param_name = key.replace('_', '-')
if isinstance(value, bool):
if value:
cmds.append(f'--{param_name}')
else:
cmds.extend([f'--{param_name}', str(value)])
server = type(
'Server',
(),
{
'host': kwargs.get('host', '0.0.0.0'),
'port': kwargs.get('port', 23333),
'process': None,
},
)()
try:
server.process = execute_shell_command(" ".join(cmds))
url = f"http://{server.host}:{server.port}"
logger.info(f"Waiting for vLLM server at {url}")
wait_for_server(url)
logger.info(f"vLLM server started at {url}")
except Exception as e:
logger.error(f"Failed to start vLLM server: {e}")
if server.process:
terminate_process(server.process)
raise RuntimeError(f"Failed to start vLLM server: {e}")
return server
class VLLMModel(OpenAIModelClass):
client = True
model = True
def load_model(self):
server_args = {
'tensor_parallel_size': 1,
'port': 23333,
'host': 'localhost',
}
model_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(model_path, download_validation_only=True)
config = builder.config
stage = config["checkpoints"]["when"]
checkpoints = config["checkpoints"]["repo_id"]
if stage in ["build", "runtime"]:
checkpoints = builder.download_checkpoints(stage=stage)
self.server = vllm_openai_server(checkpoints, **server_args)
self.client = OpenAI(
api_key="notset",
base_url=f"http://{self.server.host}:{self.server.port}/v1",
)
self.model = self.client.models.list().data[0].id
@OpenAIModelClass.method
def predict(
self,
prompt: str = "",
chat_history: List[dict] = None,
tools: List[dict] = None,
tool_choice: str = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate.",
),
temperature: float = Param(
default=0.7,
description="Sampling temperature (higher = more random).",
),
top_p: float = Param(
default=0.95,
description="Nucleus sampling threshold.",
),
) -> str:
"""Return a single completion."""
if tools is not None and tool_choice is None:
tool_choice = "auto"
messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=tools,
tool_choice=tool_choice,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
)
if response.choices[0] and response.choices[0].message.tool_calls:
import json
tool_calls = response.choices[0].message.tool_calls
return json.dumps([tc.to_dict() for tc in tool_calls], indent=2)
return response.choices[0].message.content
@OpenAIModelClass.method
def generate(
self,
prompt: str = "",
chat_history: List[dict] = None,
tools: List[dict] = None,
tool_choice: str = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate.",
),
temperature: float = Param(
default=0.7,
description="Sampling temperature (higher = more random).",
),
top_p: float = Param(
default=0.95,
description="Nucleus sampling threshold.",
),
) -> Iterator[str]:
"""Stream a completion response."""
if tools is not None and tool_choice is None:
tool_choice = "auto"
messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=tools,
tool_choice=tool_choice,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=True,
)
for chunk in response:
if chunk.choices:
if chunk.choices[0].delta.tool_calls:
import json
tool_calls_json = [tc.to_dict() for tc in chunk.choices[0].delta.tool_calls]
yield json.dumps(tool_calls_json, indent=2)
else:
text = chunk.choices[0].delta.content if chunk.choices[0].delta.content else ''
yield text
config.yaml
build_info:
image: vllm/vllm-openai:latest
checkpoints:
repo_id: Qwen/Qwen3-0.6B
type: huggingface
when: runtime
compute:
instance: g4dn.xlarge
model:
id: qwen3-06b
The config.yaml has 4 sections:
model.id— Auto-generated from the HuggingFace model namebuild_info.image—vllm/vllm-openai:latest(vLLM, PyTorch, and CUDA pre-installed)compute.instance— Auto-selected based on estimated VRAM. Runclarifai list-instancesto see all optionscheckpoints— HuggingFace model weights config. Addhf_tokenhere or setHF_TOKENfor gated models
requirements.txt
clarifai
openai
Step 4: Deploy to Cloud
Deploy to dedicated cloud compute. Clarifai automatically provisions all required infrastructure.
Since config.yaml already has a compute.instance value (auto-selected during init), you can deploy directly:
- CLI
clarifai model deploy ./Qwen3-0.6B
To override the instance type:
- CLI
clarifai model deploy ./Qwen3-0.6B --instance a10g
To see all available instance types and pricing:
- CLI
clarifai list-instances
Tip: If you have a local GPU and want to test before deploying, run
clarifai model serve ./Qwen3-0.6B --mode envfirst.
Step 5: Run Inference
- Python
import os
from openai import OpenAI
# Initialize the OpenAI client, pointing to Clarifai's API
client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1", # Clarifai's OpenAI-compatible API endpoint
api_key=os.environ["CLARIFAI_PAT"] # Ensure CLARIFAI_PAT is set as an environment variable
)
# Make a chat completion request to a Clarifai-hosted model
response = client.chat.completions.create(
model="https://clarifai.com/<user-id>/local-runner-app/models/local-runner-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the future of AI?"}
],
)
# Print the model's response
print(response.choices[0].message.content)
Or use the Clarifai CLI:
clarifai model predict https://clarifai.com/<user-id>/main/models/Qwen3-0.6B "Explain AI in one sentence"
Manage Your Deployment
# Stream live logs
clarifai model logs --deployment <deployment-id>
# Check status
clarifai model status --deployment <deployment-id>
# Remove deployment when done (stops billing)
clarifai model undeploy --deployment <deployment-id>
For the full CLI reference, see CLI Reference.