Skip to main content

vLLM

Serve any HuggingFace LLM locally or deploy to cloud compute


vLLM is a high-performance open-source inference engine for LLMs. With Clarifai, you can deploy vLLM models to cloud GPUs with a single command.

The workflow is: init → deploy

Step 1: Install Clarifai

pip install --upgrade clarifai

Note: Python 3.11 or 3.12 is required. The openai package is included with clarifai.

Step 2: Log In

clarifai login

You'll be prompted for your user ID and PAT. This saves your credentials locally so you don't need to set environment variables manually.

Example Output
clarifai login
Enter your Clarifai user ID: alfrick
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/alfrick/settings/security

Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): d6570db0fe964ce7a96c357ce84803b1

> Verifying token...
[INFO] 11:15:43.091990 Validating the Context Credentials... | thread=8309383360
[INFO] 11:15:46.647300 ✅ Context is valid | thread=8309383360

> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.

Enter a name for this context [default]:
✅ Success! You are now logged in.
Credentials saved to the 'default' context.

💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 11:15:54.361216 Login successful for user 'alfrick' in context 'default' | thread=8309383360

Step 3: Initialize a Model

Scaffold a model project using a HuggingFace model name:

clarifai model init --toolkit vllm --model-name Qwen/Qwen3-0.6B

The CLI auto-selects the optimal GPU instance based on the model's VRAM requirements. You can initialize any model supported by vLLM — just change --model-name to a different HuggingFace repo ID.

Example Output
clarifai model init --toolkit vllm --model-name Qwen/Qwen3-0.6B
[INFO] Initializing model with vllm toolkit...
[INFO] Updated Hugging Face model repo_id to: Qwen/Qwen3-0.6B
Instance: g4dn.xlarge (Estimated 7.9 GiB VRAM (1.4 GiB weights + 4.4 GiB KV cache for 40960 ctx), fits g4dn.xlarge (15 GiB))

Model initialized in ./Qwen3-0.6B

Test locally:
clarifai model serve ./Qwen3-0.6B
clarifai model serve ./Qwen3-0.6B --mode env # auto-create venv and install deps
clarifai model serve ./Qwen3-0.6B --mode container # run inside Docker

Deploy to Clarifai:
clarifai model deploy ./Qwen3-0.6B
clarifai list-instances # list available instances

This creates a ./Qwen3-0.6B/ directory:

Qwen3-0.6B/
├── 1/
│ └── model.py # vLLM inference logic
├── requirements.txt # Lightweight deps (vLLM is pre-installed in the base image)
└── config.yaml # Model config (user_id/app_id auto-filled from login)

Note: For private or gated models (Llama, Gemma, etc.), set HF_TOKEN in your environment before initializing:

export HF_TOKEN=your_token_here
model.py
import os
import sys
from typing import Iterator, List

from openai import OpenAI

from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.models.openai_class import OpenAIModelClass
from clarifai.runners.utils.data_utils import Param
from clarifai.runners.utils.openai_convertor import build_openai_messages
from clarifai.utils.logging import logger

PYTHON_EXEC = sys.executable


def vllm_openai_server(checkpoints, **kwargs):
"""Start vLLM OpenAI-compatible server."""
from clarifai.runners.utils.model_utils import (
execute_shell_command,
terminate_process,
wait_for_server,
)

cmds = [
PYTHON_EXEC,
'-m',
'vllm.entrypoints.openai.api_server',
'--model',
checkpoints,
]
for key, value in kwargs.items():
if value is None:
continue
param_name = key.replace('_', '-')
if isinstance(value, bool):
if value:
cmds.append(f'--{param_name}')
else:
cmds.extend([f'--{param_name}', str(value)])

server = type(
'Server',
(),
{
'host': kwargs.get('host', '0.0.0.0'),
'port': kwargs.get('port', 23333),
'process': None,
},
)()

try:
server.process = execute_shell_command(" ".join(cmds))
url = f"http://{server.host}:{server.port}"
logger.info(f"Waiting for vLLM server at {url}")
wait_for_server(url)
logger.info(f"vLLM server started at {url}")
except Exception as e:
logger.error(f"Failed to start vLLM server: {e}")
if server.process:
terminate_process(server.process)
raise RuntimeError(f"Failed to start vLLM server: {e}")
return server


class VLLMModel(OpenAIModelClass):
client = True
model = True

def load_model(self):
server_args = {
'tensor_parallel_size': 1,
'port': 23333,
'host': 'localhost',
}

model_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(model_path, download_validation_only=True)
config = builder.config
stage = config["checkpoints"]["when"]
checkpoints = config["checkpoints"]["repo_id"]
if stage in ["build", "runtime"]:
checkpoints = builder.download_checkpoints(stage=stage)

self.server = vllm_openai_server(checkpoints, **server_args)
self.client = OpenAI(
api_key="notset",
base_url=f"http://{self.server.host}:{self.server.port}/v1",
)
self.model = self.client.models.list().data[0].id

@OpenAIModelClass.method
def predict(
self,
prompt: str = "",
chat_history: List[dict] = None,
tools: List[dict] = None,
tool_choice: str = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate.",
),
temperature: float = Param(
default=0.7,
description="Sampling temperature (higher = more random).",
),
top_p: float = Param(
default=0.95,
description="Nucleus sampling threshold.",
),
) -> str:
"""Return a single completion."""
if tools is not None and tool_choice is None:
tool_choice = "auto"

messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=tools,
tool_choice=tool_choice,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
)

if response.choices[0] and response.choices[0].message.tool_calls:
import json

tool_calls = response.choices[0].message.tool_calls
return json.dumps([tc.to_dict() for tc in tool_calls], indent=2)
return response.choices[0].message.content

@OpenAIModelClass.method
def generate(
self,
prompt: str = "",
chat_history: List[dict] = None,
tools: List[dict] = None,
tool_choice: str = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate.",
),
temperature: float = Param(
default=0.7,
description="Sampling temperature (higher = more random).",
),
top_p: float = Param(
default=0.95,
description="Nucleus sampling threshold.",
),
) -> Iterator[str]:
"""Stream a completion response."""
if tools is not None and tool_choice is None:
tool_choice = "auto"

messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=tools,
tool_choice=tool_choice,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=True,
)
for chunk in response:
if chunk.choices:
if chunk.choices[0].delta.tool_calls:
import json

tool_calls_json = [tc.to_dict() for tc in chunk.choices[0].delta.tool_calls]
yield json.dumps(tool_calls_json, indent=2)
else:
text = chunk.choices[0].delta.content if chunk.choices[0].delta.content else ''
yield text
config.yaml
build_info:
image: vllm/vllm-openai:latest
checkpoints:
repo_id: Qwen/Qwen3-0.6B
type: huggingface
when: runtime
compute:
instance: g4dn.xlarge
model:
id: qwen3-06b

The config.yaml has 4 sections:

  • model.id — Auto-generated from the HuggingFace model name
  • build_info.imagevllm/vllm-openai:latest (vLLM, PyTorch, and CUDA pre-installed)
  • compute.instance — Auto-selected based on estimated VRAM. Run clarifai list-instances to see all options
  • checkpoints — HuggingFace model weights config. Add hf_token here or set HF_TOKEN for gated models
requirements.txt
clarifai
openai

Step 4: Deploy to Cloud

Deploy to dedicated cloud compute. Clarifai automatically provisions all required infrastructure.

Since config.yaml already has a compute.instance value (auto-selected during init), you can deploy directly:

clarifai model deploy ./Qwen3-0.6B

To override the instance type:

clarifai model deploy ./Qwen3-0.6B --instance a10g

To see all available instance types and pricing:

clarifai list-instances

Tip: If you have a local GPU and want to test before deploying, run clarifai model serve ./Qwen3-0.6B --mode env first.

Step 5: Run Inference

import os
from openai import OpenAI

# Initialize the OpenAI client, pointing to Clarifai's API
client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1", # Clarifai's OpenAI-compatible API endpoint
api_key=os.environ["CLARIFAI_PAT"] # Ensure CLARIFAI_PAT is set as an environment variable
)

# Make a chat completion request to a Clarifai-hosted model
response = client.chat.completions.create(
model="https://clarifai.com/<user-id>/local-runner-app/models/local-runner-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the future of AI?"}
],
)

# Print the model's response
print(response.choices[0].message.content)

Or use the Clarifai CLI:

clarifai model predict https://clarifai.com/<user-id>/main/models/Qwen3-0.6B "Explain AI in one sentence"

Manage Your Deployment

# Stream live logs
clarifai model logs --deployment <deployment-id>

# Check status
clarifai model status --deployment <deployment-id>

# Remove deployment when done (stops billing)
clarifai model undeploy --deployment <deployment-id>

For the full CLI reference, see CLI Reference.