vLLM

Serve any HuggingFace LLM locally or deploy to cloud compute

vLLM is a high-performance open-source inference engine for LLMs. With Clarifai, you can deploy vLLM models to cloud GPUs with a single command.

The workflow is: init → deploy

Step 1: Install Clarifai

Bash

pip install --upgrade clarifai

Note: Python 3.11 or 3.12 is required. The openai package is included with clarifai.

Step 2: Log In

clarifai login

You'll be prompted for your user ID and PAT. This saves your credentials locally so you don't need to set environment variables manually.

Example Output

clarifai login
Enter your Clarifai user ID: alfrick
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/alfrick/settings/security

Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): d6570db0fe964ce7a96c357ce84803b1

> Verifying token...
[INFO] 11:15:43.091990 Validating the Context Credentials... |  thread=8309383360 
[INFO] 11:15:46.647300 ✅ Context is valid |  thread=8309383360 

> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.

Enter a name for this context [default]:  
✅ Success! You are now logged in.
Credentials saved to the 'default' context.

💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 11:15:54.361216 Login successful for user 'alfrick' in context 'default' |  thread=8309383360 

Step 3: Initialize a Model

Scaffold a model project using a HuggingFace model name:

clarifai model init --toolkit vllm --model-name Qwen/Qwen3-0.6B

The CLI auto-selects the optimal GPU instance based on the model's VRAM requirements. You can initialize any model supported by vLLM — just change --model-name to a different HuggingFace repo ID.

Example Output

clarifai model init --toolkit vllm --model-name Qwen/Qwen3-0.6B
[INFO] Initializing model with vllm toolkit...
[INFO] Updated Hugging Face model repo_id to: Qwen/Qwen3-0.6B
  Instance: g4dn.xlarge (Estimated 7.9 GiB VRAM (1.4 GiB weights + 4.4 GiB KV cache for 40960 ctx), fits g4dn.xlarge (15 GiB))

  Model initialized in ./Qwen3-0.6B

  Test locally:
    clarifai model serve ./Qwen3-0.6B
    clarifai model serve ./Qwen3-0.6B --mode env       # auto-create venv and install deps
    clarifai model serve ./Qwen3-0.6B --mode container # run inside Docker

  Deploy to Clarifai:
    clarifai model deploy ./Qwen3-0.6B
    clarifai list-instances                             # list available instances

This creates a ./Qwen3-0.6B/ directory:

Qwen3-0.6B/
├── 1/
│   └── model.py       # vLLM inference logic
├── requirements.txt   # Lightweight deps (vLLM is pre-installed in the base image)
└── config.yaml        # Model config (user_id/app_id auto-filled from login)

Note: For private or gated models (Llama, Gemma, etc.), set HF_TOKEN in your environment before initializing:
export HF_TOKEN=your_token_here

model.py

import os
import sys
from typing import Iterator, List

from openai import OpenAI

from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.models.openai_class import OpenAIModelClass
from clarifai.runners.utils.data_utils import Param
from clarifai.runners.utils.openai_convertor import build_openai_messages
from clarifai.utils.logging import logger

PYTHON_EXEC = sys.executable


def vllm_openai_server(checkpoints, **kwargs):
    """Start vLLM OpenAI-compatible server."""
    from clarifai.runners.utils.model_utils import (
        execute_shell_command,
        terminate_process,
        wait_for_server,
    )

    cmds = [
        PYTHON_EXEC,
        '-m',
        'vllm.entrypoints.openai.api_server',
        '--model',
        checkpoints,
    ]
    for key, value in kwargs.items():
        if value is None:
            continue
        param_name = key.replace('_', '-')
        if isinstance(value, bool):
            if value:
                cmds.append(f'--{param_name}')
        else:
            cmds.extend([f'--{param_name}', str(value)])

    server = type(
        'Server',
        (),
        {
            'host': kwargs.get('host', '0.0.0.0'),
            'port': kwargs.get('port', 23333),
            'process': None,
        },
    )()

    try:
        server.process = execute_shell_command(" ".join(cmds))
        url = f"http://{server.host}:{server.port}"
        logger.info(f"Waiting for vLLM server at {url}")
        wait_for_server(url)
        logger.info(f"vLLM server started at {url}")
    except Exception as e:
        logger.error(f"Failed to start vLLM server: {e}")
        if server.process:
            terminate_process(server.process)
        raise RuntimeError(f"Failed to start vLLM server: {e}")
    return server


class VLLMModel(OpenAIModelClass):
    client = True
    model = True

    def load_model(self):
        server_args = {
            'tensor_parallel_size': 1,
            'port': 23333,
            'host': 'localhost',
        }

        model_path = os.path.dirname(os.path.dirname(__file__))
        builder = ModelBuilder(model_path, download_validation_only=True)
        config = builder.config
        stage = config["checkpoints"]["when"]
        checkpoints = config["checkpoints"]["repo_id"]
        if stage in ["build", "runtime"]:
            checkpoints = builder.download_checkpoints(stage=stage)

        self.server = vllm_openai_server(checkpoints, **server_args)
        self.client = OpenAI(
            api_key="notset",
            base_url=f"http://{self.server.host}:{self.server.port}/v1",
        )
        self.model = self.client.models.list().data[0].id

    @OpenAIModelClass.method
    def predict(
        self,
        prompt: str = "",
        chat_history: List[dict] = None,
        tools: List[dict] = None,
        tool_choice: str = None,
        max_tokens: int = Param(
            default=512,
            description="The maximum number of tokens to generate.",
        ),
        temperature: float = Param(
            default=0.7,
            description="Sampling temperature (higher = more random).",
        ),
        top_p: float = Param(
            default=0.95,
            description="Nucleus sampling threshold.",
        ),
    ) -> str:
        """Return a single completion."""
        if tools is not None and tool_choice is None:
            tool_choice = "auto"

        messages = build_openai_messages(prompt=prompt, messages=chat_history)
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            tools=tools,
            tool_choice=tool_choice,
            max_completion_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
        )

        if response.choices[0] and response.choices[0].message.tool_calls:
            import json

            tool_calls = response.choices[0].message.tool_calls
            return json.dumps([tc.to_dict() for tc in tool_calls], indent=2)
        return response.choices[0].message.content

    @OpenAIModelClass.method
    def generate(
        self,
        prompt: str = "",
        chat_history: List[dict] = None,
        tools: List[dict] = None,
        tool_choice: str = None,
        max_tokens: int = Param(
            default=512,
            description="The maximum number of tokens to generate.",
        ),
        temperature: float = Param(
            default=0.7,
            description="Sampling temperature (higher = more random).",
        ),
        top_p: float = Param(
            default=0.95,
            description="Nucleus sampling threshold.",
        ),
    ) -> Iterator[str]:
        """Stream a completion response."""
        if tools is not None and tool_choice is None:
            tool_choice = "auto"

        messages = build_openai_messages(prompt=prompt, messages=chat_history)
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            tools=tools,
            tool_choice=tool_choice,
            max_completion_tokens=max_tokens,
            temperature=temperature,
            top_p=top_p,
            stream=True,
        )
        for chunk in response:
            if chunk.choices:
                if chunk.choices[0].delta.tool_calls:
                    import json

                    tool_calls_json = [tc.to_dict() for tc in chunk.choices[0].delta.tool_calls]
                    yield json.dumps(tool_calls_json, indent=2)
                else:
                    text = chunk.choices[0].delta.content if chunk.choices[0].delta.content else ''
                    yield text

config.yaml

build_info:
  image: vllm/vllm-openai:latest
checkpoints:
  repo_id: Qwen/Qwen3-0.6B
  type: huggingface
  when: runtime
compute:
  instance: g4dn.xlarge
model:
  id: qwen3-06b

The config.yaml has 4 sections:

model.id — Auto-generated from the HuggingFace model name
build_info.image — vllm/vllm-openai:latest (vLLM, PyTorch, and CUDA pre-installed)
compute.instance — Auto-selected based on estimated VRAM. Run clarifai list-instances to see all options
checkpoints — HuggingFace model weights config. Add hf_token here or set HF_TOKEN for gated models

requirements.txt

clarifai
openai

Step 4: Deploy to Cloud

Deploy to dedicated cloud compute. Clarifai automatically provisions all required infrastructure.

Since config.yaml already has a compute.instance value (auto-selected during init), you can deploy directly:

clarifai model deploy ./Qwen3-0.6B

To override the instance type:

clarifai model deploy ./Qwen3-0.6B --instance a10g

To see all available instance types and pricing:

clarifai list-instances

Tip: If you have a local GPU and want to test before deploying, run clarifai model serve ./Qwen3-0.6B --mode env first.

Note: Clarifai automatically optimizes replica routing using KV cache affinity, improving throughput for shared system prompts, RAG pipelines, and multi-turn conversations. See Request Routing for details.

Step 5: Run Inference

Python

import os
from openai import OpenAI

# Initialize the OpenAI client, pointing to Clarifai's API
client = OpenAI(     
    base_url="https://api.clarifai.com/v2/ext/openai/v1",  # Clarifai's OpenAI-compatible API endpoint
    api_key=os.environ["CLARIFAI_PAT"]  # Ensure CLARIFAI_PAT is set as an environment variable
)

# Make a chat completion request to a Clarifai-hosted model
response = client.chat.completions.create(    
    model="https://clarifai.com/<user-id>/local-runner-app/models/local-runner-model",    
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the future of AI?"}
    ],  
)

# Print the model's response
print(response.choices[0].message.content)

Or use the Clarifai CLI:

clarifai model predict https://clarifai.com/<user-id>/main/models/Qwen3-0.6B "Explain AI in one sentence"

Manage Your Deployment

# Stream live logs
clarifai model logs --deployment <deployment-id>

# Check status
clarifai model status --deployment <deployment-id>

# Remove deployment when done (stops billing)
clarifai model undeploy --deployment <deployment-id>

For the full CLI reference, see CLI Reference.

Step 1: Install Clarifai​

Step 2: Log In​

Step 3: Initialize a Model​

Step 4: Deploy to Cloud​

Step 5: Run Inference​

Manage Your Deployment​