Skip to main content

SGLang

Serve any HuggingFace LLM locally or deploy to cloud compute


SGLang is a high-performance open-source inference engine for LLMs with support for structured generation and multimodal reasoning. With Clarifai, you can deploy SGLang models to cloud GPUs with a single command.

The workflow is: init → deploy

Important: SGLang requires Linux with an NVIDIA GPU (Ampere or newer, compute capability >= 8.0). It does not run on macOS or Windows. For local testing on non-Linux machines, use vLLM or Ollama instead.

Step 1: Install Clarifai

pip install --upgrade clarifai

Note: Python 3.11 or 3.12 is required. The openai package is included with clarifai.

Step 2: Log In

clarifai login

You'll be prompted for your user ID and PAT. This saves your credentials locally so you don't need to set environment variables manually.

Example Output
clarifai login
Enter your Clarifai user ID: user-id
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/user-id/settings/security

Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): XXXXXXXXXX

> Verifying token...
[INFO] 12:10:55.558733 Validating the Context Credentials... | thread=8729403584
[INFO] 12:10:56.693295 ✅ Context is valid | thread=8729403584

> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.

Enter a name for this context [default]:
✅ Success! You are now logged in.
Credentials saved to the 'default' context.

💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 12:10:59.177368 Login successful for user 'alfrick' in context 'default' | thread=8729403584

Step 3: Initialize a Model

Scaffold a model project using a HuggingFace model name:

clarifai model init --toolkit sglang --model-name Qwen/Qwen3-4B

The CLI auto-selects an Ampere+ GPU instance based on the model's VRAM requirements. You can initialize any model supported by SGLang — just change --model-name to a different HuggingFace repo ID.

Example Output
clarifai model init --toolkit sglang --model-name Qwen/Qwen3-4B
[INFO] Initializing model with sglang toolkit...
[INFO] Updated Hugging Face model repo_id to: Qwen/Qwen3-4B
Instance: g5.xlarge (Estimated 15.9 GiB VRAM (7.5 GiB weights + 5.6 GiB KV cache for 40960 ctx), fits g5.xlarge (22 GiB))

Model initialized in ./Qwen3-4B

Test locally:
clarifai model serve ./Qwen3-4B
clarifai model serve ./Qwen3-4B --mode env # auto-create venv and install deps
clarifai model serve ./Qwen3-4B --mode container # run inside Docker

Deploy to Clarifai:
clarifai model deploy ./Qwen3-4B
clarifai list-instances # list available instances

This creates a ./Qwen3-4B/ directory:

Qwen3-4B/
├── 1/
│ └── model.py # SGLang inference logic
├── requirements.txt # Lightweight deps (SGLang is pre-installed in the base image)
└── config.yaml # Model config (user_id/app_id auto-filled from login)

Note: For private or gated models (Llama, Gemma, etc.), set HF_TOKEN in your environment before initializing:

export HF_TOKEN=your_token_here
model.py
import os
import sys

sys.path.append(os.path.dirname(__file__))
from typing import Iterator, List

from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.models.openai_class import OpenAIModelClass
from clarifai.runners.utils.data_utils import Param
from clarifai.runners.utils.openai_convertor import build_openai_messages
from clarifai.utils.logging import logger
from openai import OpenAI
from openai_server_starter import OpenAI_APIServer
##################



class SglangModel(OpenAIModelClass):
"""
A custom runner that integrates with the Clarifai platform and uses Server inference
to process inputs, including text.
"""

client = True # This will be set in load_model method
model = True # This will be set in load_model method

def load_model(self):
"""Load the model here and start the server."""
os.path.join(os.path.dirname(__file__))

# Use downloaded checkpoints.
# Or if you intend to download checkpoint at runtime, set hf id instead. For example:
# checkpoints = "Qwen/Qwen2-7B-Instruct"

# server args were generated by `upload` module
server_args = {
'dtype': 'auto',
'kv_cache_dtype': 'auto',
'tp_size': 1,
'load_format': 'auto',
'context_length': None,
'device': 'cuda',
'port': 23333,
'host': '0.0.0.0',
'mem_fraction_static': 0.9,
'max_total_tokens': '8192',
'max_prefill_tokens': None,
'schedule_policy': 'fcfs',
'schedule_conservativeness': 1.0,
'checkpoints': 'runtime',
}

# if checkpoints == "checkpoints" => assign to checkpoints var aka local checkpoints path
stage = server_args.get("checkpoints")
if stage in ["build", "runtime"]:
# checkpoints = os.path.join(os.path.dirname(__file__), "checkpoints")
config_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(config_path, download_validation_only=True)
checkpoints = builder.download_checkpoints(stage=stage)
server_args.update({"checkpoints": checkpoints})

if server_args.get("additional_list_args") == ['']:
server_args.pop("additional_list_args")

# Start server
# This line were generated by `upload` module
self.server = OpenAI_APIServer.from_sglang_backend(**server_args)

# Create client
self.client = OpenAI(
api_key="notset", base_url=SglangModel.make_api_url(self.server.host, self.server.port)
)
self.model = self._get_model()

logger.info(f"OpenAI {self.model} model loaded successfully!")

def _get_model(self):
try:
return self.client.models.list().data[0].id
except Exception as e:
raise ConnectionError("Failed to retrieve model ID from API") from e

@staticmethod
def make_api_url(host: str, port: int, version: str = "v1") -> str:
return f"http://{host}:{port}/{version}"

@OpenAIModelClass.method
def predict(
self,
prompt: str,
chat_history: List[dict] = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.",
),
temperature: float = Param(
default=0.7,
description="A decimal number that determines the degree of randomness in the response",
),
top_p: float = Param(
default=0.8,
description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.",
),
) -> str:
"""This is the method that will be called when the runner is run. It takes in an input and
returns an output.
"""
openai_messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=openai_messages,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
)
if response.usage and response.usage.prompt_tokens and response.usage.completion_tokens:
self.set_output_context(
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
)
return response.choices[0].message.content

@OpenAIModelClass.method
def generate(
self,
prompt: str,
chat_history: List[dict] = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.",
),
temperature: float = Param(
default=0.7,
description="A decimal number that determines the degree of randomness in the response",
),
top_p: float = Param(
default=0.8,
description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.",
),
) -> Iterator[str]:
"""Example yielding a whole batch of streamed stuff back."""
openai_messages = build_openai_messages(prompt=prompt, messages=chat_history)
for chunk in self.client.chat.completions.create(
model=self.model,
messages=openai_messages,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=True,
):
if chunk.choices:
text = (
chunk.choices[0].delta.content
if (chunk and chunk.choices[0].delta.content) is not None
else ''
)
yield text

# This method is needed to test the model with the test-locally CLI command.
def test(self):
"""Test the model here."""
try:
print("Testing predict...")
# Test predict
print(
self.predict(
prompt="Hello, how are you?",
)
)
except Exception as e:
print("Error in predict", e)

try:
print("Testing generate...")
# Test generate
for each in self.generate(
prompt="Hello, how are you?",
):
print(each, end=" ")
except Exception as e:
print("Error in generate", e)
config.yaml
build_info:
image: lmsysorg/sglang:latest
checkpoints:
repo_id: Qwen/Qwen3-4B
type: huggingface
when: runtime
compute:
instance: g5.xlarge
model:
id: qwen3-4b

The config.yaml has 4 sections:

  • model.id — Auto-generated from the HuggingFace model name
  • build_info.imagelmsysorg/sglang:latest (SGLang, PyTorch, and CUDA pre-installed)
  • compute.instance — Auto-selected based on estimated VRAM (Ampere+ only). Run clarifai list-instances to see all options
  • checkpoints — HuggingFace model weights config. Add hf_token here or set HF_TOKEN for gated models
requirements.txt
clarifai
openai

Step 4: Deploy to Cloud

Deploy to dedicated cloud compute. Clarifai automatically provisions all required infrastructure.

Since config.yaml already has a compute.instance value (auto-selected during init), you can deploy directly:

clarifai model deploy ./Qwen3-4B

To override the instance type:

clarifai model deploy ./Qwen3-4B --instance g5.xlarge

To see all available instance types and pricing:

clarifai list-instances

Tip: If you have a local Linux GPU and want to test before deploying, run clarifai model serve ./Qwen3-4B --mode env first.

Step 5: Run Inference

import os
from openai import OpenAI

# Initialize the OpenAI client, pointing to Clarifai's API
client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1", # Clarifai's OpenAI-compatible API endpoint
api_key=os.environ["CLARIFAI_PAT"] # Ensure CLARIFAI_PAT is set as an environment variable
)

# Make a chat completion request to a Clarifai-hosted model
response = client.chat.completions.create(
model="https://clarifai.com/<user-id>/local-runner-app/models/local-runner-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the future of AI?"}
],
)

# Print the model's response
print(response.choices[0].message.content)

Or use the Clarifai CLI:

clarifai model predict https://clarifai.com/<user-id>/main/models/Qwen3-4B "Explain AI in one sentence"

Manage Your Deployment

# Stream live logs
clarifai model logs --deployment <deployment-id>

# Check status
clarifai model status --deployment <deployment-id>

# Remove deployment when done (stops billing)
clarifai model undeploy --deployment <deployment-id>

For the full CLI reference, see CLI Reference.