SGLang
Run models using the SGLang runtime format and make them available via a public API
SGLang is an open-source runtime and programming framework designed for structured generation and high-performance inference of large language models (LLMs) and vision-language models.
It provides a flexible way to execute models with advanced capabilities like multi-step prompting, structured outputs, and multimodal reasoning — all while maximizing throughput and minimizing latency.
With Clarifai’s Local Runners, you can download and run these models on your own machine using the SGLang runtime format, expose them securely via a public URL, and tap into Clarifai’s powerful platform — all while retaining the privacy, performance, and control of local execution.
Note: The SGLang toolkit specifies a runtime format to run models sourced from external sources like Hugging Face. After initializing a model using the toolkit, you can upload it to Clarifai to leverage the platform’s capabilities.
Step 1: Perform Prerequisites
Sign Up or Log In
Log in to your existing Clarifai account or sign up for a new one. Once you’re logged in, gather the following credentials required for setup:
- App ID – Go to the application you want to use to run the model. In the collapsible left sidebar, select Overview and copy the app ID displayed there.
- User ID – In the collapsible left sidebar, open Settings, then choose Account from the dropdown list to locate your user ID.
- Personal Access Token (PAT) – From the same Settings menu, select Secrets to create or copy your PAT. This token is used to authenticate your connection with the Clarifai platform.
Then, set your PAT as an environment variable:
- Unix-Like Systems
- Windows
export CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE
set CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE
Install Clarifai CLI
Install the latest Clarifai CLI which includes built-in support for Local Runners:
- Bash
pip install --upgrade clarifai
Note: The Local Runners require Python 3.11 or 3.12.
Install SGLang
Install SGLang to enable its runtime execution environment.
- Bash
pip install sglang
Tip: GPU acceleration (CUDA) is highly recommended for optimal performance.
Install OpenAI
Install the openai package, which is needed to perform inference with models that use the OpenAI-compatible format.
- Bash
pip install openai
Get Hugging Face Token
If you want to initialize a Hugging Face model for use with SGLang, you’ll need a Hugging Face access token to authenticate with Hugging Face services — especially when accessing private or restricted repositories.
You can create one by following these instructions. Once you have the token, include it either in your model’s config.yaml file (as described below) or set it as an environment variable.
Note: If
hf_tokenis not specified in theconfig.yamlfile, the CLI will automatically use theHF_TOKENenvironment variable for authentication with Hugging Face.
- Unix-Like Systems
- Windows
export HF_TOKEN="YOUR_HF_ACCESS_TOKEN_HERE"
set HF_TOKEN="YOUR_HF_ACCESS_TOKEN_HERE"
Step 2: Initialize a Model
With the Clarifai CLI, you can initialize a model configured to run using the SGLang runtime format. It sets up a Clarifai-compatible project directory with the appropriate files.
You can customize or optimize the model by editing the generated files as needed. For example, the command below initializes a default Hugging Face model (HuggingFaceTB/SmolLM2-135M-Instruct) in your current directory.
- Bash
clarifai model init --toolkit sglang
Note: You can initialize a model in a specific location by passing a
MODEL_PATH.
Example Output
clarifai model init --toolkit sglang
[INFO] 20:14:19.494294 Parsed GitHub repository: owner=Clarifai, repo=runners-examples, branch=sglang, folder_path= | thread=8729403584
[INFO] 20:14:20.762093 Files to be downloaded are:
1. 1/model.py
2. 1/openai_server_starter.py
3. Dockerfile
4. README.md
5. config.yaml
6. requirements.txt | thread=8729403584
Press Enter to continue...
[INFO] 20:14:24.640395 Initializing model from GitHub repository: https://github.com/Clarifai/runners-examples | thread=8729403584
[INFO] 20:14:33.997825 Successfully cloned repository from https://github.com/Clarifai/runners-examples (branch: sglang) | thread=8729403584
[INFO] 20:14:34.006824 Updated Hugging Face model repo_id to: None | thread=8729403584
[INFO] 20:14:34.006878 Model initialization complete with GitHub repository | thread=8729403584
[INFO] 20:14:34.006909 Next steps: | thread=8729403584
[INFO] 20:14:34.006929 1. Review the model configuration | thread=8729403584
[INFO] 20:14:34.006946 2. Install any required dependencies manually | thread=8729403584
[INFO] 20:14:34.006966 3. Test the model locally using 'clarifai model local-test' | thread=8729403584
You can use the --model-name parameter to initialize any supported Hugging Face model. This sets the model’s repo_id, specifying which Hugging Face repository to initialize from.
- Bash
clarifai model init --toolkit sglang --model-name unsloth/Llama-3.2-1B-Instruct
Note: Large models require significant GPU memory. Ensure your machine has enough compute capacity to run them efficiently.
The generated structure includes:
├── 1/
│ └── model.py
| └── openai_server_starter.py
├── Dockerfile
└── README.md
├── config.yaml
└── requirements.txt
model.py
Example: 1/model.py
import os
import sys
sys.path.append(os.path.dirname(__file__))
from typing import Iterator, List
from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.models.openai_class import OpenAIModelClass
from clarifai.runners.utils.data_utils import Param
from clarifai.runners.utils.openai_convertor import build_openai_messages
from clarifai.utils.logging import logger
from openai import OpenAI
from openai_server_starter import OpenAI_APIServer
##################
class SglangModel(OpenAIModelClass):
"""
A custom runner that integrates with the Clarifai platform and uses Server inference
to process inputs, including text.
"""
client = True # This will be set in load_model method
model = True # This will be set in load_model method
def load_model(self):
"""Load the model here and start the server."""
os.path.join(os.path.dirname(__file__))
# Use downloaded checkpoints.
# Or if you intend to download checkpoint at runtime, set hf id instead. For example:
# checkpoints = "Qwen/Qwen2-7B-Instruct"
# server args were generated by `upload` module
server_args = {
'dtype': 'auto',
'kv_cache_dtype': 'auto',
'tp_size': 1,
'load_format': 'auto',
'context_length': None,
'device': 'cuda',
'port': 23333,
'host': '0.0.0.0',
'mem_fraction_static': 0.9,
'max_total_tokens': '8192',
'max_prefill_tokens': None,
'schedule_policy': 'fcfs',
'schedule_conservativeness': 1.0,
'checkpoints': 'runtime',
}
# if checkpoints == "checkpoints" => assign to checkpoints var aka local checkpoints path
stage = server_args.get("checkpoints")
if stage in ["build", "runtime"]:
# checkpoints = os.path.join(os.path.dirname(__file__), "checkpoints")
config_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(config_path, download_validation_only=True)
checkpoints = builder.download_checkpoints(stage=stage)
server_args.update({"checkpoints": checkpoints})
if server_args.get("additional_list_args") == ['']:
server_args.pop("additional_list_args")
# Start server
# This line were generated by `upload` module
self.server = OpenAI_APIServer.from_sglang_backend(**server_args)
# Create client
self.client = OpenAI(
api_key="notset", base_url=SglangModel.make_api_url(self.server.host, self.server.port)
)
self.model = self._get_model()
logger.info(f"OpenAI {self.model} model loaded successfully!")
def _get_model(self):
try:
return self.client.models.list().data[0].id
except Exception as e:
raise ConnectionError("Failed to retrieve model ID from API") from e
@staticmethod
def make_api_url(host: str, port: int, version: str = "v1") -> str:
return f"http://{host}:{port}/{version}"
@OpenAIModelClass.method
def predict(
self,
prompt: str,
chat_history: List[dict] = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.",
),
temperature: float = Param(
default=0.7,
description="A decimal number that determines the degree of randomness in the response",
),
top_p: float = Param(
default=0.8,
description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.",
),
) -> str:
"""This is the method that will be called when the runner is run. It takes in an input and
returns an output.
"""
openai_messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=openai_messages,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
)
if response.usage and response.usage.prompt_tokens and response.usage.completion_tokens:
self.set_output_context(
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
)
return response.choices[0].message.content
@OpenAIModelClass.method
def generate(
self,
prompt: str,
chat_history: List[dict] = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.",
),
temperature: float = Param(
default=0.7,
description="A decimal number that determines the degree of randomness in the response",
),
top_p: float = Param(
default=0.8,
description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.",
),
) -> Iterator[str]:
"""Example yielding a whole batch of streamed stuff back."""
openai_messages = build_openai_messages(prompt=prompt, messages=chat_history)
for chunk in self.client.chat.completions.create(
model=self.model,
messages=openai_messages,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=True,
):
if chunk.choices:
text = (
chunk.choices[0].delta.content
if (chunk and chunk.choices[0].delta.content) is not None
else ''
)
yield text
# This method is needed to test the model with the test-locally CLI command.
def test(self):
"""Test the model here."""
try:
print("Testing predict...")
# Test predict
print(
self.predict(
prompt="Hello, how are you?",
)
)
except Exception as e:
print("Error in predict", e)
try:
print("Testing generate...")
# Test generate
for each in self.generate(
prompt="Hello, how are you?",
):
print(each, end=" ")
except Exception as e:
print("Error in generate", e)
This is the main runner script that defines how your model loads, runs, and handles inference.
- It subclasses
OpenAIModelClass, meaning it exposes OpenAI-compatible endpoints for inference. - The
load_model()method spins up a local SGLang backend server (viaOpenAI_APIServer.from_sglang_backend) and initializes the model checkpoint. - The
predict()andgenerate()methods define how text generation requests are processed — supporting both standard predictions and streaming outputs. - The
test()method lets you verify locally that everything is working before deployment.
openai_server_starter.py
Example: 1/openai_server_starter.py
import os
import signal
import subprocess
import sys
import threading
from typing import List
import psutil
from clarifai.utils.logging import logger
PYTHON_EXEC = sys.executable
def kill_process_tree(parent_pid, include_parent: bool = True, skip_pid: int = None):
"""Kill the process and all its child processes."""
if parent_pid is None:
parent_pid = os.getpid()
include_parent = False
try:
itself = psutil.Process(parent_pid)
except psutil.NoSuchProcess:
return
children = itself.children(recursive=True)
for child in children:
if child.pid == skip_pid:
continue
try:
child.kill()
except psutil.NoSuchProcess:
pass
if include_parent:
try:
itself.kill()
# Sometime processes cannot be killed with SIGKILL (e.g, PID=1 launched by kubernetes),
# so we send an additional signal to kill them.
itself.send_signal(signal.SIGQUIT)
except psutil.NoSuchProcess:
pass
class OpenAI_APIServer:
def __init__(self, **kwargs):
self.server_started_event = threading.Event()
self.process = None
self.backend = None
self.server_thread = None
def __del__(self, *exc):
# This is important
# close the server when exit the program
self.close()
def close(self):
if self.process:
try:
kill_process_tree(self.process.pid)
except:
self.process.terminate()
if self.server_thread:
self.server_thread.join()
def wait_for_startup(self):
self.server_started_event.wait()
def validate_if_server_start(self, line: str):
line_lower = line.lower()
if self.backend in ["vllm", "sglang", "lmdeploy"]:
if self.backend == "vllm":
return (
"application startup complete" in line_lower
or "vllm api server on" in line_lower
)
else:
return f" running on http://{self.host}:" in line.strip()
elif self.backend == "llamacpp":
return "waiting for new tasks" in line_lower
elif self.backend == "tgi":
return "Connected" in line.strip()
def _start_server(self, cmds):
try:
env = os.environ.copy()
env["VLLM_USAGE_SOURCE"] = "production-docker-image"
self.process = subprocess.Popen(
cmds,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
)
for line in self.process.stdout:
logger.info("Server Log: " + line.strip())
if self.validate_if_server_start(line):
self.server_started_event.set()
# break
except Exception as e:
if self.process:
self.process.terminate()
raise RuntimeError(f"Failed to start Server server: {e}")
def start_server_thread(self, cmds: str):
try:
# Start the server in a separate thread
self.server_thread = threading.Thread(
target=self._start_server, args=(cmds,), daemon=None
)
self.server_thread.start()
# Wait for the server to start
self.wait_for_startup()
except Exception as e:
raise Exception(e)
@classmethod
def from_sglang_backend(
cls,
checkpoints,
dtype: str = "auto",
kv_cache_dtype: str = "auto",
tp_size: int = 1,
quantization: str = None,
load_format: str = "auto",
context_length: str = None,
device: str = "cuda",
port=23333,
host="0.0.0.0",
chat_template: str = None,
mem_fraction_static: float = 0.8,
max_running_requests: int = None,
max_total_tokens: int = None,
max_prefill_tokens: int = None,
schedule_policy: str = "fcfs",
schedule_conservativeness: float = 1.0,
cpu_offload_gb: int = 0,
additional_list_args: List[str] = [],
):
"""Start SGlang OpenAI compatible server.
Args:
checkpoints (str): model id or path.
dtype (str, optional): Dtype used for the model {"auto", "half", "float16", "bfloat16", "float", "float32"}. Defaults to "auto".
kv_cache_dtype (str, optional): Dtype of the kv cache, defaults to the dtype. Defaults to "auto".
tp_size (int, optional): The number of GPUs the model weights get sharded over. Mainly for saving memory rather than for high throughput. Defaults to 1.
quantization (str, optional): Quantization format {"awq","fp8","gptq","marlin","gptq_marlin","awq_marlin","bitsandbytes","gguf","modelopt","w8a8_int8"}. Defaults to None.
load_format (str, optional): The format of the model weights to load:\n* `auto`: will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.\n* `pt`: will load the weights in the pytorch bin format. \n* `safetensors`: will load the weights in the safetensors format. \n* `npcache`: will load the weights in pytorch format and store a numpy cache to speed up the loading. \n* `dummy`: will initialize the weights with random values, which is mainly for profiling.\n* `gguf`: will load the weights in the gguf format. \n* `bitsandbytes`: will load the weights using bitsandbytes quantization."\n* `layered`: loads weights layer by layer so that one can quantize a layer before loading another to make the peak memory envelope smaller.\n. Defaults to "auto".\n
context_length (str, optional): The model's maximum context length. Defaults to None (will use the value from the model's config.json instead). Defaults to None.
device (str, optional): The device type {"cuda", "xpu", "hpu", "cpu"}. Defaults to "cuda".
port (int, optional): Port number. Defaults to 23333.
host (str, optional): Host name. Defaults to "0.0.0.0".
chat_template (str, optional): The buliltin chat template name or the path of the chat template file. This is only used for OpenAI-compatible API server.. Defaults to None.
mem_fraction_static (float, optional): The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. Defaults to 0.8.
max_running_requests (int, optional): The maximum number of running requests.. Defaults to None.
max_total_tokens (int, optional): The maximum number of tokens in the memory pool. If not specified, it will be automatically calculated based on the memory usage fraction. This option is typically used for development and debugging purposes.. Defaults to None.
max_prefill_tokens (int, optional): The maximum number of tokens in a prefill batch. The real bound will be the maximum of this value and the model's maximum context length. Defaults to None.
schedule_policy (str, optional): The scheduling policy of the requests {"lpm", "random", "fcfs", "dfs-weight"}. Defaults to "fcfs".
schedule_conservativeness (float, optional): How conservative the schedule policy is. A larger value means more conservative scheduling. Use a larger value if you see requests being retracted frequently. Defaults to 1.0.
cpu_offload_gb (int, optional): How many GBs of RAM to reserve for CPU offloading. Defaults to 0.
additional_list_args (List[str], optional): additional args to run subprocess cmd e.g. ["--arg-name", "arg value"]. See more at [github](https://github.com/sgl-project/sglang/blob/1baa9e6cf90b30aaa7dae51c01baa25229e8f7d5/python/sglang/srt/server_args.py#L298). Defaults to [].
Returns:
_type_: _description_
"""
from clarifai.runners.utils.model_utils import execute_shell_command, wait_for_server
cmds = [
PYTHON_EXEC,
"-m",
"sglang.launch_server",
"--model-path",
checkpoints,
"--dtype",
str(dtype),
"--device",
str(device),
"--kv-cache-dtype",
str(kv_cache_dtype),
"--tp-size",
str(tp_size),
"--load-format",
str(load_format),
"--mem-fraction-static",
str(mem_fraction_static),
"--schedule-policy",
str(schedule_policy),
"--schedule-conservativeness",
str(schedule_conservativeness),
"--port",
str(port),
"--host",
host,
"--trust-remote-code",
]
if chat_template:
cmds += ["--chat-template", chat_template]
if quantization:
cmds += [
"--quantization",
quantization,
]
if context_length:
cmds += [
"--context-length",
context_length,
]
if max_running_requests:
cmds += [
"--max-running-requests",
max_running_requests,
]
if max_total_tokens:
cmds += [
"--max-total-tokens",
max_total_tokens,
]
if max_prefill_tokens:
cmds += [
"--max-prefill-tokens",
max_prefill_tokens,
]
if additional_list_args:
cmds += additional_list_args
print("CMDS to run `sglang` server: ", " ".join(cmds), "\n")
_self = cls()
_self.host = host
_self.port = port
_self.backend = "sglang"
# _self.start_server_thread(cmds)
# new_path = os.environ["PATH"] + ":/sbin"
# _self.process = subprocess.Popen(cmds, text=True, stderr=subprocess.STDOUT, env={**os.environ, "PATH": new_path})
_self.process = execute_shell_command(" ".join(cmds))
logger.info("Waiting for " + f"http://{_self.host}:{_self.port}")
wait_for_server(f"http://{_self.host}:{_self.port}")
logger.info("Done")
return _self
This utility handles starting, monitoring, and shutting down the backend SGLang server. It acts as your server controller, ensuring the backend is ready before the runner starts sending requests.
- It wraps around subprocess management for launching
sglang.launch_server. - It ensures the server runs properly, logs startup messages, and handles safe termination.
- The class
OpenAI_APIServercan also be extended to support other backends likevLLM,llama.cpp, orTGI, but here it’s used for SGLang.
Dockerfile
Example: Dockerfile
# syntax=docker/dockerfile:1.13-labs
FROM --platform=$TARGETPLATFORM lmsysorg/sglang:v0.5.3-cu129 as final
COPY --link requirements.txt /home/nonroot/requirements.txt
# Update clarifai package so we always have latest protocol to the API. Everything should land in /venv
RUN ["pip", "install", "--no-cache-dir", "-r", "/home/nonroot/requirements.txt"]
RUN ["pip", "show", "--no-cache-dir", "clarifai"]
# Set the NUMBA cache dir to /tmp
# Set the TORCHINDUCTOR cache dir to /tmp
# The CLARIFAI* will be set by the templaing system.
ENV NUMBA_CACHE_DIR=/tmp/numba_cache \
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_cache \
HOME=/tmp \
DEBIAN_FRONTEND=noninteractive
#####
# Copy the files needed to download
#####
# This creates the directory that HF downloader will populate and with nonroot:nonroot permissions up.
COPY --chown=nonroot:nonroot downloader/unused.yaml /home/nonroot/main/1/checkpoints/.cache/unused.yaml
#####
# Download checkpoints if config.yaml has checkpoints.when = "build"
COPY --link=true config.yaml /home/nonroot/main/
RUN ["python", "-m", "clarifai.cli", "model", "download-checkpoints", "/home/nonroot/main", "--out_path", "/home/nonroot/main/1/checkpoints", "--stage", "build"]
#####
# Copy in the actual files like config.yaml, requirements.txt, and most importantly 1/model.py
# for the actual model.
# If checkpoints aren't downloaded since a checkpoints: block is not provided, then they will
# be in the build context and copied here as well.
COPY --link=true 1 /home/nonroot/main/1
# At this point we only need these for validation in the SDK.
COPY --link=true requirements.txt config.yaml /home/nonroot/main/
# Add the model directory to the python path.
ENV PYTHONPATH=${PYTHONPATH}:/home/nonroot/main \
CLARIFAI_PAT=${CLARIFAI_PAT} \
CLARIFAI_USER_ID=${CLARIFAI_USER_ID} \
CLARIFAI_RUNNER_ID=${CLARIFAI_RUNNER_ID} \
CLARIFAI_NODEPOOL_ID=${CLARIFAI_NODEPOOL_ID} \
CLARIFAI_COMPUTE_CLUSTER_ID=${CLARIFAI_COMPUTE_CLUSTER_ID} \
CLARIFAI_API_BASE=${CLARIFAI_API_BASE:-https://api.clarifai.com}
USER root
RUN echo "nonroot:x:65532:65532:nonroot user:/home/nonroot:/sbin/nologin" >> /etc/passwd
USER nonroot
# Finally run the clarifai entrypoint to start the runner loop and local runner server.
# Note(zeiler): we may want to make this a clarifai CLI call.
ENTRYPOINT ["python", "-m", "clarifai.runners.server"]
CMD ["--model_path", "/home/nonroot/main"]
#############################
The Dockerfile defines the container environment used to run your model runner on Clarifai’s infrastructure.
- It builds on the official SGLang base image (
lmsysorg/sglang:v0.5.3-cu129), which includes CUDA and SGLang dependencies. - It installs any Python packages listed in
requirements.txt. - It copies your model files (
model.py,config.yaml, etc.) into the container. - Optionally, it downloads checkpoints during build time if
checkpoints.when = "build". - It starts the Clarifai runner loop using
python -m clarifai.runners.server.
config.yaml
Example: config.yaml
model:
id: SmolLM2-135M-Instruct
user_id: YOUR_USER_ID
app_id: YOUR_APP_ID
model_type_id: text-to-text
build_info:
python_version: '3.11'
inference_compute_info:
cpu_limit: '3'
cpu_memory: 14Gi
num_accelerators: 1
accelerator_type:
- NVIDIA-L40S
accelerator_memory: 42Gi
checkpoints:
repo_id: HuggingFaceTB/SmolLM2-135M-Instruct
type: huggingface
when: runtime
This is the configuration file for your SGLang model runner.
- It specifies model identifiers (
model.id,user_id,app_id), which together determine where your model will run on the Clarifai platform. Your Clarifai user ID is set by default from your active context. - It defines compute resources (CPU, GPU type, and memory).
- The
checkpointssection tells the runner where and when to load model weightsTip: Use
when: runtimefor large models to reduce image size and improve load times.
requirements.txt
Example: requirements.txt
clarifai
openai
This file lists all the Python dependencies required for the runner to work. If you haven’t installed them yet, run the following command to install the dependencies:
- Bash
pip install -r requirements.txt
Step 3: Log In to Clarifai
Log in and create a configuration context:
clarifai login
Enter the requested details:
- User ID – Your Clarifai user ID
- PAT – Your personal access token (or type
ENVVARto use the environment variable) - Context name – Optional name for the config context (default:
"default")
Example Output
clarifai login
Enter your Clarifai user ID: user-id
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/user-id/settings/security
Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): XXXXXXXXXX
> Verifying token...
[INFO] 12:10:55.558733 Validating the Context Credentials... | thread=8729403584
[INFO] 12:10:56.693295 ✅ Context is valid | thread=8729403584
> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.
Enter a name for this context [default]:
✅ Success! You are now logged in.
Credentials saved to the 'default' context.
💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 12:10:59.177368 Login successful for user 'alfrick' in context 'default' | thread=8729403584
Step 4: Start Your Local Runner
Next, start your Local Runner, which connects to the SGLang runtime to execute your model locally.
clarifai model local-runner
If any configuration contexts or defaults are missing, the CLI will automatically guide you through setting them up.
This process ensures that all required components — such as compute clusters, nodepools, and deployments — are correctly configured in your context, enabling seamless local execution of your SGLang model. For more details, see Local Runners documentation.
Example Output
Step 5: Test Your Runner
After the Local Runner starts, you can use it to perform inference with your SGLang-based model.
You can run a test snippet in a separate terminal, within the same directory, to verify that your model is running and responding correctly.
Here’s an example snippet:
- Python SDK
import os
from openai import OpenAI
# Initialize the OpenAI client, pointing to Clarifai's API
client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1", # Clarifai's OpenAI-compatible API endpoint
api_key=os.environ["CLARIFAI_PAT"] # Ensure CLARIFAI_PAT is set as an environment variable
)
# Make a chat completion request to a Clarifai-hosted model
response = client.chat.completions.create(
model="https://clarifai.com/<user-id>/local-runner-app/models/local-runner-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the future of AI?"}
],
)
# Print the model's response
print(response.choices[0].message.content)