Skip to main content

SGLang

Run models using the SGLang runtime format and make them available via a public API


SGLang is an open-source runtime and programming framework designed for structured generation and high-performance inference of large language models (LLMs) and vision-language models.

It provides a flexible way to execute models with advanced capabilities like multi-step prompting, structured outputs, and multimodal reasoning — all while maximizing throughput and minimizing latency.

With Clarifai’s Local Runners, you can download and run these models on your own machine using the SGLang runtime format, expose them securely via a public URL, and tap into Clarifai’s powerful platform — all while retaining the privacy, performance, and control of local execution.

Note: The SGLang toolkit specifies a runtime format to run models sourced from external sources like Hugging Face. After initializing a model using the toolkit, you can upload it to Clarifai to leverage the platform’s capabilities.

Step 1: Perform Prerequisites

Sign Up or Log In

Log in to your existing Clarifai account or sign up for a new one. Once you’re logged in, gather the following credentials required for setup:

  • App ID – Go to the application you want to use to run the model. In the collapsible left sidebar, select Overview and copy the app ID displayed there.
  • User ID – In the collapsible left sidebar, open Settings, then choose Account from the dropdown list to locate your user ID.
  • Personal Access Token (PAT) – From the same Settings menu, select Secrets to create or copy your PAT. This token is used to authenticate your connection with the Clarifai platform.

Then, set your PAT as an environment variable:

export CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE

Install Clarifai CLI

Install the latest Clarifai CLI which includes built-in support for Local Runners:

pip install --upgrade clarifai

Note: The Local Runners require Python 3.11 or 3.12.

Install SGLang

Install SGLang to enable its runtime execution environment.

pip install sglang

Tip: GPU acceleration (CUDA) is highly recommended for optimal performance.

Install OpenAI

Install the openai package, which is needed to perform inference with models that use the OpenAI-compatible format.

pip install openai

Get Hugging Face Token

If you want to initialize a Hugging Face model for use with SGLang, you’ll need a Hugging Face access token to authenticate with Hugging Face services — especially when accessing private or restricted repositories.

You can create one by following these instructions. Once you have the token, include it either in your model’s config.yaml file (as described below) or set it as an environment variable.

Note: If hf_token is not specified in the config.yaml file, the CLI will automatically use the HF_TOKEN environment variable for authentication with Hugging Face.

export HF_TOKEN="YOUR_HF_ACCESS_TOKEN_HERE"

Step 2: Initialize a Model

With the Clarifai CLI, you can initialize a model configured to run using the SGLang runtime format. It sets up a Clarifai-compatible project directory with the appropriate files.

You can customize or optimize the model by editing the generated files as needed. For example, the command below initializes a default Hugging Face model (HuggingFaceTB/SmolLM2-135M-Instruct) in your current directory.

clarifai model init --toolkit sglang
Example Output
clarifai model init --toolkit sglang
[INFO] 20:14:19.494294 Parsed GitHub repository: owner=Clarifai, repo=runners-examples, branch=sglang, folder_path= | thread=8729403584
[INFO] 20:14:20.762093 Files to be downloaded are:
1. 1/model.py
2. 1/openai_server_starter.py
3. Dockerfile
4. README.md
5. config.yaml
6. requirements.txt | thread=8729403584
Press Enter to continue...
[INFO] 20:14:24.640395 Initializing model from GitHub repository: https://github.com/Clarifai/runners-examples | thread=8729403584
[INFO] 20:14:33.997825 Successfully cloned repository from https://github.com/Clarifai/runners-examples (branch: sglang) | thread=8729403584
[INFO] 20:14:34.006824 Updated Hugging Face model repo_id to: None | thread=8729403584
[INFO] 20:14:34.006878 Model initialization complete with GitHub repository | thread=8729403584
[INFO] 20:14:34.006909 Next steps: | thread=8729403584
[INFO] 20:14:34.006929 1. Review the model configuration | thread=8729403584
[INFO] 20:14:34.006946 2. Install any required dependencies manually | thread=8729403584
[INFO] 20:14:34.006966 3. Test the model locally using 'clarifai model local-test' | thread=8729403584
tip

You can use the --model-name parameter to initialize any supported Hugging Face model. This sets the model’s repo_id, specifying which Hugging Face repository to initialize from.

clarifai model init --toolkit sglang --model-name unsloth/Llama-3.2-1B-Instruct

Note: Large models require significant GPU memory. Ensure your machine has enough compute capacity to run them efficiently.

The generated structure includes:


├── 1/
│ └── model.py
| └── openai_server_starter.py
├── Dockerfile
└── README.md
├── config.yaml
└── requirements.txt

model.py

Example: 1/model.py
import os
import sys

sys.path.append(os.path.dirname(__file__))
from typing import Iterator, List

from clarifai.runners.models.model_builder import ModelBuilder
from clarifai.runners.models.openai_class import OpenAIModelClass
from clarifai.runners.utils.data_utils import Param
from clarifai.runners.utils.openai_convertor import build_openai_messages
from clarifai.utils.logging import logger
from openai import OpenAI
from openai_server_starter import OpenAI_APIServer
##################



class SglangModel(OpenAIModelClass):
"""
A custom runner that integrates with the Clarifai platform and uses Server inference
to process inputs, including text.
"""

client = True # This will be set in load_model method
model = True # This will be set in load_model method

def load_model(self):
"""Load the model here and start the server."""
os.path.join(os.path.dirname(__file__))

# Use downloaded checkpoints.
# Or if you intend to download checkpoint at runtime, set hf id instead. For example:
# checkpoints = "Qwen/Qwen2-7B-Instruct"

# server args were generated by `upload` module
server_args = {
'dtype': 'auto',
'kv_cache_dtype': 'auto',
'tp_size': 1,
'load_format': 'auto',
'context_length': None,
'device': 'cuda',
'port': 23333,
'host': '0.0.0.0',
'mem_fraction_static': 0.9,
'max_total_tokens': '8192',
'max_prefill_tokens': None,
'schedule_policy': 'fcfs',
'schedule_conservativeness': 1.0,
'checkpoints': 'runtime',
}

# if checkpoints == "checkpoints" => assign to checkpoints var aka local checkpoints path
stage = server_args.get("checkpoints")
if stage in ["build", "runtime"]:
# checkpoints = os.path.join(os.path.dirname(__file__), "checkpoints")
config_path = os.path.dirname(os.path.dirname(__file__))
builder = ModelBuilder(config_path, download_validation_only=True)
checkpoints = builder.download_checkpoints(stage=stage)
server_args.update({"checkpoints": checkpoints})

if server_args.get("additional_list_args") == ['']:
server_args.pop("additional_list_args")

# Start server
# This line were generated by `upload` module
self.server = OpenAI_APIServer.from_sglang_backend(**server_args)

# Create client
self.client = OpenAI(
api_key="notset", base_url=SglangModel.make_api_url(self.server.host, self.server.port)
)
self.model = self._get_model()

logger.info(f"OpenAI {self.model} model loaded successfully!")

def _get_model(self):
try:
return self.client.models.list().data[0].id
except Exception as e:
raise ConnectionError("Failed to retrieve model ID from API") from e

@staticmethod
def make_api_url(host: str, port: int, version: str = "v1") -> str:
return f"http://{host}:{port}/{version}"

@OpenAIModelClass.method
def predict(
self,
prompt: str,
chat_history: List[dict] = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.",
),
temperature: float = Param(
default=0.7,
description="A decimal number that determines the degree of randomness in the response",
),
top_p: float = Param(
default=0.8,
description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.",
),
) -> str:
"""This is the method that will be called when the runner is run. It takes in an input and
returns an output.
"""
openai_messages = build_openai_messages(prompt=prompt, messages=chat_history)
response = self.client.chat.completions.create(
model=self.model,
messages=openai_messages,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
)
if response.usage and response.usage.prompt_tokens and response.usage.completion_tokens:
self.set_output_context(
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
)
return response.choices[0].message.content

@OpenAIModelClass.method
def generate(
self,
prompt: str,
chat_history: List[dict] = None,
max_tokens: int = Param(
default=512,
description="The maximum number of tokens to generate. Shorter token lengths will provide faster performance.",
),
temperature: float = Param(
default=0.7,
description="A decimal number that determines the degree of randomness in the response",
),
top_p: float = Param(
default=0.8,
description="An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass.",
),
) -> Iterator[str]:
"""Example yielding a whole batch of streamed stuff back."""
openai_messages = build_openai_messages(prompt=prompt, messages=chat_history)
for chunk in self.client.chat.completions.create(
model=self.model,
messages=openai_messages,
max_completion_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stream=True,
):
if chunk.choices:
text = (
chunk.choices[0].delta.content
if (chunk and chunk.choices[0].delta.content) is not None
else ''
)
yield text

# This method is needed to test the model with the test-locally CLI command.
def test(self):
"""Test the model here."""
try:
print("Testing predict...")
# Test predict
print(
self.predict(
prompt="Hello, how are you?",
)
)
except Exception as e:
print("Error in predict", e)

try:
print("Testing generate...")
# Test generate
for each in self.generate(
prompt="Hello, how are you?",
):
print(each, end=" ")
except Exception as e:
print("Error in generate", e)

This is the main runner script that defines how your model loads, runs, and handles inference.

  • It subclasses OpenAIModelClass, meaning it exposes OpenAI-compatible endpoints for inference.
  • The load_model() method spins up a local SGLang backend server (via OpenAI_APIServer.from_sglang_backend) and initializes the model checkpoint.
  • The predict() and generate() methods define how text generation requests are processed — supporting both standard predictions and streaming outputs.
  • The test() method lets you verify locally that everything is working before deployment.

openai_server_starter.py

Example: 1/openai_server_starter.py
import os
import signal
import subprocess
import sys
import threading
from typing import List

import psutil
from clarifai.utils.logging import logger

PYTHON_EXEC = sys.executable


def kill_process_tree(parent_pid, include_parent: bool = True, skip_pid: int = None):
"""Kill the process and all its child processes."""
if parent_pid is None:
parent_pid = os.getpid()
include_parent = False

try:
itself = psutil.Process(parent_pid)
except psutil.NoSuchProcess:
return

children = itself.children(recursive=True)
for child in children:
if child.pid == skip_pid:
continue
try:
child.kill()
except psutil.NoSuchProcess:
pass

if include_parent:
try:
itself.kill()

# Sometime processes cannot be killed with SIGKILL (e.g, PID=1 launched by kubernetes),
# so we send an additional signal to kill them.
itself.send_signal(signal.SIGQUIT)
except psutil.NoSuchProcess:
pass


class OpenAI_APIServer:
def __init__(self, **kwargs):
self.server_started_event = threading.Event()
self.process = None
self.backend = None
self.server_thread = None

def __del__(self, *exc):
# This is important
# close the server when exit the program
self.close()

def close(self):
if self.process:
try:
kill_process_tree(self.process.pid)
except:
self.process.terminate()
if self.server_thread:
self.server_thread.join()

def wait_for_startup(self):
self.server_started_event.wait()

def validate_if_server_start(self, line: str):
line_lower = line.lower()
if self.backend in ["vllm", "sglang", "lmdeploy"]:
if self.backend == "vllm":
return (
"application startup complete" in line_lower
or "vllm api server on" in line_lower
)
else:
return f" running on http://{self.host}:" in line.strip()
elif self.backend == "llamacpp":
return "waiting for new tasks" in line_lower
elif self.backend == "tgi":
return "Connected" in line.strip()

def _start_server(self, cmds):
try:
env = os.environ.copy()
env["VLLM_USAGE_SOURCE"] = "production-docker-image"
self.process = subprocess.Popen(
cmds,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
)
for line in self.process.stdout:
logger.info("Server Log: " + line.strip())
if self.validate_if_server_start(line):
self.server_started_event.set()
# break
except Exception as e:
if self.process:
self.process.terminate()
raise RuntimeError(f"Failed to start Server server: {e}")

def start_server_thread(self, cmds: str):
try:
# Start the server in a separate thread
self.server_thread = threading.Thread(
target=self._start_server, args=(cmds,), daemon=None
)
self.server_thread.start()

# Wait for the server to start
self.wait_for_startup()
except Exception as e:
raise Exception(e)

@classmethod
def from_sglang_backend(
cls,
checkpoints,
dtype: str = "auto",
kv_cache_dtype: str = "auto",
tp_size: int = 1,
quantization: str = None,
load_format: str = "auto",
context_length: str = None,
device: str = "cuda",
port=23333,
host="0.0.0.0",
chat_template: str = None,
mem_fraction_static: float = 0.8,
max_running_requests: int = None,
max_total_tokens: int = None,
max_prefill_tokens: int = None,
schedule_policy: str = "fcfs",
schedule_conservativeness: float = 1.0,
cpu_offload_gb: int = 0,
additional_list_args: List[str] = [],
):
"""Start SGlang OpenAI compatible server.

Args:
checkpoints (str): model id or path.
dtype (str, optional): Dtype used for the model {"auto", "half", "float16", "bfloat16", "float", "float32"}. Defaults to "auto".
kv_cache_dtype (str, optional): Dtype of the kv cache, defaults to the dtype. Defaults to "auto".
tp_size (int, optional): The number of GPUs the model weights get sharded over. Mainly for saving memory rather than for high throughput. Defaults to 1.
quantization (str, optional): Quantization format {"awq","fp8","gptq","marlin","gptq_marlin","awq_marlin","bitsandbytes","gguf","modelopt","w8a8_int8"}. Defaults to None.
load_format (str, optional): The format of the model weights to load:\n* `auto`: will try to load the weights in the safetensors format and fall back to the pytorch bin format if safetensors format is not available.\n* `pt`: will load the weights in the pytorch bin format. \n* `safetensors`: will load the weights in the safetensors format. \n* `npcache`: will load the weights in pytorch format and store a numpy cache to speed up the loading. \n* `dummy`: will initialize the weights with random values, which is mainly for profiling.\n* `gguf`: will load the weights in the gguf format. \n* `bitsandbytes`: will load the weights using bitsandbytes quantization."\n* `layered`: loads weights layer by layer so that one can quantize a layer before loading another to make the peak memory envelope smaller.\n. Defaults to "auto".\n
context_length (str, optional): The model's maximum context length. Defaults to None (will use the value from the model's config.json instead). Defaults to None.
device (str, optional): The device type {"cuda", "xpu", "hpu", "cpu"}. Defaults to "cuda".
port (int, optional): Port number. Defaults to 23333.
host (str, optional): Host name. Defaults to "0.0.0.0".
chat_template (str, optional): The buliltin chat template name or the path of the chat template file. This is only used for OpenAI-compatible API server.. Defaults to None.
mem_fraction_static (float, optional): The fraction of the memory used for static allocation (model weights and KV cache memory pool). Use a smaller value if you see out-of-memory errors. Defaults to 0.8.
max_running_requests (int, optional): The maximum number of running requests.. Defaults to None.
max_total_tokens (int, optional): The maximum number of tokens in the memory pool. If not specified, it will be automatically calculated based on the memory usage fraction. This option is typically used for development and debugging purposes.. Defaults to None.
max_prefill_tokens (int, optional): The maximum number of tokens in a prefill batch. The real bound will be the maximum of this value and the model's maximum context length. Defaults to None.
schedule_policy (str, optional): The scheduling policy of the requests {"lpm", "random", "fcfs", "dfs-weight"}. Defaults to "fcfs".
schedule_conservativeness (float, optional): How conservative the schedule policy is. A larger value means more conservative scheduling. Use a larger value if you see requests being retracted frequently. Defaults to 1.0.
cpu_offload_gb (int, optional): How many GBs of RAM to reserve for CPU offloading. Defaults to 0.
additional_list_args (List[str], optional): additional args to run subprocess cmd e.g. ["--arg-name", "arg value"]. See more at [github](https://github.com/sgl-project/sglang/blob/1baa9e6cf90b30aaa7dae51c01baa25229e8f7d5/python/sglang/srt/server_args.py#L298). Defaults to [].

Returns:
_type_: _description_
"""

from clarifai.runners.utils.model_utils import execute_shell_command, wait_for_server

cmds = [
PYTHON_EXEC,
"-m",
"sglang.launch_server",
"--model-path",
checkpoints,
"--dtype",
str(dtype),
"--device",
str(device),
"--kv-cache-dtype",
str(kv_cache_dtype),
"--tp-size",
str(tp_size),
"--load-format",
str(load_format),
"--mem-fraction-static",
str(mem_fraction_static),
"--schedule-policy",
str(schedule_policy),
"--schedule-conservativeness",
str(schedule_conservativeness),
"--port",
str(port),
"--host",
host,
"--trust-remote-code",
]
if chat_template:
cmds += ["--chat-template", chat_template]
if quantization:
cmds += [
"--quantization",
quantization,
]
if context_length:
cmds += [
"--context-length",
context_length,
]
if max_running_requests:
cmds += [
"--max-running-requests",
max_running_requests,
]
if max_total_tokens:
cmds += [
"--max-total-tokens",
max_total_tokens,
]
if max_prefill_tokens:
cmds += [
"--max-prefill-tokens",
max_prefill_tokens,
]

if additional_list_args:
cmds += additional_list_args

print("CMDS to run `sglang` server: ", " ".join(cmds), "\n")
_self = cls()

_self.host = host
_self.port = port
_self.backend = "sglang"
# _self.start_server_thread(cmds)
# new_path = os.environ["PATH"] + ":/sbin"
# _self.process = subprocess.Popen(cmds, text=True, stderr=subprocess.STDOUT, env={**os.environ, "PATH": new_path})
_self.process = execute_shell_command(" ".join(cmds))

logger.info("Waiting for " + f"http://{_self.host}:{_self.port}")
wait_for_server(f"http://{_self.host}:{_self.port}")
logger.info("Done")

return _self

This utility handles starting, monitoring, and shutting down the backend SGLang server. It acts as your server controller, ensuring the backend is ready before the runner starts sending requests.

  • It wraps around subprocess management for launching sglang.launch_server.
  • It ensures the server runs properly, logs startup messages, and handles safe termination.
  • The class OpenAI_APIServer can also be extended to support other backends like vLLM, llama.cpp, or TGI, but here it’s used for SGLang.

Dockerfile

Example: Dockerfile
# syntax=docker/dockerfile:1.13-labs
FROM --platform=$TARGETPLATFORM lmsysorg/sglang:v0.5.3-cu129 as final
COPY --link requirements.txt /home/nonroot/requirements.txt

# Update clarifai package so we always have latest protocol to the API. Everything should land in /venv
RUN ["pip", "install", "--no-cache-dir", "-r", "/home/nonroot/requirements.txt"]
RUN ["pip", "show", "--no-cache-dir", "clarifai"]

# Set the NUMBA cache dir to /tmp
# Set the TORCHINDUCTOR cache dir to /tmp
# The CLARIFAI* will be set by the templaing system.
ENV NUMBA_CACHE_DIR=/tmp/numba_cache \
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_cache \
HOME=/tmp \
DEBIAN_FRONTEND=noninteractive

#####
# Copy the files needed to download
#####
# This creates the directory that HF downloader will populate and with nonroot:nonroot permissions up.
COPY --chown=nonroot:nonroot downloader/unused.yaml /home/nonroot/main/1/checkpoints/.cache/unused.yaml

#####
# Download checkpoints if config.yaml has checkpoints.when = "build"
COPY --link=true config.yaml /home/nonroot/main/
RUN ["python", "-m", "clarifai.cli", "model", "download-checkpoints", "/home/nonroot/main", "--out_path", "/home/nonroot/main/1/checkpoints", "--stage", "build"]
#####

# Copy in the actual files like config.yaml, requirements.txt, and most importantly 1/model.py
# for the actual model.
# If checkpoints aren't downloaded since a checkpoints: block is not provided, then they will
# be in the build context and copied here as well.
COPY --link=true 1 /home/nonroot/main/1
# At this point we only need these for validation in the SDK.
COPY --link=true requirements.txt config.yaml /home/nonroot/main/

# Add the model directory to the python path.
ENV PYTHONPATH=${PYTHONPATH}:/home/nonroot/main \
CLARIFAI_PAT=${CLARIFAI_PAT} \
CLARIFAI_USER_ID=${CLARIFAI_USER_ID} \
CLARIFAI_RUNNER_ID=${CLARIFAI_RUNNER_ID} \
CLARIFAI_NODEPOOL_ID=${CLARIFAI_NODEPOOL_ID} \
CLARIFAI_COMPUTE_CLUSTER_ID=${CLARIFAI_COMPUTE_CLUSTER_ID} \
CLARIFAI_API_BASE=${CLARIFAI_API_BASE:-https://api.clarifai.com}

USER root
RUN echo "nonroot:x:65532:65532:nonroot user:/home/nonroot:/sbin/nologin" >> /etc/passwd
USER nonroot


# Finally run the clarifai entrypoint to start the runner loop and local runner server.
# Note(zeiler): we may want to make this a clarifai CLI call.
ENTRYPOINT ["python", "-m", "clarifai.runners.server"]
CMD ["--model_path", "/home/nonroot/main"]
#############################

The Dockerfile defines the container environment used to run your model runner on Clarifai’s infrastructure.

  • It builds on the official SGLang base image (lmsysorg/sglang:v0.5.3-cu129), which includes CUDA and SGLang dependencies.
  • It installs any Python packages listed in requirements.txt.
  • It copies your model files (model.py, config.yaml, etc.) into the container.
  • Optionally, it downloads checkpoints during build time if checkpoints.when = "build".
  • It starts the Clarifai runner loop using python -m clarifai.runners.server.

config.yaml

Example: config.yaml
model:
id: SmolLM2-135M-Instruct
user_id: YOUR_USER_ID
app_id: YOUR_APP_ID
model_type_id: text-to-text
build_info:
python_version: '3.11'
inference_compute_info:
cpu_limit: '3'
cpu_memory: 14Gi
num_accelerators: 1
accelerator_type:
- NVIDIA-L40S
accelerator_memory: 42Gi
checkpoints:
repo_id: HuggingFaceTB/SmolLM2-135M-Instruct
type: huggingface
when: runtime

This is the configuration file for your SGLang model runner.

  • It specifies model identifiers (model.id, user_id, app_id), which together determine where your model will run on the Clarifai platform. Your Clarifai user ID is set by default from your active context.
  • It defines compute resources (CPU, GPU type, and memory).
  • The checkpoints section tells the runner where and when to load model weights

    Tip: Use when: runtime for large models to reduce image size and improve load times.

requirements.txt

Example: requirements.txt
clarifai
openai

This file lists all the Python dependencies required for the runner to work. If you haven’t installed them yet, run the following command to install the dependencies:

pip install -r requirements.txt

Step 3: Log In to Clarifai

Log in and create a configuration context:

clarifai login

Enter the requested details:

  • User ID – Your Clarifai user ID
  • PAT – Your personal access token (or type ENVVAR to use the environment variable)
  • Context name – Optional name for the config context (default: "default")
Example Output
clarifai login
Enter your Clarifai user ID: user-id
> To authenticate, you'll need a Personal Access Token (PAT).
> You can create one from your account settings: https://clarifai.com/user-id/settings/security

Enter your Personal Access Token (PAT) value (or type "ENVVAR" to use an environment variable): XXXXXXXXXX

> Verifying token...
[INFO] 12:10:55.558733 Validating the Context Credentials... | thread=8729403584
[INFO] 12:10:56.693295 ✅ Context is valid | thread=8729403584

> Let's save these credentials to a new context.
> You can have multiple contexts to easily switch between accounts or projects.

Enter a name for this context [default]:
✅ Success! You are now logged in.
Credentials saved to the 'default' context.

💡 To switch contexts later, use `clarifai config use-context <name>`.
[INFO] 12:10:59.177368 Login successful for user 'alfrick' in context 'default' | thread=8729403584

Step 4: Start Your Local Runner

Next, start your Local Runner, which connects to the SGLang runtime to execute your model locally.

clarifai model local-runner

If any configuration contexts or defaults are missing, the CLI will automatically guide you through setting them up.

This process ensures that all required components — such as compute clusters, nodepools, and deployments — are correctly configured in your context, enabling seamless local execution of your SGLang model. For more details, see Local Runners documentation.

Example Output

Step 5: Test Your Runner

After the Local Runner starts, you can use it to perform inference with your SGLang-based model.

You can run a test snippet in a separate terminal, within the same directory, to verify that your model is running and responding correctly.

Here’s an example snippet:

import os
from openai import OpenAI

# Initialize the OpenAI client, pointing to Clarifai's API
client = OpenAI(
base_url="https://api.clarifai.com/v2/ext/openai/v1", # Clarifai's OpenAI-compatible API endpoint
api_key=os.environ["CLARIFAI_PAT"] # Ensure CLARIFAI_PAT is set as an environment variable
)

# Make a chat completion request to a Clarifai-hosted model
response = client.chat.completions.create(
model="https://clarifai.com/<user-id>/local-runner-app/models/local-runner-model",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the future of AI?"}
],
)

# Print the model's response
print(response.choices[0].message.content)