vLLM
Download and serve vLLM models locally and expose them via a public API
vLLM is an open-source, high-performance inference engine that allows you to serve large language models (LLMs) locally with remarkable speed and efficiency. It supports OpenAI-compatible APIs, making it easy to integrate with the Clarifai platform.
With Clarifai’s Local Runners, you can seamlessly deploy vLLM-powered models on your own machine, expose them through a secure public URL, and take full advantage of Clarifai’s AI capabilities — while retaining control, privacy, and performance.
Note: After initializing a model using the vLLM toolkit, you can upload it to Clarifai to leverage the platform’s capabilities.
Step 1: Perform Prerequisites
Sign Up or Log In
First, either log in to your existing Clarifai account or sign up for a new one. Once logged in, you'll need these credentials to set up your project:
- App ID – Navigate to the application you'll use for your model. In the collapsible left sidebar, select the Overview option. Get the app ID from there.
- User ID – In the collapsible left sidebar, go to Settings and select the Account option. Then, find your user ID.
- Personal Access Token (PAT) – This token is essential to authenticate your connection with the Clarifai platform. To create or copy your PAT, go to Settings and choose the Secrets option.
Then, store it as an environment variable for secure authentication:
- Unix-like Systems
- Windows
export CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE
set CLARIFAI_PAT=YOUR_PERSONAL_ACCESS_TOKEN_HERE
Install the Clarifai CLI
Install the Clarifai CLI to access Local Runners and manage your deployments.
- Bash
pip install --upgrade clarifai
Note: Ensure you have Python 3.11 or 3.12 installed to run Local Runners successfully.
Install vLLM
Install the vLLM package.
- Bash
pip install vllm
vLLM supports models from the Hugging Face Hub (e.g., LLaMA, Mistral, Falcon, etc.) and serves them via a local OpenAI-compatible API.
Note: You need a Hugging Face access token to download models from private or restricted repositories. You can learn how to get it from here.
Install the OpenAI Package
Install the openai client library — it will be used to send requests to your vLLM server.
- Bash
pip install openai
Step 2: Initialize a Model
Use the Clarifai CLI to initialize a vLLM-based model directory. This setup prepares all required files for local execution and Clarifai integration.
You can further customize or optimize the model by modifying the generated files as necessary.
- Bash
clarifai model init --toolkit vllm
Note: You can initialize a model in a specific location by passing a
MODEL_PATH.
You can initialize any model supported by vLLM. If you want to initialize a specific vLLM model, use the --model-name flag.
- Bash
clarifai model init --toolkit vllm --model-name HuggingFaceH4/zephyr-7b-beta
Example Output
clarifai model init --toolkit vllm --model-name HuggingFaceH4/zephyr-7b-beta
[INFO] 12:37:30.485152 Parsed GitHub repository: owner=Clarifai, repo=runners-examples, branch=vllm, folder_path= | thread=8309383360
[INFO] 12:37:38.228774 Files to be downloaded are:
1. 1/model.py
2. config.yaml
3. requirements.txt | thread=8309383360
Press Enter to continue...
[INFO] 12:37:40.819580 Initializing model from GitHub repository: https://github.com/Clarifai/runners-examples | thread=8309383360
[INFO] 12:40:28.485625 Successfully cloned repository from https://github.com/Clarifai/runners-examples (branch: vllm) | thread=8309383360
[INFO] 12:40:28.494056 Updated Hugging Face model repo_id to: HuggingFaceH4/zephyr-7b-beta | thread=8309383360
[INFO] 12:40:28.494107 Model initialization complete with GitHub repository | thread=8309383360
[INFO] 12:40:28.494133 Next steps: | thread=8309383360
[INFO] 12:40:28.494152 1. Review the model configuration | thread=8309383360
[INFO] 12:40:28.494169 2. Install any required dependencies manually | thread=8309383360
[INFO] 12:40:28.494186 3. Test the model locally using 'clarifai model local-test' | thread=8309383360
Note: Some models are quite large and require substantial memory or GPU resources. Ensure your machine has sufficient compute capacity to load and run the model locally before initializing it.
You’ll get a folder structure similar to:
├── 1/
│ └── model.py
├── requirements.txt