Request Routing

How Clarifai routes prediction requests for optimal performance

When you send a prediction request to a deployed model, Clarifai's routing system determines where that request is processed.

Routing happens at two levels:

First selecting the right nodepool (which hardware);
Then, selecting the right replica (which pod within that hardware).

This page covers both levels, what Clarifai optimizes automatically, and what you can configure.

Nodepool Selection

When a request arrives, Clarifai picks a nodepool in this order:

Explicit routing — If you specify a deployment_id (or compute_cluster_id + nodepool_id) in the request, the request goes directly to that nodepool.
Deployment lookup — If you don't specify routing, Clarifai finds active deployments for the model version and picks the best nodepool based on queue depth and latency.
Shared compute fallback — For Clarifai-owned models with no user deployment, the request falls back to Clarifai's shared infrastructure.

Multi-Nodepool Deployments

A single deployment can span multiple nodepools with priority ordering. When you add nodepools to a deployment, you assign each a priority rank (1 being highest).

Requests are routed to the highest-priority nodepool that has available capacity, with overflow spilling to the next.

Note: You can configure multi-nodepool deployments in the UI by dragging nodepools into your preferred priority order, or via the API using deployment_nodepools with sequence values.

This enables several powerful patterns:

Cross-cloud failover — Place nodepools in different cloud providers (AWS, GCP) so traffic fails over automatically if one region has issues.
Cross-hardware routing — Since Clarifai's Compute Orchestration supports any hardware, you can route between NVIDIA and AMD GPUs, or even between GPU and CPU nodepools, within a single deployment. This means you can mix x86 and ARM architectures, or combine spot and on-demand instances for cost optimization.
Tiered performance — Use a high-performance GPU nodepool as primary and a cheaper instance as overflow.

Explicit Request-Level Routing

You can override the default routing behavior for individual requests by specifying a particular deployment ID or nodepool.

Python SDK

# Route to a specific deployment
response = model.predict(inputs, deployment_id="my-deployment")

# Route to a specific nodepool
response = model.predict(inputs, compute_cluster_id="my-cluster", nodepool_id="my-nodepool")

Automatic Replica Optimization

Once a nodepool is selected, Clarifai intelligently routes the request to the replica most likely to deliver the fastest response.

This is fully automatic — you don't configure it, but understanding what it does helps you design better deployments.

KV Cache Routing (Prefix Cache Routing)

For models backed by inference engines with KV caching (such as vLLM and SGLang), Clarifai detects shared prompt prefixes and routes requests to replicas that already hold those prefixes in their KV cache.

This avoids redundant recomputation of attention keys and values, resulting in higher throughput and lower time-to-first-token (TTFT).

Note: This is fully automatic — no configuration or code changes required.

This is especially effective for:

Shared system prompts — Multiple users hitting the same model with the same system instruction share the same prefix. Rather than recomputing it on every replica, requests are routed to replicas that already have that prefix cached.
RAG pipelines — Users querying the same knowledge base receive similar retrieved-document prefixes, which are automatically reused across requests.
Multi-turn conversations — Follow-up messages share the prior conversation as a prefix, so subsequent turns reuse cached state from earlier turns.

When a matching prefix is found, the request routes to the best-matched replica. When no prefix match exists, the system falls back gracefully — there is no performance penalty compared to routing without this feature.

Example Output

{
    "status": {
        "code": 10000,
        "description": "Success",
        "req_id": "b5894b9daa7742d0ad73eeecbbce1aed"
    },
    "outputs": [{
        "status": {
            "code": 10000
        },
        "input": {
            "id": "9208857317bb4abe98acf2e13416e4a5",
            "data": {
                "text": {
                    "url": "https://samples.clarifai.com/placeholder.gif"
                }
            }
        },
        "data": {
            "text": {
                "raw": "**Photosynthesis** is the biological process by which plants, algae, and some bacteria convert **light energy** (usually from the sun) into **chemical energy** stored in sugar molecules.\n\n### The Basics\nIt takes place primarily in the **chloroplasts** of plant cells, using a green pigment called **chlorophyll** to capture sunlight.\n\nThe overall chemical equation is:\n\n**6CO₂ + 6H₂O + light energy → C₆H₁₂O₆ + 6O₂**\n\n*(Carbon dioxide + Water + Sunlight → Glucose + Oxygen)*\n\n### The Two Main Stages\n1.  **Light-Dependent Reactions:** Chlorophyll absorbs sunlight, which splits water molecules (H₂O). This releases **oxygen** as a byproduct and creates energy-carrying molecules (ATP and NADPH).\n2.  **Light-Independent Reactions (Calvin Cycle):** The plant uses that stored energy to convert carbon dioxide (CO₂) from the air into **glucose**, a simple sugar it uses for energy and growth.\n\n### Why It Matters\nPhotosynthesis is the foundation of most life on Earth. It produces the **oxygen** we breathe and is the primary source of energy for nearly all food chains."
            },
            "string_value": "**Photosynthesis** is the biological process by which plants, algae, and some bacteria convert **light energy** (usually from the sun) into **chemical energy** stored in sugar molecules.\n\n### The Basics\nIt takes place primarily in the **chloroplasts** of plant cells, using a green pigment called **chlorophyll** to capture sunlight.\n\nThe overall chemical equation is:\n\n**6CO₂ + 6H₂O + light energy → C₆H₁₂O₆ + 6O₂**\n\n*(Carbon dioxide + Water + Sunlight → Glucose + Oxygen)*\n\n### The Two Main Stages\n1.  **Light-Dependent Reactions:** Chlorophyll absorbs sunlight, which splits water molecules (H₂O). This releases **oxygen** as a byproduct and creates energy-carrying molecules (ATP and NADPH).\n2.  **Light-Independent Reactions (Calvin Cycle):** The plant uses that stored energy to convert carbon dioxide (CO₂) from the air into **glucose**, a simple sugar it uses for energy and growth.\n\n### Why It Matters\nPhotosynthesis is the foundation of most life on Earth. It produces the **oxygen** we breathe and is the primary source of energy for nearly all food chains."
        },
        "prompt_tokens": 13,
        "completion_tokens": 680,
        "cached_tokens": 11
    }],
    "runner_selector": {
        "runner": {
            "worker": {
                "model": {
                    "id": "Kimi-K2_6",
                    "app_id": "",
                    "model_version": {
                        "id": "2280341feaf14301a1d7b3a52f0e3f29"
                    }
                }
            }
        },
        "deployment": {
            "id": "local-Kimi-K2_6",
            "user_id": "moonshotai",
            "nodepools": [{
                "id": "local-runner-nodepool",
                "compute_cluster": {
                    "id": "local-runner-compute-cluster",
                    "user_id": "moonshotai"
                }
            }]
        }
    }
}

Session Awareness

Clarifai tracks which replicas have recently served each user and favors routing subsequent requests from the same user to the same replica.

This improves cache reuse for conversational workloads without any client-side session management.

KV cache routing (prefix cache routing) and session awareness work together in a priority chain: prefix cache match → session affinity → random selection.

Prefix cache routing handles shared prefixes across users (e.g., shared system prompts), while session awareness handles reuse within a single user's conversation.

If neither signal applies, the system falls back to random selection with no degradation.

Autoscaling

Autoscaling controls how many replicas are running within a nodepool. You configure this per-nodepool within a deployment:

Setting	Default	Description
Min Replicas	`0`	Minimum instances always running. Set `≥ 1` to avoid cold starts
Max Replicas	`5`	Ceiling for autoscaling
Scale Up Delay	`300s`	Wait before adding replicas after traffic increases
Scale Down Delay	`300s`	Wait before removing replicas after traffic drops
Scale to Zero Delay	`1800s`	Idle time before scaling to zero (only when min replicas = 0)
Traffic History	`300s`	How far back to look at traffic data for scaling decisions
Disable Packing	`false`	When `true`, restricts to one replica per node for isolation

Note: You can configure autoscaling in the UI during deployment, via the Python SDK, or via the CLI. For advanced autoscaling settings (scale-to-zero delays, traffic history, packing), use the UI deployment flow or the API.

    clarifai model deploy ./my-model --min-replicas 1 --max-replicas 10

Cold Start Reduction

Clarifai pre-warms popular instance types automatically so new replicas are ready faster when traffic increases or spills across nodepools. For specific GPU types not covered by default, set min_replicas ≥ 1 on your deployment to keep your preferred hardware warm and eliminate cold starts entirely.

More replicas means more capacity and better cache distribution across your deployment.

Cache and Scaling

When new replicas scale up, they start with an empty KV cache — prefix cache routing effectiveness improves as replicas warm up over time. When replicas scale down, cached state on those replicas is lost. For latency-sensitive workloads, setting min_replicas ≥ 1 ensures you always have warm replicas with populated caches.

Prediction Caching

Prediction caching stores and reuses previously computed model outputs for identical requests, eliminating redundant inference and significantly reducing latency and compute costs.

For repeated requests with the same input, model, and model version, you can bypass inference entirely by enabling prediction caching with the use_predict_cache parameter.

cURL

curl -X POST "https://api.clarifai.com/v2/users/moonshotai/apps/chat-completion/models/Kimi-K2_6/outputs" \
  -H "Authorization: Key YOUR_PAT_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "data": {
          "text": {
            "raw": "How do I check if a Python object is an instance of a class?"
          }
        }
      }
    ],
    "use_predict_cache": true
  }'

Example Output

{
    "status": {
        "code": 10000,
        "description": "Success",
        "req_id": "412fdd20b78b41a1a4180e90bb2a2781"
    },
    "outputs": [{
        "status": {
            "code": 10000
        },
        "input": {
            "id": "5f288bb26ac3404680d5ce0c6e0faf82",
            "data": {
                "text": {
                    "url": "https://samples.clarifai.com/placeholder.gif"
                }
            }
        },
        "data": {
            "text": {
                "raw": "Use the built-in **`isinstance()`** function.\n\n### Basic syntax\n\n```python\nisinstance(object, classinfo)\n```\n\n- `object`: The variable/instance you want to check.\n- `classinfo`: A class, type, or tuple of classes/types.\n- Returns: `True` or `False`.\n\n### Examples\n\n```python\nclass Animal:\n    pass\n\nclass Dog(Animal):\n    pass\n\ndog = Dog()\n\n# Check against the exact class\nprint(isinstance(dog, Dog))      # True\n\n# Check against a parent class (works with inheritance)\nprint(isinstance(dog, Animal))   # True\n\n# Check against a built-in type\nprint(isinstance(dog, str))      # False\nprint(isinstance(\"hello\", str))  # True\n```\n\n### Checking against multiple classes\n\nPass a **tuple** of classes as the second argument:\n\n```python\nprint(isinstance(dog, (Dog, str, list)))  # True\n```\n\n### `isinstance()` vs. `type()`\n\nPrefer `isinstance()` over `type()` for type-checking because it respects **inheritance**:\n\n```python\nprint(type(dog) is Dog)      # True\nprint(type(dog) is Animal)   # False  ← type() doesn't consider subclasses\n\nprint(isinstance(dog, Animal))  # True  ← isinstance() does\n```\n\n### Summary\n\n| Task | Method |\n|------|--------|\n| Check if an object is an instance of a class (or subclass) | `isinstance(obj, MyClass)` |\n| Check against multiple types | `isinstance(obj, (A, B, C))` |\n| Check exact type only (ignores inheritance) | `type(obj) is MyClass` |\n\n**Tip:** In Python, explicit type-checking is sometimes discouraged in favor of \"duck typing\" (trying to use the object and catching exceptions), but `isinstance()` is the correct tool when you truly need it."
            },
            "string_value": "Use the built-in **`isinstance()`** function.\n\n### Basic syntax\n\n```python\nisinstance(object, classinfo)\n```\n\n- `object`: The variable/instance you want to check.\n- `classinfo`: A class, type, or tuple of classes/types.\n- Returns: `True` or `False`.\n\n### Examples\n\n```python\nclass Animal:\n    pass\n\nclass Dog(Animal):\n    pass\n\ndog = Dog()\n\n# Check against the exact class\nprint(isinstance(dog, Dog))      # True\n\n# Check against a parent class (works with inheritance)\nprint(isinstance(dog, Animal))   # True\n\n# Check against a built-in type\nprint(isinstance(dog, str))      # False\nprint(isinstance(\"hello\", str))  # True\n```\n\n### Checking against multiple classes\n\nPass a **tuple** of classes as the second argument:\n\n```python\nprint(isinstance(dog, (Dog, str, list)))  # True\n```\n\n### `isinstance()` vs. `type()`\n\nPrefer `isinstance()` over `type()` for type-checking because it respects **inheritance**:\n\n```python\nprint(type(dog) is Dog)      # True\nprint(type(dog) is Animal)   # False  ← type() doesn't consider subclasses\n\nprint(isinstance(dog, Animal))  # True  ← isinstance() does\n```\n\n### Summary\n\n| Task | Method |\n|------|--------|\n| Check if an object is an instance of a class (or subclass) | `isinstance(obj, MyClass)` |\n| Check against multiple types | `isinstance(obj, (A, B, C))` |\n| Check exact type only (ignores inheritance) | `type(obj) is MyClass` |\n\n**Tip:** In Python, explicit type-checking is sometimes discouraged in favor of \"duck typing\" (trying to use the object and catching exceptions), but `isinstance()` is the correct tool when you truly need it."
        },
        "prompt_tokens": 23,
        "completion_tokens": 768,
        "cached_tokens": 21
    }],
    "runner_selector": {
        "runner": {
            "worker": {
                "model": {
                    "id": "Kimi-K2_6",
                    "app_id": "",
                    "model_version": {
                        "id": "2280341feaf14301a1d7b3a52f0e3f29"
                    }
                }
            }
        },
        "deployment": {
            "id": "local-Kimi-K2_6",
            "user_id": "moonshotai",
            "nodepools": [{
                "id": "local-runner-nodepool",
                "compute_cluster": {
                    "id": "local-runner-compute-cluster",
                    "user_id": "moonshotai"
                }
            }]
        }
    }
}

When enabled, Clarifai caches the full prediction response for identical input + model + version combinations. Subsequent matching requests are served directly from cache without invoking model compute.

Note: Prediction caching is different from KV cache routing (prefix cache routing).

Prediction caching skips inference entirely by returning a previously computed response.

Prefix cache routing still performs inference, but accelerates it by reusing cached prompt state.

Inference Cost Savings and Cache Hits

For requests using KV cache routing, if a request is routed to a replica that already contains cached prompt state, the cached portion of the prompt is billed at a 90% discount (10% of standard prompt token pricing).

The Clarifai Python SDK automatically extracts cached_tokens from the model response metadata and forwards it to the output proto.

If cached_tokens is absent or zero, the field is omitted from the output. You can therefore treat the presence of cached_tokens as a reliable indicator that a cache hit occurred.

If you're unsure whether a specific model reports cached tokens, contact us for confirmation.

Nodepool Selection​

Multi-Nodepool Deployments​

Explicit Request-Level Routing​

Automatic Replica Optimization​

KV Cache Routing (Prefix Cache Routing)​

Session Awareness​

Autoscaling​

Prediction Caching​

Inference Cost Savings and Cache Hits​