Skip to main content

Advanced Inference Options

Learn how to use our advanced inference operations


The advanced inference operations allow you to fine-tune how outputs are generated, giving you greater control to manipulate results according to their specific tasks and requirements.

info

Before using the Python SDK, CLI, Node.js SDK, or any of our gRPC clients, ensure they are properly installed on your machine. Refer to their respective installation guides for instructions on how to install and initialize them.

Perform Batch Predictions

You can process multiple inputs in a single request, streamlining the prediction workflow and saving both time and resources.

info

The batch size should not exceed 128. Learn more here.

from clarifai.client.input import Inputs
from clarifai.client.model import Model

model_url = "https://clarifai.com/openai/chat-completion/models/gpt-4o-mini"
prompt = "What's the future of AI?"

# here is an example of creating an input proto list of size 16
proto_list=[]
for i in range(16):
proto_list.append(Inputs.get_input_from_bytes(input_id = f'demo_{i}', text_bytes=prompt.encode()))

# pass the input proto as paramater to the predict function
model_prediction = Model(url=model_url, pat="YOUR_PAT").predict(
proto_list
)

# Check the length of predictions to see if all inputs were passed successfully

print(len(model_prediction.outputs))

Customize Base_URL

You can obtain model predictions by customizing the base_url. This allows you to easily adapt your endpoint to different environments, providing a flexible and seamless way to access model predictions.

info

This feature is particularly useful for enterprises using on-premises deployments, allowing the base_url to be configured to point to their respective servers.

from clarifai.client.model import Model

# Your PAT (Personal Access Token) can be found in the Account's Security section
# Specify the correct user_id/app_id pairings
# Since you're making inferences outside your app's scope
#USER_ID = "cohere"
#APP_ID = "embed"

# You can set the model using model URL or model ID.
# Change these to whatever model you want to use
# eg : MODEL_ID = 'cohere-embed-english-v3_0'
# You can also set a particular model version by specifying the version ID
# eg: MODEL_VERSION_ID = 'model_version'
# Model class objects can be inititalised by providing its URL or also by defining respective user_id, app_id and model_id

# eg : model = Model(user_id="clarifai", app_id="main", model_id=MODEL_ID)

input_text = """In India Green Revolution commenced in the early 1960s that led to an increase in food grain production, especially in Punjab, Haryana, and Uttar Pradesh. Major milestones in this undertaking were the development of high-yielding varieties of wheat. The Green revolution is revolutionary in character due to the introduction of new technology, new ideas, the new application of inputs like HYV seeds, fertilizers, irrigation water, pesticides, etc. As all these were brought suddenly and spread quickly to attain dramatic results thus it is termed as a revolution in green agriculture.
"""
# The predict API gives the flexibility to generate predictions for data provided through URL, Filepath and bytes format.

# Example for prediction through URL:
# model_prediction = Model(model_url).predict_by_url(URL ,input_type="text")

# Example for prediction through Filepath:
# model_prediction = Model(model_url).predict_by_filepath(image_filepath, input_type="text")

model_url = "https://clarifai.com/cohere/embed/models/cohere-embed-english-v3_0"

# You can pass the new base url as paramater while initializing the Model object
model_prediction = Model(url=model_url, pat="YOUR_PAT",base_url="New Base URL").predict_by_bytes(
input_text, "text"
)

embeddings = model_prediction.outputs[0].data.embeddings[0].vector

num_dimensions = model_prediction.outputs[0].data.embeddings[0].num_dimensions

print(embeddings[:10])

Add Root Certificate

A root certificate provides an additional layer of security when communicating through APIs. As a self-signed certificate that verifies the legitimacy of other certificates, it establishes a chain of trust — ensuring that you are connecting to authentic APIs and that your data remains encrypted.

You can add your own root certificates to further strengthen data security and protect user privacy.

from clarifai.client.model import Model

# Your PAT (Personal Access Token) can be found in the Account's Security section
# Specify the correct user_id/app_id pairings
# Since you're making inferences outside your app's scope
#USER_ID = "clarifai"
#APP_ID = "main"

# You can set the model using model URL or model ID.
# Change these to whatever model you want to use
# eg : MODEL_ID = "general-image-recognition"
# You can also set a particular model version by specifying the version ID
# eg: MODEL_VERSION_ID = "aa7f35c01e0642fda5cf400f543e7c40"
# Model class objects can be inititalised by providing its URL or also by defining respective user_id, app_id and model_id

# eg : model = Model(user_id="clarifai", app_id="main", model_id=MODEL_ID)

model_url = "https://clarifai.com/clarifai/main/models/general-image-recognition"
image_url = "https://samples.clarifai.com/metro-north.jpg"

# The predict API gives flexibility to generate predictions for data provided through URL,Filepath and bytes format.

# Example for prediction through Bytes:
# model_prediction = model.predict_by_bytes(input_bytes, input_type="image")

# Example for prediction through Filepath:
# model_prediction = Model(model_url).predict_by_filepath(filepath, input_type="image")

model_prediction = Model(url=model_url, pat="YOUR_PAT",root_certificates_path="PATH_TO_ROOT_CERTIFICATE").predict_by_url(
image_url, input_type="image"
)

# Get the output
print(model_prediction.outputs[0].data)

Prompt Types

A prompt is a piece of text or set of instructions that you provide to generative AI models, such as Large Language Models (LLMs), to generate a specific response or action.

Generative AI models are a type of artificial intelligence system that are designed to create new content, such as text, images, audio, or even videos, based on patterns learned from existing data.

There are several prompting techniques you can use to communicate with generative AI models. For example, zero-shot prompting leverages a model’s inherent language understanding capabilities to generate responses without any specific preparation or examples.

You can learn about other prompting techniques here.

Here are some examples of prompts.

Question Answering

Prompt: 
Who was president of the United States in 1955?

Grammar Correction

Prompt: 
Correct this to standard English: She no went to the market.

Sample Response:
She did not go to the market.

Summarize

Prompt: 
Summarize this: Jupiter is the fifth planet from the Sun and the largest in the Solar System. It is a
gas giant with a mass one-thousandth that of the Sun, but two-and-a-half times
that of all the other planets in the Solar System combined. Jupiter is one of the
brightest objects visible to the naked eye in the night sky, and has been known to
ancient civilizations since before recorded history. It is named after the Roman
god Jupiter.[19] When viewed from Earth, Jupiter can be bright enough for its
reflected light to cast visible shadows,[20] and is on average the third-brightest
natural object in the night sky after the Moon and Venus.

Sample Response:
Jupiter is the fifth planet from the Sun and is very big and bright. It can be seen
with our eyes in the night sky and it has been known since ancient times. Its
name comes from the Roman god Jupiter. It is usually the third brightest object in
the night sky after the Moon and Venus.

Translation

Prompt: 
Translate this into 1. French, 2. Spanish, and 3. Japanese: What rooms do you have available?`

Sample Response:
Quels sont les chambres que vous avez disponibles?
2. ¿Qué habitaciones tienes disponibles?
3. どの部屋が利用可能ですか?
tip

Click here for more prompting examples.

Types of Inference Parameters

When making predictions using the models on our platform, some of them offer the ability to specify various inference parameters to influence their output.

These parameters control the behavior of the model during the prediction process, affecting aspects like creativity, coherence, and the diversity of the output.

Let’s talk about them.

Max Tokens (or Max Length)

Max Tokens specifies the maximum number of tokens (words or characters) the model is allowed to generate in a single response. It limits the length of the output, preventing the model from generating overly long responses. As such, shorter token lengths will provide faster performance.

This inference parameter helps in controlling the verbosity of the output, especially in applications where concise responses are required.

Here is a usage example:

inference_params = dict(max_tokens=100)
Model(model_url).predict(inputs,inference_params=inference_params)

Minimum Prediction Value

The min_value specifies the minimum prediction confidence required to include a result in the output. For example if you want to see all concepts with a probability score of .95 or higher, this parameter will allow you to accomplish that.

Also note that if you don't specify the number of max_concepts, you will only see the top 20. If your result can contain more values you will have to increase the number of maximum concepts as well.

Here is a usage example:

output_config = dict(min_value=0.6)
Model(model_url).predict(inputs,output_config=output_config)

Maximum Concepts

The max_concepts parameter specifies how many concepts and their associated probability scores the Predict endpoint should return. If not set, the endpoint defaults to returning the top 20 concepts.

You can currently set max_concepts to any value between 1 and 200.

If your use case requires more than 200 concepts, please reach out to our support team for assistance.

output_config = dict(max_concepts=3)
Model(model_url).predict(inputs,output_config=output_config)

Select Concepts

The select_concepts specifies the concepts to include in the prediction results. By putting this additional parameter on your predict calls, you can receive predict value(s) for only the concepts that you want to. You can specify particular concepts by either their id and/or their name.

The concept names and ids are case sensitive; and so, these must be exact matches.

Note: You can use the GetModelOutputInfo endpoint to retrieve an entire list of concepts from a given model, and get their ids and names.

caution

If you submit a request with not an exact match of the concept id or name, you will receive an invalid model argument error. However, if one or more matches while one or more do not, the API will respond with a Mixed Success.

output_config = dict(select_concepts=["concept_name"])
Model(model_url).predict(inputs,output_config=output_config)

Temperature

Temperature is a decimal number (between 0 and 1) that controls the degree of randomness in the response.

A low temperature (e.g., 0.2) makes the model more deterministic, leading to a more conservative and predictable output. A high temperature (e.g., 0.8) increases the randomness, allowing for more creative and varied responses.

Adjusting temperature is useful when you want to balance between creative responses and focused, precise answers.

Here is a usage example:

inference_params = dict(temperature=0.2)
Model(model_url).predict(inputs,inference_params=inference_params)

Top-p (Nucleus)

Top-p sampling is an alternative to temperature sampling that controls output diversity by considering the smallest set of tokens whose cumulative probability is greater than or equal to a specified threshold p (e.g., 0.9).

Rather than restricting the model to a fixed number of top tokens, this method dynamically adjusts the selection based on token probabilities, ensuring that the most likely tokens are always included while maintaining flexibility in the number of tokens considered.

It’s useful when you want to dynamically control the diversity of the generated output without setting a fixed limit on the number of tokens.

Top-k

Top-k sampling limits the model to only consider the top k most probable tokens when generating the next word, ignoring all others.

A low k (e.g., 10) reduces diversity by restricting the choice of tokens, leading to more focused outputs. A high k increases diversity by allowing a broader range of possible tokens, leading to more varied outputs.

It’s useful when you want to prevent the model from choosing rare, less likely words, but still allow for some diversity.

Reasoning Effort

The Reasoning Effort parameter controls how much internal reasoning the model performs before generating a response.

You can set it to:

  • Low – Prioritizes faster responses and minimal token usage.

  • Medium – Strikes a balance between response time and depth of reasoning.

  • High – Emphasizes thorough reasoning, which may lead to slower but more detailed answers.

You can adjust this setting based on your needs — whether you value speed, detail, or a balance of both.

Number of Beams

The Number of Beams inference parameter is integral to a method called beam search. Beam search is a search algorithm that keeps track of the top n (beam width) sequences at each step of generation, considering multiple possibilities before selecting the best one.

It helps produce more coherent and optimized outputs by exploring multiple potential sequences. This parameter is particularly useful in tasks where the quality and diversity of the entire sequence is crucial, such as translation or summarization.

Do Sample

This parameter determines whether the model should sample from the probability distribution of the next token or select the token with the highest probability.

If set to true, the model samples from the probability distribution, introducing randomness and allowing for more creative and diverse outputs. If set to false, the model selects the token with the highest probability, leading to more deterministic and predictable responses.

Sampling is typically enabled (set to true) when you want the model to generate varied and creative text. When precision and consistency are more important, sampling may be disabled (set to false).

Return Full Text

This parameter determines whether the entire generated text should be returned or just a portion of it.

If set to true, the model returns the full text, including both the prompt (if provided) and the generated continuation. If set to false, the model returns only the newly generated text, excluding the prompt.

It’s useful when you need the complete context, including the prompt, in the output. This can be important for understanding the generated response in the context of the input.

System Prompt

A system prompt is a special input prompt provided to guide the model's behavior throughout the conversation or task. It sets the tone, style, or context for the model’s responses.

It influences how the model generates responses by setting expectations or providing instructions that the model follows.

It’s often used in conversational AI to define the role the model should play (e.g., a helpful assistant, a friendly chatbot) or in specialized tasks where specific behavior or output style is desired.

It helps steer the model's responses in a consistent and contextually appropriate direction.

Prompt Template

A prompt template serves as a pre-configured piece of text used to instruct an LLM. It acts as a structured query or input that guides the model in generating the desired response. You can use a template to tailor your prompts for different use cases.

Many LLMs require prompts to follow a specific template format. To streamline this process, we provide the prompt_template inference parameter, which automatically applies the correct template format for the LLM. This means that you do not need to manually format your prompts when using an LLM through our UI or SDK.

By default, the prompt_template is set to the LLM's standard template, allowing you to simply enter your prompts without worrying about formatting. The prompts will be automatically adjusted to fit the required template.

If you need more flexibility, you can customize the prompt_template parameter. When modifying this variable, make sure it includes the placeholder {prompt}, which will be replaced with the user's prompt input.

For example, the Openchat-3.6-8b model supports the following chat template format:

GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:

Let’s break down the meaning of the template:

  • GPT4 Correct User: — This delimiter indicates the start of a user's input.
  • {prompt}: — This substring will be replaced by the actual input or question from the user. It must be included in the prompt template. It works just like the prompter node in a workflow builder, which must contain the {data.raw.text} substring. When your text data is inputted at inference time, all occurrences of the {prompt} variable within the template will be replaced with the prompt text.
  • <|end_of_turn|>:— This delimiter indicates the end of a user's input.
  • GPT4 Correct Assistant: — This indicates the start of the assistant's (or the language model's) response, which should be a corrected or refined version of the user's input or an appropriate answer to the user's question.

You can also add the <|start_of_turn|> delimiter, which specifically indicates the start of a turn; in this case, a user’s input.

Here is an example:

GPT4 Correct User: <|start_of_turn|> {prompt}<|end_of_turn|>GPT4 Correct Assistant:

Another example is the Llama 3.1-8b-Instruct model, which supports the following chat template format:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The main purpose of this format is to clearly delineate the roles and contributions of different participants in the conversation: system, user, and assistant.

Let’s break down its meaning:

  • <|begin_of_text|> — This delimiter marks the beginning of the text content.
  • <|start_header_id|>system<|end_header_id|> — This indicates the beginning of a system-level instruction or context.
  • {system_prompt} — This placeholder is for the actual system-level instruction or context.
  • <|eot_id|> — This indicates the end of a text unit; in this case, the system prompt.
  • <|start_header_id|>user<|end_header_id|> — This marks the beginning of a user's input.
  • {prompt} — As earlier described, this placeholder represents the actual prompt or query from the user.
  • <|eot_id|> — This marks the end of a text unit; in this case, the user's input.
  • <|start_header_id|>assistant<|end_header_id|> — This indicates the beginning of the assistant's response.

Predict By Model Version ID

Every time you train a custom model, it creates a new model version. By specifying version_id in your predict call, you can continue to predict on a previous version, for consistent prediction results. Clarifai also updates its pre-built models on a regular basis.

If you are looking for consistent results from your predict calls, use version_id. If the model version_id is not specified, predict will default to the most current model.

Below is an example of how you would set a model version ID and receive predictions from Clarifai's general-image-recognition model.

#######################################################################################################
# In this section, we set the user authentication, user and app ID, model ID, model version ID, and
# URL of the image we want as an input. Change these strings to run your own example.
#######################################################################################################

# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = 'YOUR_PAT_HERE'
# Specify the correct user_id/app_id pairings
# Since you're making inferences outside your app's scope
USER_ID = 'clarifai'
APP_ID = 'main'
# Change these to whatever you want to process
MODEL_ID = 'general-image-recognition'
MODEL_VERSION_ID = 'aa7f35c01e0642fda5cf400f543e7c40'
IMAGE_URL = 'https://samples.clarifai.com/metro-north.jpg'

############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################

from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
from clarifai_grpc.grpc.api import resources_pb2, service_pb2, service_pb2_grpc
from clarifai_grpc.grpc.api.status import status_code_pb2

channel = ClarifaiChannel.get_grpc_channel()
stub = service_pb2_grpc.V2Stub(channel)

metadata = (('authorization', 'Key ' + PAT),)

userDataObject = resources_pb2.UserAppIDSet(user_id=USER_ID, app_id=APP_ID)

post_model_outputs_response = stub.PostModelOutputs(
service_pb2.PostModelOutputsRequest(
user_app_id=userDataObject, # The userDataObject is created in the overview and is required when using a PAT
model_id=MODEL_ID,
version_id=MODEL_VERSION_ID,
inputs=[

resources_pb2.Input(
data=resources_pb2.Data(
image=resources_pb2.Image(
url=IMAGE_URL
)
)
)
]
),
metadata=metadata
)
if post_model_outputs_response.status.code != status_code_pb2.SUCCESS:
print(post_model_outputs_response.status)
raise Exception("Post model outputs failed, status: " + post_model_outputs_response.status.description)

# Since we have one input, one output will exist here
output = post_model_outputs_response.outputs[0]

print("Predicted concepts:")
for concept in output.data.concepts:
print("%s %.2f" % (concept.name, concept.value))

# Uncomment this line to print the raw output
#print(output)
Output Example
Predicted concepts:
train 1.00
railway 1.00
subway system 1.00
station 1.00
locomotive 1.00
transportation system 1.00
travel 0.99
commuter 0.98
platform 0.98
light 0.97
train station 0.97
blur 0.97
city 0.96
road 0.96
urban 0.96
traffic 0.96
street 0.95
public 0.93
tramway 0.93
business 0.93