MultiModal as Input

Learn how to perform inference with multimodal inputs using Clarifai Python SDK

Multi-modal inputs refer to feeding multiple types of data into a single model for processing and analysis. These data types, or modalities, can be diverse, such as text, images, audio, video, sensor data, or any other form of structured or unstructured data.

[Image,Text] to Text

Leverage the power of the Predict API to seamlessly process multimodal inputs and obtain accurate predictions. In this example, we demonstrate the capability to send both image and text inputs to a model, showcasing the versatility of the Predict API in handling diverse data types.

Python

from clarifai.client.model import Model
from clarifai.client.input import Inputs


# Your PAT (Personal Access Token) can be found in the Account's Security section
# Specify the correct user_id/app_id pairings
# Since you're making inferences outside your app's scope
#USER_ID = "openai"
#APP_ID = "chat-completion"

# You can set the model using model URL or model ID.
# Change these to whatever model you want to use
# eg : MODEL_ID = 'openai-gpt-4-vision'
# You can also set a particular model version by specifying the  version ID
# eg: MODEL_VERSION_ID = 'model_version'
#  Model class objects can be inititalised by providing its URL or also by defining respective user_id, app_id and model_id

# eg : model = Model(user_id="clarifai", app_id="main", model_id=MODEL_ID)

prompt = "What time of day is it?"
image_url = "https://samples.clarifai.com/metro-north.jpg"
model_url = "https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision"
inference_params = dict(temperature=0.2, max_tokens=100)
multi_inputs = Inputs.get_multimodal_input(input_id="", image_url=image_url, raw_text=prompt)
# Predicts the model based on the given inputs.
model_prediction = Model(url=model_url, pat="YOUR_PAT").predict(
    inputs=[
        multi_inputs
    ],
    inference_params=inference_params,
)

print(model_prediction.outputs[0].data.text.raw)

Output

The time of day in the image appears to be either dawn or dusk, given the light in the sky. It's not possible to determine the exact time without additional context, but the sky has a mix of light and dark hues, which typically occurs during sunrise or sunset. The presence of snow and the lighting at the train station suggest that it might be winter, and depending on the location, this could influence whether it's morning or evening.

MultiModal as Input

[Image,Text] to Text​

[Image,Text] to Text