MultiModal as Input
Learn how to perform inference with multimodal inputs using Clarifai SDKs
Multi-modal inputs refer to feeding multiple types of data into a single model for processing and analysis. These data types, or modalities, can be diverse, such as text, images, audio, video, sensor data, or any other form of structured or unstructured data.
[Image,Text] to Text
Leverage the power of the Predict API to seamlessly process multimodal inputs and obtain accurate predictions. In this example, we demonstrate the capability to send both image and text inputs to a model, showcasing the versatility of the Predict API in handling diverse data types.
Predict Via Image URL
- Python
- Typescript
from clarifai.client.model import Model
from clarifai.client.input import Inputs
# Your PAT (Personal Access Token) can be found in the Account's Security section
# Specify the correct user_id/app_id pairings
# Since you're making inferences outside your app's scope
#USER_ID = "openai"
#APP_ID = "chat-completion"
# You can set the model using model URL or model ID.
# Change these to whatever model you want to use
# eg : MODEL_ID = 'openai-gpt-4-vision'
# You can also set a particular model version by specifying the version ID
# eg: MODEL_VERSION_ID = 'model_version'
# Model class objects can be inititalised by providing its URL or also by defining respective user_id, app_id and model_id
# eg : model = Model(user_id="clarifai", app_id="main", model_id=MODEL_ID)
prompt = "What time of day is it?"
image_url = "https://samples.clarifai.com/metro-north.jpg"
model_url = "https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision"
inference_params = dict(temperature=0.2, max_tokens=100)
multi_inputs = Inputs.get_multimodal_input(input_id="", image_url=image_url, raw_text=prompt)
# Predicts the model based on the given inputs.
model_prediction = Model(url=model_url, pat="YOUR_PAT").predict(
inputs=[
multi_inputs
],
inference_params=inference_params,
)
print(model_prediction.outputs[0].data.text.raw)
Output
The time of day in the image appears to be either dawn or dusk, given the light in the sky. It's not possible to determine the exact time without additional context, but the sky has a mix of light and dark hues, which typically occurs during sunrise or sunset. The presence of snow and the lighting at the train station suggest that it might be winter, and depending on the location, this could influence whether it's morning or evening.
import { Model, Input } from "clarifai-nodejs";
/**
Your PAT (Personal Access Token) can be found in the Account's Security section
Specify the correct userId/appId pairings
Since you're making inferences outside your app's scope
USER_ID = "openai"
APP_ID = "chat-completion"
You can set the model using model URL or model ID.
Change these to whatever model you want to use
eg : MODEL_ID = 'openai-gpt-4-vision'
You can also set a particular model version by specifying the version ID
eg: MODEL_VERSION_ID = "model_version"
Model class objects can be initialised by providing its URL or also by defining respective userId, appId and modelId
eg :
const model = new Model({
authConfig: {
userId: "clarifai",
appId: "main",
pat: process.env.CLARIFAI_PAT,
},
modelId: MODEL_ID,
});
*/
const prompt = "What time of day is it?";
const imageUrl = "https://samples.clarifai.com/metro-north.jpg";
const modelUrl =
"https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision";
const inferenceParams = { temperature: 0.2, maxTokens: 100 };
const multiInputs = Input.getMultimodalInput({
inputId: "",
imageUrl,
rawText: prompt,
});
/*
The predict API gives flexibility to generate predictions for data provided through URL, Filepath and bytes format.
Example for prediction through Bytes:
const modelPrediction = await model.predictByBytes({
inputBytes: Bytes,
inputType: "image"
});
Example for prediction through Filepath:
const modelPrediction = await model.predictByFilepath({
filepath,
inputType: "image",
});
*/
const model = new Model({
url: modelUrl,
authConfig: { pat: process.env.CLARIFAI_PAT },
});
const modelPrediction = await model.predict({
inputs: [multiInputs],
inferenceParams,
});
console.log(modelPrediction?.[0]?.data?.text?.raw);
Predict Via Local Image
- Python
- Typescript
from clarifai.client.model import Model
from clarifai.client.input import Inputs
IMAGE_FILE_LOCATION = 'LOCAL IMAGE PATH'
with open(IMAGE_FILE_LOCATION, "rb") as f:
file_bytes = f.read()
prompt = "What time of day is it?"
inference_params = dict(temperature=0.2, max_tokens=100)
model_prediction = Model("https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision").predict(inputs = [Inputs.get_multimodal_input(input_id="", image_bytes = file_bytes, raw_text=prompt)], inference_params=inference_params)
print(model_prediction.outputs[0].data.text.raw)
Output
The time of day in the image appears to be either dawn or dusk, given the light in the sky. It's not possible to determine the exact time without additional context, but the sky has a mix of light and dark hues, which typically occurs during sunrise or sunset. The presence of snow and the lighting at the train station suggest that it might be winter, and depending on the location, this could influence whether it's morning or evening.
import { Model } from 'clarifai';
import { Inputs } from 'clarifai';
const IMAGE_FILE_LOCATION = 'LOCAL IMAGE PATH';
const fs = require('fs');
const file_bytes = fs.readFileSync(IMAGE_FILE_LOCATION);
const prompt = "What time of day is it?";
const inference_params = { temperature: 0.2, max_tokens: 100 };
const model = new Model("https://clarifai.com/openai/chat-completion/models/openai-gpt-4-vision");
model.predict({
inputs: [
Inputs.getMultimodalInput({
input_id: "",
image_bytes: file_bytes,
raw_text: prompt
})
],
inference_params: inference_params
}).then((model_prediction) => {
console.log(model_prediction.outputs[0].data.text.raw);
});