Multimodal-to-Text
Make multimodal-to-text predictions
Input: Text, images, etc
Output: Text
Multimodal-to-text models allow you to generate textual descriptions or responses from multimodal inputs. "Multimodal" refers to the integration of information from multiple modalities, such as text, images, and/or other types of data.
A multimodal-to-text model might take as input a combination of textual data and images and generate a descriptive text that captures the content of both modalities. It can comprehend and generate a human-like response that encompasses multiple types of information.
Vision Language Models (VLMs), also called Large Vision Language Models (LVLMs), integrate capabilities from both Large Language Models (LLMs) and vision models. These models, such as Qwen-VL-Chat, are designed to understand both text and visual content, enabling you to perform tasks that require interpreting multimodal information.
You can use LVLMs to process and understand data from different modalities, such as text and images. They can generate coherent textual outputs for your natural language processing needs.
The initialization code used in the following examples is outlined in detail on the client installation page.
You can use hyperparameters to customize the behavior of the models. You can also utilize the API keys from a third-party model provider as an option—in addition to using the default Clarifai keys.
Below is an example of how you would make a multimodal-to-text prediction using the GPT-4 Vision model.
- Python
- JavaScript (REST)
- NodeJS
- Java
- PHP
- cURL
##############################################################################################################
# In this section, we set the user authentication, user and app ID, model details, and input details.
# Change these values to run your own example.
##############################################################################################################
# Your PAT (Personal Access Token) can be found in the Account's Security section
PAT = "YOUR_PAT_HERE"
# Specify the correct user_id/app_id pairings
# Since you're making inferences outside your app's scope
USER_ID = "openai"
APP_ID = "chat-completion"
# Change these to whatever model and inputs you want to use
MODEL_ID = "openai-gpt-4-vision"
MODEL_VERSION_ID = "266df29bc09843e0aee9b7bf723c03c2"
RAW_TEXT = "Write a caption for the image"
# To use a hosted text file, assign the URL variable
# TEXT_FILE_URL = "https://samples.clarifai.com/negative_sentence_12.txt"
# Or, to use a local text file, assign the location variable
# TEXT_FILE_LOCATION = "YOUR_TEXT_FILE_LOCATION_HERE"
IMAGE_URL = "https://samples.clarifai.com/metro-north.jpg"
# To use a local file, assign the location variable
# IMAGE_FILE_LOCATION = "YOUR_IMAGE_FILE_LOCATION_HERE"
############################################################################
# YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
############################################################################
from clarifai_grpc.channel.clarifai_channel import ClarifaiChannel
from clarifai_grpc.grpc.api import resources_pb2, service_pb2, service_pb2_grpc
from clarifai_grpc.grpc.api.status import status_code_pb2
from google.protobuf.struct_pb2 import Struct
channel = ClarifaiChannel.get_grpc_channel()
stub = service_pb2_grpc.V2Stub(channel)
metadata = (("authorization", "Key " + PAT),)
userDataObject = resources_pb2.UserAppIDSet(user_id=USER_ID, app_id=APP_ID)
# To use a local text file, uncomment the following lines
# with open(TEXT_FILE_LOCATION, "rb") as f:
# file_bytes = f.read()
# To use a local image file, uncomment the following lines
# with open(IMAGE_FILE_LOCATION, "rb") as f:
# file_bytes = f.read()
params = Struct()
params.update(
{
"temperature": 0.5,
"max_tokens": 2048,
"top_p": 0.95,
# "api_key": "ADD_THIRD_PARTY_KEY_HERE"
}
)
post_model_outputs_response = stub.PostModelOutputs(
service_pb2.PostModelOutputsRequest(
user_app_id=userDataObject, # The userDataObject is created in the overview and is required when using a PAT
model_id=MODEL_ID,
version_id=MODEL_VERSION_ID, # This is optional. Defaults to the latest model version
inputs=[
resources_pb2.Input(
data=resources_pb2.Data(
text=resources_pb2.Text(
raw=RAW_TEXT
# url=TEXT_FILE_URL
# raw=file_bytes
),
image=resources_pb2.Image(
url=IMAGE_URL
# base64=file_bytes
),
)
)
],
model=resources_pb2.Model(
model_version=resources_pb2.ModelVersion(
output_info=resources_pb2.OutputInfo(params=params)
)
),
),
metadata=metadata,
)
if post_model_outputs_response.status.code != status_code_pb2.SUCCESS:
print(post_model_outputs_response.status)
raise Exception(f"Post model outputs failed, status: {post_model_outputs_response.status.description}")
# Since we have one input, one output will exist here
output = post_model_outputs_response.outputs[0]
print(output.data.text.raw)
<!--index.html file-->
<script>
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// In this section, we set the user authentication, user and app ID, model details, and input details.
// Change these values to run your own example.
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// Your PAT (Personal Access Token) can be found in the Account's Security section
const PAT = "YOUR_PAT_HERE";
// Specify the correct user_id/app_id pairings
// Since you're making inferences outside your app's scope
const USER_ID = "openai";
const APP_ID = "chat-completion";
// Change these to whatever model and inputs you want to use
const MODEL_ID = "openai-gpt-4-vision";
const MODEL_VERSION_ID = "266df29bc09843e0aee9b7bf723c03c2";
const RAW_TEXT = "Write a caption for the image";
// To use a hosted text file, assign the URL variable
// const TEXT_FILE_URL = "https://samples.clarifai.com/negative_sentence_12.txt";
const IMAGE_URL = "https://samples.clarifai.com/metro-north.jpg";
// To use image bytes, assign its variable
// const IMAGE_BYTES_STRING = "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQUGBwwIBwcHBw8LCwkMEQ8SEhEPERETFhwXExQaFRERGCEYGh0dHx8fExciJCIeJBweHx7/2wBDAQUFBQcGBw4ICA4eFBEUHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh4eHh7/wAARCAAoACgDASIAAhEBAxEB/8QAGQAAAgMBAAAAAAAAAAAAAAAAAAYDBQcE/8QAMBAAAQMDAwMDAgQHAAAAAAAAAQIDBAAFEQYSIQcTMTJBURRhCBYikSNScXKhsdH/xAAZAQACAwEAAAAAAAAAAAAAAAAFBgIDBAf/xAAtEQABAwMBBgQHAQAAAAAAAAABAgMRAAQhMQUSE0FRYQaBocEUFiJCcrHR8P/aAAwDAQACEQMRAD8A3+RYY1unSYzCS0ttZUkAgktn0q5yT7jPyDUC4wdGwycH5U2Kt9ZQ7VI1qw5PkvQy3CSVPpf7aQjuKyFH25xzn3pHn3TVNy01Hl2hyy6YdkSpKsS9sl/6RlI3rRu3dxWd6spwnAGPIJTfl925fcLaoSDHXvyo6i9SlCQrU9wKln3OyWiaDN1RAbW3kKbSd7gPtwMkH/tTWy9afuy1iPfnXMAblITwkE4yf08cn3pSbYt1uts24XH6fUbiLAuY1MWyGkLEmUW0rcCRvUpQ5CtwKQCPgi4S1ZbDe4sd9NntDEe79m3uOBLTr0IR9jzodSMqUpTu9JJ8owD7UTT4ZCfv9PbP7860m+s+HBSrejWRuz2kAxoesGYxTW/Zlpkwo1vkuSly3UgKWQUhHJUvIHsAaKTemF8XE6sWmxyZkiaZrMh1jv8ArQNpUVqB8FW0njHqx4zRVVhsph1KlKk5xQ+7uHmikaSJrQerMByet2IwvtuTLa4xv2k7Rk84H9x/esHv92d01boenLXGcuiWrFIhLlpbcaQ2/JdK3VJCkAq2pAR7Zz7YxWudY9fxNIdQbNGkR5TyX4aisNNpUMFZAzkj4NK0jq9ZpbLr0PSlzkhrlZDaQlP3P8Q4/ap3F87bPucJEkx/hHv60b2TYXLrKN5sramYECSQRk9M6c6zmJ+eb5Hi22M7cnWGIQgFLbX0zSo4PDa1YBcTgDyMjJ/qbGPabH08SJt1Uzc9QqRliGg5QySPKvgc+TyfYDmmTUWpNYz7ctxoQdPQshCktupckDJUPUcJT6DwMq8YyaQ9VL0pCS8zapcq4SVOBZmPDO8/cnknlWcDBwn4NYnPjLkQ+qE9OtOVlYpeVHDCEkkkJyT+SuQzy5Y0ru6Ez511/Efa5s1fdkOtyVurIxgdlQAA9gOKKPwolU7remU5hCGYEgo38KUv9I/0TRTDYJCWQBSF4rIN/CRgAR0iTpVD1j1g/qDqJcJqlKcjB9bcda142MpOEJAzgeMnjyTSyze5KEuNRpDoDvC0oe4X9iAeaKKFK+oya6fbOqYbDTeEiAPKpHdS3gBLYc7RQkp3ApQog+cq8nwPJrljzxnPZbUfnugn/NFFRgEVch9xKsH0H8pg6e3x3T3UC1ajaZITGkJLoS4MKbOUrzz/ACKVRRRVzVwtoQmhG1NkWu0HuI+JI8u/Kv/Z";
///////////////////////////////////////////////////////////////////////////////////
// YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
///////////////////////////////////////////////////////////////////////////////////
const raw = JSON.stringify({
"inputs": [
{
"data": {
"text": {
"raw": RAW_TEXT
// "url": TEXT_FILE_URL
},
"image": {
"url": IMAGE_URL
// "base64": IMAGE_BYTES_STRING
}
}
}
],
"model": {
"model_version": {
"output_info": {
"params": {
"temperature": 0.5,
"max_tokens": 2048,
"top_p": 0.95,
// "api_key": "ADD_THIRD_PARTY_KEY_HERE"
}
}
}
}
});
const requestOptions = {
method: "POST",
headers: {
"Accept": "application/json",
"Authorization": "Key " + PAT
},
body: raw
};
fetch(`https://api.clarifai.com/v2/users/${USER_ID}/apps/${APP_ID}/models/${MODEL_ID}/versions/${MODEL_VERSION_ID}/outputs`, requestOptions)
.then(response => response.text())
.then(result => console.log(result))
.catch(error => console.log('error', error));
</script>
//index.js file
////////////////////////////////////////////////////////////////////////////////////////////////////////////
// In this section, we set the user authentication, user and app ID, model details, and input details.
// Change these values to run your own example.
////////////////////////////////////////////////////////////////////////////////////////////////////////////
// Your PAT (Personal Access Token) can be found in the Account's Security section
const PAT = "YOUR_PAT_HERE";
// Specify the correct user_id/app_id pairings
// Since you're making inferences outside your app's scope
const USER_ID = "openai";
const APP_ID = "chat-completion";
// Change these to whatever model and inputs you want to use
const MODEL_ID = "openai-gpt-4-vision";
const MODEL_VERSION_ID = "266df29bc09843e0aee9b7bf723c03c2";
const RAW_TEXT = "Write a caption for the image";
// To use a hosted text file, assign the URL variable
// const TEXT_FILE_URL = "https://samples.clarifai.com/negative_sentence_12.txt";
// Or, to use a local text file, assign the location variable
// TEXT_FILE_LOCATION = "YOUR_TEXT_FILE_LOCATION_HERE";
const IMAGE_URL = "https://samples.clarifai.com/metro-north.jpg";
// To use a local file, assign the location variable;
// const IMAGE_FILE_LOCATION = "YOUR_IMAGE_FILE_LOCATION_HERE";
/////////////////////////////////////////////////////////////////////////////
// YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
/////////////////////////////////////////////////////////////////////////////
const { ClarifaiStub, grpc } = require("clarifai-nodejs-grpc");
const stub = ClarifaiStub.grpc();
// This will be used by every Clarifai endpoint call
const metadata = new grpc.Metadata();
metadata.set("authorization", "Key " + PAT);
// To use a local text file, uncomment the following lines
// const fs = require("fs");
// const fileBytes = fs.readFileSync(TEXT_FILE_LOCATION);
// To use a local image file, uncomment the following lines
// const fs = require("fs");
// const imageBytes = fs.readFileSync(IMAGE_FILE_LOCATION);
stub.PostModelOutputs(
{
user_app_id: {
"user_id": USER_ID,
"app_id": APP_ID
},
model_id: MODEL_ID,
version_id: MODEL_VERSION_ID, // This is optional. Defaults to the latest model version
inputs: [
{
"data": {
"text": {
"raw": RAW_TEXT,
// url: TEXT_FILE_URL,
// raw: fileBytes
},
"image": {
"url": IMAGE_URL,
// base64: imageBytes
}
}
}
],
model: {
"model_version": {
"output_info": {
"params": {
"temperature": 0.5,
"max_tokens": 2048,
"top_p": 0.95
// "api_key": "ADD_THIRD_PARTY_KEY_HERE"
}
}
}
}
},
metadata,
(err, response) => {
if (err) {
throw new Error(err);
}
if (response.status.code !== 10000) {
throw new Error("Post models failed, status: " + response.status.description);
}
// Since we have one input, one output will exist here.
const output = response.outputs[0];
console.log(output.data.text.raw);
}
);
package com.clarifai.example;
import com.clarifai.grpc.api.*;
import com.clarifai.channel.ClarifaiChannel;
import com.clarifai.credentials.ClarifaiCallCredentials;
import com.clarifai.grpc.api.status.StatusCode;
import com.google.protobuf.Struct;
import com.google.protobuf.Value;
import com.google.protobuf.ByteString;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
public class ClarifaiExample {
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// In this section, we set the user authentication, user and app ID, model details, and input details.
// Change these values to run your own example.
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// Your PAT (Personal Access Token) can be found in the portal under Authentication
static final String PAT = "YOUR_PAT_HERE";
// Specify the correct user_id/app_id pairings
// Since you're making inferences outside your app's scope
static final String USER_ID = "openai";
static final String APP_ID = "chat-completion";
// Change these to whatever model and inputs you want to use
static final String MODEL_ID = "openai-gpt-4-vision";
static final String MODEL_VERSION_ID = "266df29bc09843e0aee9b7bf723c03c2";
static final String RAW_TEXT = "Write a caption for the image";
// To use a hosted text file, assign the URL variable
// static final String TEXT_FILE_URL = "https://samples.clarifai.com/negative_sentence_12.txt";
// Or, to use a local text file, assign the location variable
// static final String TEXT_FILE_LOCATION = "YOUR_TEXT_FILE_LOCATION_HERE";
static final String IMAGE_URL = "https://samples.clarifai.com/metro-north.jpg";
// Or, to use a local image file, assign the location variable
//static final String IMAGE_FILE_LOCATION = "YOUR_IMAGE_FILE_LOCATION_HERE";
///////////////////////////////////////////////////////////////////////////////////
// YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
///////////////////////////////////////////////////////////////////////////////////
public static void main(String[] args) throws IOException {
V2Grpc.V2BlockingStub stub = V2Grpc.newBlockingStub(ClarifaiChannel.INSTANCE.getGrpcChannel())
.withCallCredentials(new ClarifaiCallCredentials(PAT));
Struct.Builder params = Struct.newBuilder()
.putFields("temperature", Value.newBuilder().setNumberValue(0.5).build())
.putFields("max_tokens", Value.newBuilder().setNumberValue(2048).build())
.putFields("top_p", Value.newBuilder().setNumberValue(0.95).build());
// .putFields("api_key", Value.newBuilder().setNumberValue("ADD_THIRD_PARTY_KEY_HERE").build());
MultiOutputResponse postModelOutputsResponse = stub.postModelOutputs(
PostModelOutputsRequest.newBuilder()
.setUserAppId(UserAppIDSet.newBuilder().setUserId(USER_ID).setAppId(APP_ID))
.setModelId(MODEL_ID)
.setVersionId(MODEL_VERSION_ID) // This is optional. Defaults to the latest model version
.addInputs(
Input.newBuilder().setData(
Data.newBuilder().setText(
Text.newBuilder().setRaw(RAW_TEXT)
// Text.newBuilder().setUrl(TEXT_FILE_URL)
// Text.newBuilder().setRawBytes(ByteString.copyFrom(Files.readAllBytes(
// new File(TEXT_FILE_LOCATION).toPath()
// )))
)
.setImage(
Image.newBuilder().setUrl(IMAGE_URL)
// Image.newBuilder().setBase64(ByteString.copyFrom(Files.readAllBytes(
// new File(IMAGE_FILE_LOCATION).toPath()
// )))
)
)
)
.setModel(Model.newBuilder()
.setModelVersion(ModelVersion.newBuilder()
.setOutputInfo(OutputInfo.newBuilder().setParams(params))
)
)
.build()
);
if (postModelOutputsResponse.getStatus().getCode() != StatusCode.SUCCESS) {
throw new RuntimeException("Post model outputs failed, status: " + postModelOutputsResponse.getStatus());
}
// Since we have one input, one output will exist here
Output output = postModelOutputsResponse.getOutputs(0);
System.out.println(output.getData().getText().getRaw());
}
}
<?php
require __DIR__ . "/vendor/autoload.php";
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// In this section, we set the user authentication, user and app ID, model details, and input details.
// Change these values to run your own example.
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
// Your PAT (Personal Access Token) can be found in the Account's Security section
$PAT = "YOUR_PAT_HERE";
// Specify the correct user_id/app_id pairings
// Since you're making inferences outside your app's scope
$USER_ID = "openai";
$APP_ID = "chat-completion";
// Change these to whatever model and inputs you want to use
$MODEL_ID = "openai-gpt-4-vision";
$MODEL_VERSION_ID = "266df29bc09843e0aee9b7bf723c03c2";
$RAW_TEXT = "Write a caption for the image";
// To use a hosted text file, assign the URL variable
// $TEXT_FILE_URL = "https://samples.clarifai.com/negative_sentence_12.txt";
// Or, to use a local text file, assign the location variable
// $TEXT_FILE_LOCATION = "YOUR_TEXT_FILE_LOCATION_HERE";
$IMAGE_URL = "https://samples.clarifai.com/metro-north.jpg";
// To use a local file, assign the location variable
// $IMAGE_FILE_LOCATION = "YOUR_IMAGE_FILE_LOCATION_HERE";
///////////////////////////////////////////////////////////////////////////////////
// YOU DO NOT NEED TO CHANGE ANYTHING BELOW THIS LINE TO RUN THIS EXAMPLE
///////////////////////////////////////////////////////////////////////////////////
use Clarifai\ClarifaiClient;
use Clarifai\Api\Data;
use Clarifai\Api\Text;
use Clarifai\Api\Image;
use Clarifai\Api\Input;
use Clarifai\Api\Model;
use Clarifai\Api\ModelVersion;
use Clarifai\Api\OutputInfo;
use Clarifai\Api\PostModelOutputsRequest;
use Clarifai\Api\Status\StatusCode;
use Clarifai\Api\UserAppIDSet;
use Google\Protobuf\Struct;
$client = ClarifaiClient::grpc();
$metadata = ["Authorization" => ["Key " . $PAT]];
$userDataObject = new UserAppIDSet([
"user_id" => $USER_ID,
"app_id" => $APP_ID,
]);
// Create Struct instance
$params = new Struct();
$params->temperature = 0.5;
$params->max_tokens = 2048;
$params->top_p = 0.95;
// $params->api_key = "ADD_THIRD_PARTY_KEY_HERE";
// To use a local text file, uncomment the following lines
// $textData = file_get_contents($TEXT_FILE_LOCATION); // Get the text bytes data from the location
// To use a local image file, uncomment the following lines
// $imageData = file_get_contents($IMAGE_FILE_LOCATION);
// Let's make a RPC call to the Clarifai platform. It uses the opened gRPC client channel to communicate a
// request and then wait for the response
[$response, $status] = $client->PostModelOutputs(
// The request object carries the request along with the request status and other metadata related to the request itself
new PostModelOutputsRequest([
"user_app_id" => $userDataObject,
"model_id" => $MODEL_ID,
"version_id" => $MODEL_VERSION_ID, // This is optional. Defaults to the latest model version
"inputs" => [
new Input([
// The Input object wraps the Data object in order to meet the API specification
"data" => new Data([
// The Data object is constructed around the Text object. It offers a container that has additional text independent
// metadata. In this particular use case, no other metadata is needed to be specified
"text" => new Text([
// In the Clarifai platform, a text is defined by a special Text object
"raw" => $RAW_TEXT,
// "url" => $TEXT_FILE_URL
// "raw" => $textData
]),
"image" => new Image([
// In the Clarifai platform, an image is defined by a special Image object
"url" => $IMAGE_URL,
// "base64" => $imageData,
]),
]),
]),
],
"model" => new Model([
"model_version" => new ModelVersion([
"output_info" => new OutputInfo(["params" => $params]),
]),
]),
]),
$metadata
)->wait();
// A response is returned and the first thing we do is check the status of it
// A successful response will have a status code of 0; otherwise, there is some error
if ($status->code !== 0) {
throw new Exception("Error: {$status->details}");
}
// In addition to the RPC response status, there is a Clarifai API status that reports if the operation was a success or failure
// (not just that the communication was successful)
if ($response->getStatus()->getCode() != StatusCode::SUCCESS) {
throw new Exception("Failure response: " . $response->getStatus()->getDescription() . " " . $response->getStatus()->getDetails() );
}
# Since we have one input, one output will exist here
echo $response->getOutputs()[0]->getData()->getText()->getRaw();
?>
curl -X POST "https://api.clarifai.com/v2/users/openai/apps/chat-completion/models/openai-gpt-4-vision/versions/266df29bc09843e0aee9b7bf723c03c2/outputs" \
-H "Authorization: Key YOUR_PAT_HERE" \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"data": {
"text": {
"raw": "Write a caption for the image"
},
"image": {
"url": "https://samples.clarifai.com/metro-north.jpg"
}
}
}
],
"model": {
"model_version": {
"output_info": {
"params": {
"temperature": 0.5,
"max_tokens": 2048,
"top_p": 0.95,
"api_key": "ADD_THIRD_PARTY_KEY_HERE"
}
}
}
}
}'
Text Output Example
"Early morning solitude: A lone traveler waits on a snowy platform as dawn breaks."