Skip to main content

Create and Train Models

Learn how to easily create and train new models


The Clarifai platform simplifies the entire process of creating and training AI models, making it incredibly easy and efficient.

With just a single click, your model is not only trained but also automatically deployed, ready to enhance your business solutions instantly.

You can either build custom models tailored to your specific needs or jumpstart your projects with Clarifai's pre-optimized models, which are designed for immediate use.

Custom Models

When you train a custom model, you are telling the system to look at all the inputs with concepts you've provided and learn from them. Then, when the model encounters new inputs, it could correctly generate predictions by applying the learned knowledge.

The train operation is asynchronous. It may take some time for your model to be fully trained and ready. Your model will be trained on all inputs that have been processed, and a new version will be created.

Clarifai Models

Clarifai models are designed to be fast, scalable, and highly flexible, providing the ability to quickly deploy solutions that can adapt to your specific business needs. They can simplify complex tasks, reduce development time, and deliver reliable, accurate results.

Our model library is continually expanding and evolving. You can explore the Community platform to find a model that best fits your needs.

Our models are fully "trainable" machine learning models developed in-house and rigorously tested, ready to make predictions right out of the box.

We offer models across a wide range of categories, including generative models like large language models (LLMs), as well as classification, detection, and segmentation models.

Try our Hosted Models

Before training your own model, we recommend trying the models hosted on our platform to see if they meet your needs.

Tip: Read a comparison of GPT-5 and other models — covering features, pricing, and use cases.

Model Types

Whether you build a custom model or use one of Clarifai’s pre-built models, selecting the right model type is essential for your specific use case.

Different model types are optimized for different tasks and produce distinct outputs based on your input data and desired AI functionality.

Choosing the appropriate model type enables you to fully leverage the Clarifai platform and power your business with advanced AI capabilities.

Notes
  • You can use the List Model Types method to view a complete list of available model types suited to your needs.
  • To create a model with a specific type, you need to specify the desired model_type_id in the request body.

Broadly, you can create and train different model types on our platform using either of the following techniques:

List of Model Types

Model IDTitleDescription
embedding-classifierTransfer Learning ClassifierClassify images or texts based on the embedding model that has indexed them in your app. Transfer learning leverages feature representations from a pre-trained model based on massive amounts of data, so you don’t have to train a new model from scratch and can learn new things very quickly with minimal training data.
audio-embedderAudio EmbedderEmbed audio signal into a vector representing a high level understanding from our AI models. These embeddings enable similarity search and training on top of them.
visual-detector-embedderVisual Detector + EmbedderDetect bounding box regions in images or video frames where things occur and then embed them into a high level understanding from our AI models to enable visual search and training on top of them.
optical-character-recognizerOptical Character Recognizer (OCR)Detect bounding box regions in images or video frames where text is present and then output the text read with the score.
image-to-imageImage to ImageGiven an image, apply a transformation on the input and return the post-processed image as output.
image-to-textImage To TextTakes in cropped regions with text in them and returns the text it sees.
text-to-imageText To ImageTakes in a prompt and generates an image.
clustererClustererCluster semantically similar images and video frames together in embedding space. This is the basis for good visual search within your app at scale or for grouping your data together without the need for annotated concepts.
image-color-recognizerImage Color RecognizerRecognize standard color formats and the proportion each color that covers an image.
concept-thresholderConcept ThresholderThreshold input concepts according to both a threshold and an operator (>, >=, =, <=, or <).
region-thresholderRegion ThresholderThreshold regions based on the concepts that they contain using a threshold per concept and an overall operator (>, >=, =, <=, or <).
concept-synonym-mapperConcept Synonym MapperMap the input concepts to output concepts by following synonym concept relations in the knowledge graph of your app.
annotation-writerAnnotation WriterWrite the input data to the database in the form of an annotation with a specified status as if a specific user created the annotation.
image-cropImage CropperCrop the input image according to each input region that is present in the input.
random-sampleRandom SamplerRandomly sample allowing the input to pass to the output. This is done with the conditional keep_fraction > rand() where keep_fraction is the fraction to allow through on average.
visual-keypointerVisual KeypointThis model detects keypoints in images or video frames.
emailEmail AlertEmail alert model will send an email if there are any data fields input to this model.
smsSMS AlertSMS alert model will send a SMS if there are any data fields input to this model.
object-counterObject CounterCount number of regions that match this model's active concepts frame by frame.
image-alignImage AlignAligns images using keypoints.
input-searcherCross-App Input SearcherTriggers a visual search in another app based on the model configs if concept(s) are found in images and returns the matched search hits as regions.
input-filterInput FilterIf the input going through this model does not match those we are filtering for, it will not be passed on in the workflow branch.
text-to-audioText to AudioGiven text input, this model produces an audio file containing the spoken version of the input.
regex-based-classifierRegex Based ClassifierClassifies text using regex. If the regex matches, the text is classified as the provided concepts.
prompterPrompterPrompt template where inputted text will be inserted into placeholders marked with {data.text.raw}.
rag-prompterRAG PrompterA prompt template where we will perform a semantic search in the app with the incoming text.
image-prompterImage PrompterA prompter model, that helps create a MultiModal input with the inputted image and text.
concept-to-text-mapperConcept To Text MapperMaps concepts to text.
mcpMCPProcess MCP messages with any input and output.
openaiOpenAIProcess Clarifai models with OPENAI Format messages.
any-to-anyAny To AnyProcess any input and output with any data type.
image-tiling-operatorImage Tiling OperatorOperator for tiling images into a fixed number of equal sized images.
isolation-operatorIsolation OperatorOperator that computes distance between detections and assigns isolation label.
language-id-operatorLanguage Identification OperatorOperator for language identification using the langdetect library.
text-aggregation-operatorText Aggregation OperatorOperator that combines text detections into text body for the whole image.
tiling-region-aggregator-operatorTiling Region Aggregator OperatorOperator to be used as a follow up to the image-tiling-operator and visual detector. This operator will transform the detections on each of tiles back to the original image and perform non-maximum suppression.
barcode-operatorBarcode OperatorOperator that detects and recognizes barcodes from the image. It assigns regions with barcode text for each detected barcode.
keyword-filter-operatorKeyword Filter OperatorThis operator is initialized with a set of words, and then determines which are found in the input text.
raft-operatorRAFT OperatorCalls an LLM to generate questions and answers based on a text input chunk. The output is the chat-formatted instruction for finetuning.
tokens-to-entity-operatorTokens to Entity OperatorOperator that combines text tokens into entities, e.g. New + York -> New York.
track-representation-operatorTrack Representation OperatorThe operator takes embedding of each track frame and aggregate them to form a track embedding.
byte-trackerBYTE TrackA multi-object tracker that aims to keep track of all boxes per frame forming them into tracklets.
centroid-trackerCentroid TrackerRelies on the Euclidean distance between centroids of regions in different video frames to assign the same track ID to detections of the same object.
kalman-filter-trackerKalman Filter Hungarian TrackerRelies on the Kalman Filter algorithm to estimate the next position of an object, matched to detections using the Hungarian algorithm.
text-classifierText ClassifierClassify text into a set of concepts.
visual-detectorVisual DetectorDetect bounding box regions in images or video frames where things and then classify objects, descriptive words or topics within the boxes.
multimodal-to-textMultimodal To TextGenerate text from either text or images or both as input, allowing it to understand and respond to questions about those images.
text-embedderText EmbedderEmbed text into a vector representing a high level understanding from our AI models. These embeddings enable similarity search and training on top of them.
visual-embedderVisual EmbedderEmbed images and videos frames into a vector representing a high level understanding from our AI models. These embeddings enable visual search and training on top of them.
visual-segmenterVisual SegmenterSegment a per-pixel mask in images where things are and then classify objects, descriptive words or topics within the masks.
zero-shot-text-classifierZero Shot Text ClassifierClassify text into a set of concepts provided by user using a pretrained model.
multimodal-embedderMultimodal EmbedderEmbed text or image into a vector representing a high level understanding from our AI models, e.g. CLIP. These embeddings enable similarity search and training on top of them.
text-to-textText GeneratorGenerate or convert text based on text input, e.g. prompt completion, translation or summarization.
visual-anomaly-heatmapVisual AnomalyVisual anomaly detection with image-level score and anomaly heatmap.
zero-shot-image-classifierZero Shot Image ClassifierClassify image into a set of concepts provided by user using a pretrained model.
audio-classifierAudio ClassifierClassify audio into a set of concepts.
text-token-classifierText Token ClassifierClassify tokens from a set of entity classes.
visual-classifierVisual ClassifierClassify images and videos frames into set of concepts.
zero-shot-image-segmenterZero Shot Image SegmenterDynamically segment a per-pixel mask in images where things are and then classify objects, descriptive words or topics within the masks.
audio-to-textAudio To TextClassify audio signal into string of text.