Skip to main content

Generative AI Glossary

A Glossary of Generative AI Terms for Using the Clarifai Platform Effectively


Adversarial Autoencoder (AAE)

A type of autoencoder which combines the principles of adversarial loss, integral to GANs, and the architecture of an autoencoder. This combination empowers the model to learn complex distributions of data effectively.

Audio Synthesis

This involves using AI to create new, artificial sounds or voice outputs. Such sounds can be as simple as a specific tone or as complex as a mimicked form of speech.

Autoregressive Models

These are generative models that produce data by conditioning each element's probability on previous elements in a sequence. For example, WaveNet and PixelCNN are autoregressive models for creating music and images, respectively.


An autoencoder is an artificial neural network utilized for learning efficient encodings of input data. It has two crucial components: an encoder that compresses the input data and a decoder that reconstructs the data from its reduced form.

Autoregressive Generative Models

These models predict the distribution of subsequent sequence elements using prior sequence elements to implicitly establish a distribution across sequences using Conditional Probability's Chain Rule. The main architectures for autoregressive models are causal convolutional networks and recurrent neural networks.


BERT (Bidirectional Encoder Representations from Transformers)

BERT, developed by Google, is a pre-trained transformer-based language model. It stands out for its bidirectional training approach, which allows it to understand the context of a word based on all of its surroundings (left and right of the word).


Developed by The BLOOM project, Bloom is a large-scale language model that can execute a vast array of natural language understanding and generation tasks accurately.



Developed by OpenAI, ChatGPT is a specialized large-scale language model that generates human-like text. It's a popular choice for developing AI powered chatbots due to its convincing conversation-generation capabilities.

CLIP (Contrastive Language—Imagen Pretraining)

This involves using AI to create new, artificial sounds or voice outputs. Such sounds can be as simple as a specific tone or as complex as a mimicked form of speech.

Close-Book QA

Close-book QA, also known as zero-shot QA, refers to the ability of an LLM to answer questions without access to any additional information or context beyond its internal knowledge base.

open-book QA

Close-book QA stands in contrast to open-book QA, where the LLM can access and process external sources of information, such as documents, web pages, or knowledge bases.

Conditional GANs (cGANs)

These are a type of GAN where a conditional variable is introduced to the input layer, allowing the model to generate data conditioned on certain factors. This augmentation provides the model with the capability to generate data with desired characteristics.


Cross-modal learning refers to using information from one modality to understand or make predictions in another modality. This could involve translating or transforming the data in some way. For example, a cross-modal learning system might be designed to accept text input and output a related image or vice versa.


A type of GAN that can translate an image from a source domain to a target domain without paired examples. It's particularly useful in tasks like photo enhancement, image colorization, and style transfer for unpaired photo-to-photo translation.



This is an updated version of DALL-E, an AI model developed by OpenAI to generate images from textual descriptions. It's an excellent example of a multi-modal AI system.

Data Distribution

In machine learning, data distribution refers to the overall layout or spread of data points within a dataset. In the case of generative models such as GANs, the generator seeks to mimic the actual data distribution.


Synthetic media in which a person in an existing image or video is replaced with someone else's likeness using machine learning techniques. While they could serve interactive entertainment purposes, deepfakes may mislead viewers, often with unintended consequences.


In AI, 'diffusion' refers to a technique used for generating new data by starting with a portion of actual data, then gradually adding random noise. This process is generally reversed, with a neural network trained to predict the reverse process of noise addition to the data.


In a GAN, the discriminator is the component that tries to differentiate real data instances from the fictitious ones fabricated by the generator. It helps refine the generator's ability to create realistic data.



An embedding represents data in a new form, often a vector space, facilitating comparisons and calculations with other data points. Similar items should have similar embeddings, making it an essential feature for many AI tasks, like recommendation systems and natural language processing.

Emergence/Emergent Behavior

("sharp left turns," intelligence explosions). In artificial intelligence, emergence refers to complex phenomena that arise from simple rules or processes. Radical concepts like "sharp left turns" and "intelligence explosions" denote sudden, dramatic developments in AI, often related to AGI's emergence.


Few-Shot Learning

A machine learning method where the model learns to perform a task from a few examples per class. For instance, it can correctly categorize new data after being shown only a few samples from each category.


A form of transfer learning wherein a pre-trained model is slightly modified or adjusted to perform a new task. This process allows for more efficient use of the pre-trained models by adjusting them to solve tasks similar to the ones they were originally trained on.

Foundation Model

In AI, foundation models are large-scale AI models trained on diverse and extensive data meant to be fine-tuned or adapted for more specific tasks. These are called foundation models, as they offer a robust and broad foundation that can be built upon for various AI tasks.


Generative Models for Images

These are generative models like GANs, VAEs, and DALL-E, trained on image data and capable of generating new images that reflect the patterns found in the training data.

Generative Pre-Trained Transformer (GPT)

GPT is a family of neural network models trained to generate content. These models are pre-trained on vast amounts of text data, allowing them to generate coherent and relevant text based on user prompts. GPT models can automate content creation and analyze customer feedback for insights, fostering personalized interactions.


In a Generative Adversarial Network, the generator is the component that creates new instances of data by learning to mimic the real data distribution.

GPT-1, GPT-2, GPT-3, and GPT-4

Progressive versions of the generative pre-trained transformers developed by OpenAI. Each model sees improvements and expansions on its predecessors, offering advanced text generation capabilities and greater application versatility. GPT-3, for instance, is an extremely sophisticated model known for its wide-ranging applicability, including translation, question-answering, and text completion tasks.


GPT-J is an open-source large language model developed by EleutherAI in 2021. It is a generative pre-trained transformer model with 6 billion parameters, similar to GPT-3, but with some architectural differences. GPT-J was trained on a large-scale dataset called The Pile, a mixture of sources from different domains.


GPT-Neo is a family of transformer-based language models from EleutherAI based on the GPT architecture. It is an open-source alternative to GPT-3 that can generate natural language texts using deep learning. The GPT-Neo model comes in 125M, 1.3B, and 2.7B parameter variants. This allows users to choose the model size that best fits their specific use case and computational constraints.


It is the process of linking a model's output to factual and verifiable information sources. This technique enhances the accuracy and reliability of the model, especially in applications where factual correctness is critical. Grounding reduces the risk of the model generating unfounded or incorrect content.



In AI, a hallucination occurs when a model makes erroneous conclusions and generates content that doesn't correspond to reality. These erroneous outputs indicate problems in the workings of the AI model. Team vigilance is necessary to maintain the accuracy and reliability of AI systems in identifying and mitigating hallucinations.


Image Translation

A task in computer vision where the goal is to map or translate one image into another, often using a model known as GANs. For example, translating a daytime scene into a nighttime scene.


A generative task where the AI is meant to fill in missing or corrupted parts of an image. Typical applications include photo restoration and the completion of unfinished art.



Langchain is a concept in AI and machine learning that affects reasoning capability. When prompting an LLM, the "chain-of-thought" technique improves the model's reasoning by breaking tasks into smaller, discrete steps. A more complex approach, "tree-of-thought," allows logical steps to branch and backtrack.

Large Language Models (LLMs)

Large-scale AI models trained on extensive text data, such as GPT 3 and BERT. They can respond to prompts, generate text, answer questions, create poetry, and even generate code. This ability can enable personalized and authentic customer interactions and assist in automating customer-facing content.

Latent Space

In generative models, latent space refers to a compressed input data representation. It is the transition medium between the noise injected into the GAN’s generator and its output.

Llama 2

Llama 2 is a collection of pre-trained and fine-tuned large language models (LLMs) created and publicly released by Meta AI. It is available in three model sizes: 7, 13, and 70 billion parameters. Llama 2-Chat is a fine-tuned version of Llama 2, specifically optimized for dialogue-based scenarios.


Machine Learning Bias

Bias in machine learning can occur from intentionally or unintentionally biased data or algorithms making incorrect assumptions, leading to skewed decisions. Understanding and addressing this bias ensures fair and accurate treatment for all customers.


Midjourney is a text-to-image AI service developed by an independent research lab. It allows users to generate images based on textual descriptions, creating a wide range of art forms, from realistic to abstract styles and is especially known for its high-quality, well-structured, and detailed images.

Mistral 7B

Mistral 7B, introduced by Mistral AI, is an LLM that has gathered attention due to its efficiency and strong performance. It is a 7.3 billion-parameter model, making it smaller than other models like GPT-3 (175 billion) but still powerful for various tasks. Despite its size, Mistral 7B has shown impressive performance on various benchmarks, even surpassing some larger models in specific areas.

Mixture of Experts

It is a machine learning method where specialized models, or “experts”, handle different parts of data distribution. The final prediction is a blend of these expert outputs, adjusted by a “gating” system that determines each expert’s relevance. This leverages individual strengths to form a more robust model.


Refer to the various types of data that a model can process and interpret. These include text, images, audio, video, and other forms of sensory data. Each modality represents a unique form of information, offering distinct insights and characteristics that can be utilized in AI applications.

Mode Collapse

This phrase refers to a situation when the Generator in a Generative Adversarial Network begins to produce the same output (or a narrow set of outputs) repetitively rather than generating diverse outputs. It destabilizes the learning process and poses a challenge in GAN training.

Multi-Modal AI

This type of AI has the capability to process and understand inputs from different data types, like text, speech, images, and videos. Thus, these AI models can deal with diverse data inputs, enhancing their applicability in various contexts.


A multimodal learning model makes predictions by accepting and analyzing various types of input, such as audio and video data, improving its understanding of scenarios like movie scenes.


NeRF (Neural Radiance Fields)

A method for creating a three-dimensional scene from two-dimensional images using a neural network. NeRF can create a photorealistic rendering, synthesize views, and offer more capabilities in understanding and reconstructing scenes from 2D images.



A generative task where the AI is asked to extend the existing content of an image. It fills the areas beyond the image boundaries with plausible content that seamlessly connects with the original image context.


Parameter Efficient Fine-Tuning (PEFT)

Full parameter fine-tuning traditionally involves adjusting all parameters across all layers of a pre-trained model. While it typically yields optimal performance, it is resource-intensive and time-consuming, demanding significant GPU resources and time. On the other hand, PEFT offers a way to fine-tune models with minimal resources and costs. One notable PEFT method is Low-Rank Adaptation (LoRA).


LoRA is a game-changer for fine-tuning LLMs on resource-constrained devices or environments. It achieves this by exploiting inherent low-rank structures within the model's parameters. These structures capture essential patterns and relationships in the data, allowing LoRA to focus on these during fine-tuning, rather than modifying the entire parameter space.


Parameters are the fundamental elements that define the behavior and output of a model. They are akin to settings or dials that can be adjusted to control various aspects of the model's performance, such as its responsiveness, creativity, and accuracy. Parameters play a crucial role in fine-tuning the model to achieve desired results, whether it's generating text, images, or other forms of content. They are essential for optimizing the model to suit specific tasks or applications.

Plugins / tools

AI agents such as LLMs may have the ability to use 'tools' via APIs that give them new capabilities. For example, LLMs equipped with web search capabilities can access data not present in their training dataset, which can significantly reduce the risk of hallucinations.

Pre-Encoded Knowledge QA

This involves utilizing the model's built-in pre-encoded knowledge base to respond to questions. The model is provided with a large collection of facts and relationships, which it uses to generate answers when given prompts or questions. The pre-existing knowledge base equips the model with the ability to answer questions that demand a good understanding of the world.


A prompt is the initial input or direction given to an AI model to execute a task or answer a query. It sets the starting context for the model's generation process.

Prompt Tokens / Sampled Tokens / Completion

These terms relate to how the AI uses tokens or units of data as input or output. A prompt token starts the model's data generation process, the model chooses sampled tokens during this process, and completion signifies the model's output following the prompt.



Quantization is a model compression method that involves converting the weights and activations within an LLM from a high-precision data representation to a lower-precision one – without sacrificing significant accuracy. This means transitioning from a data type capable of holding more information, such as a 32-bit floating-point number (FP32), to one with less capacity, such as an 8-bit or 4-bit integer (INT8 or INT4).


RLHF (Reinforcement Learning from Human Feedback)

This technique incorporates human feedback into the learning process of an AI model. Evaluators provide feedback on the model's outputs, which can help improve the model's performance over time.


Self-Supervised Learning

This is a type of machine learning that trains an algorithm to analyze and infer from test data that hasn't been classified, labeled or categorized, as opposed to supervised learning that usually involves training data which has been labeled, classified and categorized.

Sequence Generation

This is a task in natural language processing where the model generates a sequence of words or symbols, such as in text generation. It's one of the capabilities of auto-regressive language models, including GPT and BERT.

Style Transfer

This generative model application involves capturing the artistic style of one image (the style source) and transferring it onto another image (the content source).


Developed by NVIDIA, StyleGAN is a GANbased model known for its high-quality and consistent outputs. Particularly, it gained attention for its capability to generate hyperrealistic images of human faces.

Super Resolution

An application of generative models that involves increasing the resolution of an image. Using these models, lower-quality images can be enhanced successfully.

Symbolic Artificial Intelligence

A type of AI that leverages symbolic reasoning mechanisms to solve problems and represent knowledge. It typically involves the usage of symbols to represent concepts and the relationships between them.

System Prompt

This refers to the predefined instructions that set the general behavior of an AI system, like a chatbot. It begins every interaction and influences how the AI responds to user inputs.



It is a parameter that controls the randomness and creativity of a model's output. A higher temperature setting results in more varied and unpredictable responses, fostering creativity. Conversely, a lower temperature yields more deterministic and predictable outputs, enhancing coherence and reliability. This parameter is essential for fine-tuning the balance between novelty and accuracy in generated content.


In the context of neural networks, tokenization is the process of encoding text into numerical values. Tokens may represent letters, groups of letters, or whole words.


Translation refers to the process of automatically converting text or speech from one language (source language) to another language (target language), preserving the original meaning as closely as possible.


Variational Autoencoder (VAE)

Unlike traditional autoencoders, which can learn any function to reconstruct data, a VAE places additional constraints on encoded representations, so they learn parameters of a probability distribution representing the data.


Weakly Supervised Learning

This is an approach to supervised learning in which the training data is noisy, limited, or, imprecise; however, these weakly labeled samples are often easier and cheaper to obtain, resulting in larger effective training sets.


Zero-Shot Learning

A type of machine learning where the model can make predictions about data it has never encountered during its training. It leverages similarities between what it has seen and the novel data to make predictions.