Clusterer
Learn about our clusterer model type
Input: Images and videos
Output: Clusters
Clusterer is a type of deep fine-tuned model designed to identify and group similar images or video frames within a dataset. The primary goal of clustering is to discover patterns or relationships among data points based on their inherent characteristics or features, without requiring explicit labels or predefined categories.
Cluster models are often used in conjunction with embedding models to perform visual searches. This is done by first using an embedding model to represent each image as a vector in a lower-dimensional space. The cluster model then uses the mathematical structure of this space to determine which images are "clustered together."
The cluster model type can be used in a wide range of applications, including:
- Customer segmentation in marketing: Cluster models can be used to group customers with similar purchasing behaviors, demographics, or preferences.
- Anomaly detection in network security: Cluster models can identify unusual patterns in network traffic data, helping detect potential security threats or cyberattacks. Unusual clusters can indicate unauthorized access or malicious activity.
- Document clustering in natural language processing: In textual data analysis, cluster models can group similar documents based on their content. This aids in tasks like topic modeling, content summarization, and document organization.
You may choose a visual classifier model type in cases where:
- You want to perform visual searches accurately, quickly, and easily. Cluster models and embedding models do not require any labels or custom concepts to be trained. This makes them much more scalable and flexible than traditional methods for visual search, which often require a large amount of labeled data to train.
- You need a cluster model to learn new features not recognized by the existing Clarifai models. In that case, you may need to "deep fine-tune" your custom model and integrate it directly within your workflows.
- You have a custom-tailored dataset, accurate labels, and the expertise and time to fine-tune models.
Example Use Case
If you want to find all images of cats in your dataset, you can simply use the cluster model to find all images that are clustered together with the embedding of a cat image.
Create and Train a Clusterer
Let's demonstrate how to create and train a clustering model using our API.
Before using the Python SDK, Node.js SDK, or any of our gRPC clients, ensure they are properly installed on your machine. Refer to their respective installation guides for instructions on how to install and initialize them.
Step 1: App Creation
Let's start by creating an app.
- Python SDK
from clarifai.client.user import User
#replace your "user_id"
client = User(user_id="user_id")
app = client.create_app(app_id="demo_train", base_workflow="Universal")
Step 2: Dataset Upload
Next, let’s upload the dataset that will be used to train the model to the app.
You can find the dataset we used here.
- Python SDK
#importing load_module_dataloader for calling the dataloader object in dataset.py in the local data folder
from clarifai.datasets.upload.utils import load_module_dataloader
# Construct the path to the dataset folder
CSV_PATH = os.path.join(os.getcwd().split('/models/model_train')[0],'datasets/upload/data/imdb.csv')
# Create a Clarifai dataset with the specified dataset_id
dataset = app.create_dataset(dataset_id="text_dataset")
# Upload the dataset using the provided dataloader and get the upload status
dataset.upload_from_csv(csv_path=CSV_PATH,input_type='text',csv_type='raw', labels=True)
Step 3: Model Creation
Let's list all the available trainable model types in the Clarifai platform.
- Python SDK
print(app.list_trainable_model_types())
Output
['visual-classifier',
'visual-detector',
'visual-segmenter',
'visual-embedder',
'clusterer',
'text-classifier',
'embedding-classifier',
'text-to-text']
Next, let's select the clusterer
model type and use it to create a model.
- Python SDK
MODEL_ID = "model_clusterer"
MODEL_TYPE_ID = "clusterer"
# Create a model by passing the model name and model type as parameter
model = app.create_model(model_id=MODEL_ID, model_type_id=MODEL_TYPE_ID)
Step 4: Patch Model (optional)
After creating a model, you can perform patch operations on it by merging, removing, or overwriting data. By default, all actions support overwriting, but they handle lists of objects in specific ways.
- The
merge
action updates akey:value
pair withkey:new_value
or appends to an existing list. For dictionaries, it merges entries that share the sameid
field. - The
remove
action is only used to delete the model's cover image on the platform UI. - The
overwrite
action completely replaces an existing object with a new one.
Below is an example of performing patch operations on a model, such as updating its description and notes.
- Python SDK
from clarifai.client.app import App
app = App(app_id="YOUR_APP_ID_HERE", user_id="YOUR_USER_ID_HERE", pat="YOUR_PAT_HERE")
# Update the details of the model
app.patch_model(model_id="model_clusterer", action="merge", description="description", notes="notes", toolkits=["OpenAI"], use_cases=["llm"], languages=["en"], image_url="https://samples.clarifai.com/metro-north.jpg")
# Update the model's image by specifying the 'remove' action
app.patch_model(model_id='model_clusterer', action='remove', image_url='https://samples.clarifai.com/metro-north.jpg')
Step 5: Set Up Model Parameters
You can customize the model parameters as needed before starting the training process.
- Python SDK
# Get the params for the selected template
model_params = model.get_params()
print(model_params)
Output
{'train_params': {'base_embed_model': None,
'coarse_clusters': 32.0,
'eval_holdout_fraction': 0.2,
'query_holdout_fraction': 0.1,
'to_be_indexed_queries_fraction': 0.25,
'max_num_query_embeddings': 100.0,
'num_results_per_query': [1.0, 5.0, 10.0, 20.0],
'max_visited': 32.0,
'quota': 1000.0,
'beta': 1.0}}
Step 6: Initiate Model Training
To initiate the model training process, call the model.train()
method. The Clarifai API also provides features for monitoring training status and saving training logs to a local file.
If the training status code returns MODEL-TRAINED
, it means the model has successfully completed training and is ready for use.
- Python SDK
import time
#Starting the training
model_version_id = model.train()
#Checking the status of training
while True:
status = model.training_status(version_id=model_version_id,training_logs=False)
if status.code == 21106: #MODEL_TRAINING_FAILED
print(status)
break
elif status.code == 21100: #MODEL_TRAINED
print(status)
break
else:
print("Current Status:",status)
print("Waiting---")
time.sleep(120)
Step 7: Model Prediction
After the model is trained and ready to use, you can run some predictions with it.
- Python SDK
TEXT = b"This is a great place to work"
# get the predictions
model_prediction = model.predict_by_bytes(TEXT, input_type="text")
print(model_prediction.outputs[0].data.clusters)
Output
[id: "22_5"
projection: 0.010116016492247581
projection: -0.035988882184028625
]