Text Classifier
Learn about our text classifier model type
Input: Text
Output: Concepts
Text classifier is a type of deep fine-tuned model designed to automatically categorize or classify text data into predefined categories or concepts. This is a common task in natural language processing (NLP) and has a wide range of applications, including sentiment analysis, spam detection, topic categorization, and more.
The text classifier model type also comes with various templates that give you the control to choose the specific architecture used by your neural network, as well as define a set of hyperparameters you can use to fine-tune the way your model learns.
You may choose a text classifier model type in cases where:
- You need an automated way to process and categorize large amounts of textual data, enabling applications that require efficient and accurate text categorization.
- You need a text classification model to learn new features not recognized by the existing Clarifai models. In that case, you may need to "deep fine-tune" your custom model and integrate it directly within your workflows.
- You have a custom-tailored dataset, accurate labels, and the expertise and time to fine-tune models.
Example Use Case
A company wants to monitor customer sentiment towards its products by analyzing online reviews. They receive a large number of product reviews on their website and social media platforms. To efficiently understand customer opinions, they can employ a text classifier model to automatically classify these reviews as positive, negative, or neutral.
You can explore the step-by-step tutorial on fine-tuning the GPT-Neo LoRA template for text classification tasks here.
Create and Train Text Classifier
Let's demonstrate how to create and train a text classifier model using our API.
Before using the Python SDK, Node.js SDK, or any of our gRPC clients, ensure they are properly installed on your machine. Refer to their respective installation guides for instructions on how to install and initialize them.
Step 1: App Creation
Let's start by creating an app.
- Python SDK
from clarifai.client.user import User
#replace your "user_id"
client = User(user_id="user_id")
app = client.create_app(app_id="demo_train", base_workflow="Universal")
Step 2: Dataset Upload
Next, let’s upload the dataset that will be used to train the model to the app.
You can find the dataset we used here.
- Python SDK
#importing load_module_dataloader for calling the dataloader object in dataset.py in the local data folder
from clarifai.datasets.upload.utils import load_module_dataloader
# Construct the path to the dataset folder
CSV_PATH = os.path.join(os.getcwd().split('/models/model_train')[0],'datasets/upload/data/imdb.csv')
# Create a Clarifai dataset with the specified dataset_id
dataset = app.create_dataset(dataset_id="text_dataset")
# Upload the dataset using the provided dataloader and get the upload status
dataset.upload_from_csv(csv_path=CSV_PATH,input_type='text',csv_type='raw', labels=True)
Step 3: Model Creation
Let's list all the available trainable model types in the Clarifai platform.
- Python SDK
print(app.list_trainable_model_types())
Output
['visual-classifier',
'visual-detector',
'visual-segmenter',
'visual-embedder',
'clusterer',
'text-classifier',
'embedding-classifier',
'text-to-text']
Next, let's select the text-classifier
model type and use it to create a model.
- Python SDK
MODEL_ID = "model_text_classifier"
MODEL_TYPE_ID = "text-classifier"
# Create a model by passing the model name and model type as parameter
model = app.create_model(model_id=MODEL_ID, model_type_id=MODEL_TYPE_ID)
Step 4: Template Selection
Let's list all the available training templates in the Clarifai platform.
- Python SDK
print(model.list_training_templates())
Output
['HF_GPTNeo_125m_lora',
'HF_GPTNeo_2p7b_lora',
'HF_Llama_2_13b_chat_GPTQ_lora',
'HF_Llama_2_7b_chat_GPTQ_lora',
'HF_Mistral_7b_instruct_GPTQ_lora',
'HuggingFace_AdvancedConfig']
Next, let's choose the 'HuggingFace_AdvancedConfig'
template to use for training our model.
- Python SDK
# get the model parameters
model_params = model.get_params(template='HuggingFace_AdvancedConfig')
Step 5: Set Up Model Parameters
You can customize the model parameters as needed before starting the training process.
- Python SDK
# get the model parameters
model_params = model.get_params(template='HuggingFace_AdvancedConfig')
concepts = [concept.id for concept in app.list_concepts()]
# update the concept field in model parameters
model.update_params(dataset_id = 'text_dataset',concepts = ["id-pos","id-neg"])
Output
{'dataset_id': 'text_dataset',
'dataset_version_id': '',
'concepts': ['id-pos', 'id-neg'],
'train_params': {'invalid_data_tolerance_percent': 5.0,
'template': 'HuggingFace_AdvancedConfig',
'model_config': {'problem_type': 'multi_label_classification',
'pretrained_model_name_or_path': 'bert-base-cased',
'torch_dtype': 'torch.float32'},
'tokenizer_config': {},
'trainer_config': {'num_train_epochs': 1.0,
'auto_find_batch_size': True,
'output_dir': 'checkpoint'}},
'inference_params': {'select_concepts': []}}
Step 6: Initiate Model Training
To initiate the model training process, call the model.train()
method. The Clarifai API also provides features for monitoring training status and saving training logs to a local file.
If the training status code returns MODEL-TRAINED
, it means the model has successfully completed training and is ready for use.
- Python SDK
import time
#Starting the training
model_version_id = model.train()
#Checking the status of training
while True:
status = model.training_status(version_id=model_version_id,training_logs=False)
if status.code == 21106: #MODEL_TRAINING_FAILED
print(status)
break
elif status.code == 21100: #MODEL_TRAINED
print(status)
break
else:
print("Current Status:",status)
print("Waiting---")
time.sleep(120)
Step 7: Model Prediction
After the model is trained and ready to use, you can run some predictions with it.
- Python SDK
# Getting the predictions
TEXT = b"This is a great place to work"
model_prediction = model.predict_by_bytes(TEXT, input_type="text")
# Get the output
print('Input: ',TEXT)
for concept in model_prediction.outputs[0].data.concepts:
print(concept.id,':',round(concept.value,2))
Output
Input: b'This is a great place to work'
id-neg : 0.56
id-pos : 0.39
Step 8: Model Evaluation
Let’s evaluate the model using both the training and test datasets. We’ll start by reviewing the evaluation metrics for the training dataset.
- Python SDK
# Evaluate the model using the specified dataset ID 'text_dataset' and evaluation ID 'one'.
model.evaluate(dataset_id='text_dataset', eval_id='one')
# Retrieve the evaluation result for the evaluation ID 'one'.
result = model.get_eval_by_id(eval_id="one")
# Print the summary of the evaluation result.
print(result.summary)
Output
macro_avg_roc_auc: 0.6499999761581421
macro_std_roc_auc: 0.07468751072883606
macro_avg_f1_score: 0.75
macro_avg_precision: 0.6000000238418579
macro_avg_recall: 0.5
Before evaluating the model on the test dataset, ensure it is uploaded using the data loader. Once uploaded, proceed with the evaluation.
- Python SDK
#importing load_module_dataloader for calling the dataloader object in dataset.py in the local data folder
from clarifai.datasets.upload.utils import load_module_dataloader
# Construct the path to the dataset folder
CSV_PATH = os.path.join(os.getcwd().split('/models/model_train')[0],'datasets/upload/data/test_imdb.csv')
# Create a Clarifai dataset with the specified dataset_id
test_dataset = app.create_dataset(dataset_id="test_text_dataset")
# Upload the dataset using the provided dataloader and get the upload status
test_dataset.upload_from_csv(csv_path=CSV_PATH,input_type='text',csv_type='raw', labels=True)
# Evaluate the model using the specified test text dataset identified as 'test_text_dataset'
# and the evaluation identifier 'two'.
model.evaluate(dataset_id='test_text_dataset', eval_id='two')
# Retrieve the evaluation result with the identifier 'two'.
result = model.get_eval_by_id("two")
# Print the summary of the evaluation result.
print(result.summary)
Output
macro_avg_roc_auc: 0.6161290407180786
macro_std_roc_auc: 0.1225806474685669
macro_avg_f1_score: 0.7207207679748535
macro_avg_precision: 0.5633803009986877
macro_avg_recall: 0.5
Finally, to gain deeper insights into the model’s performance, use the EvalResultCompare
method to compare results across multiple datasets.
- Python SDK
from clarifai.utils.evaluation import EvalResultCompare
# Creating an instance of EvalResultCompare class with specified models and datasets
eval_result = EvalResultCompare(models=[model], datasets=[dataset, test_dataset])
# Printing a detailed summary of the evaluation result
print(eval_result.detailed_summary())
Output
( Concept Accuracy (ROC AUC) Total Labeled Total Predicted True Positives \
0 id-pos 0.725 80 0 0
0 id-neg 0.575 120 200 120
0 id-pos 0.739 31 0 0
0 id-neg 0.494 40 71 40
False Negatives False Positives Recall Precision F1 \
0 80 0 0.0 1.0000 0.000000
0 0 80 1.0 0.6000 0.750000
0 31 0 0.0 1.0000 0.000000
0 0 31 1.0 0.5634 0.720737
Dataset
0 text_dataset2
0 text_dataset2
0 test_text_dataset3
0 test_text_dataset3 ,
Total Concept Accuracy (ROC AUC) Total Labeled \
0 Dataset:text_dataset2 0.650000 200
0 Dataset:test_text_dataset3 0.616129 71
Total Predicted True Positives False Negatives False Positives Recall \
0 200 120 80 80 0.60000
0 71 40 31 31 0.56338
Precision F1
0 0.760000 0.670588
0 0.754028 0.644909 )