Introduction to Multiclass Text Classification with LLMs
Multiclass text classification (MTC) is a natural language processing (NLP) task where text is categorized into multiple predefined categories or classes. Traditional approaches rely on training machine learning models, requiring labeled data and iterative fine-tuning. However, with the advent of large language models (LLMs), this task can now be approached differently. Instead of building and training a custom model, we can utilize pre-trained LLMs to classify text using carefully designed prompts, allowing rapid deployment with minimal data requirements and enabling flexibility to adjust classes without retraining.
Approaches for MTC-LLM
In MTC-LLM, we generally have two main approaches for utilizing LLMs to achieve classification.
Single Classifier with a Multi-Class Prompt
Using a single LLM prompt for multi-class text classification involves providing a single, comprehensive prompt that instructs the model on all possible classes, expecting it to classify the text into one of these categories. This approach is simple and straightforward, as it requires only one prompt, making implementation fast and computationally efficient. It also reduces costs, as each classification requires just one LLM call, saving on both usage costs and processing time.
However, this approach has notable limitations. When classes are similar, the model may struggle to make precise distinctions, reducing accuracy in nuanced tasks. Additionally, handling all categories within a single prompt can lead to lengthy and complex instructions, which may introduce ambiguity and diminish the model’s reliability. Another critical drawback is the approach’s inability to detect hierarchical relationships within a taxonomy; without recognizing these layers, the model may miss important contextual distinctions between classes that depend on hierarchical categorization.
Hierarchical Sequence of Binary Classifiers
The hierarchical sequence of binary classifiers approach structures classification as a decision tree, where each node represents a binary decision point. Starting from the top node, the model proceeds through a series of binary classifications, with each LLM call determining whether the text belongs to a specific class. This process continues down the hierarchy until a final classification is achieved.
This method provides high accuracy since each binary decision allows the model to make precise, focused choices, which is particularly valuable for distinguishing among nuanced classes. It is also highly adaptable to complex hierarchies, accommodating cases where broad classes may require further subclass distinctions for an accurate classification.
However, this approach comes with increased costs and latency, as multiple LLM calls are needed to reach a final classification, making it more expensive and time-consuming. Additionally, managing this approach requires structuring and maintaining numerous prompts and class definitions, adding to its complexity. For use cases where accuracy is prioritized over cost—such as in high-stakes applications like customer service—this hierarchical method is generally the recommended approach.
Example Use Case: Intent Detection for Airline Customer Service
Let’s consider an airline company using an automated system to respond to customer emails. The goal is to detect the intent behind each email accurately, enabling the system to route the message to the appropriate department or generate a relevant response. This system leverages a hierarchical sequence of binary classifiers, providing a structured approach to intent detection. At each level of the hierarchy, binary classifiers assess whether a specific intent is present, progressively narrowing down the scope of inquiry to arrive at a precise classification.
High-Level Intent Classification
At the first stage of the hierarchy, the system categorizes emails into high-level intents to streamline processing and ensure accurate responses. These high-level intents include:
General Queries
This intent captures broad, information-seeking emails unrelated to specific complaints or actions. These emails are generally routed to informational workflows or knowledge bases, allowing for automated responses with the required details. Examples include:
- What is your baggage policy for international flights
- Can you provide details about your frequent flyer program?
- What are the requirements for traveling with a pet?
Booking Issues
Emails under this intent are related to the booking process or flight details. These emails are generally routed to booking support workflows, where sub-classification helps further refine the action required, such as new bookings, modifications, or cancellations.
Examples include:
- I want to book a flight to London for next month
- Can you confirm my ticket for flight number ABC123
- I need to reschedule my flight due to a personal emergency.
Customer Complaints
This category identifies emails expressing dissatisfaction or grievances. These emails are prioritized for customer service escalation, ensuring timely resolution and acknowledgment. Examples include:
- My flight was delayed, and I missed my connection.
- I was charged twice for the same ticket.
- The in-flight entertainment system was not working.
Refund Requests
This category is specific to emails where customers request refunds for canceled flights, overcharges, or other issues. These emails are routed to the refund processing team, where workflows validate the claim and initiate the refund process. Examples include:
- I canceled my flight last week but haven’t received my refund yet
- I was overcharged for baggage fees. Please issue a refund
Special Assistance Requests
Emails in this category pertain to special accommodations or requests from passengers. These are routed to workflows that handle special services and ensure the requests are appropriately addressed.
Examples include:
- I need wheelchair assistance at the airport
- Can you provide a meal suitable for someone with a gluten allergy
- I’m traveling with a child and need a bassinet seat
Lost and Found Inquiries
This intent captures emails related to lost items or baggage issues. These emails are routed to the airline’s lost and found or baggage resolution teams.
Examples include:
- I left my laptop on flight XYZ123. How can I retrieve it?
- My checked luggage did not arrive at my destination
- I need to report a lost wallet at the airport
Hierarchical Sub-Classification
Once the high-level intent is identified, a second layer of binary classifiers operates within each category to refine the classification further. For example:
Booking Issues Sub-Classifiers
- New Bookings
- Modifications to Existing Bookings
- Cancellations
Customer Complaints Sub-Classifiers
- Flight Delays
- Billing Issues
- Service Quality
Refund Requests Sub-Classifiers
- Flight Cancellations
- Baggage Fees
- Duplicate Charges
Special Assistance Requests Sub-Classifiers
- Mobility Assistance
- Dietary Preferences
- Family Travel Needs
Lost and Found Sub-Classifiers
- Lost Items in Cabin
- Missing Baggage
- Items Lost at the Airport
Benefits of this Approach
Scalability
The hierarchical design enables seamless addition of new intents or sub-intents as customer needs evolve, without disrupting the existing classification framework.
Efficiency
By filtering out irrelevant categories at each stage, the system minimizes computational overhead and ensures that only relevant workflows are triggered for each email.
Improved Accuracy
Binary classification simplifies the decision-making process, leading to higher precision and recall compared to a flat multiclass classifier.
Enhanced Customer Experience
Automated responses tailored to specific intents ensure quicker resolutions and more accurate handling of customer inquiries, enhancing overall satisfaction.
Cost-Effectiveness
Automating intent detection reduces reliance on human intervention for routine tasks, freeing up resources for more complex customer service needs.
By categorizing emails into high-level intents like general queries, booking issues, complaints, refunds, special assistance requests, and lost and found inquiries, this automated system ensures efficient routing and resolution. Hierarchical sub-classification adds an extra layer of precision, enabling the airline to deliver fast, accurate, and customer-centric responses while optimizing operational efficiency.
The table below is a representation of the complete taxonomy of the intent detection system organized into primary and secondary intents. This taxonomy enables the chatbot to understand and respond more accurately to customer intents, from broad categories down to specific, actionable concerns. Each level helps direct the inquiry to the appropriate team or resource for faster, more effective resolution.
Level | Category | Sub-Category |
High-Level Intent | General Queries | |
Sub-Intent | General Queries | Baggage Policy |
Sub-Intent | General Queries | Frequent Flyer Program |
Sub-Intent | General Queries | Travel with Pets |
High-Level Intent | Booking Issues | |
Sub-Intent | Booking Issues | New Bookings |
Sub-Intent | Booking Issues | Modifications to Existing Bookings |
Sub-Intent | Booking Issues | Cancellations |
High-Level Intent | Customer Complaints | |
Sub-Intent | Customer Complaints | Flight Delays |
Sub-Intent | Customer Complaints | Billing Issues |
Sub-Intent | Customer Complaints | Service Quality |
High-Level Intent | Refund Requests | |
Sub-Intent | Refund Requests | Flight Cancellations |
Sub-Intent | Refund Requests | Baggage Fees |
Sub-Intent | Refund Requests | Duplicate Charges |
High-Level Intent | Special Assistance Requests | |
Sub-Intent | Special Assistance Requests | Mobility Assistance |
Sub-Intent | Special Assistance Requests | Dietary Preferences |
Sub-Intent | Special Assistance Requests | Family Travel Needs |
High-Level Intent | Lost and Found Inquiries | |
Sub-Intent | Lost and Found Inquiries | Lost Items in Cabin |
Sub-Intent | Lost and Found Inquiries | Missing Baggage |
Sub-Intent | Lost and Found Inquiries | Items Lost at the Airport |
The diagram below provides a depiction of this architecture.
Prompt Structure for a Binary Classifier
Here’s a sample structure for a binary classifier prompt, where the LLM determines if a customer message is related to a Booking Inquiry.
You are an AI language model tasked with classifying whether a customer’s message to the Acme airline company is a “BOOKING INQUIRY.”
Definition:
A “BOOKING INQUIRY” is a message that directly involves:
- Booking a flight: Questions or assistance requests about reserving a new flight.
- Modifying a reservation: Any request to change an existing booking, such as altering dates, times, destinations, or passenger details.
- Managing a reservation: Tasks like seat selection, cancellations, refunds, or upgrading class, which are tied to the customer’s reservation.
- Resolving issues related to booking: Problems like errors in the booking process, confirmation issues, or requests for help with travel-related arrangements.
Messages must demonstrate a clear and specific relationship to these areas to qualify as “BOOKING INQUIRY.” General questions about unrelated travel aspects (e.g., baggage fees, flight status, or policies) are classified as “NOT A BOOKING INQUIRY.”
Instructions (Chain-of-Thought Process):
For each customer message, follow this reasoning process:
- Step 1: Understand the Context – Read the message carefully. If the message is in a language other than English, translate it to English first for proper analysis.
- Step 2: Identify Booking-Related Keywords or Phrases – Look for keywords or phrases related to booking (e.g., “book a flight,” “cancel reservation,” “change my seat”). Determine if the message is directly addressing the reservation process or related issues.
- Step 3: Match to Definition – Compare the content of the message to the definition of “BOOKING INQUIRY.” Determine if it fits one of the following categories:
- Booking a flight
- Modifying an existing reservation
- Managing or resolving booking-related issues
- Step 4: Evaluate Confidence Level – Decide if the message aligns strongly with the definition and the criteria for “BOOKING INQUIRY.” If there is ambiguity or insufficient information classify it as “NOT A BOOKING INQUIRY.”
- Step 5: Provide a Clear Explanation – Based on your analysis, explain your decision in step-by-step reasoning, ensuring the classification is well-justified.
Examples:
Positive Examples:
- Input Message – “I’d like to change my seat for my flight next week.”
Decision: true
Reasoning: The message explicitly mentions “change my seat,” which is directly related to modifying a reservation. It aligns with the definition of “BOOKING INQUIRY” as it involves managing a booking.
- Input Message – “Can I cancel my reservation and get a refund?”
Decision: true
Reasoning: The message includes “cancel my reservation” and “get a refund,” which are part of managing an existing booking. This request is a clear match with the definition of “BOOKING INQUIRY.”
Negative Examples:
- Input Message: “How much does it cost to add extra baggage?”
Decision: false
Reasoning: The message asks about baggage costs, which relates to general travel policies rather than reservations or bookings. There is no indication of booking, modifying, or managing a reservation.
- Input Message: “What’s the delay on flight AA123?”
Decision: false
Reasoning: The message focuses on the status of a flight, not the reservation or booking process. It does not meet the definition of “BOOKING INQUIRY.”
Output: Provide your classification output in the following JSON format:
{
“decision”: true/false,
“reasoning”: “Step-by-step reasoning for the decision.”
}
Example Code for Binary Classifier Using boto3 and Bedrock
In this section, we’ll explore a Python script that implements hierarchical intent detection on user messages by interfacing with a language model (LLM) via AWS Bedrock runtime. The script is designed for flexibility and can be customized to work with other LLM frameworks.
“`python
import json
import boto3
from pathlib import Path
from typing import List
def get_prompt(intent: str) -> str:
“””
Retrieve the prompt template for a given intent from the ‘prompts’ directory.
Assumes that prompt files are stored in a ‘./prompts/’ directory relative to this file,
and that the filenames are in the format ‘{INTENT}-prompt.txt’, e.g., ‘GENERAL_QUERIES-prompt.txt’.
Parameters:
intent (str): The intent for which to retrieve the prompt template.
Returns:
str: The content of the prompt template file corresponding to the specified intent.
“””
# Determine the path to the ‘prompts’ directory relative to this file.
project_root = Path(file).parent
full_path = project_root / “prompts”
# Open and read the prompt file for the specified intent.
with open(full_path / f”{intent}-prompt.txt”) as file:
prompt = file.read()
return prompt
def intent_detection(message: str, decision_list: List[str]) -> str:
“””
Recursively detect the intent of a message by querying an LLM (Anthropic Claude v2).
This function iterates over a list of intents, formats a prompt for each,
and queries the LLM to determine if the message matches the intent.
If a match is found, it may recursively check for more specific sub-intents.
Assumptions:
- The prompts explicitly ask the model to return a ‘decision’ with a single response: True or False in JSON format.
Example: {‘decision’: True}
- The prompts contain a variable called ‘input_text’ that is formatted with the user’s message.
- If the model is not able to detect the intent, it will return ‘UNKNOWN’.
Parameters:
message (str): The user’s message for which to detect the intent.
decision_list (List[str]): A list of intent names to evaluate.
Returns:
str: The detected intent name, or ‘UNKNOWN’ if no intent is matched.
“””
The Future of Big Data
With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.
# Create a client for AWS Bedrock runtime to interact with the LLM.
client = boto3.client(“bedrock-runtime”, region_name=”us-east-1″)
for intent in decision_list:
# Retrieve and format the prompt template with the user’s message.
prompt_template = get_prompt(intent)
prompt = prompt_template.format(input_text=message)
# Construct the request body for the LLM API call.
body = json.dumps(
{
“anthropic_version”: “bedrock-2023-05-31”,
“max_tokens”: 4096,
“temperature”: 0.0,
“messages”: [
{
“role”: “user”,
“content”: [
{“type”: “text”, “text”: prompt}
]
}
]
}
)
# Invoke the LLM model with the constructed body.
raw_response = client.invoke_model(
modelId=”anthropic.claude-3-5-sonet-20240620-v1:0″,
body=body
)
# Read and parse the response from the LLM.
response = raw_response.get(“body”).read()
response_body = json.loads(response)
llm_text_response = response_body.get(“content”)[0].get(“text”)
# Parse the LLM’s text response to JSON.
llm_response_json = json.loads(llm_text_response)
# Check if the LLM decided that the message matches the current intent.
if llm_response_json.get(“decision”, False):
transitional_intent = intent
break # Exit the loop as we’ve found a matching intent.
else:
# If not matched, set the transitional intent to ‘UNKNOWN’.
transitional_intent = “UNKNOWN”
# Define the root intents that may have more specific sub-intents.
root_intents = [“GENERAL_QUERIES”, “BOOKING_ISSUES”, “CUSTOMER_COMPLAINTS”]
# If a matching root intent is found, recursively check for more specific intents.
if transitional_intent in root_intents:
# Mapping of root intents to their related sub-intents.
intent_definition = {
“GENERAL_QUERIES_related_intents”: [
“DESTINATION_INFORMATION”,
“LOYALTY_PROGRAM_DETAILS”,
“FLIGHT_SCHEDULES”,
“AIRLINE_POLICIES”,
“CHECK_IN_PROCEDURES”,
“IN_FLIGHT_SERVICES”,
“CANCELLATION_POLICY”
],
“BOOKING_ISSUES_related_intents”: [
“FLIGHT_CHANGE”,
“SEAT_SELECTION”,
“BAGGAGE”
],
“CUSTOMER_COMPLAINTS_related_intents”: [
“DELAY”,
“SERVICE_DISSATISFACTION”,
“SAFETY_CONCERNS”
]
}
# Recursively call intent_detection with the related sub-intents.
return intent_detection(
message,
intent_definition.get(f”{transitional_intent}_related_intents”)
)
else:
# Return the detected intent or ‘UNKNOWN’ if none matched.
return transitional_intent
def main(message: str) -> str:
“””
Main function to initiate intent detection on a user’s message.
Parameters:
message (str): The user’s message for which to detect the intent.
Returns:
str: The detected intent name, or ‘UNKNOWN’ if no intent is matched.
“””
# Start intent detection with the root intents.
return intent_detection(
message=message,
decision_list=[
“GENERAL_QUERIES”,
“BOOKING_ISSUES”,
“CUSTOMER_COMPLAINTS”
]
)
if name == “main“:
message = “””\
Hello,
I’m planning to travel next month and wanted to ask about your airline’s policies. Could you please provide information on:
Your refund and cancellation policies.
Rules regarding carrying liquids or other restricted items.
Any COVID-19 safety measures still in place.
Looking forward to your response.
“””
print(main(message=message))
This module is part of an automated email processing system designed to analyze customer messages, detect their intent, and generate structured responses based on the analysis. The system employs a large language model API to perform Natural Language Processing (NLP), classifying emails into primary intents such as “General Queries,” “Booking Issues,” or “Customer Complaints.”
Evaluation Guidelines
To comprehensively evaluate the performance of a hierarchical sequence of binary classifiers for multiclass text classification using LLMs, a well-constructed ground truth dataset is critical. This dataset should be meticulously designed to serve multiple purposes, ensuring both the overall system and individual classifiers are assessed accurately.
Dataset Design Considerations
- Balanced Dataset for Overall Evaluation:
The ground truth dataset must encompass a balanced representation of all intent categories to evaluate the system holistically. This enables the calculation of critical overall metrics such as accuracy, macro-precision, macro-recall, and micro-precision. A balanced dataset ensures that no specific category disproportionately influences these metrics, providing a fair measure of the system’s performance across all intents.
- Per-Classifier Evaluation:
Each binary classifier in the hierarchy should also be evaluated individually. To achieve this, the dataset must contain balanced positive and negative samples for each classifier. This balance is essential to calculate metrics such as accuracy, precision, recall, and F1-score for each individual classifier, enabling targeted performance analysis and iterative improvements at every level of the hierarchy.
- Negative Sample Creation:
Designing negative samples is a critical aspect of the dataset preparation process. Negative samples should be created using common sense principles to reflect real-world scenarios accurately:
- Diversity: Negative samples should be diverse to simulate various input conditions, preventing classifiers from overfitting to narrow definitions of “positive” and “negative” examples.
- Relevance for Lower-Level Classifiers: For classifiers deeper in the hierarchy, negative samples need not include examples from unrelated categories. For instance, in a “Flight Change” classifier, negative samples can exclude intents related to “Safety Concerns” or “In-Flight Entertainment.” This specificity helps avoid unnecessary complexity and confusion, focusing the classifier on its immediate decision boundary.
Metrics for Evaluation
- Overall System Metrics:
- Accuracy: The ratio of correctly classified samples to total samples, indicating the system’s general performance.
- Macro and Micro Precision & Recall: Macro metrics weigh each class equally, providing insights into system performance for underrepresented categories. Micro metrics, on the other hand, weigh classes proportionally to their sample sizes, offering a perspective on system performance for frequently occurring categories.
- Classifier-Level Metrics:
- Each binary classifier must be evaluated independently using accuracy, precision, recall, and F1-score. These metrics help pinpoint weaknesses in individual classifiers, which can then be addressed through retraining, hyperparameter tuning, or data augmentation.
- Cost per Classification:
- Tracking the computational or financial cost per classification is vital, especially in scenarios where resource efficiency is a priority. This metric helps balance the trade-off between model performance and operational budget constraints.
Additional Considerations
- Dataset Size:
- The dataset should be large enough to capture variations in intent expressions while ensuring each classifier receives sufficient positive and negative samples for robust training and evaluation.
- Data Augmentation:
- Techniques such as paraphrasing, synonym replacement, or noise injection can be employed to expand the dataset and improve classifier generalization.
- Cross-Validation:
- Employing techniques like k-fold cross-validation can ensure that the evaluation metrics are not biased by a specific train-test split, providing a more reliable assessment of the system’s performance.
- Real-World Testing:
- In addition to ground truth datasets, testing the system on real-world, unstructured data can reveal gaps in performance and help fine-tune classifiers to handle practical scenarios effectively.
By adhering to these principles, the evaluation process will yield a thorough understanding of both the end-to-end system’s performance and the individual strengths and weaknesses of each classifier, guiding data-driven refinements and ensuring robust, scalable deployment.
Additional Best Practices for Multiclass Text Classification Using LLMs
Prompt Caching
Prompt caching is a powerful technique for improving efficiency and reducing latency in applications with repeated queries or predictable user interactions. By caching prompts and their corresponding LLM-generated outputs, systems can avoid redundant API calls, thereby improving response times and lowering operational costs.
- Implementation Across Popular LLM Suites
- Anthropic: Anthropic’s models support prompt caching is done by marking specific parts of your prompt—such as tool definitions, system instructions, or lengthy context—with the cache_control parameter in your API requests. For example, you might include the entire text of a book in your prompt and cache it, allowing you to ask multiple questions about the text without reprocessing it each time. To enable this feature, include the header anthropic-beta: prompt-caching-2024-07-31 in your API calls, as prompt caching is currently in beta. By structuring your prompts with static content at the beginning and dynamic, user-specific content at the end, and by strategically marking cacheable sections, you can optimize performance, reduce latency, and lower operational costs when working with Anthropic’s language models.
- ChatGPT (OpenAI): To implement OpenAI’s Prompt Caching and optimize your application’s performance, structure your prompts so that static or repetitive content—like system prompts and common instructions—is placed at the beginning, while dynamic, user-specific information is appended at the end. This setup leverages exact prefix matching, increasing the likelihood of cache hits for prompts longer than 1,024 tokens. When the prefix of a prompt matches a cached entry, the system reuses the cached processing results, reducing latency by up to 80% and cutting costs by 50% for lengthy prompts. The caching mechanism operates automatically, requiring no additional code changes, and is specific to your organization to maintain data privacy. Cached prompts remain active for 5 to 10 minutes of inactivity and can persist up to an hour during off-peak periods. By following these implementation strategies, you can enhance API efficiency and reduce operational costs when interacting with OpenAI’s language models.
- Gemini (Google): Context caching in the Gemini API enables you to reduce processing time and costs by caching large input tokens that are reused across multiple requests. To implement this, you first upload your content (such as large documents or files) using the Files API. Then, you create a cache with a specified Time to Live (TTL) using the CachedContent.create() method, which stores the tokenized content for a duration you choose. When generating responses, you construct a GenerativeModel that references this cached content, allowing the model to access the cached tokens without reprocessing them. This is particularly effective for applications like chatbots with extensive system instructions or repetitive analysis tasks, as it minimizes redundant token processing and optimizes overall performance.
- Best Practices for Implementing Caching with Large Language Models (LLMs):
- Structure Prompts Effectively:
- Static Content First: Place static or repetitive content—such as system prompts, instructions, context, or examples—at the beginning of your prompt.
- Dynamic Content Last: Append variable or user-specific information at the end. This increases the likelihood of cache hits due to exact prefix matching.
- Leverage Exact Prefix Matching:
- Ensure that the cached sections of your prompts are identical across requests. Even minor differences can prevent cache hits.
- Use consistent formatting, wording, and structure for the static parts of your prompts.
- Utilize Caching for Long Prompts:
- Caching is most beneficial for prompts that exceed certain token thresholds (e.g., 1,024 tokens).
- For lengthy prompts with repetitive elements, caching can significantly reduce latency and cost.
- Mark Cacheable Sections Appropriately:
- Use available API features (such as cache_control parameters or specific headers) to designate cacheable sections in your prompts.
- Clearly define cache boundaries to optimize caching efficiency.
- Set Appropriate Time to Live (TTL):
- Adjust the TTL based on how frequently the cached content is accessed.
- Longer TTLs are advantageous for content that is reused often, while shorter TTLs prevent stale data in dynamic environments.
- Be Mindful of Model and API Constraints:
- Ensure that you’re using models that support caching features.
- Be aware of minimum token counts and other limitations specific to the LLM you’re using.
- Understand Pricing and Cost Implications:
- Familiarize yourself with the pricing model for caching, including any costs for cache writes, reads, and storage duration.
- Balance the cost of caching against the benefits of reduced processing time and lower per-request costs.
- Handle Cache Invalidation and Updates:
- Implement mechanisms to update or invalidate caches when the underlying content changes.
- Be prepared to handle cache misses gracefully by processing the full prompt when necessary.
Temperature Settings
The temperature parameter is critical in controlling the randomness and creativity of an LLM’s output.
Low Temperature (e.g., 0.2)
A low temperature setting makes the model’s outputs more deterministic by prioritizing higher-probability tokens. This is ideal for:
- Classification-oriented tasks requiring consistent responses.
- Scenarios where factual accuracy is critical.
- Narrow decision boundaries, such as binary classifiers in the hierarchy.
High Temperature (e.g., 0.8–1.0)
Higher temperature settings introduce more randomness, making the model explore diverse possibilities. This is useful for:
- Generating creative text, brainstorming ideas, or handling ambiguous inputs.
- Scenarios where the intent is not well-defined and may benefit from exploratory responses.
Best Practices for Multiclass Hierarchies
- Use low temperatures for top-level binary classifiers where intent boundaries are clear.
- Experiment with slightly higher temperatures for ambiguous or nuanced intent categories to capture edge cases during evaluation phases.
Adding Reasoning to the Prompt
Encouraging LLMs to reason step-by-step improves their ability to handle ambiguous or complex cases. This can be achieved by explicitly prompting the model to break down the classification process. For instance:
- Use phrases like “First, analyze the input for relevant keywords. Then, decide the most appropriate intent based on the following rules.”
- This approach helps mitigate errors in cases where multiple intents may appear similar by providing a logical framework for decision-making.
Prompt Optimization with Meta-Prompting
Meta-prompts are prompts about prompts. They guide the LLM to follow specific rules or adhere to structured formats for better interpretability and accuracy. Examples include:
- Defining constraints, such as “Respond only with ‘Yes’ or ‘No.'”
- Setting explicit rules, such as “If the input mentions scheduling changes, classify as ‘Flight Change.'”
- Clarifying ambiguous instructions, such as “If unsure, classify as ‘Miscellaneous’ and provide an explanation.”
Fine-Tuning Other Key LLM Parameters
- Max Tokens – Control the length of the output to avoid excessive verbosity or truncation. For classification tasks, limit the tokens to the minimal response necessary (e.g., “Yes,” “No,” or a concise class label).
- Top-p Sampling (Nucleus Sampling) – Instead of selecting tokens based on temperature alone, top-p sampling chooses from a subset of tokens whose cumulative probability adds up to a specified threshold. For deterministic tasks, set top-p close to 0.9 to balance precision and diversity.
- Stop Sequences – Define stop sequences to terminate outputs gracefully, ensuring outputs do not contain unnecessary or irrelevant continuations.
Iterative Prompt Refinement
Iterative prompt refinement is a crucial process for continuously improving the performance of LLMs in hierarchical multiclass classification tasks. By systematically analyzing errors, refining prompts, and validating changes, you can ensure the system evolves to handle complex and ambiguous scenarios more effectively. A structured “prompt refinement pipeline” can greatly enhance this process by combining meta-prompts and ground truth datasets for evaluation.
The Prompt Refinement Pipeline
A prompt refinement pipeline is an automated or semi-automated framework that systematically refines, tests, and evaluates prompts. It consists of the following components:
- Meta-Prompt for Refinement
Use an LLM itself to refine existing prompts by generating more concise, effective, or logically robust alternatives. A meta-prompt asks the model to analyze and improve a given prompt. For example:
- Input Meta-Prompt:
- “The following prompt is used for a binary classifier in a hierarchical text classification task. Suggest improvements to make it more specific, avoid ambiguity, and handle edge cases better. Also, propose an explanation for why your suggestions improve the prompt. Current prompt: [insert prompt].”
- Output: The model may suggest rewording, adding explicit constraints, or including step-by-step reasoning logic. These suggestions can then be iteratively tested.
- Ground Truth Dataset for Evaluation
Use a ground truth dataset to validate refined prompts against pre-labeled examples. This ensures that improvements suggested by the meta-prompt are objectively tested. Key steps include:
- Evaluate the refined prompt on classification accuracy, precision, recall, and F1-score using the ground truth dataset.
- Compare these metrics against the original prompt to ensure genuine improvement.
- Use misclassified examples to further identify weaknesses and refine prompts iteratively.
- Automated Testing and Feedback Loop
Implement an automated system to:
- Test the refined prompt on a validation set.
- Log performance metrics, including correct classifications, errors, and cases where ambiguity persists.
- Highlight specific prompts or input types that consistently underperform for further manual refinement.
- Version Control and Experimentation
Maintain a version-controlled repository for prompts. Track:
- Changes made during each refinement cycle.
- Associated performance metrics.
- Rationale behind prompt modifications. This documentation provides a knowledge base for future refinements and prevents regressions.
Benefits of a Prompt Refinement Pipeline
- Systematic Improvement
A structured approach ensures refinements are not ad hoc but are guided by data-driven insights and measurable results.
- Scalability
By automating key aspects of the refinement process, the pipeline scales effectively with larger datasets and more complex classification hierarchies.
- Model-Agnostic
The pipeline can be used with various LLMs, such as Anthropic’s models, OpenAI’s ChatGPT, or Google Gemini. This flexibility enables organizations to adopt or switch LLM providers without losing the benefits of the refinement process.
- Increased Robustness
Leveraging ground truth datasets ensures that prompts are evaluated on real-world examples, helping the model handle diverse and ambiguous scenarios with greater reliability.
- Meta-Prompt Benefits
Meta-prompts provide an efficient mechanism to leverage LLM capabilities for self-improvement. By incorporating LLM-generated suggestions, the system continuously evolves in response to new challenges or requirements.
- Error Analysis
The feedback loop enables a focused analysis of misclassifications, guiding the creation of targeted prompts that address specific failure cases or edge conditions.
Iterative Workflow for Prompt Refinement Pipeline
Baseline Testing – Start with an initial prompt and evaluate it on the ground truth dataset. Log performance metrics.
Meta-Prompt Refinement – Use a meta-prompt to generate improved versions of the initial prompt. Select the most promising refinement.
Validation and Comparison – Test the refined prompt on the dataset, comparing results to the baseline. Identify improvements and areas where performance remains suboptimal.
Targeted Refinements – For consistently misclassified samples, manually analyze and refine the prompt further. Re-evaluate until significant performance gains are achieved.
Deployment and Monitoring- Deploy the improved prompt into production and monitor real-world performance. Incorporate newly encountered edge cases into subsequent iterations of the refinement pipeline.
A prompt refinement pipeline provides a robust framework for systematically improving the performance of LLMs in hierarchical multiclass classification tasks. By combining meta-prompts, ground truth datasets, and automated evaluation, this approach ensures continuous improvement, scalability, and adaptability to new challenges, resulting in a more reliable and efficient classification system.
References
- For further reading on MTC-LLM, the following papers and blogs provide valuable insights
- Brown, T. B., et al. (2020). “Language Models are Few-Shot Learners.” *NeurIPS*
- OpenAI. “Best Practices for Prompt Engineering with GPT-4.”
- Anthropic. “Building Reliable Classification with Claude.”
- https://www.vellum.ai/llm-parameters-guide