Skip to main content

Data & Intelligence

Multiclass Text Classification Using LLM (MTC-LLM): A Comprehensive Guide

Curved Data Lines With Shallow Depth Of Field Blue Technology, Computer Network, Abstract Pattern

by Luis Pacheco and Uday Yallapragada

Introduction to Multiclass Text Classification with LLMs

Multiclass text classification (MTC) is a natural language processing (NLP) task where text is categorized into multiple predefined categories or classes. Traditional approaches rely on training machine learning models, requiring labeled data and iterative fine-tuning. However, with the advent of large language models (LLMs), this task can now be approached differently. Instead of building and training a custom model, we can utilize pre-trained LLMs to classify text using carefully designed prompts, allowing rapid deployment with minimal data requirements and enabling flexibility to adjust classes without retraining. 

Approaches for MTC-LLM 

In MTC-LLM, we generally have two main approaches for utilizing LLMs to achieve classification. 

Single Classifier with a Multi-Class Prompt 

Using a single LLM prompt for multi-class text classification involves providing a single, comprehensive prompt that instructs the model on all possible classes, expecting it to classify the text into one of these categories. This approach is simple and straightforward, as it requires only one prompt, making implementation fast and computationally efficient. It also reduces costs, as each classification requires just one LLM call, saving on both usage costs and processing time. 

However, this approach has notable limitations. When classes are similar, the model may struggle to make precise distinctions, reducing accuracy in nuanced tasks. Additionally, handling all categories within a single prompt can lead to lengthy and complex instructions, which may introduce ambiguity and diminish the model’s reliability. Another critical drawback is the approach’s inability to detect hierarchical relationships within a taxonomy; without recognizing these layers, the model may miss important contextual distinctions between classes that depend on hierarchical categorization. 

Hierarchical Sequence of Binary Classifiers 

The hierarchical sequence of binary classifiers approach structures classification as a decision tree, where each node represents a binary decision point. Starting from the top node, the model proceeds through a series of binary classifications, with each LLM call determining whether the text belongs to a specific class. This process continues down the hierarchy until a final classification is achieved. 

This method provides high accuracy since each binary decision allows the model to make precise, focused choices, which is particularly valuable for distinguishing among nuanced classes. It is also highly adaptable to complex hierarchies, accommodating cases where broad classes may require further subclass distinctions for an accurate classification. 

However, this approach comes with increased costs and latency, as multiple LLM calls are needed to reach a final classification, making it more expensive and time-consuming. Additionally, managing this approach requires structuring and maintaining numerous prompts and class definitions, adding to its complexity. For use cases where accuracy is prioritized over cost—such as in high-stakes applications like customer service—this hierarchical method is generally the recommended approach. 

Example Use Case: Intent Detection for Airline Customer Service 

Let’s consider an airline company using an automated system to respond to customer emails. The goal is to detect the intent behind each email accurately, enabling the system to route the message to the appropriate department or generate a relevant response. This system leverages a hierarchical sequence of binary classifiers, providing a structured approach to intent detection. At each level of the hierarchy, binary classifiers assess whether a specific intent is present, progressively narrowing down the scope of inquiry to arrive at a precise classification. 

 High-Level Intent Classification 

At the first stage of the hierarchy, the system categorizes emails into high-level intents to streamline processing and ensure accurate responses. These high-level intents include: 

General QueriesThis intent captures broad, information-seeking emails unrelated to specific complaints or actions.    These emails are generally routed to informational workflows or knowledge bases, allowing for automated responses with the required details.  

Booking IssuesEmails under this intent are related to the booking process or flight details. These emails are generally routed to booking support workflows, where sub-classification helps further refine the action required, such as new bookings, modifications, or cancellations. 

Customer ComplaintsThis category identifies emails expressing dissatisfaction or grievances. These emails are prioritized for customer service escalation, ensuring timely resolution and acknowledgment. Examples include: 

Refund RequestsThis category is specific to emails where customers request refunds for canceled flights, overcharges, or other issues. These emails are routed to the refund processing team, where workflows validate the claim and initiate the refund process.  Examples include: 

Special Assistance RequestsEmails in this category pertain to special accommodations or requests from passengers. These are routed to workflows that handle special services and ensure the requests are appropriately addressed. 

Lost and Found Inquiries – This intent captures emails related to lost items or baggage issues. These emails are routed to the airline’s lost and found or baggage resolution teams. 

Hierarchical Sub-Classification 

Once the high-level intent is identified, a second layer of binary classifiers operates within each category to refine the classification further. For example: 

Booking Issues Sub-Classifiers 

  •    New Bookings 
  •   Modifications to Existing Bookings   
  •    Cancellations   

Customer Complaints Sub-Classifiers  

  •    Flight Delays   
  •    Billing Issues   
  •    Service Quality   

Refund Requests Sub-Classifiers 

  •    Flight Cancellations   
  •    Baggage Fees   
  •    Duplicate Charges   

Special Assistance Requests Sub-Classifiers 

  •    Mobility Assistance   
  •    Dietary Preferences   
  •    Family Travel Needs   

Lost and Found Sub-Classifiers  

  •    Lost Items in Cabin   
  •    Missing Baggage   
  •    Items Lost at the Airport   

Benefits of this Approach 

 Scalability – The hierarchical design enables seamless addition of new intents or sub-intents as customer needs evolve, without disrupting the existing classification framework. 

EfficiencyBy filtering out irrelevant categories at each stage, the system minimizes computational overhead and ensures that only relevant workflows are triggered for each email. 

Improved AccuracyBinary classification simplifies the decision-making process, leading to higher precision and recall compared to a flat multiclass classifier. 

Enhanced Customer ExperienceAutomated responses tailored to specific intents ensure quicker resolutions and more accurate handling of customer inquiries, enhancing overall satisfaction. 

Cost-Effectiveness – Automating intent detection reduces reliance on human intervention for routine tasks, freeing up resources for more complex customer service needs. 

By categorizing emails into high-level intents like general queries, booking issues, complaints, refunds, special assistance requests, and lost and found inquiries, this automated system ensures efficient routing and resolution. Hierarchical sub-classification adds an extra layer of precision, enabling the airline to deliver fast, accurate, and customer-centric responses while optimizing operational efficiency. 

The table below is a representation of the complete taxonomy of the intent detection system organized into primary and secondary intents. This taxonomy enables the chatbot to understand and respond more accurately to customer intents, from broad categories down to specific, actionable concerns. Each level helps direct the inquiry to the appropriate team or resource for faster, more effective resolution. 

 

Level  Category  Sub-Category 
High-Level Intent  General Queries    
Sub-Intent  General Queries  Baggage Policy 
Sub-Intent  General Queries  Frequent Flyer Program 
Sub-Intent  General Queries  Travel with Pets 
High-Level Intent  Booking Issues    
Sub-Intent  Booking Issues  New Bookings 
Sub-Intent  Booking Issues  Modifications to Existing Bookings 
Sub-Intent  Booking Issues  Cancellations 
High-Level Intent  Customer Complaints    
Sub-Intent  Customer Complaints  Flight Delays 
Sub-Intent  Customer Complaints  Billing Issues 
Sub-Intent  Customer Complaints  Service Quality 
High-Level Intent  Refund Requests    
Sub-Intent  Refund Requests  Flight Cancellations 
Sub-Intent  Refund Requests  Baggage Fees 
Sub-Intent  Refund Requests  Duplicate Charges 
High-Level Intent  Special Assistance Requests    
Sub-Intent  Special Assistance Requests  Mobility Assistance 
Sub-Intent  Special Assistance Requests  Dietary Preferences 
Sub-Intent  Special Assistance Requests  Family Travel Needs 
High-Level Intent  Lost and Found Inquiries    
Sub-Intent  Lost and Found Inquiries  Lost Items in Cabin 
Sub-Intent  Lost and Found Inquiries  Missing Baggage 
Sub-Intent  Lost and Found Inquiries  Items Lost at the Airport 

 

The diagram below provides a depiction of this architecture. 

 

 

Mtc Llm Blog Image

Prompt Structure for a Binary Classifier 

Here’s a sample structure for a binary classifier prompt, where the LLM determines if a customer message is related to a Booking Inquiry. 

You are an AI language model tasked with classifying whether a customer's message to the Acme airline company is a "BOOKING INQUIRY."  

Definition: 

A "BOOKING INQUIRY" is a message that directly involves: 

Booking a flight: Questions or assistance requests about reserving a new flight. 
Modifying a reservation: Any request to change an existing booking, such as altering dates, times, destinations, or passenger details. 
Managing a reservation: Tasks like seat selection, cancellations, refunds, or upgrading class, which are tied to the customer's reservation. 
Resolving issues related to booking: Problems like errors in the booking process, confirmation issues, or requests for help with travel-related arrangements. 

Messages must demonstrate a clear and specific relationship to these areas to qualify as "BOOKING INQUIRY." General questions about unrelated travel aspects (e.g., baggage fees, flight status, or policies) are classified as "NOT A BOOKING INQUIRY." 

Instructions (Chain-of-Thought Process): 

For each customer message, follow this reasoning process: 

Step 1: Understand the Context - Read the message carefully. If the message is in a language other than English, translate it to English first for proper analysis. 
Step 2: Identify Booking-Related Keywords or Phrases - Look for keywords or phrases related to booking (e.g., "book a flight," "cancel reservation," "change my seat"). Determine if the message is directly addressing the reservation process or related issues. 
Step 3: Match to Definition - Compare the content of the message to the definition of "BOOKING INQUIRY." Determine if it fits one of the following categories: 
Booking a flight 
Modifying an existing reservation 
Managing or resolving booking-related issues 
Step 4: Evaluate Confidence Level - Decide if the message aligns strongly with the definition and the criteria for "BOOKING INQUIRY." If there is ambiguity or insufficient information classify it as "NOT A BOOKING INQUIRY." 
Step 5: Provide a Clear Explanation - Based on your analysis, explain your decision in step-by-step reasoning, ensuring the classification is well-justified. 

Examples: 

Positive Examples: 

Input Message - "I’d like to change my seat for my flight next week." 
Decision: true 
Reasoning: The message explicitly mentions "change my seat," which is directly related to modifying a reservation. It aligns with the definition of "BOOKING INQUIRY" as it involves managing a booking. 

Input Message - "Can I cancel my reservation and get a refund?" 
Decision: true 
Reasoning: The message includes "cancel my reservation" and "get a refund," which are part of managing an existing booking. This request is a clear match with the definition of "BOOKING INQUIRY." 

Negative Examples: 

Input Message: "How much does it cost to add extra baggage?" 
Decision: false 
Reasoning: The message asks about baggage costs, which relates to general travel policies rather than reservations or bookings. There is no indication of booking, modifying, or managing a reservation. 

Input Message: "What’s the delay on flight AA123?" 
Decision: false 
Reasoning: The message focuses on the status of a flight, not the reservation or booking process. It does not meet the definition of "BOOKING INQUIRY." 

Output: Provide your classification output in the following JSON format:
{
  "decision": true/false,
  "reasoning": "Step-by-step reasoning for the decision."
}

 

 

Example Code for Binary Classifier Using boto3 and Bedrock 

In this section, we are providing a Python script that implements hierarchical intent detection on user messages by interfacing with a language model (LLM) via AWS Bedrock runtime. The script is designed for flexibility and can be customized to work with other LLM frameworks.

This module is part of an automated email processing system designed to analyze customer messages, detect their intent, and generate structured responses based on the analysis. The system employs a large language model API to perform Natural Language Processing (NLP), classifying emails into primary intents such as “General Queries,” “Booking Issues,” or “Customer Complaints.”

```python 

import json 
import boto3 
from pathlib import Path 
from typing import List 

def get_prompt(intent: str) -> str: 

    """ 
    Retrieve the prompt template for a given intent from the 'prompts' directory. 
    Assumes that prompt files are stored in a './prompts/' directory relative to this file, 
    and that the filenames are in the format '{INTENT}-prompt.txt', e.g., 'GENERAL_QUERIES-prompt.txt'. 

    Parameters: 
        intent (str): The intent for which to retrieve the prompt template. 
 
    Returns: 
        str: The content of the prompt template file corresponding to the specified intent. 
    """ 

    # Determine the path to the 'prompts' directory relative to this file. 
    project_root = Path(__file__).parent 
    full_path = project_root / "prompts" 

 
    # Open and read the prompt file for the specified intent. 
    with open(full_path / f"{intent}-prompt.txt") as file: 
        prompt = file.read() 

    return prompt 

 

def intent_detection(message: str, decision_list: List[str]) -> str: 

    """ 
    Recursively detects the intent of a message by querying an LLM. 
    This function iterates over a list of intents, formats a prompt for each, 
    and queries the LLM to determine if the message matches the intent. 
    If a match is found, it may recursively check for more specific sub-intents.  

    Parameters: 
        message (str): The user's message for which to detect the intent. 
        decision_list (List[str]): A list of intent names to evaluate. 

    Returns: 
        str: The detected intent name, or 'UNKNOWN' if no intent is matched. 
    """ 

    # Create a client for AWS Bedrock runtime to interact with the LLM. 
    client = boto3.client("bedrock-runtime", region_name="us-east-1") 

    for intent in decision_list: 

        # Retrieve and format the prompt template with the user's message. 
        prompt_template = get_prompt(intent) 
        prompt = prompt_template.format(input_text=message) 


        # Construct the request body for the LLM API call. 
        body = json.dumps( 
            { 
                "anthropic_version": "bedrock-2023-05-31", 
                "max_tokens": 4096, 
                "temperature": 0.0, 
                "messages": [ 
                    { 
                        "role": "user", 
                        "content": [ 
                            {"type": "text", "text": prompt} 
                        ] 
                    } 
                ] 
            } 
        ) 

        # Invoke the LLM model with the constructed body. 
        raw_response = client.invoke_model( 
            modelId="anthropic.claude-3-5-sonet-20240620-v1:0", 
            body=body 
        ) 

        # Read and parse the response from the LLM. 
        response = raw_response.get("body").read() 
        response_body = json.loads(response) 
        llm_text_response = response_body.get("content")[0].get("text") 

        # Parse the LLM's text response to JSON. 
        llm_response_json = json.loads(llm_text_response) 

        # Check if the LLM decided that the message matches the current intent. 
        if llm_response_json.get("decision", False): 
            transitional_intent = intent 
            break  # Exit the loop as we've found a matching intent. 
        else: 
            # If not matched, set the transitional intent to 'UNKNOWN'. 
            transitional_intent = "UNKNOWN" 

 
    # Define the root intents that may have more specific sub-intents. 
    root_intents = ["GENERAL_QUERIES", "BOOKING_ISSUES", "CUSTOMER_COMPLAINTS"] 

    # If a matching root intent is found, recursively check for more specific intents. 
    if transitional_intent in root_intents: 

        # Mapping of root intents to their related sub-intents. 
        intent_definition = { 
            "GENERAL_QUERIES_related_intents": [ 
                "DESTINATION_INFORMATION", 
                "LOYALTY_PROGRAM_DETAILS", 
                "FLIGHT_SCHEDULES", 
                "AIRLINE_POLICIES", 
                "CHECK_IN_PROCEDURES", 
                "IN_FLIGHT_SERVICES", 
                "CANCELLATION_POLICY" 
            ], 

            "BOOKING_ISSUES_related_intents": [ 
                "FLIGHT_CHANGE", 
                "SEAT_SELECTION", 
                "BAGGAGE" 
            ], 

            "CUSTOMER_COMPLAINTS_related_intents": [ 
                "DELAY", 
                "SERVICE_DISSATISFACTION", 
                "SAFETY_CONCERNS" 
            ] 
        } 

        # Recursively call intent_detection with the related sub-intents. 
        return intent_detection( 
            message, 
            intent_definition.get(f"{transitional_intent}_related_intents") 
        ) 

    else: 
        # Return the detected intent or 'UNKNOWN' if none matched. 
        return transitional_intent 
 

def main(message: str) -> str: 

    """ 
    Main function to initiate intent detection on a user's message. 
    Parameters: 
        message (str): The user's message for which to detect the intent.  
    Returns: 
        str: The detected intent name, or 'UNKNOWN' if no intent is matched. 
    """ 

    # Start intent detection with the root intents. 

    return intent_detection( 
        message=message, 
        decision_list=[ 
            "GENERAL_QUERIES", 
            "BOOKING_ISSUES", 
            "CUSTOMER_COMPLAINTS" 
        ] 
    ) 

if __name__ == "__main__": 
    message = """\ 
Hello, 
I'm planning to travel next month and wanted to ask about your airline's policies. Could you please provide information on: 
Your refund and cancellation policies. 
Rules regarding carrying liquids or other restricted items. 
Any COVID-19 safety measures still in place. 
Looking forward to your response. 
    """ 
    print(main(message=message))

 

Evaluation Guidelines 

To comprehensively evaluate the performance of a hierarchical sequence of binary classifiers for multiclass text classification using LLMs, a well-constructed ground truth dataset is critical. This dataset should be meticulously designed to serve multiple purposes, ensuring both the overall system and individual classifiers are assessed accurately. 

Dataset Design Considerations 

  • Balanced Dataset for Overall Evaluation: The ground truth dataset must encompass a balanced representation of all intent categories to evaluate the system holistically. This enables the calculation of critical overall metrics such as accuracy, macro-precision, macro-recall, and micro-precision. A balanced dataset ensures that no specific category disproportionately influences these metrics, providing a fair measure of the system’s performance across all intents.
  • Per-Classifier Evaluation: Each binary classifier in the hierarchy should also be evaluated individually. To achieve this, the dataset must contain balanced positive and negative samples for each classifier. This balance is essential to calculate metrics such as accuracy, precision, recall, and F1-score for each individual classifier, enabling targeted performance analysis and iterative improvements at every level of the hierarchy.
  • Negative Sample Creation: Designing negative samples is a critical aspect of the dataset preparation process. Negative samples should be created using common sense principles to reflect real-world scenarios accurately: 
    • Diversity: Negative samples should be diverse to simulate various input conditions, preventing classifiers from overfitting to narrow definitions of “positive” and “negative” examples. 
    • Relevance for Lower-Level Classifiers: For classifiers deeper in the hierarchy, negative samples need not include examples from unrelated categories. For instance, in a “Flight Change” classifier, negative samples can exclude intents related to “Safety Concerns” or “In-Flight Entertainment.” This specificity helps avoid unnecessary complexity and confusion, focusing the classifier on its immediate decision boundary. 

Metrics for Evaluation 

  • Overall System Metrics: 
    • Accuracy: The ratio of correctly classified samples to total samples, indicating the system’s general performance. 
    • Macro and Micro Precision & Recall: Macro metrics weigh each class equally, providing insights into system performance for underrepresented categories. Micro metrics, on the other hand, weigh classes proportionally to their sample sizes, offering a perspective on system performance for frequently occurring categories. 
  • Classifier-Level Metrics: 
    • Each binary classifier must be evaluated independently using accuracy, precision, recall, and F1-score. These metrics help pinpoint weaknesses in individual classifiers, which can then be addressed through retraining, hyperparameter tuning, or data augmentation. 
  • Cost per Classification: 
    • Tracking the computational or financial cost per classification is vital, especially in scenarios where resource efficiency is a priority. This metric helps balance the trade-off between model performance and operational budget constraints. 

Additional Considerations 

  • Dataset Size:  The dataset should be large enough to capture variations in intent expressions while ensuring each classifier receives sufficient positive and negative samples for robust training and evaluation. 
  • Data Augmentation: Techniques such as paraphrasing, synonym replacement, or noise injection can be employed to expand the dataset and improve classifier generalization. 
  • Cross-Validation:  Employing techniques like k-fold cross-validation can ensure that the evaluation metrics are not biased by a specific train-test split, providing a more reliable assessment of the system’s performance. 
  • Real-World Testing:  In addition to ground truth datasets, testing the system on real-world, unstructured data can reveal gaps in performance and help fine-tune classifiers to handle practical scenarios effectively. 

By adhering to these principles, the evaluation process will yield a thorough understanding of both the end-to-end system’s performance and the individual strengths and weaknesses of each classifier, guiding data-driven refinements and ensuring robust, scalable deployment. 

Additional Best Practices for Multiclass Text Classification Using LLMs 

Prompt Caching 

Prompt caching is a powerful technique for improving efficiency and reducing latency in applications with repeated queries or predictable user interactions. By caching prompts and their corresponding LLM-generated outputs, systems can avoid redundant API calls, thereby improving response times and lowering operational costs. 

Implementation Across Popular LLM Suites 
  • Anthropic: Anthropic’s models support prompt caching is done by marking specific parts of your prompt—such as tool definitions, system instructions, or lengthy context—with the cache_control parameter in your API requests. For example, you might include the entire text of a book in your prompt and cache it, allowing you to ask multiple questions about the text without reprocessing it each time. To enable this feature, include the header anthropic-beta: prompt-caching-2024-07-31 in your API calls, as prompt caching is currently in beta. By structuring your prompts with static content at the beginning and dynamic, user-specific content at the end, and by strategically marking cacheable sections, you can optimize performance, reduce latency, and lower operational costs when working with Anthropic’s language models. 
  • ChatGPT (OpenAI): To implement OpenAI’s Prompt Caching and optimize your application’s performance, structure your prompts so that static or repetitive content—like system prompts and common instructions—is placed at the beginning, while dynamic, user-specific information is appended at the end. This setup leverages exact prefix matching, increasing the likelihood of cache hits for prompts longer than 1,024 tokens. When the prefix of a prompt matches a cached entry, the system reuses the cached processing results, reducing latency by up to 80% and cutting costs by 50% for lengthy prompts. The caching mechanism operates automatically, requiring no additional code changes, and is specific to your organization to maintain data privacy. Cached prompts remain active for 5 to 10 minutes of inactivity and can persist up to an hour during off-peak periods. By following these implementation strategies, you can enhance API efficiency and reduce operational costs when interacting with OpenAI’s language models. 
  • Gemini (Google): Context caching in the Gemini API enables you to reduce processing time and costs by caching large input tokens that are reused across multiple requests. To implement this, you first upload your content (such as large documents or files) using the Files API. Then, you create a cache with a specified Time to Live (TTL) using the CachedContent.create() method, which stores the tokenized content for a duration you choose. When generating responses, you construct a GenerativeModel that references this cached content, allowing the model to access the cached tokens without reprocessing them. This is particularly effective for applications like chatbots with extensive system instructions or repetitive analysis tasks, as it minimizes redundant token processing and optimizes overall performance. 
Best Practices for Implementing Caching with Large Language Models (LLMs):
  • Structure Prompts Effectively 
    • Static Content First: Place static or repetitive content—such as system prompts, instructions, context, or examples—at the beginning of your prompt. 
    • Dynamic Content Last: Append variable or user-specific information at the end. This increases the likelihood of cache hits due to exact prefix matching. 
  • Leverage Exact Prefix Matching 
    • Ensure that the cached sections of your prompts are identical across requests. Even minor differences can prevent cache hits. 
    • Use consistent formatting, wording, and structure for the static parts of your prompts. 
  • Utilize Caching for Long Prompts
    • Caching is most beneficial for prompts that exceed certain token thresholds (e.g., 1,024 tokens). 
    • For lengthy prompts with repetitive elements, caching can significantly reduce latency and cost. 
  • Mark Cacheable Sections Appropriately 
    • Use available API features (such as cache_control parameters or specific headers) to designate cacheable sections in your prompts. 
    • Clearly define cache boundaries to optimize caching efficiency. 
  • Set Appropriate Time to Live (TTL)
    • Adjust the TTL based on how frequently the cached content is accessed. 
    • Longer TTLs are advantageous for content that is reused often, while shorter TTLs prevent stale data in dynamic environments. 
  • Be Mindful of Model and API Constraints
    • Ensure that you’re using models that support caching features. 
    • Be aware of minimum token counts and other limitations specific to the LLM you’re using. 
  • Understand Pricing and Cost Implications: 
    • Familiarize yourself with the pricing model for caching, including any costs for cache writes, reads, and storage duration. 
    • Balance the cost of caching against the benefits of reduced processing time and lower per-request costs. 
  • Handle Cache Invalidation and Updates: 
    • Implement mechanisms to update or invalidate caches when the underlying content changes. 
    • Be prepared to handle cache misses gracefully by processing the full prompt when necessary. 

Temperature Settings

The temperature parameter is critical in controlling the randomness and creativity of an LLM’s output. 

Low Temperature (e.g., 0.2) 

A low temperature setting makes the model’s outputs more deterministic by prioritizing higher-probability tokens. This is ideal for: 

  • Classification-oriented tasks requiring consistent responses. 
  • Scenarios where factual accuracy is critical. 
  • Narrow decision boundaries, such as binary classifiers in the hierarchy. 
High Temperature (e.g., 0.8–1.0) 

Higher temperature settings introduce more randomness, making the model explore diverse possibilities. This is useful for: 

  • Generating creative text, brainstorming ideas, or handling ambiguous inputs. 
  • Scenarios where the intent is not well-defined and may benefit from exploratory responses. 
Best Practices for Multiclass Hierarchies 
  • Use low temperatures for top-level binary classifiers where intent boundaries are clear. 
  • Experiment with slightly higher temperatures for ambiguous or nuanced intent categories to capture edge cases during evaluation phase
Adding Reasoning to the Prompt 

Encouraging LLMs to reason step-by-step improves their ability to handle ambiguous or complex cases. This can be achieved by explicitly prompting the model to break down the classification process. For instance: 

  • Use phrases like “First, analyze the input for relevant keywords. Then, decide the most appropriate intent based on the following rules.” 
  • This approach helps mitigate errors in cases where multiple intents may appear similar by providing a logical framework for decision-making. 

Prompt Optimization with Meta-Prompting 

Meta-prompts are prompts about prompts. They guide the LLM to follow specific rules or adhere to structured formats for better interpretability and accuracy. Examples include: 

  • Defining constraints, such as “Respond only with ‘Yes’ or ‘No.'” 
  • Setting explicit rules, such as “If the input mentions scheduling changes, classify as ‘Flight Change.'” 
  • Clarifying ambiguous instructions, such as “If unsure, classify as ‘Miscellaneous’ and provide an explanation.” 

Fine-Tuning Other Key LLM Parameters 

  • Max Tokens – Control the length of the output to avoid excessive verbosity or truncation. For classification tasks, limit the tokens to the minimal response necessary (e.g., “Yes,” “No,” or a concise class label). 
  • Top-p Sampling (Nucleus Sampling) – Instead of selecting tokens based on temperature alone, top-p sampling chooses from a subset of tokens whose cumulative probability adds up to a specified threshold. For deterministic tasks, set top-p close to 0.9 to balance precision and diversity. 
  • Stop Sequences – Define stop sequences to terminate outputs gracefully, ensuring outputs do not contain unnecessary or irrelevant continuations. 

Iterative Prompt Refinement 

Iterative prompt refinement is a crucial process for continuously improving the performance of LLMs in hierarchical multiclass classification tasks. By systematically analyzing errors, refining prompts, and validating changes, you can ensure the system evolves to handle complex and ambiguous scenarios more effectively. A structured “prompt refinement pipeline” can greatly enhance this process by combining meta-prompts and ground truth datasets for evaluation. 

The Prompt Refinement Pipeline 

A prompt refinement pipeline is an automated or semi-automated framework that systematically refines, tests, and evaluates prompts. It consists of the following components: 

Meta-Prompt for Refinement 

Use an LLM itself to refine existing prompts by generating more concise, effective, or logically robust alternatives. A meta-prompt asks the model to analyze and improve a given prompt. For example: 

  • Input Meta-Prompt: 
    • “The following prompt is used for a binary classifier in a hierarchical text classification task. Suggest improvements to make it more specific, avoid ambiguity, and handle edge cases better. Also, propose an explanation for why your suggestions improve the prompt. Current prompt: [insert prompt].” 
  • Output: The model may suggest rewording, adding explicit constraints, or including step-by-step reasoning logic. These suggestions can then be iteratively tested. 
Ground Truth Dataset for Evaluation 

Use a ground truth dataset to validate refined prompts against pre-labeled examples. This ensures that improvements suggested by the meta-prompt are objectively tested. Key steps include: 

  • Evaluate the refined prompt on classification accuracy, precision, recall, and F1-score using the ground truth dataset. 
  • Compare these metrics against the original prompt to ensure genuine improvement. 
  • Use misclassified examples to further identify weaknesses and refine prompts iteratively. 
Automated Testing and Feedback Loop 

Implement an automated system to: 

  • Test the refined prompt on a validation set. 
  • Log performance metrics, including correct classifications, errors, and cases where ambiguity persists. 
  • Highlight specific prompts or input types that consistently underperform for further manual refinement. 
Version Control and Experimentation 

Maintain a version-controlled repository for prompts. Track: 

  • Changes made during each refinement cycle. 
  • Associated performance metrics. 
  • Rationale behind prompt modifications. This documentation provides a knowledge base for future refinements and prevents regressions. 
Benefits of a Prompt Refinement Pipeline 
  • Systematic Improvement  – A structured approach ensures refinements are not ad hoc but are guided by data-driven insights and measurable results. 
  • ScalabilityBy automating key aspects of the refinement process, the pipeline scales effectively with larger datasets and more complex classification hierarchies. 
  • Model-AgnosticThe pipeline can be used with various LLMs, such as Anthropic’s models, OpenAI’s ChatGPT, or Google Gemini. This flexibility enables organizations to adopt or switch LLM providers without losing the benefits of the refinement process. 
  • Increased Robustness – Leveraging ground truth datasets ensures that prompts are evaluated on real-world examples, helping the model handle diverse and ambiguous scenarios with greater reliability. 
  • Meta-Prompt BenefitsMeta-prompts provide an efficient mechanism to leverage LLM capabilities for self-improvement. By incorporating LLM-generated suggestions, the system continuously evolves in response to new challenges or requirements. 
  • Error AnalysisThe feedback loop enables a focused analysis of misclassifications, guiding the creation of targeted prompts that address specific failure cases or edge conditions. 
Iterative Workflow for Prompt Refinement Pipeline 
  • Baseline Testing – Start with an initial prompt and evaluate it on the ground truth dataset. Log performance metrics. 
  • Meta-Prompt Refinement – Use a meta-prompt to generate improved versions of the initial prompt. Select the most promising refinement. 
  • Validation and Comparison – Test the refined prompt on the dataset, comparing results to the baseline. Identify improvements and areas where performance remains suboptimal. 
  • Targeted Refinements – For consistently misclassified samples, manually analyze and refine the prompt further. Re-evaluate until significant performance gains are achieved. 
  • Deployment and Monitoring- Deploy the improved prompt into production and monitor real-world performance. Incorporate newly encountered edge cases into subsequent iterations of the refinement pipeline. 

A prompt refinement pipeline provides a robust framework for systematically improving the performance of LLMs in hierarchical multiclass classification tasks. By combining meta-prompts, ground truth datasets, and automated evaluation, this approach ensures continuous improvement, scalability, and adaptability to new challenges, resulting in a more reliable and efficient classification system. 

References

  1. Brown, T. B., et al. (2020). “Language Models are Few-Shot Learners.” *NeurIPS* 
  2. OpenAI. “Best Practices for Prompt Engineering with GPT-4.” 
  3. Anthropic. “Building Reliable Classification with Claude.” 
  4. https://huggingface.co/docs/transformers/en/tasks/prompting 
  5. https://www.vellum.ai/llm-parameters-guide 

Tags

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Uday Yallapragada

Uday Yallapragada is a seasoned senior solutions architect at Perficient with over 20 years of experience in machine learning and software development. Uday brings a wealth of expertise in generative AI, machine learning, natural language processing and data science pipelines. He led successful execution of solutions for several use cases ranging from retrieval augmented generation, recommendation engines and text classifiers to network analysis. Uday has a passion for designing and developing responsible end-to-end AI and ML solutions.

More from this Author

Follow Us