Step 1: Define your label set precisely

Write out each label with a one-sentence definition and 2 boundary examples (one clear case, one near-miss from the adjacent label). Ambiguous labels are the #1 cause of classification failure -- models can't hit a moving target.

LABELS:
  billing: customer is asking about charges, invoices, or payments.
    example: "Why was I charged twice?"
    near-miss: "My account is locked" -> account, not billing
  technical: customer is reporting the product doesn't work as expected.
    example: "Login button is broken"
    near-miss: "I forgot my password" -> account, not technical
  account: access, password, profile, or subscription state.
    example: "How do I reset my password?"
    near-miss: "Reset my billing plan" -> billing, not account

Step 2: Write the prompt with structured output

Force JSON output matching a schema. Removes parsing bugs and lets you validate programmatically.

SYSTEM:
You classify customer support tickets. Return only JSON matching:
  {"label": "billing" | "technical" | "account", "confidence": number 0-1, "reasoning": string}

Labels:
  billing: charges, invoices, payments
  technical: product not working as expected
  account: access, password, profile, subscription state

Choose the label the ticket primarily concerns. When ambiguous, prefer the
category that describes what the user wants to happen, not what caused the
problem. Include a one-sentence reason.

USER:
{ticket_text}

Step 3: Add self-consistency voting

Run the prompt 5 times at temperature=0.7, take the majority label. Cheap reliability upgrade.

import collections

def classify(ticket, n=5):
    labels = []
    for _ in range(n):
        result = call_llm(prompt=prompt.format(ticket_text=ticket),
                          temperature=0.7)
        labels.append(result["label"])
    winner, count = collections.Counter(labels).most_common(1)[0]
    return {
        "label": winner,
        "confidence": count / n,
        "n": n,
    }

Step 4: Build the eval harness

Split your labeled data: 80% training (for few-shot examples), 20% test. Run the classifier on the test set, compute accuracy and per-label F1. Your ship gate: accuracy doesn't regress on new prompt variants.

import json
from sklearn.metrics import accuracy_score, classification_report

test = [json.loads(l) for l in open("test.jsonl")]
preds = [classify(t["text"])["label"] for t in test]
y_true = [t["label"] for t in test]

print(f"Accuracy: {accuracy_score(y_true, preds):.3f}")
print(classification_report(y_true, preds))

Step 5: Add few-shot examples where accuracy lags

Look at per-label F1. For the 1-2 worst labels, pull 3 boundary examples from your training set into the prompt as worked examples. Re-run eval. Iterate.

Failure modes to watch

Label drift: if new ticket types appear, your labels go stale. Budget a monthly review.
Length bias: very short tickets ("help") can't be classified reliably -- add a "needs_info" label or a low-confidence fallback.
Over-confident hallucinations: LLMs report high confidence on wrong answers. Don't trust the confidence field alone; use the self-consistency agreement rate.

Variations

Hierarchical labels: "billing" → "billing/refund" → "billing/refund/duplicate-charge". Classify top-down for deeper taxonomies.
Multi-label: change schema to an array of labels.
Extraction + classification: same prompt also extracts order IDs, account emails, dates. One call, more data.

Build a Classification Prompt That Beats Fine-Tuning