Prerequisites
- A labeled dataset of 50-200 examples (CSV or JSON)
- An API key for one modern LLM (Claude 4.x, GPT-5, or Gemini 2.x)
- Python 3.10+ with one of the official SDKs
Step 1: Define your label set precisely
Write out each label with a one-sentence definition and 2 boundary examples (one clear case, one near-miss from the adjacent label). Ambiguous labels are the #1 cause of classification failure -- models can't hit a moving target.
LABELS:
billing: customer is asking about charges, invoices, or payments.
example: "Why was I charged twice?"
near-miss: "My account is locked" -> account, not billing
technical: customer is reporting the product doesn't work as expected.
example: "Login button is broken"
near-miss: "I forgot my password" -> account, not technical
account: access, password, profile, or subscription state.
example: "How do I reset my password?"
near-miss: "Reset my billing plan" -> billing, not account
Step 2: Write the prompt with structured output
Force JSON output matching a schema. Removes parsing bugs and lets you validate programmatically.
SYSTEM:
You classify customer support tickets. Return only JSON matching:
{"label": "billing" | "technical" | "account", "confidence": number 0-1, "reasoning": string}
Labels:
billing: charges, invoices, payments
technical: product not working as expected
account: access, password, profile, subscription state
Choose the label the ticket primarily concerns. When ambiguous, prefer the
category that describes what the user wants to happen, not what caused the
problem. Include a one-sentence reason.
USER:
{ticket_text}
Step 3: Add self-consistency voting
Run the prompt 5 times at temperature=0.7, take the majority label. Cheap reliability upgrade.
import collections
def classify(ticket, n=5):
labels = []
for _ in range(n):
result = call_llm(prompt=prompt.format(ticket_text=ticket),
temperature=0.7)
labels.append(result["label"])
winner, count = collections.Counter(labels).most_common(1)[0]
return {
"label": winner,
"confidence": count / n,
"n": n,
}
Step 4: Build the eval harness
Split your labeled data: 80% training (for few-shot examples), 20% test. Run the classifier on the test set, compute accuracy and per-label F1. Your ship gate: accuracy doesn't regress on new prompt variants.
import json
from sklearn.metrics import accuracy_score, classification_report
test = [json.loads(l) for l in open("test.jsonl")]
preds = [classify(t["text"])["label"] for t in test]
y_true = [t["label"] for t in test]
print(f"Accuracy: {accuracy_score(y_true, preds):.3f}")
print(classification_report(y_true, preds))
Step 5: Add few-shot examples where accuracy lags
Look at per-label F1. For the 1-2 worst labels, pull 3 boundary examples from your training set into the prompt as worked examples. Re-run eval. Iterate.
Failure modes to watch
- Label drift: if new ticket types appear, your labels go stale. Budget a monthly review.
- Length bias: very short tickets ("help") can't be classified reliably -- add a "needs_info" label or a low-confidence fallback.
- Over-confident hallucinations: LLMs report high confidence on wrong answers. Don't trust the confidence field alone; use the self-consistency agreement rate.
Variations
- Hierarchical labels: "billing" → "billing/refund" → "billing/refund/duplicate-charge". Classify top-down for deeper taxonomies.
- Multi-label: change schema to an array of labels.
- Extraction + classification: same prompt also extracts order IDs, account emails, dates. One call, more data.
