adaptive few-shot prompting May 25, 2026 • 5 min read

Adaptive Few-Shot Prompting: Use Semantic Retrieval to Pick Your Examples Instead of Hardcoding Them

New research shows dynamically selecting in-context examples by embedding similarity outperforms static few-shot by up to 7 BLEU -- here's how to build it

Static few-shot prompting is the hardcoded config file of prompt engineering. You pick three to five examples, paste them into every request, and hope they're relevant enough. For simple tasks, that works. But the moment your inputs get diverse, those fixed examples start dragging down your results.

Recent research puts a number on the gap. A March 2026 study from Emory and NIH tested dynamic example selection for biomedical named entity recognition and found it added +7.3% F1 over static structured prompting in 5-shot settings (Ge et al., npj Artificial Intelligence, 2026). An AAAI 2025 paper on machine translation showed +9.31 BLEU when selecting examples by embedding similarity instead of using fixed demonstrations (arXiv 2501.01679). These aren't marginal gains. They're the difference between "pretty good" and "actually useful."

The fix is straightforward: embed your example pool into a vector store, retrieve the most similar examples per input at inference time, and template them into your prompt. You're picking examples that look like the current task instead of examples that looked good when you wrote the prompt three weeks ago.

Why Static Selection Breaks Down

Think about a sentiment classifier. You hardcode five examples: two positive reviews, two negative, one neutral. Works fine when the input is a product review. Breaks when someone submits a sarcastic tweet, a formal complaint letter, or a review in mixed English and Spanish.

Static examples teach the model a pattern. Dynamic examples teach the model the right pattern for this specific input. The closer your examples are to the actual query, the stronger the in-context learning signal.

A study on code vulnerability detection made this concrete. Retrieval-augmented few-shot prompting hit 48.60% subset accuracy at 10 shots versus random prompting's 38.90% -- a 25% relative improvement. It also outperformed fine-tuned Gemini on F1 score, with zero training cost (Trad & Chehab, arXiv 2512.04106). You read that right: picking better examples beat fine-tuning the model.

The Build: Vector Store + kNN Retrieval + Prompt Template

The architecture has three pieces:

1. Example pool. A collection of input-output pairs for your task. For classification, that's text + label. For translation, source + target. For content generation, brief + output. Start with 50-100 examples. More is better, but even 20 outperforms static selection if they're diverse.

2. Embedding index. Embed the input side of each example into a vector store. FAISS, Chroma, Pinecone -- pick whatever you already run. You only embed once at index time, then query at inference.

3. Retrieval + templating. For each new input, embed it, find the k-nearest examples, and slot them into your prompt template.

Here's what this looks like in practice.

The Prompt (classification task with 3 dynamic examples):

You are a customer feedback classifier. Categorize each message as one of: [bug_report, feature_request, praise, complaint, question].

Here are examples of similar feedback and their correct categories:

Example 1:
Feedback: "{retrieved_example_1_input}"
Category: {retrieved_example_1_label}

Example 2:
Feedback: "{retrieved_example_2_input}"
Category: {retrieved_example_2_label}

Example 3:
Feedback: "{retrieved_example_3_input}"
Category: {retrieved_example_3_label}

Now classify this feedback:
Feedback: "{user_input}"
Category:

Why This Works: The three examples aren't fixed. They're the three closest matches to the current input from your example pool. If the user submits a message about login failures, the retrieved examples will be other messages about authentication issues, crashes, or error states. If they submit a feature idea, the examples will be other feature requests. The model sees demonstrations that are structurally and semantically similar to the task at hand.

Expected Output:

bug_report

When the input is "The app crashes every time I try to upload a file larger than 10MB on Android," the retrieved examples would be other bug reports about crashes, file handling, or mobile issues. The model doesn't have to generalize from a praise example about "great UI design" to understand this is a bug report.

The LangChain Shortcut

If you're already in the LangChain ecosystem, this is a solved problem. SemanticSimilarityExampleSelector does exactly this pattern. IBM published a production tutorial in March 2026 showing this approach for a sales intelligence agent, where dynamic few-shot selection guided LLM reasoning over regional sales metrics.

LangSmith now ships dynamic few-shot as a first-class feature in open beta. You upload examples, it handles embedding, retrieval, and injection. The static FewShotPromptTemplate is no longer the recommended path.

But you don't need LangChain. The pattern is simple enough to build with any embedding API and a vector store client. Fifteen lines of Python gets you there.

What to Embed Matters More Than Which Embedder

Here's a finding that saves you money. The Emory/NIH study tested four retrieval methods: TF-IDF, SBERT, ColBERT, and DPR. TF-IDF and SBERT performed best. The simpler, cheaper methods won.

An EMNLP 2023 paper called Skill-KNN took this further. Instead of embedding raw input text, they first rewrote each example into a "skill description" that captured the reasoning pattern, then embedded that. This rewrite-then-retrieve approach consistently outperformed raw-text kNN across 5 datasets and 6 LLMs, including GPT-4 (EMNLP 2023, Skill-KNN).

The takeaway: if your raw text has a lot of surface noise (domain jargon, formatting variance, mixed languages), preprocess before embedding. A short summarization step that extracts the core task pattern before retrieval can make a cheap embedding model outperform an expensive one on raw text.

The Prompt (rewrite-then-retrieve preprocessing):

Summarize the core task pattern in this customer message in one sentence. 
Focus on what the customer needs, not specific product names or details.

Message: "{raw_input}"
Task pattern:

Why This Works: Stripping surface-level details before embedding means your similarity search finds examples that match on task structure, not keyword overlap. A complaint about "Bluetooth disconnecting on my XR-500" and "WiFi drops on the Pro Max" are different on the surface but identical in task pattern: connectivity reliability issue on specific hardware.

Expected Output:

Customer is reporting an intermittent hardware connectivity failure on a specific device model.

You embed that summary, not the raw message, into your vector search.

When NOT to Use Dynamic Few-Shot

This technique isn't universal. DeepSeek's R1 paper and multiple independent studies found that few-shot examples consistently degrade performance on reasoning-heavy tasks with models like o1 and R1. These models have built-in chain-of-thought reasoning, and external examples bias them toward surface pattern copying instead of genuine problem-solving. DeepSeek officially recommends zero-shot for R1.

The split is clean: for generation, classification, translation, NER, and structured output tasks, dynamic few-shot selection gives you measurable gains. For math, logic, and complex coding tasks on reasoning-tuned models, skip the examples entirely and let the model think.

Classification is the exception that works everywhere. Even R1 hit 91.39% F1 on 5-class sentiment classification with just 5 shots (arXiv 2509.23196).

The Practical Threshold

Industry consensus is settling around a useful rule of thumb: if you need more than 8 static examples to cover your input space, switch to dynamic selection. Below that, static is fine. Above that, you're padding every prompt with irrelevant examples and paying for tokens that hurt more than they help.

Dynamic selection also tends to use fewer examples per request. Three well-chosen examples often outperform eight generic ones, which means shorter prompts, lower latency, and lower cost.

The research is clear. Static few-shot is a starting point, not a destination. Once your task has any real input diversity, semantic retrieval for example selection is the move. The tools exist, the pattern is proven, and the gains are significant.

Want hands-on training on building retrieval-augmented prompting systems for your team? Connect with Kief Studio on Discord or schedule a session.

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.

Subscribe