Research
Papers
A curated digest of recent prompt-engineering, agentic, and AI-security research. Each paper: a 3-sentence TL;DR, why it matters for practitioners, and how to put it to work.
8 papers
- Development Oct 2023 arXiv: 2310.06770
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Benchmark that tests whether an LLM can take a real GitHub issue and a full repository and produce a passing code change. Went from 2% solve rate in 2023 to 70%+ by 2025 -- the clearest quantitative record of agentic coding progress we have.
- Techniques May 2023 arXiv: 2305.10601
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Generalizes chain-of-thought by having the model explore multiple reasoning branches, score each, and prune. Dramatically better on puzzle-like problems (24, crosswords, creative writing with constraints) at the cost of 5-10x the tokens.
- Security Feb 2023 arXiv: 2310.03025
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
The paper that named and benchmarked indirect prompt injection. Demonstrated end-to-end attacks against Bing Chat, GitHub Copilot Chat, and other production LLM integrations via poisoned web content, email, and code comments. The practical wake-up call for agent security.
- Agentic Oct 2022 arXiv: 2210.03629
ReAct: Synergizing Reasoning and Acting in Language Models
Introduced the Reason-Act-Observe loop: models that alternate between thinking and calling tools dramatically outperform models that only think OR only act. The founding paper of modern agentic LLM architectures.
- Techniques Mar 2022 arXiv: 2307.11760
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Sample the model N times at nonzero temperature, take the mode of the final answers. Cheapest known reliability upgrade for discrete-answer tasks -- easily doubles accuracy on math reasoning at the cost of N-x compute.
- Techniques Jan 2022 arXiv: 2201.11903
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Appending "let's think step by step" or showing worked-example reasoning in the prompt dramatically improves LLM accuracy on math and multi-step problems. The paper that named and formalized chain-of-thought. Still the cited reference for CoT despite being from 2022.
- Development Jun 2021 arXiv: 2106.09685
LoRA: Low-Rank Adaptation of Large Language Models
Fine-tune a giant model by training tiny adapter matrices alongside it, leaving the base frozen. Cuts training memory by 10-100x, lets you host one base model with many LoRAs for different tasks, runs on a single consumer GPU.
- Applied May 2020 arXiv: 2005.11401
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Coined the term RAG. Showed that combining a language model with retrieval over a knowledge base beats fine-tuning on knowledge-intensive tasks while staying updateable without retraining. The foundation of every modern document-chat, support bot, and enterprise search app.
