Research

Papers

A curated digest of recent prompt-engineering, agentic, and AI-security research. Each paper: a 3-sentence TL;DR, why it matters for practitioners, and how to put it to work.

8 papers

Development Oct 2023 arXiv: 2310.06770

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Benchmark that tests whether an LLM can take a real GitHub issue and a full repository and produce a passing code change. Went from 2% solve rate in 2023 to 70%+ by 2025 -- the clearest quantitative record of agentic coding progress we have.

Carlos E. Jimenez, John Yang, Alexander Wettig +4
Techniques May 2023 arXiv: 2305.10601

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Generalizes chain-of-thought by having the model explore multiple reasoning branches, score each, and prune. Dramatically better on puzzle-like problems (24, crosswords, creative writing with constraints) at the cost of 5-10x the tokens.

Shunyu Yao, Dian Yu, Jeffrey Zhao +4
Security Feb 2023 arXiv: 2310.03025

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

The paper that named and benchmarked indirect prompt injection. Demonstrated end-to-end attacks against Bing Chat, GitHub Copilot Chat, and other production LLM integrations via poisoned web content, email, and code comments. The practical wake-up call for agent security.

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra +3
Agentic Oct 2022 arXiv: 2210.03629

ReAct: Synergizing Reasoning and Acting in Language Models

Introduced the Reason-Act-Observe loop: models that alternate between thinking and calling tools dramatically outperform models that only think OR only act. The founding paper of modern agentic LLM architectures.

Shunyu Yao, Jeffrey Zhao, Dian Yu +4
Techniques Mar 2022 arXiv: 2307.11760

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Sample the model N times at nonzero temperature, take the mode of the final answers. Cheapest known reliability upgrade for discrete-answer tasks -- easily doubles accuracy on math reasoning at the cost of N-x compute.

Xuezhi Wang, Jason Wei, Dale Schuurmans +5
Techniques Jan 2022 arXiv: 2201.11903

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Appending "let's think step by step" or showing worked-example reasoning in the prompt dramatically improves LLM accuracy on math and multi-step problems. The paper that named and formalized chain-of-thought. Still the cited reference for CoT despite being from 2022.

Jason Wei, Xuezhi Wang, Dale Schuurmans +6
Development Jun 2021 arXiv: 2106.09685

LoRA: Low-Rank Adaptation of Large Language Models

Fine-tune a giant model by training tiny adapter matrices alongside it, leaving the base frozen. Cuts training memory by 10-100x, lets you host one base model with many LoRAs for different tasks, runs on a single consumer GPU.

Edward J. Hu, Yelong Shen, Phillip Wallis +5
Applied May 2020 arXiv: 2005.11401

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Coined the term RAG. Showed that combining a language model with retrieval over a knowledge base beats fine-tuning on knowledge-intensive tasks while staying updateable without retraining. The foundation of every modern document-chat, support bot, and enterprise search app.

Patrick Lewis, Ethan Perez, Aleksandara Piktus +9

Papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

ReAct: Synergizing Reasoning and Acting in Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks