Development

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

Published June 17, 2021 arXiv: 2106.09685 View on arXiv → PDF →

TL;DR

Fine-tune a giant model by training tiny adapter matrices alongside it, leaving the base frozen. Cuts training memory by 10-100x, lets you host one base model with many LoRAs for different tasks, runs on a single consumer GPU.

Why it matters

LoRA turned fine-tuning from a data-center exercise into something individual developers can do on a 3080. Combined with 4-bit quantization (QLoRA), it's the practical ceiling for hobbyist / small-team model customization. The entire Hugging Face PEFT library is organized around this idea.

If you're considering fine-tuning in 2026, your default is LoRA, not full fine-tuning. Full fine-tuning makes sense only for infra teams at scale or when adapter-based approaches aren't enough.

How you'd use this

When prompt engineering plateaus on a narrow task with lots of examples, LoRA-train an adapter on 5-10k labeled examples. Swap in at inference time via PEFT. Many concurrent LoRAs can coexist on one base model -- a nice multi-tenant story.

Read the authors' abstract

We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture.