TL;DR
Fine-tune a giant model by training tiny adapter matrices alongside it, leaving the base frozen. Cuts training memory by 10-100x, lets you host one base model with many LoRAs for different tasks, runs on a single consumer GPU.
Why it matters
LoRA turned fine-tuning from a data-center exercise into something individual developers can do on a 3080. Combined with 4-bit quantization (QLoRA), it's the practical ceiling for hobbyist / small-team model customization. The entire Hugging Face PEFT library is organized around this idea.
If you're considering fine-tuning in 2026, your default is LoRA, not full fine-tuning. Full fine-tuning makes sense only for infra teams at scale or when adapter-based approaches aren't enough.
How you'd use this
When prompt engineering plateaus on a narrow task with lots of examples, LoRA-train an adapter on 5-10k labeled examples. Swap in at inference time via PEFT. Many concurrent LoRAs can coexist on one base model -- a nice multi-tenant story.
Read the authors' abstract
We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture.
