LoRA (Hu et al., 2021) exploits the fact that weight updates during fine-tuning are usually low-rank. Instead of updating billions of base-model parameters, you add two small matrices A and B (rank 8-64) whose product approximates the update. Base model stays frozen; only A and B train.
Result: an adapter that's < 1% the size of the base model, trainable on a single consumer GPU, swappable at inference time. You can host one base model and dozens of LoRAs for different tasks. QLoRA combines LoRA with 4-bit quantization for even smaller memory footprint.
Example Prompt
# Hugging Face + PEFT: train a LoRA adapter
from peft import LoraConfig, get_peft_model
lora_cfg = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(base_model, lora_cfg)
# Only LoRA params train. Base model stays frozen.
model.print_trainable_parameters()
# "trainable params: 8M || all params: 7B || trainable%: 0.11%"When to use it
- Fine-tuning an open-weight model on consumer GPUs
- Hosting many task-specific variants off one base (cheap multi-tenancy)
- Experimenting with fine-tuning before committing to full training
When NOT to use it
- Using a closed-weight model -- you can't LoRA what you can't access
- The task needs representations the base model fundamentally lacks (full fine-tuning or a different base)
