LoRA (Low-Rank Adaptation) -- Qurtoo Glossary

LoRA (Hu et al., 2021) exploits the fact that weight updates during fine-tuning are usually low-rank. Instead of updating billions of base-model parameters, you add two small matrices A and B (rank 8-64) whose product approximates the update. Base model stays frozen; only A and B train.

Result: an adapter that's < 1% the size of the base model, trainable on a single consumer GPU, swappable at inference time. You can host one base model and dozens of LoRAs for different tasks. QLoRA combines LoRA with 4-bit quantization for even smaller memory footprint.

Example Prompt

# Hugging Face + PEFT: train a LoRA adapter
from peft import LoraConfig, get_peft_model

lora_cfg = LoraConfig(
    r=16,                      # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
)
model = get_peft_model(base_model, lora_cfg)
# Only LoRA params train. Base model stays frozen.
model.print_trainable_parameters()
# "trainable params: 8M || all params: 7B || trainable%: 0.11%"

When to use it

Fine-tuning an open-weight model on consumer GPUs
Hosting many task-specific variants off one base (cheap multi-tenancy)
Experimenting with fine-tuning before committing to full training

When NOT to use it

Using a closed-weight model -- you can't LoRA what you can't access
The task needs representations the base model fundamentally lacks (full fine-tuning or a different base)