PEFT (LoRA) – Fine-Tuning LLMs Without Big GPUs

Ajith Kumar — Wed, 05 Nov 2025 10:51:26 +0000

Large Language Models (LLMs) can have billions of parameters.
Fine-tuning them usually requires high-end GPUs and large memory.
Parameter-Efficient Fine-Tuning (PEFT) offers a solution to adapt such models using fewer resources.

What is LoRA?

LoRA (Low-Rank Adaptation) is a PEFT technique where instead of updating the full model,
we only train small, low-rank matrices inserted inside the model layers.

Why Does This Work?

Most weight matrices in large models have redundancy.
LoRA approximates updates using smaller matrices, reducing the number of trainable parameters.

Key Benefits

Requires much less GPU memory
Faster training
Can store multiple task adapters without duplicating full models

Example Comparison

If a model has 10 billion parameters, traditional fine-tuning updates all 10B parameters.
LoRA may only train around 10-50 million parameters, making it extremely resource-efficient.

Where LoRA is Used

Chatbot customization
Domain-specific summarization
Speech and vision language models

How Self-Attention Actually Works (Simple Explanation)

Ajith Kumar — Wed, 05 Nov 2025 10:48:50 +0000

Self-attention is one of the core ideas behind modern Transformer models such as BERT, GPT, and T5.
It allows a model to understand relationships between words in a sequence, regardless of where they appear.

Why Self-Attention?

Earlier models like RNNs and LSTMs processed words in order, making it difficult to learn long-range dependencies.
Self-attention solves this by allowing every word to look at every other word in the sentence at the same time.

Key Idea

Each word in a sentence is transformed into three vectors:

Query (Q) – What the word is looking for
Key (K) – What information the word exposes
Value (V) – The actual information carried by the word

The model computes similarity scores between words using dot products of queries and keys.
These scores are then normalized (using softmax) to determine how much attention one word should pay to another.

Example

In the sentence: "The cat chased the mouse",

When focusing on the word “chased,” it may attend more to “cat” (the subject) and “mouse” (the object)
Attention weights tell the model which words are relevant for understanding a given word

Multi-Head Attention

Instead of one set of Q, K, and V, the model uses multiple heads.
Each head focuses on different relationships (syntax, meaning, etc.).

Benefits of Self-Attention

Learns long-range relationships easily
Can process words in parallel (faster than RNNs)
Works well for multilingual and domain-specific language tasks

Forem: Ajith Kumar