Forem: Rushi Chaudhari

Antennas: The Physics Rabbit Hole Hidden Inside a Piece of Wire

Rushi Chaudhari — Sat, 09 May 2026 20:18:20 +0000

Recently I met some experts in the radio frequency space and accidentally fell into one of the deepest engineering rabbit holes I’ve hit in years.

These guys were sitting there with SDRs — software-defined radios — casually dragging sliders around while entire invisible worlds appeared on screen.

Airplanes.
Weather satellites.
Garage door openers.
Random telemetry bursts.
Digital chirps from devices I probably shouldn’t know exist.

Then they started building antennas.

And this is where my brain completely derailed.

Because until this point, antennas lived in the same mental category as:

paperclips,
extension cords,
and “miscellaneous wire-shaped objects.”

An antenna is just a piece of wire, right?

How do you go from:

Maxwell’s equations,
resonance,
standing waves,
impedance,
electromagnetic field propagation,

to:

“yeah just cut this copper wire to 16.4 cm and now you can talk to satellites.”

That felt absurd.

Like discovering gravity becomes stronger if you carve wood into the right shape.

And the weirdest part?

Underneath all the scary terminology, antennas are shockingly elegant.

It’s just oscillation.

Springs.
Pendulums.
LC circuits.
Standing waves.
Light itself.

The universe keeps reusing the same oscillator math over and over again.

The Sentence That Changed Everything

The sentence that finally made antennas click for me was this:

“A dipole antenna is basically an LC circuit that leaks energy into space on purpose.”

That one sentence connected:

electronics,
resonance,
waves,
and radio

into one mental model.

Before that, antennas felt magical.

After that, they started feeling inevitable.

Everything Starts With Charge

Before radio.
Before antennas.
Before Maxwell.

There’s charge.

Electrons.

That’s the whole game.

Electrons repel each other.
Opposite charges attract.
That interaction creates electric fields.

Coulomb’s law describes it:

Same inverse-square law shape as gravity.

Apparently the universe found one equation template it liked and just kept shipping expansions.

The important realization is this:

electric fields are physically real.

Not just math.
Not just diagrams in textbooks.

Fields actually contain energy.

That becomes extremely important later because antennas are fundamentally field machines.

The wire is almost incidental.

The fields are the real story.

Capacitance: Storing Energy In Space Like A Madman

A capacitor is just:

two conductors,
separated by an insulator.

That’s it.

+ plate        - plate
| | | electric field | |

Apply voltage:

charge accumulates,
electric field forms,
energy gets stored.

The stored energy equation:

\frac{1}{2}CV^2

Which is still mildly insane to me because it means:

empty space between metal plates is storing usable energy.

That energy exists in the field itself.

This becomes the bridge into radio.

Inductance: Current With Momentum

Then inductors enter the story.

An inductor is basically:

a wire,
usually coiled,
creating magnetic fields when current changes.

But the important intuition is:

inductors resist changes in current.

Like rotational inertia for electricity.

Equation:

L\frac{dI}{dt}

Fast current change?
Big opposing voltage.

Inductors are electrical flywheels.

And suddenly:

capacitors store electric field energy,
inductors store magnetic field energy.

Which leads to one of the coolest systems in engineering.

The Universe Invented Springs Once And Never Stopped Reusing Them

A spring system oscillates between:

kinetic energy,
and potential energy.

An LC circuit does the same thing electrically.

Mechanical oscillator:

\frac{1}{2\pi}\sqrt{\frac{k}{m}}

Electrical oscillator:

\frac{1}{2\pi\sqrt{LC}}

Different nouns.
Same mathematics.

This was one of those moments where physics stopped feeling like memorization and started feeling like uncovering source code.

What An LC Circuit Actually Does

An LC circuit is just:

a capacitor,
connected to an inductor.

That’s it.

But the behavior is beautiful.

The capacitor starts charged.

It pushes current into the inductor.

The inductor builds a magnetic field and resists sudden current changes.

Then the capacitor empties…

…but the inductor keeps current flowing because magnetic fields collapse gradually.

That recharges the capacitor backwards.

Then everything reverses.

Over and over.

Electric field -> magnetic field -> electric field -> magnetic field

This is resonance.

And this is the first deep intuition for antennas:

oscillation is everything.

No oscillation → no radio.

DC current does not radiate.

Accelerating charge radiates.

That distinction is the entire field of RF engineering in one sentence.

Tiny Python Resonance Calculator

import math

L = 10e-6   ## 10 uH
C = 100e-12 ## 100 pF

f = 1 / (2 * math.pi * math.sqrt(L * C))

print(f"Resonant frequency: {f/1e6:.2f} MHz")

Output:

Resonant frequency: 5.03 MHz

That LC combination naturally wants to oscillate around 5 MHz.

Not because we commanded it to.

Because physics prefers that state.

Maxwell Basically Completed Electricity DLC

Before Maxwell:

electricity was one thing,
magnetism was another weird thing.

Then Maxwell unified them and accidentally discovered light.

Which is one of the greatest scientific flexes of all time.

His insight:

changing electric fields create magnetic fields
and changing magnetic fields create electric fields

That loop creates self-propagating waves.

changing E -> changing B -> changing E -> changing B

That’s radio.

That’s WiFi.
Bluetooth.
Microwaves.
Visible light.

Same phenomenon.
Different frequency.

The speed comes directly from Maxwell’s equations:

\frac{1}{\sqrt{\mu_0\epsilon_0}}

Which evaluates to:

299,792,458 m/s

The speed of light.

Meaning:

light is just electromagnetic oscillation moving through space.

That realization broke my brain a little.

The Most Important Antenna Equation

Eventually all antenna design collapses into one equation:

c = f\lambda

Where:

c = speed of light
f = frequency
λ = wavelength

This equation controls almost every antenna dimension.

Tiny Python Frequency/Wavelength Calculator

c = 299792458
f = 433e6

wavelength = c / f

print(f"Wavelength: {wavelength:.3f} meters")

Output:

Wavelength: 0.692 meters

And suddenly:

quarter-wave antenna = 17.3 cm
half-wave dipole = 34.6 cm

Antennas stop feeling magical.

Because you realize:

antennas are geometry matched to oscillation.

So What Actually Is A Dipole?

The dipole is the “hello world” of antennas.

Two wires.
Fed in the middle.

------|------
      ^
   feed point

That’s it.

Which is deeply offensive considering how much physics is hiding inside it.

At resonance:

charge accumulates at tips,
current peaks at center,
standing waves form,
electromagnetic fields launch outward.

Current distribution:

Current:
   /\
--/--\--

Voltage:
--\--/--
   \/

Current maximum at center.
Voltage maximum at tips.

Exactly LC oscillator behavior spread spatially across the wire.

Which leads to the insight that haunted me for days:

a dipole is basically an LC circuit stretched into space.

A Wire Antenna Is Literally A Distributed LC Circuit

This part is beautiful.

In a normal LC circuit:

capacitor stores E-field energy,
inductor stores B-field energy.

In an antenna:

the wire itself has inductance,
the antenna ends create capacitance,
the entire geometry becomes a resonator.

Not metaphorically.

Literally.

The standing wave on the antenna behaves exactly like oscillation inside an LC tank.

Except now:

the energy leaks into space on purpose.

That leakage is radiation.

Why Doesn't The Antenna Just Spark Like A Tesla Coil?

This question bothered me for an entire afternoon.

Because if:

huge oscillating voltages exist,
charge accumulates at the tips,
electric fields are huge,

why isn’t every antenna basically a lightning machine?

The answer is incredibly important:

a spark is what happens when energy cannot escape.

Tesla coils trap energy locally.
Antennas intentionally radiate it away.

Tesla coil:

extremely high Q,
energy trapped locally,
voltage builds,
air ionizes,
spark.

Antenna:

geometry matched to wavelength,
fields detach,
energy propagates outward,
no giant voltage buildup.

The field lines leave.

That’s radiation.

Near Field vs Far Field

This distinction finally made antennas click.

Near field:

energy still attached to antenna,
fields slosh locally,
reactive energy dominates.

Far field:

E and B fields detach,
wave propagates independently,
energy permanently leaves.

Rule of thumb:

far field starts around:
r > λ / 2π

Inside near field:

antenna behaves like a weird resonant circuit.

Outside:

it behaves like a radio transmitter.

Impedance: The Thing RF Engineers Never Stop Talking About

Impedance sounded fake to me initially.

Like engineering jargon invented because “resistance” wasn’t intimidating enough.

But impedance is just:

resistance plus time behavior.

In AC systems:

capacitors delay voltage,
inductors delay current,
phase matters.

So impedance becomes:

Z = R + jX

Where:

R = resistance
X = reactance
j = imaginary component

Meaning:
the circuit resists current in both magnitude and timing.

This matters enormously in antennas because impedance mismatches reflect power backward.

Tiny Python Reactance Calculator

import math

f = 14e6
L = 2e-6
C = 100e-12

XL = 2 * math.pi * f * L
XC = 1 / (2 * math.pi * f * C)

print(f"XL = {XL:.2f} ohms")
print(f"XC = {XC:.2f} ohms")

Output:

XL = 175.93 ohms
XC = 113.68 ohms

At resonance:

XL = XC
reactances cancel
impedance becomes purely resistive

Which is where antennas become happiest.

Standing Waves: The RF Version Of Yelling Into A Wall

If impedance mismatches occur:

radio -> coax -> mismatch -> reflection

part of the signal reflects backward.

Forward and reflected waves interfere.

That creates standing waves.

Measured as:
SWR — Standing Wave Ratio.

Perfect:

1:1

Bad:

5:1

Very bad:

your transmitter becomes a tiny expensive heater

Tiny Python SWR Calculator

Z0 = 50
ZL = 75

gamma = abs((ZL - Z0) / (ZL + Z0))
swr = (1 + gamma) / (1 - gamma)

print(f"SWR = {swr:.2f}:1")

Output:

SWR = 1.50:1

Which is actually pretty decent.

What Is A Balun?

Balun = BALanced to UNbalanced transformer.

This confused me for way too long.

A dipole antenna is balanced:

equal currents on both sides.

Coax cable is unbalanced:

shield on one side,
center conductor on the other.

Directly connecting them can cause RF current to flow down the outside of the coax shield.

Which creates:

distorted radiation patterns,
weird interference,
mysterious RF gremlins.

A balun fixes this.

Common types:

1:1 choke balun
4:1 transformer balun

The simplest balun is hilariously primitive:

wrap coax into several loops

That’s it.

Congratulations.
You built RF wizardry.

What Is A Matching Network?

A matching network transforms impedance so maximum power transfers.

Because RF systems are extremely dramatic about impedance mismatches.

Usually built using:

capacitors,
inductors,
transmission lines.

Goal:

antenna impedance -> 50 ohms

because most radios and coax systems use 50Ω.

Matching networks are basically:

translators,
for electrical stubbornness.

Quarter-Wave Transformer: The Most Elegant RF Hack Ever

This one genuinely delighted me.

A transmission line cut to exactly λ/4 transforms impedance according to:

Z_t = \sqrt{Z_1 Z_2}

Meaning:
a carefully chosen quarter-wave cable section can match mismatched impedances.

No active electronics.
No DSP.
No magic.

Just geometry and wave physics.

RF engineering contains an alarming amount of:

“this exact length of wire solves the problem somehow.”

What Is An Antenna Tuner?

An antenna tuner (ATU) dynamically adjusts matching networks.

Important subtle point:

a tuner does NOT magically fix the antenna.

It mostly fixes what the radio sees.

Which still matters enormously.

Tuners usually contain:

variable inductors,
variable capacitors,
switching networks.

You tweak knobs until SWR drops.

Which feels halfway between engineering and safecracking.

Tiny Dipole Calculator

Suppose we want a dipole for 433 MHz.

c = 299792458
f = 433e6

wavelength = c / f
dipole_total = wavelength / 2

print(f"Wavelength: {wavelength:.3f} m")
print(f"Dipole length: {dipole_total:.3f} m")

Output:

Wavelength: 0.692 m
Dipole length: 0.346 m

Each side:

17.3 cm

You literally cut two wires.

And somehow that lets you interact with invisible oscillating spacetime fields.

Still feels slightly illegal.

Gain: The Thing Marketing Departments Abuse Constantly

Gain is NOT amplification.

Passive antennas do not create energy.

Gain means:

focusing energy directionally.

Flashlight vs bare bulb.

Same power.
Different distribution.

Dipole:

~2.15 dBi
donut-shaped radiation

Yagi:

directional beam
higher gain

Dish:

microwave death laser plate

Every extra dB narrows beamwidth.

Physics always charges rent somewhere.

Tiny Python Gain/EIRP Calculator

tx_power_dbm = 20
antenna_gain_dbi = 8
cable_loss_db = 2

eirp = tx_power_dbm + antenna_gain_dbi - cable_loss_db

print(f"EIRP = {eirp} dBm")

Output:

EIRP = 26 dBm

Practical Antenna Building Workflow

The actual engineering workflow finally became clear to me:

Step 1 — Choose Frequency

Everything begins with:

433 MHz
915 MHz
2.4 GHz
etc.

Step 2 — Calculate Wavelength

\lambda = \frac{c}{f}

Step 3 — Pick Geometry

Need omnidirectional?
→ dipole

Need directional?
→ Yagi

Need compact?
→ patch or loop

Step 4 — Match Impedance

Usually:

antenna ≈ 50–75Ω
coax = 50Ω

Use:

balun,
tuner,
matching network,
quarter-wave transformer.

Step 5 — Tune

Trim gradually while measuring SWR.

Every RF person repeats this religiously:

cut long first.

Because you can remove wire.
You cannot emotionally recover from cutting it too short.

The Tool That Made RF Feel Real

The first time I connected a NanoVNA to an antenna and watched resonance appear exactly where the equations predicted…

…it was over.

I was hooked.

You sweep frequency.

SWR dips appear.

Resonance moves when you trim wire.

And suddenly:

Maxwell’s equations stop feeling theoretical.

You are watching physics happen live.

The NanoVNA ecosystem is honestly incredible for hobby RF work. The official NanoVNA project and software ecosystem are here:

Antenna Simulation Feels Like A Superpower

Another thing that completely changed the game for me was antenna modeling software.

Tools like:

let you:

simulate radiation patterns,
impedance,
gain,
SWR,
current distributions,
near/far fields,

before cutting any metal.

Which means:
you can literally watch Maxwell’s equations numerically solve your antenna.

That still feels absurdly futuristic.

The Weirdly Beautiful Part Of All This

The deeper I got into antennas, the more everything started collapsing into one giant unified oscillator story.

Springs.
Pendulums.
LC circuits.
Standing waves.
Light.

The universe keeps reusing the same mathematics because oscillation is deeply fundamental.

And antennas are one of the purest examples of that.

A carefully sized piece of metal starts coupling energy into spacetime itself.

That sentence sounds fake.

But it’s literally what’s happening.

The Mental Model That Finally Made It Click

Here’s the final simplified picture that made antennas intuitive for me:

Battery:
pushes charge steadily

LC circuit:
sloshes energy back and forth

Antenna:
sloshes energy back and forth
AND leaks some of it into space

That’s radio.

That’s basically the whole thing.

Everything else:

impedance matching,
SWR,
baluns,
gain,
feedlines,
radiation patterns,

is engineering optimization around that core phenomenon.

The Most Memorable Insight I Took Away

This line stayed with me:

“A spark is what happens when energy can’t escape. A radio wave is what happens when it can.”

That’s basically the difference between:

a Tesla coil,
and a transmitter tower.

One traps energy.

One launches it.

And somehow all of that emerges from:

moving electrons,
oscillating fields,
and a carefully sized piece of metal.

Which honestly still feels slightly magical.

Training LLMs on Mixed GPUs: My Experiments and What I Learnt

Rushi Chaudhari — Fri, 28 Nov 2025 16:59:38 +0000

In the last few months, I have been very interested in large language models. At the same time, the GPU world is also changing. Nvidia is still the market leader, but AMD, Intel, and even Chinese companies are making cheaper GPUs. The main challenge is that CUDA is still the dominant software stack, and Nvidia drivers are not open source. Because of this, using non‑Nvidia GPUs is still not smooth.

As someone who runs a homelab, I wanted a setup where I can use different GPUs together. But even mixing two Nvidia GPUs of different generations is hard. If you upgrade from RTX 3090 to RTX 5090, you may need a different CUDA version, a different Python version, and a different PyTorch version. New architectures like Blackwell also take time to enter mainstream frameworks.

So many people end up buying the same model GPU again just to do multi‑GPU training.

I wanted to avoid that and see if mixed‑GPU training is possible.

System Architecture Diagram

The system auto generates a topology diagram after you configure and run the coordinator once. The generated file is saved at architecture.png.

What Current ML Systems Support

I looked into many systems like:

DeepSpeed
Megatron‑LM
PyTorch Distributed + TorchGpipe
vLLM
Colossal‑AI

All of these are powerful, but none properly support mixing CUDA and ROCm GPUs in one training job.

There is something called UCC (Unified Collective Communication) that tries to help. But the PyTorch integration here (torch‑ucc) is still experimental and archived:
https://github.com/openucx/torch-ucc

UCX developers also said here that CUDA and ROCm support is “in theory”, but mixed setups were never fully tested:
https://github.com/openucx/ucx/discussions/9985

So true heterogeneous GPU training is still not ready in major frameworks.

Papers Trying to Solve This

I found some research papers that aim to solve heterogeneous GPU training:

HetHub https://arxiv.org/pdf/2405.16256
HyperPipe https://ieeexplore.ieee.org/document/11033309
Cephalo https://dl.acm.org/doi/10.1145/3721145.3730418
HeterMoE https://arxiv.org/pdf/2504.03871
Zorse https://arxiv.org/abs/2507.10392

These papers show that the idea is possible, but:

None of these are open source
Real‑world implementations are still missing
Homelab users cannot use these systems directly

Because of all these limitations, I decided to build my own simple framework.

How my HeteroGPU framework enables mixed‑GPU pipeline training in homelabs

My goal was very simple:

I wanted to run LLM training across different GPUs in my homelab, even if they belong to different generations or vendors, without depending on complicated distributed frameworks.

My HeteroGPU framework helps to do this by providing:

Layer‑based pipeline parallelism
The model is split by layers so it can run across GPUs with different VRAM sizes.
Simple coordinator–worker design
The main machine holds the first part of the model. Remote machines run later layers. They communicate using a lightweight socket interface over 10Gb ethernet or thunderbolt (not implemented).
Support for mixed GPU speeds
Faster GPU can take more layers, slower GPU can take fewer layers.
Small and hackable codebase
Ideal for homelab experimentation, unlike large frameworks like DeepSpeed.
Profiler inspired by Cephalo
Helps decide how to split layers between GPUs based on compute speed, memory capacity, and communication delay.
Works even when GPUs require different drivers or CUDA versions
Because each machine only loads its own shard locally and communicates via raw tensors over the network, you do not need unified CUDA versions.

This makes heterogeneous pipeline training practical for home users who may have a strong Nvidia GPU as main device, an older GPU on another machine, or even an integrated GPU like Strix Halo. With this design, training becomes possible even if a single GPU cannot fit the model.

Quick Explanation of Parallelism

Data Parallelism: Copy the whole model to each GPU and split the batch.
Tensor / Model Parallelism: Split each layer across GPUs. Very communication heavy.
Pipeline Parallelism: Split the model layer‑wise. GPU 1 runs early layers, GPU 2 runs later layers.

Pipeline parallelism is the easiest for mixed GPUs. The only drawback is that transformers often cause one GPU to wait while the other works. But it still allows training when a model cannot fit into one GPU.

My Experiments With LLaMA Finetuning

I tested the same training script on:

RTX 5090 single GPU
AMD Strix Halo single GPU
Two‑machine pipeline setup

The results showed how mixed GPU training behaves.

RTX 5090 (Single GPU)

» python examples/alpaca_example_singlemachine.py
Using device: cuda
`torch_dtype` is deprecated! Use `dtype` instead!
trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695
Epoch 0 | Step 10 | Loss 2.4383 | LR 0.000020
Epoch 0 | Step 20 | Loss 1.8139 | LR 0.000040
Epoch 0 | Step 30 | Loss 1.4709 | LR 0.000060
Epoch 0 | Step 40 | Loss 1.2903 | LR 0.000080
Epoch 0 | Step 50 | Loss 1.2693 | LR 0.000100
Epoch 0 | Step 60 | Loss 1.2671 | LR 0.000120
Saved LoRA adapters to: ./lora_unsloth_sft/lora
Training complete.

Sample generation:
 <s>You are a helpful assistant.
<|user|>
Write a haiku about GPUs.
<|assistant|>
In the lab, the GPU
Is the heart of the machine,
Running calculations.
</s>

Total training time: 289.11 seconds

Training time: 289 seconds
Loss dropped smoothly from 2.43 to 1.26.
Fast and stable.

Strix Halo (Single GPU)

$ python examples/alpaca_example_singlemachine.py 
Using device: cuda
`torch_dtype` is deprecated! Use `dtype` instead!
g++ (GCC) 15.2.1 20250813
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695

Epoch 0 | Step 10 | Loss 2.4027 | LR 0.000020
Epoch 0 | Step 20 | Loss 1.8115 | LR 0.000040
Epoch 0 | Step 30 | Loss 1.2460 | LR 0.000060
Epoch 0 | Step 40 | Loss 1.4227 | LR 0.000080
Epoch 0 | Step 50 | Loss 1.2628 | LR 0.000100
Epoch 0 | Step 60 | Loss 1.2507 | LR 0.000120
Saved LoRA adapters to: ./lora_unsloth_sft/lora
Training complete.

Sample generation:
 <s>You are a helpful assistant.
<|user|>
Write a haiku about GPUs.
<|assistant|>
A GPU, a powerful tool
For processing data and computing
A helpful aid for many a task.
</s>

Total training time: 3242.91 seconds
(.venv) [alpha@toolbx HeteroShard]$

Training time: 3243 seconds
Loss also went down correctly, but speed was extremely slow. Around 11 times slower than the 5090. This shows the large performance gap between GPU types.

Distributed Pipeline Training (Two GPUs)

Expand for full logs

» python examples/demo_llama8b4bit_distributed.py --config hetero_config.json
📍 This machine: doraemon-arch (192.168.1.153)
✓ Role: COORDINATOR

======================================================================
COORDINATOR MODE - LLAMA 8B 4-BIT TRAINING
======================================================================

Device: cuda
Worker: worker1 (192.168.1.166:9999)
Split: Layers 0-15 (local) | 16-31 (remote)

Connecting to worker...
✓ Connected

Loading tokenizer...
Loading model...
`torch_dtype` is deprecated! Use `dtype` instead!
trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695

Creating local shard...
✓ Local shard ready (Embedding + Layers 0-15)

Loading dataset...
✓ Dataset: 100 examples

======================================================================
TRAINING
======================================================================
Steps: 25 | Batch: 1 | Accum: 4

/mnt/sdc3/Documents/hetrogpu/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
Epoch 0 | Step 1/25 | Loss 2.3243 | LR 0.000020
Epoch 0 | Step 2/25 | Loss 2.4754 | LR 0.000040
Epoch 0 | Step 3/25 | Loss 2.4923 | LR 0.000060
Epoch 0 | Step 4/25 | Loss 2.7389 | LR 0.000080
Epoch 0 | Step 5/25 | Loss 2.1877 | LR 0.000100
Epoch 0 | Step 6/25 | Loss 2.0371 | LR 0.000120
Epoch 0 | Step 7/25 | Loss 2.3928 | LR 0.000140
Epoch 0 | Step 8/25 | Loss 1.5122 | LR 0.000160
Epoch 0 | Step 9/25 | Loss 1.9724 | LR 0.000180
Epoch 0 | Step 10/25 | Loss 2.2792 | LR 0.000200
Epoch 0 | Step 11/25 | Loss 1.9573 | LR 0.000198
Epoch 0 | Step 12/25 | Loss 1.4388 | LR 0.000192
Epoch 0 | Step 13/25 | Loss 1.8510 | LR 0.000183
Epoch 0 | Step 14/25 | Loss 1.6279 | LR 0.000170
Epoch 0 | Step 15/25 | Loss 1.4549 | LR 0.000155
Epoch 0 | Step 16/25 | Loss 1.2129 | LR 0.000138
Epoch 0 | Step 17/25 | Loss 1.3626 | LR 0.000119
Epoch 0 | Step 18/25 | Loss 1.2285 | LR 0.000101
Epoch 0 | Step 19/25 | Loss 1.4700 | LR 0.000082
Epoch 0 | Step 20/25 | Loss 1.3244 | LR 0.000065
Epoch 0 | Step 21/25 | Loss 1.4875 | LR 0.000050
Epoch 0 | Step 22/25 | Loss 1.4656 | LR 0.000037
Epoch 0 | Step 23/25 | Loss 1.0804 | LR 0.000028
Epoch 0 | Step 24/25 | Loss 1.5531 | LR 0.000022
Epoch 0 | Step 25/25 | Loss 1.0947 | LR 0.000020

✓ Training complete!
Total training time: 184.59 seconds
Saved LoRA adapters to: ./lora_unsloth_sft_distributed/lora

Sample generation:
 You are a helpful assistant.
&lt;|user|&gt;
Write a short haiku about distributed training.
&lt;|assistant|&gt;
Distributed training,
Like a symphony,
All the parts work together.






--- 



$ python examples/demo_llama8b4bit_distributed.py --config hetero_config.json
📍 This machine: toolbx (192.168.1.166)
✓ Role: WORKER 1

======================================================================
WORKER MODE - LLAMA 8B 4-BIT (LAYERS 16-31)
======================================================================

Device: cuda
Port: 9999

Loading model...
`torch_dtype` is deprecated! Use `dtype` instead!
g++ (GCC) 15.2.1 20250813
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Creating remote shard...
✓ Remote shard ready (Layers 16-31)

Listening on 0.0.0.0:9999...
✓ Connected to coordinator at ('192.168.1.153', 46384)

[Step 0] Waiting for data...
/torch-therock/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1035: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
[Step 0] Loss: 1.6613
[Step 0] ✓ Complete

[Step 1] Waiting for data...
[Step 1] Loss: 2.5880
[Step 1] ✓ Complete

[Step 2] Waiting for data...
[Step 2] Loss: 3.1850
[Step 2] ✓ Complete

[Step 3] Waiting for data...
[Step 3] Loss: 1.8631
[Step 3] ✓ Complete

[Step 4] Waiting for data...
[Step 4] Loss: 2.3016
[Step 4] ✓ Complete

[Step 5] Waiting for data...
[Step 5] Loss: 2.4796
[Step 5] ✓ Complete

[Step 6] Waiting for data...
[Step 6] Loss: 2.7196
[Step 6] ✓ Complete

[Step 7] Waiting for data...
[Step 7] Loss: 2.4008
[Step 7] ✓ Complete

[Step 8] Waiting for data...
[Step 8] Loss: 1.9301
[Step 8] ✓ Complete

[Step 9] Waiting for data...
[Step 9] Loss: 1.9098
[Step 9] ✓ Complete

[Step 10] Waiting for data...
[Step 10] Loss: 3.0177
[Step 10] ✓ Complete

[Step 11] Waiting for data...
[Step 11] Loss: 3.1114
[Step 11] ✓ Complete

[Step 12] Waiting for data...
[Step 12] Loss: 1.7507
[Step 12] ✓ Complete

[Step 13] Waiting for data...
[Step 13] Loss: 3.0108
[Step 13] ✓ Complete

[Step 14] Waiting for data...
[Step 14] Loss: 2.5046
[Step 14] ✓ Complete

[Step 15] Waiting for data...
[Step 15] Loss: 3.6894
[Step 15] ✓ Complete

[Step 16] Waiting for data...
[Step 16] Loss: 1.8336
[Step 16] ✓ Complete

[Step 17] Waiting for data...
[Step 17] Loss: 1.5026
[Step 17] ✓ Complete

[Step 18] Waiting for data...
[Step 18] Loss: 3.4676
[Step 18] ✓ Complete

[Step 19] Waiting for data...
[Step 19] Loss: 1.9469
[Step 19] ✓ Complete

[Step 20] Waiting for data...
[Step 20] Loss: 2.0781
[Step 20] ✓ Complete

[Step 21] Waiting for data...
[Step 21] Loss: 1.7651
[Step 21] ✓ Complete

[Step 22] Waiting for data...
[Step 22] Loss: 2.0139
[Step 22] ✓ Complete

[Step 23] Waiting for data...
[Step 23] Loss: 2.2912
[Step 23] ✓ Complete

[Step 24] Waiting for data...
[Step 24] Loss: 2.6897
[Step 24] ✓ Complete

[Step 25] Waiting for data...
[Step 25] Loss: 2.8378
[Step 25] ✓ Complete

[Step 26] Waiting for data...
[Step 26] Loss: 1.9898
[Step 26] ✓ Complete

[Step 27] Waiting for data...
[Step 27] Loss: 2.0538
[Step 27] ✓ Complete

[Step 28] Waiting for data...
[Step 28] Loss: 1.6081
[Step 28] ✓ Complete

[Step 29] Waiting for data...
[Step 29] Loss: 1.4623
[Step 29] ✓ Complete

[Step 30] Waiting for data...
[Step 30] Loss: 1.2606
[Step 30] ✓ Complete

[Step 31] Waiting for data...
[Step 31] Loss: 1.7178
[Step 31] ✓ Complete

[Step 32] Waiting for data...
[Step 32] Loss: 1.9203
[Step 32] ✓ Complete

[Step 33] Waiting for data...
[Step 33] Loss: 1.6814
[Step 33] ✓ Complete

[Step 34] Waiting for data...
[Step 34] Loss: 2.5819
[Step 34] ✓ Complete

[Step 35] Waiting for data...
[Step 35] Loss: 1.7061
[Step 35] ✓ Complete

[Step 36] Waiting for data...
[Step 36] Loss: 2.3311
[Step 36] ✓ Complete

[Step 37] Waiting for data...
[Step 37] Loss: 2.2990
[Step 37] ✓ Complete

[Step 38] Waiting for data...
[Step 38] Loss: 1.8855
[Step 38] ✓ Complete

[Step 39] Waiting for data...
[Step 39] Loss: 2.6010
[Step 39] ✓ Complete

[Step 40] Waiting for data...
[Step 40] Loss: 2.3807
[Step 40] ✓ Complete

[Step 41] Waiting for data...
[Step 41] Loss: 2.0204
[Step 41] ✓ Complete

[Step 42] Waiting for data...
[Step 42] Loss: 1.7209
[Step 42] ✓ Complete

[Step 43] Waiting for data...
[Step 43] Loss: 1.7073
[Step 43] ✓ Complete

[Step 44] Waiting for data...
[Step 44] Loss: 1.1900
[Step 44] ✓ Complete

[Step 45] Waiting for data...
[Step 45] Loss: 1.8439
[Step 45] ✓ Complete

[Step 46] Waiting for data...
[Step 46] Loss: 1.1291
[Step 46] ✓ Complete

[Step 47] Waiting for data...
[Step 47] Loss: 1.5923
[Step 47] ✓ Complete

[Step 48] Waiting for data...
[Step 48] Loss: 1.9110
[Step 48] ✓ Complete

[Step 49] Waiting for data...
[Step 49] Loss: 1.1971
[Step 49] ✓ Complete

[Step 50] Waiting for data...
[Step 50] Loss: 3.0576
[Step 50] ✓ Complete

[Step 51] Waiting for data...
[Step 51] Loss: 1.2383
[Step 51] ✓ Complete

[Step 52] Waiting for data...
[Step 52] Loss: 1.6820
[Step 52] ✓ Complete

[Step 53] Waiting for data...
[Step 53] Loss: 1.7755
[Step 53] ✓ Complete

[Step 54] Waiting for data...
[Step 54] Loss: 1.2515
[Step 54] ✓ Complete

[Step 55] Waiting for data...
[Step 55] Loss: 1.8027
[Step 55] ✓ Complete

[Step 56] Waiting for data...
[Step 56] Loss: 1.2692
[Step 56] ✓ Complete

[Step 57] Waiting for data...
[Step 57] Loss: 1.6293
[Step 57] ✓ Complete

[Step 58] Waiting for data...
[Step 58] Loss: 1.1256
[Step 58] ✓ Complete

[Step 59] Waiting for data...
[Step 59] Loss: 1.7956
[Step 59] ✓ Complete

[Step 60] Waiting for data...
[Step 60] Loss: 1.3114
[Step 60] ✓ Complete

[Step 61] Waiting for data...
[Step 61] Loss: 1.4944
[Step 61] ✓ Complete

[Step 62] Waiting for data...
[Step 62] Loss: 0.9233
[Step 62] ✓ Complete

[Step 63] Waiting for data...
[Step 63] Loss: 1.1224
[Step 63] ✓ Complete

[Step 64] Waiting for data...
[Step 64] Loss: 1.4849
[Step 64] ✓ Complete

[Step 65] Waiting for data...
[Step 65] Loss: 1.0226
[Step 65] ✓ Complete

[Step 66] Waiting for data...
[Step 66] Loss: 1.3064
[Step 66] ✓ Complete

[Step 67] Waiting for data...
[Step 67] Loss: 1.6367
[Step 67] ✓ Complete

[Step 68] Waiting for data...
[Step 68] Loss: 1.6595
[Step 68] ✓ Complete

[Step 69] Waiting for data...
[Step 69] Loss: 1.3235
[Step 69] ✓ Complete

[Step 70] Waiting for data...
[Step 70] Loss: 0.8673
[Step 70] ✓ Complete

[Step 71] Waiting for data...
[Step 71] Loss: 1.0639
[Step 71] ✓ Complete

[Step 72] Waiting for data...
[Step 72] Loss: 1.6803
[Step 72] ✓ Complete

[Step 73] Waiting for data...
[Step 73] Loss: 1.5877
[Step 73] ✓ Complete

[Step 74] Waiting for data...
[Step 74] Loss: 1.3728
[Step 74] ✓ Complete

[Step 75] Waiting for data...
[Step 75] Loss: 1.2393
[Step 75] ✓ Complete

[Step 76] Waiting for data...
[Step 76] Loss: 1.4007
[Step 76] ✓ Complete

[Step 77] Waiting for data...
[Step 77] Loss: 0.9818
[Step 77] ✓ Complete

[Step 78] Waiting for data...
[Step 78] Loss: 1.3658
[Step 78] ✓ Complete

[Step 79] Waiting for data...
[Step 79] Loss: 1.5493
[Step 79] ✓ Complete

[Step 80] Waiting for data...
[Step 80] Loss: 1.3884
[Step 80] ✓ Complete

[Step 81] Waiting for data...
[Step 81] Loss: 1.3920
[Step 81] ✓ Complete

[Step 82] Waiting for data...
[Step 82] Loss: 1.9356
[Step 82] ✓ Complete

[Step 83] Waiting for data...
[Step 83] Loss: 1.2340
[Step 83] ✓ Complete

[Step 84] Waiting for data...
[Step 84] Loss: 1.2280
[Step 84] ✓ Complete

[Step 85] Waiting for data...
[Step 85] Loss: 1.7844
[Step 85] ✓ Complete

[Step 86] Waiting for data...
[Step 86] Loss: 1.2704
[Step 86] ✓ Complete

[Step 87] Waiting for data...
[Step 87] Loss: 1.5795
[Step 87] ✓ Complete

[Step 88] Waiting for data...
[Step 88] Loss: 0.9333
[Step 88] ✓ Complete

[Step 89] Waiting for data...
[Step 89] Loss: 0.9236
[Step 89] ✓ Complete

[Step 90] Waiting for data...
[Step 90] Loss: 1.0831
[Step 90] ✓ Complete

[Step 91] Waiting for data...
[Step 91] Loss: 1.3817
[Step 91] ✓ Complete

[Step 92] Waiting for data...
[Step 92] Loss: 1.3752
[Step 92] ✓ Complete

[Step 93] Waiting for data...
[Step 93] Loss: 1.9094
[Step 93] ✓ Complete

[Step 94] Waiting for data...
[Step 94] Loss: 1.6458
[Step 94] ✓ Complete

[Step 95] Waiting for data...
[Step 95] Loss: 1.2820
[Step 95] ✓ Complete

[Step 96] Waiting for data...
[Step 96] Loss: 1.5715
[Step 96] ✓ Complete

[Step 97] Waiting for data...
[Step 97] Loss: 0.8391
[Step 97] ✓ Complete

[Step 98] Waiting for data...
[Step 98] Loss: 0.9126
[Step 98] ✓ Complete

[Step 99] Waiting for data...
[Step 99] Loss: 1.0555
[Step 99] ✓ Complete

[Step 100] Waiting for data...
Connection closed.
(.venv) [alpha@toolbx HeteroShard]$

Training time: 184 seconds

Model was split:

Layers 0–15 on the main machine
Layers 16–31 on the worker machine

Both GPUs handled their parts. Worker logs show: Waiting for data, Loss, Complete. This shows the pipeline stalls, which is expected. Still, the total time was faster than the single 5090.

What I Learnt From These Runs

Mixed‑GPU pipeline training works in real life, not just in papers.
Speed depends on the slowest GPU, so good splitting is important.
Distributed training has waiting time and communication cost, but still can beat a single strong GPU.
Consumer GPUs vary hugely in speed, which is why homelab users need flexible systems.
A simple framework like HeteroGPU can achieve things that big frameworks do not support yet.

My Profiler System

The profiler I added does the following:

Runs tiny batches on each GPU
Measures latency and memory usage
Builds simple linear models to predict performance
Measures communication cost
Chooses the best pipeline split

This matches the idea in the Cephalo paper:
https://dl.acm.org/doi/10.1145/3721145.3730418

This allows the system to work even when one GPU is fast but low VRAM, and another GPU is slow but high VRAM.

Next Steps

Now I plan to experiment with:

HeterMoE: https://arxiv.org/pdf/2504.03871 or maybe
Zorse: https://arxiv.org/abs/2507.10392

MoE (Mixture‑of‑Experts) models are naturally suited for heterogeneous hardware, so they may perform better in mixed GPU clusters.

Github repo: https://github.com/0xrushi/HeteroShard

Is Google Colab Pro Really Worth It?

Rushi Chaudhari — Wed, 01 May 2024 05:31:01 +0000

In late 2022, Google revamped its widely-used Colab platform, transitioning from a subscription-based system to a pay-as-you-go model under the new Colab Pro and Pro+ schemes. This change introduced "compute units," which serve as the new currency within the platform, where the consumption rate depends on the virtual machine's configuration and the use of specialized accelerators like TPUs or GPUs.

Here's a breakdown of how the compute units are consumed based on different GPUs, assuming an allocation of 100 units:

T4: Consumes 1.96 units per hour, providing about 51 hours of use.
V100: Requires 5 units per hour, totaling about 20 hours.
A100: Demands 15 units per hour, which amounts to 6 hours.

It's important to note that the T4 GPU is available for free; however, its availability under the Colab Pro tier is not guaranteed, often necessitating the use of costlier alternatives.

This shift has introduced a layer of complexity that many users find disappointing, especially when there are more straightforward options available on the market. For comparison, here's a quick overview of pricing and availability from various smaller cloud providers:

Lambda Labs, Jarvislabs.ai, tensordock, genesis cloud, paperspace, Vast.ai, and FluidStack offer a range of GPU options like NVIDIA A100 PCIe and V100 at varying price points and hourly rates.

Competitive Analysis of Cloud Computing Providers

When considering cost-effectiveness:

For the A100 GPU, Paperspace offers the lowest price at $1.15 per hour.
For the V100 GPU, Vast.ai provides the most affordable rate at $0.16 per hour.

However, it's essential to highlight that major cloud services like AWS, Azure, and Google Cloud were excluded from this comparison due to their higher prices, despite offering better scalability and integration.

Potential users should be aware that the availability of instances on smaller clouds can be unpredictable, making them more suitable for personal projects rather than enterprise solutions. Additionally, these platforms may not always have complete libraries installed (e.g., Hugging Face on Paperspace), which could extend setup times.

Conclusion

Before subscribing to Colab Pro, thoroughly explore and compare alternative cloud platforms that may offer better rates or features suited to your needs.

Exploring Low-Rank Adaptation (LoRA) from scratch

Rushi Chaudhari — Thu, 25 Apr 2024 05:11:48 +0000

Notebook link: here

I've been exploring LoRA and was seeking a straightforward implementation example. Many resources I've found focus on training large models and often utilize PEFT and the loralib package, as well as some basic implementations using CNNs or ANNs as outlined in sources like [[2]].

I came across some examples using LoRA with BERT, DistillBert, and others involving a Linear() layer. However, I'm specifically interested in applying it to GPT2, which uses a Conv1D() layer instead of Linear().

These days, the deep learning models have significantly more layers. One major challenge with fine-tuning large models like GPT is their size; they often don't fit into the limited VRAM available. To address this, researchers at Microsoft developed the Low Rank Adaptation (LoRA) technique. This method leverages the principle of low-rank matrix decomposition. It has shown that common pre-trained models can be effectively fine-tuned or adapted using just a small subset of their original parameters, instead of modifying every parameter. This approach not only reduces the VRAM requirements but can be just as effective for fine-tuning purposes as using the full set of parameters.

LoRA approximates a layer's weight changes during training, ΔW, in a low-rank format.

For instance, whereas in regular finetuning, we compute the weight updates of a weight matrix W as ΔW, in LoRA, we approximate ΔW through the matrix multiplication of two smaller matrices AB, as illustrated in the figure below. (If you are familiar with PCA or SVD, consider this as decomposing ΔW into A and B.)

With LoRA, the transformation in a particular layer originally involved just $\cdot x$ , where $W$ is the weight matrix and $x$ is the input. This operation now includes an additional term, resulting in $(W_A W_B) \cdot x$ .

Original Operation: The operation $W x$ involves $W$ , a large matrix typically with dimensions like $768 \times 768$ as seen in models like BERT or GPT-2. The computational complexity of this operation is $O(d^2)$ , where $d$ is the dimension of $W$ (assuming square matrices for simplicity).

LoRA Operation: In the LoRA approach, $W_A$ and $W_B$ are smaller matrices with dimensions $\times r$ and $\times d$ respectively, where $r$ is much smaller than $d$ (indicating low rank). The product $W_A W_B$ , therefore, has the same dimension as $W$ but is composed of two smaller matrices. This configuration reduces the computational load significantly:

First, the product $W_A W_B$ is computed, which involves a complexity of $O(d2⋅r)O(d^2 \cdot r)$ .
Then, this product multiplies the input $x$ , resulting in $W_A W_B)x$ , with a computational complexity similar to the original operation $W x$ , but the initial reduction in complexity due to the lower rank matrices helps to manage overall computational demands effectively.

For instance, consider a weight matrix W in a specific layer, sized at 2,000x10,000, totaling 20 million parameters. If we opt for a rank r=3, we would set up two new matrices: a 2,000x3 matrix B and an 3x10,000 matrix A. Together, matrices A and B contain just 6000 + 30,000 = 36,000 parameters, making them over 555 times smaller than the 20 million parameters typically involved in standard fine-tuning with ΔW.

We'll use the News Articles dataset from Kaggle to explore experiments with GPT2. Below are some code snippets that show data loading and preprocessing steps.

!pip install pytorch-lightning lightning accelerate transformers[torch]

import pandas as pd
import numpy as np
import re
import math

from transformers import TextDataset, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import init

Data Preprocess

def cleaning(s):
    s = str(s)
    s = re.sub('\s\W',' ',s)
    s = re.sub('\W,\s',' ',s)
    s = re.sub("\d+", "", s)
    s = re.sub('\s+',' ',s)
    s = re.sub('[!@#$_]', '', s)
    s = s.replace("co","")
    s = s.replace("https","")
    s = s.replace("[\w*"," ")
    return s

# dataset link https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles
df = pd.read_csv("./Articles.csv", encoding="ISO-8859-1")
df = df.dropna()
text_data = open('Articles.txt', 'w')
for idx, item in df.iterrows():
  article = cleaning(item["Article"])
  text_data.write(article)
text_data.close()

def load_dataset(file_path, tokenizer, block_size = 128):
    dataset = TextDataset(
        tokenizer = tokenizer,
        file_path = file_path,
        block_size = block_size,
    )
    return dataset


def load_data_collator(tokenizer, mlm = False):
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=mlm,
    )
    return data_collator

Download pretrained GPT2 model

model = GPT2LMHeadModel.from_pretrained(model_name)
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Contrary to the examples referenced, this model doesn't use a Linear() layer but instead features a Conv1D() layer, which is mathematically equivalent. The concept remains the same, though the implementation differs. Let's proceed by creating a LoRA wrapper specifically tailored for it.

Note that we have frozen the base models parameters so only lora weights get trained.

Let's now create a LoRa wrapper for Conv1D.

Conv1D Lora Wrapper

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class LoRAConv1DWrapper(nn.Module):
    """
    A wrapper module that applies LORA to the weights of a convolutional layer.
    """

    def __init__(self, module: nn.Module, rank: int):
        """
        Initializes the LoRAConv1DWrapper instance.

        Parameters:
            module (nn.Module): The base module whose weights are to be adapted.
            rank (int): The rank for the low-rank matrices A and B. If set to 0, LoRA is effectively disabled.
        """
        super().__init__()
        if rank < 0:
            raise ValueError("Rank must be a non-negative integer")

        self.base_module = module

        out_features, in_features = self.base_module.weight.shape

        self.lora_rank = rank
        if self.lora_rank > 0:
            self.W_A = nn.Parameter(
                torch.zeros((self.lora_rank, in_features)),
                requires_grad=True)
            self.W_B = nn.Parameter(
                torch.zeros((out_features, self.lora_rank)),
                requires_grad=True)

            # self.print_trainable_parameters()

            # freeze the base module's parameters, only focus on updating lora weights
            self.base_module.weight.requires_grad = False
            if self.base_module.bias is not None:
                self.base_module.bias.requires_grad = False
        else:
            print(f"Creating LoRAConv1DWrapper with no rank adaptation: rank {self.lora_rank}")

        self.reset_parameters()

    def reset_parameters(self):
        """
        Initializes or resets the parameters of the LoRA matrices A and B to their default values.
        This method typically mirrors the initialization logic of the base module.
        """
        if self.lora_rank > 0:
            # initialize A matrix
            nn.init.kaiming_uniform_(self.W_A, a=math.sqrt(5))
            # initialize B matrix to 0
            nn.init.zeros_(self.W_B)

    def print_trainable_parameters(self):
        """
        Prints the number of trainable parameters in the base module and the additional parameters added by LoRA.
        """
        base_params = sum(p.numel() for p in self.base_module.parameters())
        lora_params = sum(p.numel() for p in [self.W_A, self.W_B])

        print(f"Trainable parameters in base module: {base_params}")
        print(f"Trainable parameters in LoRA (base module frozen): {lora_params}")

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Performs a forward pass through the LoRAConv1DWrapper, applying low-rank adaptations to the base module's weights.

        Parameters:
            x (torch.Tensor): The input tensor to the module.

        Returns:
            torch.Tensor: The output of the module after applying the low-rank adapted forward pass.
        """
        if self.lora_rank > 0:
            # Compute the base module's forward pass with adapted weights
            # print(self.W_A.shape)
            # print(self.W_B.shape)
            adapted_weight = self.base_module.weight + self.W_B @ self.W_A
            return F.linear(x, adapted_weight.T, self.base_module.bias)
        else:
            # Perform a standard forward pass using the base module's original weights and bias
            return F.linear(x, self.base_module.weight, self.base_module.bias)

def update_model_layers(model):
  # Set LoRA hyperparameters
  lora_r = 8
  lora_alpha = 16
  lora_dropout = 0.05
  # flag to apply LoRA to Transformer layers
  lora_attn = True
  # flag to apply LoRA to MLP layers
  lora_mlp = True

  # Apply LoRA modifications to the GPT2 layers
  for block in model.transformer.h:
      if lora_attn:
        block.attn.c_attn = LoRAConv1DWrapper(block.attn.c_attn, rank=2)
        block.attn.c_proj = LoRAConv1DWrapper(block.attn.c_proj, rank=2)

      if lora_mlp:
          block.mlp.c_fc = LoRAConv1DWrapper(block.mlp.c_fc, rank=2)
          block.mlp.c_proj = LoRAConv1DWrapper(block.mlp.c_proj, rank=2)
  return model

print(update_model_layers(model))

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): LoRAConv1DWrapper(
            (base_module): Conv1D()
          )
          (c_proj): LoRAConv1DWrapper(
            (base_module): Conv1D()
          )
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): LoRAConv1DWrapper(
            (base_module): Conv1D()
          )
          (c_proj): LoRAConv1DWrapper(
            (base_module): Conv1D()
          )
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

def train(train_file_path,model_name,
          output_dir,
          overwrite_output_dir,
          per_device_train_batch_size,
          num_train_epochs,
          save_steps):
  """
    Trains a GPT-2 model using the Hugging Face Transformers library.

    This function initializes a model, tokenizer, and data collator. It sets up training arguments and
    creates a Trainer instance to manage the training process.

    Parameters:
    - train_file_path (str): The file path to the training dataset.
    - model_name (str): The name of the pre-trained GPT-2 model to use. This can be a model identifier
        from Hugging Face's model hub (e.g., 'gpt2', 'gpt2-medium') or the path to a local directory containing model files.
    - output_dir (str): The directory where the model checkpoints will be saved during training.
    - overwrite_output_dir (bool): Set to True to overwrite the output directory, or False to continue training from the last checkpoint.
    - per_device_train_batch_size (int): Batch size per device during training.
    - num_train_epochs (int): Total number of training epochs.
    - save_steps (int): The number of training steps to perform before saving a checkpoint.

    Returns:
    None

    Saves the tokenizer and model to the specified output directory. Trains the model using the
    given dataset, saving the final model configuration to the output directory after training.

    """
  tokenizer = GPT2Tokenizer.from_pretrained(model_name)
  train_dataset = load_dataset(train_file_path, tokenizer)
  data_collator = load_data_collator(tokenizer)

  tokenizer.save_pretrained(output_dir)

  model = GPT2LMHeadModel.from_pretrained(model_name)

  # # comment this to skip LoRA
  model = update_model_layers(model)

  model.save_pretrained(output_dir)

  training_args = TrainingArguments(
          output_dir=output_dir,
          overwrite_output_dir=overwrite_output_dir,
          per_device_train_batch_size=per_device_train_batch_size,
          num_train_epochs=num_train_epochs,
      )

  trainer = Trainer(
          model=model,
          args=training_args,
          data_collator=data_collator,
          train_dataset=train_dataset,
  )

  trainer.train()
  trainer.save_model()

As we can see Conv1D has successfully been replaced by the LoRAConv1DWrapper layer.

# some constants
train_file_path = "Articles.txt"
model_name = 'gpt2'
output_dir = 'result'
overwrite_output_dir = False
per_device_train_batch_size = 8
num_train_epochs = 12
save_steps = 500

train(
    train_file_path=train_file_path,
    model_name=model_name,
    output_dir=output_dir,
    overwrite_output_dir=overwrite_output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    num_train_epochs=num_train_epochs,
    save_steps=save_steps
)

Training without Lora 5 Epochs

The initial loss seems to be lower than lora because all the weights are getting updated

Training with Lora 5 epochs

Let's attempt to lengthen the epochs using Lora; this might help reduce the loss further.

Training with Lora 12 Epochs

Training without Lora starts with a lower loss compared to using Lora, probably because all the weights are updated. It's suitable for the GPU, but it might need more epochs.

References

[1] https://www.linkedin.com/pulse/more-efficient-finetuning-implementing-lora-from-scratch-george-davis/

[2] https://lightning.ai/lightning-ai/studios/code-lora-from-scratch

[3] https://towardsdatascience.com/implementing-lora-from-scratch-20f838b046f1

[4] LoRA explained (and a bit about precision and quantization)
https://youtu.be/t509sv5MT0w

Building Your Own Personal Assistant With ChatGPT

Rushi Chaudhari — Sun, 19 Feb 2023 03:28:28 +0000

If you've ever used Siri, Alexa, or Google Assistant, you know how powerful and convenient having a personal assistant can be. What if you could build your own personal assistant, tailored to your specific needs? Thanks to the power of OpenAI's ChatGPT language model and the open-source community, you can!

In this post, we'll explore a GitHub project called "ChatGPT-chan," which provides a collection of tools to help you build your own personal assistant. Let's dive in!

What is ChatGPT-chan?

ChatGPT-chan is an open-source project that provides tools for building conversational interfaces, automating tasks, and more. It leverages the power of OpenAI's ChatGPT language model to understand natural language inputs and provide intelligent responses.

The project consists of three main components:

Emotion Classifier: A machine learning model that can detect the emotion in a given text input.
Stable Diffusion Model: A machine learning model that generates realistic images based on text prompts.
ChatGPT Wrapper: A Python library that provides a simple API for integrating the above models and creating conversational interfaces.

The project also includes a demo that showcases the power of the ChatGPT wrapper. Check out the demo video here

github: chatgpt-chan

How to Set Up ChatGPT-chan

To get ChatGPT-chan up and running, you'll find comprehensive guidance in the project's GitHub repository. Here’s a simplified setup process:

Install the Emotion Classifier and Stable Diffusion Model either on a server or locally.
Clone the ChatGPT-chan repository and install all necessary dependencies.
Modify the configuration file to connect to the Emotion Classifier and Stable Diffusion Model servers.
Launch the ChatGPT Wrapper.

This streamlined overview should help you initiate the setup quickly, with detailed steps available in the GitHub readme.

Why Build Your Own Personal Assistant?

Building your own personal assistant can be a fun and rewarding project, but it also has practical applications. For example, you could use it to automate tasks in your daily life, such as setting reminders or sending messages. You could also integrate it into your own projects to provide a natural language interface for your users.

Another benefit of building your own personal assistant is that you have full control over the data it collects and how it's used. With commercial personal assistants, you may not know what data is being collected and how it's being used.

Conclusion

If you're interested in building your own personal assistant, ChatGPT-chan is a great place to start. It provides powerful tools for understanding natural language inputs and generating intelligent responses, all using open-source software.

While the setup process can be a bit involved, the end result is a powerful tool that you can use to automate tasks and provide natural language interfaces for your projects. Give it a try and see what you can build!