<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Rushi Chaudhari</title>
    <description>The latest articles on Forem by Rushi Chaudhari (@rushichaudhari).</description>
    <link>https://forem.com/rushichaudhari</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F341175%2F8d25a7e2-275f-45e9-83e7-76da8f15c718.png</url>
      <title>Forem: Rushi Chaudhari</title>
      <link>https://forem.com/rushichaudhari</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rushichaudhari"/>
    <language>en</language>
    <item>
      <title>Training LLMs on Mixed GPUs: My Experiments and What I Learnt</title>
      <dc:creator>Rushi Chaudhari</dc:creator>
      <pubDate>Fri, 28 Nov 2025 16:59:38 +0000</pubDate>
      <link>https://forem.com/rushichaudhari/training-llms-on-mixed-gpus-my-experiments-and-what-i-learnt-1k7n</link>
      <guid>https://forem.com/rushichaudhari/training-llms-on-mixed-gpus-my-experiments-and-what-i-learnt-1k7n</guid>
      <description>&lt;p&gt;In the last few months, I have been very interested in large language models. At the same time, the GPU world is also changing. Nvidia is still the market leader, but AMD, Intel, and even Chinese companies are making cheaper GPUs. The main challenge is that CUDA is still the dominant software stack, and Nvidia drivers are not open source. Because of this, using non‑Nvidia GPUs is still not smooth.&lt;/p&gt;

&lt;p&gt;As someone who runs a homelab, I wanted a setup where I can use different GPUs together. But even mixing two Nvidia GPUs of different generations is hard. If you upgrade from RTX 3090 to RTX 5090, you may need a different CUDA version, a different Python version, and a different PyTorch version. New architectures like Blackwell also take time to enter mainstream frameworks.&lt;/p&gt;

&lt;p&gt;So many people end up buying the same model GPU again just to do multi‑GPU training.&lt;/p&gt;

&lt;p&gt;I wanted to avoid that and see if mixed‑GPU training is possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;The system auto generates a topology diagram after you configure and run the coordinator once. The generated file is saved at &lt;code&gt;architecture.png&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq42p1m77gei2rfwkjmoa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq42p1m77gei2rfwkjmoa.png" alt="architecture" width="800" height="107"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Current ML Systems Support
&lt;/h2&gt;

&lt;p&gt;I looked into many systems like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DeepSpeed&lt;/li&gt;
&lt;li&gt;Megatron‑LM&lt;/li&gt;
&lt;li&gt;PyTorch Distributed + TorchGpipe&lt;/li&gt;
&lt;li&gt;vLLM&lt;/li&gt;
&lt;li&gt;Colossal‑AI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of these are powerful, but none properly support mixing CUDA and ROCm GPUs in one training job.&lt;/p&gt;

&lt;p&gt;There is something called UCC (Unified Collective Communication) that tries to help. But the PyTorch integration here (torch‑ucc) is still experimental and archived:&lt;br&gt;
&lt;a href="https://github.com/openucx/torch-ucc" rel="noopener noreferrer"&gt;https://github.com/openucx/torch-ucc&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;UCX developers also said here that CUDA and ROCm support is “in theory”, but mixed setups were never fully tested:&lt;br&gt;
&lt;a href="https://github.com/openucx/ucx/discussions/9985" rel="noopener noreferrer"&gt;https://github.com/openucx/ucx/discussions/9985&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So true heterogeneous GPU training is still not ready in major frameworks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Papers Trying to Solve This
&lt;/h2&gt;

&lt;p&gt;I found some research papers that aim to solve heterogeneous GPU training:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HetHub
&lt;a href="https://arxiv.org/pdf/2405.16256" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2405.16256&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;HyperPipe
&lt;a href="https://ieeexplore.ieee.org/document/11033309" rel="noopener noreferrer"&gt;https://ieeexplore.ieee.org/document/11033309&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Cephalo
&lt;a href="https://dl.acm.org/doi/10.1145/3721145.3730418" rel="noopener noreferrer"&gt;https://dl.acm.org/doi/10.1145/3721145.3730418&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;HeterMoE
&lt;a href="https://arxiv.org/pdf/2504.03871" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2504.03871&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Zorse
&lt;a href="https://arxiv.org/abs/2507.10392" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2507.10392&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These papers show that the idea is possible, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;None of these are open source&lt;/li&gt;
&lt;li&gt;Real‑world implementations are still missing&lt;/li&gt;
&lt;li&gt;Homelab users cannot use these systems directly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because of all these limitations, I decided to build my own simple framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  How my HeteroGPU framework enables mixed‑GPU pipeline training in homelabs
&lt;/h2&gt;

&lt;p&gt;My goal was very simple:&lt;/p&gt;

&lt;p&gt;I wanted to run LLM training across different GPUs in my homelab, even if they belong to different generations or vendors, without depending on complicated distributed frameworks.&lt;/p&gt;

&lt;p&gt;My HeteroGPU framework helps to do this by providing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Layer‑based pipeline parallelism&lt;br&gt;
The model is split by layers so it can run across GPUs with different VRAM sizes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simple coordinator–worker design&lt;br&gt;
The main machine holds the first part of the model. Remote machines run later layers. They communicate using a lightweight socket interface over 10Gb ethernet or thunderbolt (not implemented).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Support for mixed GPU speeds&lt;br&gt;
Faster GPU can take more layers, slower GPU can take fewer layers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Small and hackable codebase&lt;br&gt;
Ideal for homelab experimentation, unlike large frameworks like DeepSpeed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Profiler inspired by Cephalo&lt;br&gt;
Helps decide how to split layers between GPUs based on compute speed, memory capacity, and communication delay.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Works even when GPUs require different drivers or CUDA versions&lt;br&gt;
Because each machine only loads its own shard locally and communicates via raw tensors over the network, you do not need unified CUDA versions.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This makes heterogeneous pipeline training practical for home users who may have a strong Nvidia GPU as main device, an older GPU on another machine, or even an integrated GPU like Strix Halo. With this design, training becomes possible even if a single GPU cannot fit the model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Explanation of Parallelism
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Data Parallelism: Copy the whole model to each GPU and split the batch.&lt;/li&gt;
&lt;li&gt;Tensor / Model Parallelism: Split each layer across GPUs. Very communication heavy.&lt;/li&gt;
&lt;li&gt;Pipeline Parallelism: Split the model layer‑wise. GPU 1 runs early layers, GPU 2 runs later layers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pipeline parallelism is the easiest for mixed GPUs. The only drawback is that transformers often cause one GPU to wait while the other works. But it still allows training when a model cannot fit into one GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Experiments With LLaMA Finetuning
&lt;/h2&gt;

&lt;p&gt;I tested the same training script on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;RTX 5090 single GPU&lt;/li&gt;
&lt;li&gt;AMD Strix Halo single GPU&lt;/li&gt;
&lt;li&gt;Two‑machine pipeline setup&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The results showed how mixed GPU training behaves.&lt;/p&gt;

&lt;h2&gt;
  
  
  RTX 5090 (Single GPU)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;» python examples/alpaca_example_singlemachine.py
Using device: cuda
`torch_dtype` is deprecated! Use `dtype` instead!
trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695
Epoch 0 | Step 10 | Loss 2.4383 | LR 0.000020
Epoch 0 | Step 20 | Loss 1.8139 | LR 0.000040
Epoch 0 | Step 30 | Loss 1.4709 | LR 0.000060
Epoch 0 | Step 40 | Loss 1.2903 | LR 0.000080
Epoch 0 | Step 50 | Loss 1.2693 | LR 0.000100
Epoch 0 | Step 60 | Loss 1.2671 | LR 0.000120
Saved LoRA adapters to: ./lora_unsloth_sft/lora
Training complete.

Sample generation:
 &amp;lt;s&amp;gt;You are a helpful assistant.
&amp;lt;|user|&amp;gt;
Write a haiku about GPUs.
&amp;lt;|assistant|&amp;gt;
In the lab, the GPU
Is the heart of the machine,
Running calculations.
&amp;lt;/s&amp;gt;

Total training time: 289.11 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Training time: 289 seconds&lt;br&gt;
Loss dropped smoothly from 2.43 to 1.26. &lt;br&gt;
Fast and stable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strix Halo (Single GPU)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ python examples/alpaca_example_singlemachine.py 
Using device: cuda
`torch_dtype` is deprecated! Use `dtype` instead!
g++ (GCC) 15.2.1 20250813
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695

Epoch 0 | Step 10 | Loss 2.4027 | LR 0.000020
Epoch 0 | Step 20 | Loss 1.8115 | LR 0.000040
Epoch 0 | Step 30 | Loss 1.2460 | LR 0.000060
Epoch 0 | Step 40 | Loss 1.4227 | LR 0.000080
Epoch 0 | Step 50 | Loss 1.2628 | LR 0.000100
Epoch 0 | Step 60 | Loss 1.2507 | LR 0.000120
Saved LoRA adapters to: ./lora_unsloth_sft/lora
Training complete.

Sample generation:
 &amp;lt;s&amp;gt;You are a helpful assistant.
&amp;lt;|user|&amp;gt;
Write a haiku about GPUs.
&amp;lt;|assistant|&amp;gt;
A GPU, a powerful tool
For processing data and computing
A helpful aid for many a task.
&amp;lt;/s&amp;gt;

Total training time: 3242.91 seconds
(.venv) [alpha@toolbx HeteroShard]$ 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Training time: 3243 seconds&lt;br&gt;
Loss also went down correctly, but speed was extremely slow. Around 11 times slower than the 5090. This shows the large performance gap between GPU types.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distributed Pipeline Training (Two GPUs)
&lt;/h2&gt;

&lt;p&gt;Expand for full logs&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;» python examples/demo_llama8b4bit_distributed.py --config hetero_config.json
📍 This machine: doraemon-arch (192.168.1.153)
✓ Role: COORDINATOR

======================================================================
COORDINATOR MODE - LLAMA 8B 4-BIT TRAINING
======================================================================

Device: cuda
Worker: worker1 (192.168.1.166:9999)
Split: Layers 0-15 (local) | 16-31 (remote)

Connecting to worker...
✓ Connected

Loading tokenizer...
Loading model...
`torch_dtype` is deprecated! Use `dtype` instead!
trainable params: 13,631,488 || all params: 8,043,892,736 || trainable%: 0.1695

Creating local shard...
✓ Local shard ready (Embedding + Layers 0-15)

Loading dataset...
✓ Dataset: 100 examples

======================================================================
TRAINING
======================================================================
Steps: 25 | Batch: 1 | Accum: 4

/mnt/sdc3/Documents/hetrogpu/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1044: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
Epoch 0 | Step 1/25 | Loss 2.3243 | LR 0.000020
Epoch 0 | Step 2/25 | Loss 2.4754 | LR 0.000040
Epoch 0 | Step 3/25 | Loss 2.4923 | LR 0.000060
Epoch 0 | Step 4/25 | Loss 2.7389 | LR 0.000080
Epoch 0 | Step 5/25 | Loss 2.1877 | LR 0.000100
Epoch 0 | Step 6/25 | Loss 2.0371 | LR 0.000120
Epoch 0 | Step 7/25 | Loss 2.3928 | LR 0.000140
Epoch 0 | Step 8/25 | Loss 1.5122 | LR 0.000160
Epoch 0 | Step 9/25 | Loss 1.9724 | LR 0.000180
Epoch 0 | Step 10/25 | Loss 2.2792 | LR 0.000200
Epoch 0 | Step 11/25 | Loss 1.9573 | LR 0.000198
Epoch 0 | Step 12/25 | Loss 1.4388 | LR 0.000192
Epoch 0 | Step 13/25 | Loss 1.8510 | LR 0.000183
Epoch 0 | Step 14/25 | Loss 1.6279 | LR 0.000170
Epoch 0 | Step 15/25 | Loss 1.4549 | LR 0.000155
Epoch 0 | Step 16/25 | Loss 1.2129 | LR 0.000138
Epoch 0 | Step 17/25 | Loss 1.3626 | LR 0.000119
Epoch 0 | Step 18/25 | Loss 1.2285 | LR 0.000101
Epoch 0 | Step 19/25 | Loss 1.4700 | LR 0.000082
Epoch 0 | Step 20/25 | Loss 1.3244 | LR 0.000065
Epoch 0 | Step 21/25 | Loss 1.4875 | LR 0.000050
Epoch 0 | Step 22/25 | Loss 1.4656 | LR 0.000037
Epoch 0 | Step 23/25 | Loss 1.0804 | LR 0.000028
Epoch 0 | Step 24/25 | Loss 1.5531 | LR 0.000022
Epoch 0 | Step 25/25 | Loss 1.0947 | LR 0.000020

✓ Training complete!
Total training time: 184.59 seconds
Saved LoRA adapters to: ./lora_unsloth_sft_distributed/lora

Sample generation:
 You are a helpful assistant.
&amp;amp;lt;|user|&amp;amp;gt;
Write a short haiku about distributed training.
&amp;amp;lt;|assistant|&amp;amp;gt;
Distributed training,
Like a symphony,
All the parts work together.






--- 



$ python examples/demo_llama8b4bit_distributed.py --config hetero_config.json
📍 This machine: toolbx (192.168.1.166)
✓ Role: WORKER 1

======================================================================
WORKER MODE - LLAMA 8B 4-BIT (LAYERS 16-31)
======================================================================

Device: cuda
Port: 9999

Loading model...
`torch_dtype` is deprecated! Use `dtype` instead!
g++ (GCC) 15.2.1 20250813
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Creating remote shard...
✓ Remote shard ready (Layers 16-31)

Listening on 0.0.0.0:9999...
✓ Connected to coordinator at ('192.168.1.153', 46384)

[Step 0] Waiting for data...
/torch-therock/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:1035: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. Starting in PyTorch 2.9, calling checkpoint without use_reentrant will raise an exception. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
  return fn(*args, **kwargs)
[Step 0] Loss: 1.6613
[Step 0] ✓ Complete

[Step 1] Waiting for data...
[Step 1] Loss: 2.5880
[Step 1] ✓ Complete

[Step 2] Waiting for data...
[Step 2] Loss: 3.1850
[Step 2] ✓ Complete

[Step 3] Waiting for data...
[Step 3] Loss: 1.8631
[Step 3] ✓ Complete

[Step 4] Waiting for data...
[Step 4] Loss: 2.3016
[Step 4] ✓ Complete

[Step 5] Waiting for data...
[Step 5] Loss: 2.4796
[Step 5] ✓ Complete

[Step 6] Waiting for data...
[Step 6] Loss: 2.7196
[Step 6] ✓ Complete

[Step 7] Waiting for data...
[Step 7] Loss: 2.4008
[Step 7] ✓ Complete

[Step 8] Waiting for data...
[Step 8] Loss: 1.9301
[Step 8] ✓ Complete

[Step 9] Waiting for data...
[Step 9] Loss: 1.9098
[Step 9] ✓ Complete

[Step 10] Waiting for data...
[Step 10] Loss: 3.0177
[Step 10] ✓ Complete

[Step 11] Waiting for data...
[Step 11] Loss: 3.1114
[Step 11] ✓ Complete

[Step 12] Waiting for data...
[Step 12] Loss: 1.7507
[Step 12] ✓ Complete

[Step 13] Waiting for data...
[Step 13] Loss: 3.0108
[Step 13] ✓ Complete

[Step 14] Waiting for data...
[Step 14] Loss: 2.5046
[Step 14] ✓ Complete

[Step 15] Waiting for data...
[Step 15] Loss: 3.6894
[Step 15] ✓ Complete

[Step 16] Waiting for data...
[Step 16] Loss: 1.8336
[Step 16] ✓ Complete

[Step 17] Waiting for data...
[Step 17] Loss: 1.5026
[Step 17] ✓ Complete

[Step 18] Waiting for data...
[Step 18] Loss: 3.4676
[Step 18] ✓ Complete

[Step 19] Waiting for data...
[Step 19] Loss: 1.9469
[Step 19] ✓ Complete

[Step 20] Waiting for data...
[Step 20] Loss: 2.0781
[Step 20] ✓ Complete

[Step 21] Waiting for data...
[Step 21] Loss: 1.7651
[Step 21] ✓ Complete

[Step 22] Waiting for data...
[Step 22] Loss: 2.0139
[Step 22] ✓ Complete

[Step 23] Waiting for data...
[Step 23] Loss: 2.2912
[Step 23] ✓ Complete

[Step 24] Waiting for data...
[Step 24] Loss: 2.6897
[Step 24] ✓ Complete

[Step 25] Waiting for data...
[Step 25] Loss: 2.8378
[Step 25] ✓ Complete

[Step 26] Waiting for data...
[Step 26] Loss: 1.9898
[Step 26] ✓ Complete

[Step 27] Waiting for data...
[Step 27] Loss: 2.0538
[Step 27] ✓ Complete

[Step 28] Waiting for data...
[Step 28] Loss: 1.6081
[Step 28] ✓ Complete

[Step 29] Waiting for data...
[Step 29] Loss: 1.4623
[Step 29] ✓ Complete

[Step 30] Waiting for data...
[Step 30] Loss: 1.2606
[Step 30] ✓ Complete

[Step 31] Waiting for data...
[Step 31] Loss: 1.7178
[Step 31] ✓ Complete

[Step 32] Waiting for data...
[Step 32] Loss: 1.9203
[Step 32] ✓ Complete

[Step 33] Waiting for data...
[Step 33] Loss: 1.6814
[Step 33] ✓ Complete

[Step 34] Waiting for data...
[Step 34] Loss: 2.5819
[Step 34] ✓ Complete

[Step 35] Waiting for data...
[Step 35] Loss: 1.7061
[Step 35] ✓ Complete

[Step 36] Waiting for data...
[Step 36] Loss: 2.3311
[Step 36] ✓ Complete

[Step 37] Waiting for data...
[Step 37] Loss: 2.2990
[Step 37] ✓ Complete

[Step 38] Waiting for data...
[Step 38] Loss: 1.8855
[Step 38] ✓ Complete

[Step 39] Waiting for data...
[Step 39] Loss: 2.6010
[Step 39] ✓ Complete

[Step 40] Waiting for data...
[Step 40] Loss: 2.3807
[Step 40] ✓ Complete

[Step 41] Waiting for data...
[Step 41] Loss: 2.0204
[Step 41] ✓ Complete

[Step 42] Waiting for data...
[Step 42] Loss: 1.7209
[Step 42] ✓ Complete

[Step 43] Waiting for data...
[Step 43] Loss: 1.7073
[Step 43] ✓ Complete

[Step 44] Waiting for data...
[Step 44] Loss: 1.1900
[Step 44] ✓ Complete

[Step 45] Waiting for data...
[Step 45] Loss: 1.8439
[Step 45] ✓ Complete

[Step 46] Waiting for data...
[Step 46] Loss: 1.1291
[Step 46] ✓ Complete

[Step 47] Waiting for data...
[Step 47] Loss: 1.5923
[Step 47] ✓ Complete

[Step 48] Waiting for data...
[Step 48] Loss: 1.9110
[Step 48] ✓ Complete

[Step 49] Waiting for data...
[Step 49] Loss: 1.1971
[Step 49] ✓ Complete

[Step 50] Waiting for data...
[Step 50] Loss: 3.0576
[Step 50] ✓ Complete

[Step 51] Waiting for data...
[Step 51] Loss: 1.2383
[Step 51] ✓ Complete

[Step 52] Waiting for data...
[Step 52] Loss: 1.6820
[Step 52] ✓ Complete

[Step 53] Waiting for data...
[Step 53] Loss: 1.7755
[Step 53] ✓ Complete

[Step 54] Waiting for data...
[Step 54] Loss: 1.2515
[Step 54] ✓ Complete

[Step 55] Waiting for data...
[Step 55] Loss: 1.8027
[Step 55] ✓ Complete

[Step 56] Waiting for data...
[Step 56] Loss: 1.2692
[Step 56] ✓ Complete

[Step 57] Waiting for data...
[Step 57] Loss: 1.6293
[Step 57] ✓ Complete

[Step 58] Waiting for data...
[Step 58] Loss: 1.1256
[Step 58] ✓ Complete

[Step 59] Waiting for data...
[Step 59] Loss: 1.7956
[Step 59] ✓ Complete

[Step 60] Waiting for data...
[Step 60] Loss: 1.3114
[Step 60] ✓ Complete

[Step 61] Waiting for data...
[Step 61] Loss: 1.4944
[Step 61] ✓ Complete

[Step 62] Waiting for data...
[Step 62] Loss: 0.9233
[Step 62] ✓ Complete

[Step 63] Waiting for data...
[Step 63] Loss: 1.1224
[Step 63] ✓ Complete

[Step 64] Waiting for data...
[Step 64] Loss: 1.4849
[Step 64] ✓ Complete

[Step 65] Waiting for data...
[Step 65] Loss: 1.0226
[Step 65] ✓ Complete

[Step 66] Waiting for data...
[Step 66] Loss: 1.3064
[Step 66] ✓ Complete

[Step 67] Waiting for data...
[Step 67] Loss: 1.6367
[Step 67] ✓ Complete

[Step 68] Waiting for data...
[Step 68] Loss: 1.6595
[Step 68] ✓ Complete

[Step 69] Waiting for data...
[Step 69] Loss: 1.3235
[Step 69] ✓ Complete

[Step 70] Waiting for data...
[Step 70] Loss: 0.8673
[Step 70] ✓ Complete

[Step 71] Waiting for data...
[Step 71] Loss: 1.0639
[Step 71] ✓ Complete

[Step 72] Waiting for data...
[Step 72] Loss: 1.6803
[Step 72] ✓ Complete

[Step 73] Waiting for data...
[Step 73] Loss: 1.5877
[Step 73] ✓ Complete

[Step 74] Waiting for data...
[Step 74] Loss: 1.3728
[Step 74] ✓ Complete

[Step 75] Waiting for data...
[Step 75] Loss: 1.2393
[Step 75] ✓ Complete

[Step 76] Waiting for data...
[Step 76] Loss: 1.4007
[Step 76] ✓ Complete

[Step 77] Waiting for data...
[Step 77] Loss: 0.9818
[Step 77] ✓ Complete

[Step 78] Waiting for data...
[Step 78] Loss: 1.3658
[Step 78] ✓ Complete

[Step 79] Waiting for data...
[Step 79] Loss: 1.5493
[Step 79] ✓ Complete

[Step 80] Waiting for data...
[Step 80] Loss: 1.3884
[Step 80] ✓ Complete

[Step 81] Waiting for data...
[Step 81] Loss: 1.3920
[Step 81] ✓ Complete

[Step 82] Waiting for data...
[Step 82] Loss: 1.9356
[Step 82] ✓ Complete

[Step 83] Waiting for data...
[Step 83] Loss: 1.2340
[Step 83] ✓ Complete

[Step 84] Waiting for data...
[Step 84] Loss: 1.2280
[Step 84] ✓ Complete

[Step 85] Waiting for data...
[Step 85] Loss: 1.7844
[Step 85] ✓ Complete

[Step 86] Waiting for data...
[Step 86] Loss: 1.2704
[Step 86] ✓ Complete

[Step 87] Waiting for data...
[Step 87] Loss: 1.5795
[Step 87] ✓ Complete

[Step 88] Waiting for data...
[Step 88] Loss: 0.9333
[Step 88] ✓ Complete

[Step 89] Waiting for data...
[Step 89] Loss: 0.9236
[Step 89] ✓ Complete

[Step 90] Waiting for data...
[Step 90] Loss: 1.0831
[Step 90] ✓ Complete

[Step 91] Waiting for data...
[Step 91] Loss: 1.3817
[Step 91] ✓ Complete

[Step 92] Waiting for data...
[Step 92] Loss: 1.3752
[Step 92] ✓ Complete

[Step 93] Waiting for data...
[Step 93] Loss: 1.9094
[Step 93] ✓ Complete

[Step 94] Waiting for data...
[Step 94] Loss: 1.6458
[Step 94] ✓ Complete

[Step 95] Waiting for data...
[Step 95] Loss: 1.2820
[Step 95] ✓ Complete

[Step 96] Waiting for data...
[Step 96] Loss: 1.5715
[Step 96] ✓ Complete

[Step 97] Waiting for data...
[Step 97] Loss: 0.8391
[Step 97] ✓ Complete

[Step 98] Waiting for data...
[Step 98] Loss: 0.9126
[Step 98] ✓ Complete

[Step 99] Waiting for data...
[Step 99] Loss: 1.0555
[Step 99] ✓ Complete

[Step 100] Waiting for data...
Connection closed.
(.venv) [alpha@toolbx HeteroShard]$ 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Training time: 184 seconds&lt;/p&gt;

&lt;p&gt;Model was split:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Layers 0–15 on the main machine&lt;/li&gt;
&lt;li&gt;Layers 16–31 on the worker machine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both GPUs handled their parts. Worker logs show: Waiting for data, Loss, Complete. This shows the pipeline stalls, which is expected. Still, the total time was faster than the single 5090.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learnt From These Runs
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Mixed‑GPU pipeline training works in real life, not just in papers.&lt;/li&gt;
&lt;li&gt;Speed depends on the slowest GPU, so good splitting is important.&lt;/li&gt;
&lt;li&gt;Distributed training has waiting time and communication cost, but still can beat a single strong GPU.&lt;/li&gt;
&lt;li&gt;Consumer GPUs vary hugely in speed, which is why homelab users need flexible systems.&lt;/li&gt;
&lt;li&gt;A simple framework like HeteroGPU can achieve things that big frameworks do not support yet.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  My Profiler System
&lt;/h2&gt;

&lt;p&gt;The profiler I added does the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs tiny batches on each GPU&lt;/li&gt;
&lt;li&gt;Measures latency and memory usage&lt;/li&gt;
&lt;li&gt;Builds simple linear models to predict performance&lt;/li&gt;
&lt;li&gt;Measures communication cost&lt;/li&gt;
&lt;li&gt;Chooses the best pipeline split&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matches the idea in the Cephalo paper:&lt;br&gt;
&lt;a href="https://dl.acm.org/doi/10.1145/3721145.3730418" rel="noopener noreferrer"&gt;https://dl.acm.org/doi/10.1145/3721145.3730418&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This allows the system to work even when one GPU is fast but low VRAM, and another GPU is slow but high VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Now I plan to experiment with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HeterMoE: &lt;a href="https://arxiv.org/pdf/2504.03871" rel="noopener noreferrer"&gt;https://arxiv.org/pdf/2504.03871&lt;/a&gt; or maybe&lt;/li&gt;
&lt;li&gt;Zorse: &lt;a href="https://arxiv.org/abs/2507.10392" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2507.10392&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MoE (Mixture‑of‑Experts) models are naturally suited for heterogeneous hardware, so they may perform better in mixed GPU clusters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Github repo&lt;/strong&gt;: &lt;a href="https://github.com/0xrushi/HeteroShard" rel="noopener noreferrer"&gt;https://github.com/0xrushi/HeteroShard&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>nvidia</category>
      <category>amd</category>
      <category>dgx</category>
    </item>
    <item>
      <title>Is Google Colab Pro Really Worth It?</title>
      <dc:creator>Rushi Chaudhari</dc:creator>
      <pubDate>Wed, 01 May 2024 05:31:01 +0000</pubDate>
      <link>https://forem.com/rushichaudhari/is-google-colab-pro-really-worth-it-5531</link>
      <guid>https://forem.com/rushichaudhari/is-google-colab-pro-really-worth-it-5531</guid>
      <description>&lt;p&gt;In late 2022, Google revamped its widely-used Colab platform, transitioning from a subscription-based system to a pay-as-you-go model under the new Colab Pro and Pro+ schemes. This change introduced "compute units," which serve as the new currency within the platform, where the consumption rate depends on the virtual machine's configuration and the use of specialized accelerators like TPUs or GPUs.&lt;/p&gt;

&lt;p&gt;Here's a breakdown of how the compute units are consumed based on different GPUs, assuming an allocation of 100 units:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;T4&lt;/strong&gt;: Consumes 1.96 units per hour, providing about 51 hours of use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V100&lt;/strong&gt;: Requires 5 units per hour, totaling about 20 hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A100&lt;/strong&gt;: Demands 15 units per hour, which amounts to 6 hours.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's important to note that the T4 GPU is available for free; however, its availability under the Colab Pro tier is not guaranteed, often necessitating the use of costlier alternatives.&lt;/p&gt;

&lt;p&gt;This shift has introduced a layer of complexity that many users find disappointing, especially when there are more straightforward options available on the market. For comparison, here's a quick overview of pricing and availability from various smaller cloud providers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lambda Labs, Jarvislabs.ai, tensordock, genesis cloud, paperspace, Vast.ai, and FluidStack&lt;/strong&gt; offer a range of GPU options like NVIDIA A100 PCIe and V100 at varying price points and hourly rates.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Competitive Analysis of Cloud Computing Providers
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj43d4xmavsodv1b6t7nr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj43d4xmavsodv1b6t7nr.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When considering cost-effectiveness:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For the A100 GPU, &lt;strong&gt;Paperspace&lt;/strong&gt; offers the lowest price at $1.15 per hour.&lt;/li&gt;
&lt;li&gt;For the V100 GPU, &lt;strong&gt;Vast.ai&lt;/strong&gt; provides the most affordable rate at $0.16 per hour.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, it's essential to highlight that major cloud services like AWS, Azure, and Google Cloud were excluded from this comparison due to their higher prices, despite offering better scalability and integration.&lt;/p&gt;

&lt;p&gt;Potential users should be aware that the availability of instances on smaller clouds can be unpredictable, making them more suitable for personal projects rather than enterprise solutions. Additionally, these platforms may not always have complete libraries installed (e.g., Hugging Face on Paperspace), which could extend setup times.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Before subscribing to Colab Pro, thoroughly explore and compare alternative cloud platforms that may offer better rates or features suited to your needs.&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>gpu</category>
      <category>cloud</category>
      <category>google</category>
    </item>
    <item>
      <title>Exploring Low-Rank Adaptation (LoRA) from scratch</title>
      <dc:creator>Rushi Chaudhari</dc:creator>
      <pubDate>Thu, 25 Apr 2024 05:11:48 +0000</pubDate>
      <link>https://forem.com/rushichaudhari/exploring-low-rank-adaptation-lora-from-scratch-2jc1</link>
      <guid>https://forem.com/rushichaudhari/exploring-low-rank-adaptation-lora-from-scratch-2jc1</guid>
      <description>&lt;p&gt;Notebook link: &lt;a href="https://github.com/0xrushi/deep-learning-notebooks/blob/main/GPU/Exploring%20Low-Rank%20Adaptation%20(LoRA)%20from%20scratch.ipynb" rel="noopener noreferrer"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I've been exploring LoRA and was seeking a straightforward implementation example. Many resources I've found focus on training large models and often utilize PEFT and the loralib package, as well as some basic implementations using CNNs or ANNs as outlined in sources like [[2]].&lt;/p&gt;

&lt;p&gt;I came across some examples using LoRA with BERT, DistillBert, and others involving a Linear() layer. However, I'm specifically interested in applying it to GPT2, which uses a Conv1D() layer instead of Linear().&lt;/p&gt;

&lt;p&gt;These days, the deep learning models have significantly more layers. One major challenge with fine-tuning large models like GPT is their size; they often don't fit into the limited VRAM available. To address this, researchers at Microsoft developed the Low Rank Adaptation (LoRA) technique. This method leverages the principle of low-rank matrix decomposition. It has shown that common pre-trained models can be effectively fine-tuned or adapted using just a small subset of their original parameters, instead of modifying every parameter. This approach not only reduces the VRAM requirements but can be just as effective for fine-tuning purposes as using the full set of parameters.&lt;/p&gt;

&lt;p&gt;LoRA approximates a layer's weight changes during training, ΔW, in a low-rank format.&lt;/p&gt;

&lt;p&gt;For instance, whereas in regular finetuning, we compute the weight updates of a weight matrix W as ΔW, in LoRA, we approximate ΔW through the matrix multiplication of two smaller matrices AB, as illustrated in the figure below. (If you are familiar with PCA or SVD, consider this as decomposing ΔW into A and B.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxv09feiaqhaov34350hu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxv09feiaqhaov34350hu.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With LoRA, the transformation in a particular layer originally involved just 

&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;W⋅xW \cdot x&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;⋅&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, where 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WW&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is the weight matrix and 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;xx&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is the input. This operation now includes an additional term, resulting in 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Wx+(WAWB)⋅xWx + (W_A W_B) \cdot x&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;A&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;B&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;⋅&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Original Operation&lt;/strong&gt;: The operation 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WxWx&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 involves 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WW&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, a large matrix typically with dimensions like 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;768×768768 \times 768&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;768&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;768&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 as seen in models like BERT or GPT-2. The computational complexity of this operation is 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(d2)O(d^2)&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, where 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;dd&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is the dimension of 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WW&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 (assuming square matrices for simplicity).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LoRA Operation&lt;/strong&gt;: In the LoRA approach, 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WAW_A&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;A&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 and 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WBW_B&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;B&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 are smaller matrices with dimensions 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;d×rd \times r&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;r&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 and 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;r×dr \times d&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;r&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 respectively, where 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;rr&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;r&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is much smaller than 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;dd&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 (indicating low rank). The product 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WAWBW_A W_B&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;A&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;B&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, therefore, has the same dimension as 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WW&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 but is composed of two smaller matrices. This configuration reduces the computational load significantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, the product 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WAWBW_A W_B&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;A&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;B&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
 is computed, which involves a complexity of 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;O(d2⋅r)O(d^2 \cdot r)&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;O&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;2&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;⋅&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;r&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
.&lt;/li&gt;
&lt;li&gt;Then, this product multiplies the input 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;xx&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, resulting in 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;(WAWB)x(W_A W_B)x&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;A&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;B&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, with a computational complexity similar to the original operation 
&lt;span class="katex-element"&gt;
  &lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;WxWx&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;W&lt;/span&gt;&lt;span class="mord mathnormal"&gt;x&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/span&gt;
, but the initial reduction in complexity due to the lower rank matrices helps to manage overall computational demands effectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For instance, consider a weight matrix W in a specific layer, sized at 2,000x10,000, totaling 20 million parameters. If we opt for a rank r=3, we would set up two new matrices: a 2,000x3 matrix B and an 3x10,000 matrix A. Together, matrices A and B contain just 6000 + 30,000 = 36,000 parameters, making them over 555 times smaller than the 20 million parameters typically involved in standard fine-tuning with ΔW.&lt;/p&gt;

&lt;p&gt;We'll use the News Articles dataset from Kaggle to explore experiments with GPT2. Below are some code snippets that show data loading and preprocessing steps.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;pytorch&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;lightning&lt;/span&gt; &lt;span class="n"&gt;lightning&lt;/span&gt; &lt;span class="n"&gt;accelerate&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TextDataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DataCollatorForLanguageModeling&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GPT2Tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GPT2LMHeadModel&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;transformers&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Trainer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TrainingArguments&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn.functional&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;init&lt;/span&gt;



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Data Preprocess
&lt;/h1&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cleaning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\s\W&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\W,\s&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\d+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\s+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[!@#$_]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;co&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[\w*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# dataset link https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;./Articles.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ISO-8859-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;text_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Articles.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iterrows&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="n"&gt;article&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;cleaning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Article&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
  &lt;span class="n"&gt;text_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;article&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;text_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TextDataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;block_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_data_collator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mlm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;data_collator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DataCollatorForLanguageModeling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;mlm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;mlm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data_collator&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Download pretrained GPT2 model
&lt;/h1&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GPT2LMHeadModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Contrary to the examples referenced, this model doesn't use a Linear() layer but instead features a Conv1D() layer, which is mathematically equivalent. The concept remains the same, though the implementation differs. Let's proceed by creating a LoRA wrapper specifically tailored for it.&lt;/p&gt;

&lt;p&gt;Note that we have frozen the base models parameters so only lora weights get trained.&lt;/p&gt;

&lt;p&gt;Let's now create a LoRa wrapper for Conv1D.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conv1D Lora Wrapper
&lt;/h1&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch.nn.functional&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LoRAConv1DWrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    A wrapper module that applies LORA to the weights of a convolutional layer.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Initializes the LoRAConv1DWrapper instance.

        Parameters:
            module (nn.Module): The base module whose weights are to be adapted.
            rank (int): The rank for the low-rank matrices A and B. If set to 0, LoRA is effectively disabled.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rank must be a non-negative integer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt;

        &lt;span class="n"&gt;out_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_rank&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Parameter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;in_features&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                &lt;span class="n"&gt;requires_grad&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Parameter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;out_features&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_rank&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
                &lt;span class="n"&gt;requires_grad&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# self.print_trainable_parameters()
&lt;/span&gt;
            &lt;span class="c1"&gt;# freeze the base module's parameters, only focus on updating lora weights
&lt;/span&gt;            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requires_grad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requires_grad&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Creating LoRAConv1DWrapper with no rank adaptation: rank &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_rank&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset_parameters&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reset_parameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Initializes or resets the parameters of the LoRA matrices A and B to their default values.
        This method typically mirrors the initialization logic of the base module.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_rank&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# initialize A matrix
&lt;/span&gt;            &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;kaiming_uniform_&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="c1"&gt;# initialize B matrix to 0
&lt;/span&gt;            &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;init&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros_&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;print_trainable_parameters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Prints the number of trainable parameters in the base module and the additional parameters added by LoRA.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;base_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;numel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;lora_params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;numel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_B&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Trainable parameters in base module: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base_params&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Trainable parameters in LoRA (base module frozen): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;lora_params&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Performs a forward pass through the LoRAConv1DWrapper, applying low-rank adaptations to the base module&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s weights.

        Parameters:
            x (torch.Tensor): The input tensor to the module.

        Returns:
            torch.Tensor: The output of the module after applying the low-rank adapted forward pass.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lora_rank&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Compute the base module's forward pass with adapted weights
&lt;/span&gt;            &lt;span class="c1"&gt;# print(self.W_A.shape)
&lt;/span&gt;            &lt;span class="c1"&gt;# print(self.W_B.shape)
&lt;/span&gt;            &lt;span class="n"&gt;adapted_weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_B&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;W_A&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapted_weight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Perform a standard forward pass using the base module's original weights and bias
&lt;/span&gt;            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;update_model_layers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="c1"&gt;# Set LoRA hyperparameters
&lt;/span&gt;  &lt;span class="n"&gt;lora_r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;
  &lt;span class="n"&gt;lora_alpha&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
  &lt;span class="n"&gt;lora_dropout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;
  &lt;span class="c1"&gt;# flag to apply LoRA to Transformer layers
&lt;/span&gt;  &lt;span class="n"&gt;lora_attn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
  &lt;span class="c1"&gt;# flag to apply LoRA to MLP layers
&lt;/span&gt;  &lt;span class="n"&gt;lora_mlp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;

  &lt;span class="c1"&gt;# Apply LoRA modifications to the GPT2 layers
&lt;/span&gt;  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transformer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lora_attn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_attn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoRAConv1DWrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_attn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_proj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoRAConv1DWrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_proj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lora_mlp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
          &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mlp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_fc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoRAConv1DWrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mlp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_fc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mlp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_proj&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoRAConv1DWrapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mlp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;c_proj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;update_model_layers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): LoRAConv1DWrapper(
            (base_module): Conv1D()
          )
          (c_proj): LoRAConv1DWrapper(
            (base_module): Conv1D()
          )
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): LoRAConv1DWrapper(
            (base_module): Conv1D()
          )
          (c_proj): LoRAConv1DWrapper(
            (base_module): Conv1D()
          )
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;overwrite_output_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;save_steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Trains a GPT-2 model using the Hugging Face Transformers library.

    This function initializes a model, tokenizer, and data collator. It sets up training arguments and
    creates a Trainer instance to manage the training process.

    Parameters:
    - train_file_path (str): The file path to the training dataset.
    - model_name (str): The name of the pre-trained GPT-2 model to use. This can be a model identifier
        from Hugging Face&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s model hub (e.g., &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gpt2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gpt2-medium&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;) or the path to a local directory containing model files.
    - output_dir (str): The directory where the model checkpoints will be saved during training.
    - overwrite_output_dir (bool): Set to True to overwrite the output directory, or False to continue training from the last checkpoint.
    - per_device_train_batch_size (int): Batch size per device during training.
    - num_train_epochs (int): Total number of training epochs.
    - save_steps (int): The number of training steps to perform before saving a checkpoint.

    Returns:
    None

    Saves the tokenizer and model to the specified output directory. Trains the model using the
    given dataset, saving the final model configuration to the output directory after training.

    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
  &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GPT2Tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;train_dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;data_collator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_data_collator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GPT2LMHeadModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# # comment this to skip LoRA
&lt;/span&gt;  &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;update_model_layers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;training_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;TrainingArguments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;overwrite_output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;overwrite_output_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Trainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
          &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;training_args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;data_collator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data_collator&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;As we can see Conv1D has successfully been replaced by the LoRAConv1DWrapper layer.&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;p&gt;&lt;span class="c1"&gt;# some constants&lt;br&gt;
&lt;/span&gt;&lt;span class="n"&gt;train_file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Articles.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;br&gt;
&lt;span class="n"&gt;model_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gpt2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;br&gt;
&lt;span class="n"&gt;output_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;br&gt;
&lt;span class="n"&gt;overwrite_output_dir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;br&gt;
&lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;br&gt;
&lt;span class="n"&gt;num_train_epochs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;br&gt;
&lt;span class="n"&gt;save_steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;/p&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;p&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;train_file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;train_file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;output_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;overwrite_output_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;overwrite_output_dir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;per_device_train_batch_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_train_epochs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;br&gt;
    &lt;span class="n"&gt;save_steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;save_steps&lt;/span&gt;&lt;br&gt;
&lt;span class="p"&gt;)&lt;/span&gt;&lt;/p&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Training without Lora 5 Epochs&lt;br&gt;
&lt;/h2&gt;

&lt;p&gt;The initial loss seems to be lower than lora because all the weights are getting updated&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrtg2nyb1qylcl24qbah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvrtg2nyb1qylcl24qbah.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Training with Lora 5 epochs
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpbe0jpgpbeedbjk2pmn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpbe0jpgpbeedbjk2pmn.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's attempt to lengthen the epochs using Lora; this might help reduce the loss further.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training with Lora 12 Epochs
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff13hvdnbezvwfu7ch1i3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff13hvdnbezvwfu7ch1i3.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Training without Lora starts with a lower loss compared to using Lora, probably because all the weights are updated. It's suitable for the GPU, but it might need more epochs.&lt;/p&gt;

&lt;h1&gt;
  
  
  References
&lt;/h1&gt;

&lt;p&gt;[1] &lt;a href="https://www.linkedin.com/pulse/more-efficient-finetuning-implementing-lora-from-scratch-george-davis/" rel="noopener noreferrer"&gt;https://www.linkedin.com/pulse/more-efficient-finetuning-implementing-lora-from-scratch-george-davis/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] &lt;a href="https://lightning.ai/lightning-ai/studios/code-lora-from-scratch" rel="noopener noreferrer"&gt;https://lightning.ai/lightning-ai/studios/code-lora-from-scratch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] &lt;a href="https://towardsdatascience.com/implementing-lora-from-scratch-20f838b046f1" rel="noopener noreferrer"&gt;https://towardsdatascience.com/implementing-lora-from-scratch-20f838b046f1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[4] LoRA explained (and a bit about precision and quantization)&lt;br&gt;
 &lt;a href="https://youtu.be/t509sv5MT0w" rel="noopener noreferrer"&gt;https://youtu.be/t509sv5MT0w&lt;/a&gt;&lt;/p&gt;

</description>
      <category>lora</category>
      <category>llm</category>
      <category>language</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building Your Own Personal Assistant With ChatGPT</title>
      <dc:creator>Rushi Chaudhari</dc:creator>
      <pubDate>Sun, 19 Feb 2023 03:28:28 +0000</pubDate>
      <link>https://forem.com/rushichaudhari/building-your-own-personal-assistant-with-chatgpt-98i</link>
      <guid>https://forem.com/rushichaudhari/building-your-own-personal-assistant-with-chatgpt-98i</guid>
      <description>&lt;p&gt;If you've ever used Siri, Alexa, or Google Assistant, you know how powerful and convenient having a personal assistant can be. What if you could build your own personal assistant, tailored to your specific needs? Thanks to the power of OpenAI's ChatGPT language model and the open-source community, you can!&lt;/p&gt;

&lt;p&gt;In this post, we'll explore a GitHub project called "ChatGPT-chan," which provides a collection of tools to help you build your own personal assistant. Let's dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  What is ChatGPT-chan?
&lt;/h2&gt;

&lt;p&gt;ChatGPT-chan is an open-source project that provides tools for building conversational interfaces, automating tasks, and more. It leverages the power of OpenAI's ChatGPT language model to understand natural language inputs and provide intelligent responses.&lt;/p&gt;

&lt;p&gt;The project consists of three main components: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Emotion Classifier: A machine learning model that can detect the emotion in a given text input. &lt;/li&gt;
&lt;li&gt;Stable Diffusion Model: A machine learning model that generates realistic images based on text prompts. &lt;/li&gt;
&lt;li&gt;ChatGPT Wrapper: A Python library that provides a simple API for integrating the above models and creating conversational interfaces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project also includes a demo that showcases the power of the ChatGPT wrapper. Check out the demo video &lt;a href="https://odysee.com/@rushi:2/chatgptchandemo2:4" rel="noopener noreferrer"&gt;here&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/rushic24/chatgpt-chan" rel="noopener noreferrer"&gt;github: chatgpt-chan&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  How to Set Up ChatGPT-chan
&lt;/h2&gt;

&lt;p&gt;To get ChatGPT-chan up and running, you'll find comprehensive guidance in the project's GitHub repository. Here’s a simplified setup process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install the Emotion Classifier and Stable Diffusion Model either on a server or locally.&lt;/li&gt;
&lt;li&gt;Clone the ChatGPT-chan repository and install all necessary dependencies.&lt;/li&gt;
&lt;li&gt;Modify the configuration file to connect to the Emotion Classifier and Stable Diffusion Model servers.&lt;/li&gt;
&lt;li&gt;Launch the ChatGPT Wrapper.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This streamlined overview should help you initiate the setup quickly, with detailed steps available in the GitHub readme.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Build Your Own Personal Assistant?
&lt;/h2&gt;

&lt;p&gt;Building your own personal assistant can be a fun and rewarding project, but it also has practical applications. For example, you could use it to automate tasks in your daily life, such as setting reminders or sending messages. You could also integrate it into your own projects to provide a natural language interface for your users.&lt;/p&gt;

&lt;p&gt;Another benefit of building your own personal assistant is that you have full control over the data it collects and how it's used. With commercial personal assistants, you may not know what data is being collected and how it's being used.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;If you're interested in building your own personal assistant, ChatGPT-chan is a great place to start. It provides powerful tools for understanding natural language inputs and generating intelligent responses, all using open-source software.&lt;/p&gt;

&lt;p&gt;While the setup process can be a bit involved, the end result is a powerful tool that you can use to automate tasks and provide natural language interfaces for your projects. Give it a try and see what you can build!&lt;/p&gt;

</description>
      <category>puzzlegames</category>
      <category>firstpost</category>
      <category>showdev</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
