Forem: Ahmed Elnaggar

Fine-Tune Any HuggingFace Model like Gemma on TPUs with TorchAX

Ahmed Elnaggar — Mon, 27 Apr 2026 08:45:25 +0000

What if you could fine-tune any HuggingFace model on TPUs — using PyTorch code?

Here is what the end result looks like:

import torchax as tx
import torchax.train

# One function: forward → loss → gradients → optimizer update
step_fn = tx.train.make_train_step(model_fn, loss_fn, optimizer)

# Training loop
for batch in dataloader:
    loss, params, opt_state = step_fn(params, buffers, opt_state, batch, batch["labels"])

Your PyTorch model. JAX's training primitives. Running on TPU. No rewrite needed.

In the first part of this series, we ran HuggingFace models on JAX for fast inference. Now we take the next step: training. We will instruction-tune Gemma 3 1B on the Databricks Dolly 15k dataset using LoRA and torchax's functional training API — all on a free Colab TPU.

Why Train on TPUs?

Google's Tensor Processing Units (TPUs) are purpose-built for matrix operations — the bread and butter of deep learning. Free Colab gives you access to a TPU v2-8 with ~15GB of high-bandwidth memory. That is enough to fine-tune a 1B parameter model with LoRA.

But training on TPUs traditionally meant rewriting your model in JAX (Flax, Equinox) or using PyTorch/XLA. torchax offers a third path: keep your PyTorch model, but use JAX's functional training primitives.

How torchax Training Differs from Standard PyTorch

Standard PyTorch	torchax
`loss.backward()`	`jax.value_and_grad(loss_fn)(params, ...)`
`optimizer.step()`	`optax.apply_updates(params, updates)`
Model holds its own state	Params and buffers are separate pytrees
Eager execution	JIT-compiled training steps

The key difference: functional training. Instead of calling loss.backward() and optimizer.step() on a stateful model, torchax separates the model into immutable weight pytrees and passes them through pure functions. This is what enables JAX's jax.jit to compile the entire training step into a single optimized program.

Prerequisites & Setup

What you need:

Python 3.10+
Basic familiarity with PyTorch and HuggingFace transformers
A Google Colab account (free tier works with LoRA)

Zero-setup option: Click the Colab badge above. The notebook handles all installation automatically.

Local setup:

# PyTorch CPU (torchax handles the accelerator via JAX)
pip install torch --index-url https://download.pytorch.org/whl/cpu

# JAX + all training dependencies in a single pip call
pip install -U 'jax[tpu]' torchax transformers flax peft datasets optax   # TPU
# pip install -U 'jax[cuda12]' torchax transformers flax peft datasets optax  # GPU

Colab note: The notebook installs packages and automatically restarts the runtime, since Colab pre-loads an older JAX that stays cached in memory until restart.

Key Concepts for Training

Before writing code, let's understand the four concepts that make torchax training work.

1. Param/Buffer Separation

JAX's jax.value_and_grad needs to know which inputs to differentiate. In standard PyTorch, the model owns its weights. In torchax training, we explicitly separate:

params — trainable parameters (get gradients)
buffers — everything else (frozen weights, running stats, constants)

params = {n: p for n, p in model.named_parameters() if p.requires_grad}
frozen = {n: p for n, p in model.named_parameters() if not p.requires_grad}
buffers = dict(model.named_buffers())
buffers.update(frozen)

For LoRA, params contains only the tiny adapter weights (~0.5% of the model). For full fine-tuning, it contains everything.

2. optax Optimizers

Unlike PyTorch optimizers (which carry hidden mutable state), optax optimizers are pure functions:

# PyTorch: hidden state inside optimizer
optimizer.step()

# optax: explicit state, no hidden pockets
updates, new_opt_state = optimizer.update(grads, opt_state, params)
new_params = optax.apply_updates(params, updates)

This functional design means the optimizer state is just another pytree that flows through the training step — perfect for jax.jit.

3. make_train_step

torchax.train.make_train_step() is the central API. It composes three pieces into a single JIT-compilable function:

model_fn — a pure function: (weights, buffers, batch) → output
loss_fn — extracts the scalar loss: (output, labels) → loss
optimizer — an optax optimizer

The result is step_fn(params, buffers, opt_state, batch, labels) → (loss, new_params, new_opt_state).

Under the hood, this uses jax.value_and_grad for efficient gradient computation and optax.apply_updates for weight updates — all compiled into a single XLA program.

4. Full Fine-Tuning vs LoRA

	Full Fine-Tuning	LoRA
Trainable params	All (~2B)	Tiny adapters (~0.5%)
Memory	~18-20 GB	~5-7 GB
Speed	Slower	Faster
Quality	Higher ceiling	Nearly as good
Free Colab TPU	Tight / may OOM	Fits comfortably

LoRA (Low-Rank Adaptation) freezes the base model and adds small trainable matrices to attention layers. Instead of updating the full weight matrix W, it learns a low-rank decomposition: W + (α/r) × B·A where A and B are tiny matrices.

For free Colab, LoRA is the recommended path.

Step 1: Load and Prepare the Dataset

We use Databricks Dolly 15k — 15,000 human-written instruction-response pairs across 7 categories (QA, summarization, brainstorming, etc.).

import datasets as hf_datasets
from transformers import AutoTokenizer

MODEL_NAME = "google/gemma-3-1b-it"
DATASET_NAME = "databricks/databricks-dolly-15k"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

raw_dataset = hf_datasets.load_dataset(DATASET_NAME, split="train")

Each example has an instruction, optional context, response, and category. We format these into Gemma's chat template:

def format_example(example):
    user_content = example["instruction"]
    if example.get("context", ""):
        user_content += f"\n\nContext: {example['context']}"

    messages = [
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": example["response"]},
    ]
    text = tokenizer.apply_chat_template(messages, tokenize=False)
    return {"text": text}

Then tokenize and create dataloaders:

from torch.utils.data import DataLoader
from transformers import DataCollatorForLanguageModeling

# Subset, split, tokenize
subset = raw_dataset.shuffle(seed=42).select(range(2200))
split = subset.train_test_split(test_size=200, seed=42)

def tokenize_example(example):
    formatted = format_example(example)
    return tokenizer(formatted["text"], padding="max_length", max_length=512, truncation=True)

train_tokenized = split["train"].map(tokenize_example, remove_columns=split["train"].column_names)
eval_tokenized = split["test"].map(tokenize_example, remove_columns=split["test"].column_names)

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
train_dataloader = DataLoader(train_tokenized, shuffle=True, collate_fn=collator, batch_size=2)
eval_dataloader = DataLoader(eval_tokenized, shuffle=False, collate_fn=collator, batch_size=2)

Step 2: Load the Model and Apply LoRA

Here is where the torchax pattern matters: load the model with torchax disabled, then enable it before moving to JAX.

import torch
import torchax as tx
import peft

# Load model with torchax disabled to avoid intercepting init ops
with tx.disable_temporarily():
    model = transformers.AutoModelForCausalLM.from_pretrained(
        MODEL_NAME, torch_dtype=torch.bfloat16
    )

# Sync pad_token_id so loss computation properly ignores padding
model.config.pad_token_id = tokenizer.pad_token_id

Why disable? HuggingFace model initialization uses operations (like in-place tensor filling) that torchax does not support. Disabling torchax during loading keeps everything on CPU, then we move to JAX after.

Now apply LoRA:

peft_config = peft.LoraConfig(
    task_type=peft.TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,                             # Rank of the LoRA matrices
    lora_alpha=16,                   # Scaling factor
    lora_dropout=0.0,                # 0.0 for bfloat16 numerical stability
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # All attention layers
)
model = peft.get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Output: trainable params: 5,767,168 || all params: 2,619,206,656 || trainable%: 0.22%

Only 0.22% of parameters are trainable — that is the power of LoRA.

Finally, enable torchax and move to the JAX device:

tx.enable_accuracy_mode()  # Float32 accumulation for bfloat16 stability
tx.enable_globally()
device = torch.device("jax")
model.to(device)
model.train()

Step 3: Baseline Evaluation

Before training, we measure the model's performance to compare against later:

import math

def evaluate_loss(model, dataloader, device, max_batches=50):
    model.eval()
    total_loss, total_batches = 0.0, 0
    with torch.no_grad():
        for i, batch in enumerate(dataloader):
            if i >= max_batches:
                break
            # Drop attention_mask — Gemma's sliding window attention produces NaN
            # with padded masks on torchax/JAX. Labels already mask padding with -100.
            batch = {k: v.to(device) for k, v in batch.items() if k != "attention_mask"}
            outputs = model(**batch)
            total_loss += outputs.loss.item()
            total_batches += 1
    model.train()
    avg_loss = total_loss / max(total_batches, 1)
    return avg_loss, math.exp(min(avg_loss, 100))

baseline_loss, baseline_ppl = evaluate_loss(model, eval_dataloader, device)
print(f"Baseline loss: {baseline_loss:.4f}, perplexity: {baseline_ppl:.2f}")

We also generate sample responses for qualitative comparison. For fast generation, we register StaticCache as a JAX pytree and use KV-cached decoding — only the new token is processed each step instead of the full sequence (~50x faster):

from transformers.cache_utils import StaticCache
from jax.tree_util import register_pytree_node

def _flatten_static_cache(cache):
    return (cache.key_cache, cache.value_cache), (
        cache.config, cache.max_batch_size, cache.max_cache_len,
        getattr(cache, "device", None), getattr(cache, "dtype", None),
    )

def _unflatten_static_cache(aux, children):
    config, max_batch_size, max_cache_len, dev, dtype = aux
    kwargs = {}
    if dev is not None: kwargs["device"] = dev
    if dtype is not None: kwargs["dtype"] = dtype
    sc = StaticCache(config, max_batch_size, max_cache_len, **kwargs)
    sc.key_cache, sc.value_cache = children
    return sc

register_pytree_node(StaticCache, _flatten_static_cache, _unflatten_static_cache)

The generation function uses prefill (process full prompt) then per-token decode with the cache and a tqdm progress bar:

from tqdm.auto import tqdm

def generate_response(model, tokenizer, instruction, device, max_new_tokens=100):
    messages = [{"role": "user", "content": instruction}]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to(device)
    seq_len = input_ids.shape[1]

    kv = StaticCache(config=model.config, max_batch_size=1,
                     max_cache_len=seq_len + max_new_tokens,
                     device=device, dtype=torch.bfloat16)
    pos = torch.arange(seq_len, device=device)

    model.eval()
    with torch.no_grad():
        # Prefill: process full prompt, populate cache
        logits, kv = model(input_ids, cache_position=pos, past_key_values=kv,
                           return_dict=False, use_cache=True)
        tok = torch.argmax(logits[:, -1], dim=-1)[:, None]
        generated = [tok[:, 0].item()]
        pos = torch.tensor([seq_len], device=device)

        # Decode: one token at a time using cached keys/values
        for _ in tqdm(range(max_new_tokens - 1), desc="Generating", leave=False):
            logits, kv = model(tok, cache_position=pos, past_key_values=kv,
                               return_dict=False, use_cache=True)
            tok = torch.argmax(logits[:, -1], dim=-1)[:, None]
            tid = tok[:, 0].item()
            if tid == tokenizer.eos_token_id:
                break
            generated.append(tid)
            pos += 1

    model.train()
    return tokenizer.decode(generated, skip_special_tokens=True)

Step 4: Set Up Functional Training

This is where torchax diverges from standard PyTorch. We separate the model, create an optax optimizer, and compose everything into a JIT-compiled training step.

Separate params and buffers

import optax
import torchax.train

params = {n: p for n, p in model.named_parameters() if p.requires_grad}
buffers = dict(model.named_buffers())
frozen_params = {n: p for n, p in model.named_parameters() if not p.requires_grad}
buffers.update(frozen_params)

Create the optimizer

schedule = optax.warmup_cosine_decay_schedule(
    init_value=0.0, peak_value=1e-4, warmup_steps=50, decay_steps=500
)
optimizer = optax.chain(
    optax.clip_by_global_norm(1.0),
    optax.adamw(learning_rate=schedule, weight_decay=0.01),
)
opt_state = tx.interop.call_jax(optimizer.init, params)

Note tx.interop.call_jax — this bridges optax's JAX calls with torchax tensors.

Define model_fn and loss_fn

def model_fn(weights, buffers, batch):
    """Stateless forward pass using functional_call."""
    return torch.func.functional_call(
        model, {**weights, **buffers}, args=(), kwargs=batch
    )

def loss_fn(model_output, labels):
    """Extract loss from HuggingFace model output."""
    return model_output.loss

torch.func.functional_call runs the model as a pure function — no hidden state, just inputs and outputs. This is what enables JAX to trace and compile it.

Compose into a training step

step_fn = tx.train.make_train_step(model_fn, loss_fn, optimizer)

That single line creates a function that does: forward pass → loss computation → gradient calculation → optimizer update — all compiled into one XLA program.

Step 5: The Training Loop

import time
from tqdm.auto import tqdm

torch.manual_seed(42)
train_losses = []
start_time = time.time()

for epoch in range(1):
    pbar = tqdm(enumerate(train_dataloader), total=len(train_dataloader))
    for step, batch in pbar:
        # Drop attention_mask — Gemma's sliding window attention produces NaN with
        # padded masks on torchax/JAX. Labels already mask padding with -100.
        batch = {k: v.to(device) for k, v in batch.items() if k != "attention_mask"}

        loss, params, opt_state = step_fn(
            params, buffers, opt_state, batch, batch["labels"]
        )

        train_losses.append(loss.item())
        pbar.set_postfix({"loss": f"{loss.item():.4f}"})

elapsed = time.time() - start_time
print(f"Training complete! {len(train_losses)} steps in {elapsed:.0f}s")

What to expect:

Step 1: ~30-60 seconds (JAX compiles the entire training step)
Steps 2+: ~1-3 seconds each (running the compiled program)
Total: ~20-40 minutes for 2000 samples with LoRA on free Colab TPU

The first step is slow because JAX traces through the entire model, loss computation, gradient calculation, and optimizer update — then compiles it all into a single optimized XLA program. Every subsequent step reuses this compiled program.

Step 6: Evaluate the Improvement

After training, we compare against our baseline:

# Load trained params back into model
with torch.no_grad():
    for name, param in params.items():
        parts = name.split(".")
        obj = model
        for part in parts[:-1]:
            obj = getattr(obj, part)
        setattr(obj, parts[-1], torch.nn.Parameter(param))

final_loss, final_ppl = evaluate_loss(model, eval_dataloader, device)

print(f"{'Metric':<20} {'Before':>10} {'After':>10}")
print(f"{'Loss':<20} {baseline_loss:>10.4f} {final_loss:>10.4f}")
print(f"{'Perplexity':<20} {baseline_ppl:>10.2f} {final_ppl:>10.2f}")

You should see loss decrease and perplexity improve after training. The qualitative comparison (generated responses before vs. after) is even more telling — the fine-tuned model produces more focused, instruction-following responses.

Step 7: Save and Reload

Save

Convert JAX arrays back to CPU tensors and save using HuggingFace's standard format:

import numpy as np

save_dir = "./fine_tuned_model"

with torch.no_grad():
    cpu_state_dict = {
        name: torch.tensor(np.array(p)).contiguous()
        for name, p in params.items()
    }
    # safe_serialization=False avoids a safetensors/torchax C-extension conflict on reload
    model.save_pretrained(save_dir, state_dict=cpu_state_dict, safe_serialization=False)

tokenizer.save_pretrained(save_dir)

For LoRA, this saves only the tiny adapter weights (~20MB). For full fine-tuning, it saves the entire model (~4GB).

Reload

with tx.disable_temporarily():
    # For LoRA: load base model + adapters separately
    reloaded_model = transformers.AutoModelForCausalLM.from_pretrained(
        MODEL_NAME, torch_dtype=torch.bfloat16
    )
    # torch_device="cpu" forces PEFT to load adapter weights on CPU,
    # avoiding a safetensors/torchax C-extension conflict.
    reloaded_model = peft.PeftModel.from_pretrained(reloaded_model, save_dir, torch_device="cpu")

reloaded_model.to(device)
reloaded_model.eval()

The pattern is the same as loading: disable torchax, load on CPU, then move to JAX. For LoRA models, you load the base model first, then attach the saved adapters with PeftModel.from_pretrained(). The torch_device="cpu" ensures PEFT loads weights through PyTorch's standard path rather than safetensors' C extension, which conflicts with torchax.

Full Fine-Tuning: When LoRA Is Not Enough

The notebook supports full fine-tuning by changing one setting:

TRAINING_MODE = "full"

This trains all parameters instead of just the LoRA adapters. The trade-off is much higher memory usage. To make it fit on free Colab TPU:

AdaFactor optimizer — uses ~50% less memory than AdamW (stores only row/column statistics instead of per-parameter moments)
Reduced sequence length — MAX_SEQ_LEN = 256 halves activation memory
Smaller batch size — BATCH_SIZE = 1 with higher gradient accumulation steps

USE_ADAFACTOR = True
USE_GRADIENT_CHECKPOINTING = True

if TRAINING_MODE == "full" and USE_ADAFACTOR:
    optimizer = optax.chain(
        optax.clip_by_global_norm(1.0),
        optax.adafactor(learning_rate=schedule),
    )
else:
    optimizer = optax.chain(
        optax.clip_by_global_norm(1.0),
        optax.adamw(learning_rate=schedule, weight_decay=0.01),
    )

Full fine-tuning gives a higher quality ceiling but LoRA gets you 90%+ of the way with a fraction of the compute.

Troubleshooting

Error	Cause	Fix
`OutOfMemoryError`	Model + optimizer too large	Switch to LoRA, reduce `BATCH_SIZE` or `MAX_SEQ_LEN`
`TypeError: not a valid JAX type`	Custom HuggingFace type not registered	Register with `jax.tree_util.register_pytree_node()`
`Loss is NaN`	Numerical instability in bfloat16	1. Call `tx.enable_accuracy_mode()` before `tx.enable_globally()`. 2. Reduce LR (try 1e-4). 3. Set `lora_dropout=0.0`. 4. Add `optax.clip_by_global_norm(1.0)`.
`Slow first step`	Normal — JAX JIT compilation	Wait ~30-60s; subsequent steps are fast
`make_train_step error`	API mismatch	Update: `pip install -U torchax`

The Big Picture: Inference + Training

With the inference tutorial and this training tutorial, you now have the complete torchax story:

Run any HuggingFace model on TPU (model.to("jax"))
Benchmark with JIT compilation (10-100x speedup)
Fine-tune with LoRA or full training (make_train_step)
Save and reload for production inference

All using PyTorch code. No JAX rewrite needed.

Resources

Notebooks:
- Full training tutorial — all the code from this post, ready to run
- Training quickstart — same pipeline in ~10 cells
- Inference tutorial — Part 1 of this series
Libraries:
References:
- torchax PEFT LoRA example — the official example this tutorial builds on
- Han Qi's tutorial series — the original 3-part series on torchax + HuggingFace

Credits

Han Qi (@qihqi) — author of torchax, PEFT training example, and the original tutorial series
torchax team at Google — library development
HuggingFace — transformers, PEFT, and datasets ecosystem
Databricks — Dolly 15k dataset
JAX team at Google — JAX, XLA, and TPU support

Run Any HuggingFace Model on TPUs: A Beginner's Guide to TorchAX

Ahmed Elnaggar — Sun, 29 Mar 2026 17:04:14 +0000

What if you could run any HuggingFace model on TPUs — without rewriting a single line of model code?

Here is what the end result looks like:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("google/gemma-3-1b-it", torch_dtype="bfloat16")

import torchax
torchax.enable_globally()  # Enable AFTER loading the model

model.to("jax")  # That's it. Now running on JAX.

Five lines. Your PyTorch model is now executing on JAX — with access to TPUs, JIT compilation, and automatic parallelism across devices.

In this tutorial, we will go from zero to building a working chatbot powered by a HuggingFace model running on JAX. Along the way, you will learn key JAX concepts, see real benchmarks, and understand why this approach exists.

Why This Matters: The HuggingFace + JAX Problem

In 2024, HuggingFace removed native JAX and TensorFlow support from its transformers library to focus development on PyTorch. This left thousands of JAX users — especially those running on Google Cloud TPUs — without a straightforward way to use HuggingFace's massive model collection.

What is JAX?

If you are new to JAX, think of it as Google's high-performance numerical computing library. It looks like NumPy on the surface, but under the hood it offers three powerful capabilities:

JIT Compilation — JAX can compile your Python code into optimized machine code using the XLA compiler. The first run is slower (compilation), but every subsequent call is dramatically faster.
TPU Support — JAX is the native programming model for Google's Tensor Processing Units. If you want to use TPUs, JAX is the most natural path.
Automatic Parallelism — JAX can automatically distribute computation across multiple devices (TPUs or GPUs) using a single-program model called gSPMD. You describe what should be sharded; the compiler figures out how.

Enter TorchAX

torchax is a library from Google that bridges PyTorch and JAX. It works by creating a special torch.Tensor subclass that secretly holds a jax.Array inside. When PyTorch operations are called on this tensor, torchax intercepts them and executes the JAX equivalent instead.

Think of it like a Trojan horse: PyTorch thinks it is working with regular tensors, but the computation is actually happening on JAX.

PyTorch Model
    |
    v
torchax.Tensor (looks like torch.Tensor)
    |
    v
jax.Array (actual computation on TPU/GPU)

This means you can take any PyTorch model — including HuggingFace models — and run it on JAX without modifying the model code at all.

Credits: This tutorial builds on the excellent 3-part blog series by Han Qi (@qihqi), the author of torchax, and on the torchax documentation. We expand on those tutorials with beginner-friendly explanations, a different model (Gemma instead of Llama), benchmarks, and a complete Colab-ready notebook.

TorchAX vs. the Alternatives

Before diving into code, it helps to understand where torchax fits in the broader ecosystem:

Approach	Effort	Performance	Best For
Rewrite in Flax/Equinox	High (full rewrite)	Native JAX speed	New projects starting in JAX
torch-xla (PyTorch/XLA)	Low (add XLA device)	Good (XLA compiled)	PyTorch training on TPUs
torchax	Low (change device to 'jax')	Great (JAX JIT + interop)	Running HF models on JAX, mixing PyTorch + JAX
ONNX export	Medium (export + runtime)	Variable	Cross-framework deployment

When should you use torchax? When you have a PyTorch model (especially from HuggingFace) and want to leverage JAX's JIT compilation, TPU support, or interop with JAX libraries — without rewriting the model.

Prerequisites & Setup

What you need:

Python 3.10+
Basic familiarity with PyTorch (loading models, running inference)
A Google Colab account (free tier works for the 1B model)

Zero-setup option: Click the Colab badge above. The notebook handles all installation automatically.

Local setup:

# 1. Install PyTorch (CPU version — torchax handles the accelerator)
pip install torch --index-url https://download.pytorch.org/whl/cpu  # Linux
# pip install torch  # macOS

# 2. Install JAX for your accelerator
pip install -U jax[tpu]     # Google Cloud TPU
# pip install -U jax[cuda12]  # NVIDIA GPU
# pip install -U jax          # CPU only

# 3. Install torchax, transformers, and flax (for JAX compatibility)
pip install -U torchax transformers flax

Key Concepts for Beginners

Before we write code, let's demystify three JAX concepts you will encounter throughout this tutorial.

Pytrees: JAX's Data Containers

A pytree is any nested structure of Python containers (dicts, lists, tuples) with arrays as leaves. JAX uses pytrees everywhere — model weights are pytrees, function inputs/outputs are pytrees.

Think of a pytree like a shipping box with labeled compartments. JAX knows how to open standard boxes (dicts, lists, tuples), pull out all the arrays, do math on them, and put them back.

The catch: JAX does not know how to open custom boxes. HuggingFace defines custom output types like CausalLMOutputWithPast — we need to teach JAX how to unpack and repack these. This is called pytree registration, and we will see it in action shortly.

JIT Compilation: Translate Once, Run Fast Forever

JIT (Just-In-Time) compilation is like translating a recipe from English to machine code. The first time you call a JIT-compiled function, JAX traces through it, records all the operations, and compiles an optimized version. Subsequent calls skip the tracing and run the compiled version directly.

First call:  Python code → trace → compile → execute  (slow)
Second call: compiled code → execute                   (fast!)

The speedup can be 10-100x or more. The trade-off is that the compiled function is specialized for the input shapes it was traced with — if shapes change, JAX recompiles.

Static vs. Dynamic Values

When JAX traces a function for JIT, it treats inputs as abstract shapes, not concrete values. If your code has a branch like if use_cache:, JAX cannot evaluate it during tracing because use_cache is abstract. This causes a ConcretizationTypeError.

The fix: mark such values as static (compile-time constants) so JAX knows their actual value during tracing. We will see two ways to do this: closures and static_argnums.

Step 1: Your First Forward Pass

Let's load a model and run it on JAX. We will use Gemma 3 1B IT — a small, instruction-tuned model from Google that runs comfortably on free Colab hardware.

import torch
import torchax
import jax
import time

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "google/gemma-3-1b-it"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="cpu"
)

# Enable torchax globally AFTER model loading
# This prevents intercepting unsupported initialization ops
torchax.enable_globally()

# Move model weights to the JAX device
model.to("jax")

# Tokenize an input prompt
prompt = "The secret to baking a good cake is"
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to("jax")

# Run a forward pass (eager mode)
start = time.perf_counter()
with torch.no_grad():
    outputs = model(input_ids, use_cache=False)
elapsed = time.perf_counter() - start

print(f"Output logits shape: {outputs.logits.shape}")
print(f"Eager forward pass: {elapsed:.3f}s")

What happened:

We load the model on CPU first, then call torchax.enable_globally(). This ordering is important — enabling torchax before model loading can intercept unsupported initialization ops and cause errors.
model.to("jax") moves every parameter from CPU to the JAX device — just like model.to("cuda") for GPUs.
The forward pass runs through PyTorch's code path, but every operation is executed by JAX under the hood.

The output logits tensor has shape (1, sequence_length, vocab_size). Each position contains a score for every token in the vocabulary — the highest score is the model's prediction for the next token.

Step 2: Speed It Up with JIT Compilation

The eager forward pass works, but it is slow — every operation goes through Python one at a time. Let's compile the model for dramatically faster inference.

The extract_jax Approach

The torchax.extract_jax() function converts a PyTorch model into a pure JAX function:

# Extract a JAX-callable function and the model weights as a pytree
weights, jax_func = torchax.extract_jax(model)

This returns two things:

weights — the model's state_dict as a pytree of jax.Arrays
jax_func — a function with signature jax_func(weights, args_tuple, kwargs_dict)

Register HuggingFace Output Types as Pytrees

Before we can JIT this function, we need to teach JAX about HuggingFace's custom types:

from jax.tree_util import register_pytree_node
from transformers import modeling_outputs, cache_utils

# Register CausalLMOutputWithPast
def output_flatten(v):
    return v.to_tuple(), None

def output_unflatten(aux, children):
    return modeling_outputs.CausalLMOutputWithPast(*children)

register_pytree_node(
    modeling_outputs.CausalLMOutputWithPast,
    output_flatten,
    output_unflatten,
)

# Register DynamicCache
def _flatten_dynamic_cache(cache):
    return (cache.key_cache, cache.value_cache), None

def _unflatten_dynamic_cache(aux, children):
    c = cache_utils.DynamicCache()
    c.key_cache, c.value_cache = children
    return c

register_pytree_node(
    cache_utils.DynamicCache,
    _flatten_dynamic_cache,
    _unflatten_dynamic_cache,
)

Handle Static Arguments with a Closure

The use_cache flag is a boolean that JAX cannot trace. We wrap it in a closure to make it a compile-time constant:

def forward_no_cache(weights, input_ids):
    return jax_func(weights, (input_ids,), {"use_cache": False})

jitted_forward = jax.jit(forward_no_cache)

Benchmark: Eager vs. JIT

# Convert input to a native JAX array for jax.jit
jax_input_ids = jax.device_put(inputs["input_ids"].numpy())

# Warm up (first call triggers compilation)
res = jitted_forward(weights, jax_input_ids)
jax.block_until_ready(res)

# Benchmark 3 runs
for i in range(3):
    start = time.perf_counter()
    res = jitted_forward(weights, jax_input_ids)
    jax.block_until_ready(res)
    elapsed = time.perf_counter() - start
    print(f"Run {i}: {elapsed:.4f}s")

Expected output (times will vary by hardware):

Run 0: 0.0142s  # Already compiled from warm-up
Run 1: 0.0038s
Run 2: 0.0035s

The JIT-compiled version runs orders of magnitude faster than eager mode. This is the power of XLA compilation — operations are fused, memory is optimized, and the accelerator runs a single optimized program.

Step 3: The Simpler API — torchax.compile

The extract_jax + manual JIT approach gives you full control, but for most cases there is a simpler way. The catch is that torchax.compile() uses jax.jit under the hood, so we need to avoid passing dynamic boolean flags like use_cache. We wrap the model in a thin module that bakes in these constants:

import torch.nn as nn

class NoCacheModel(nn.Module):
    def __init__(self, base_model):
        super().__init__()
        self.base_model = base_model

    def forward(self, input_ids):
        # Return only logits to avoid HuggingFace output class pytree issues
        return self.base_model(input_ids, use_cache=False, return_dict=False)[0]

# One-liner: compile the wrapped model
compiled_model = torchax.compile(NoCacheModel(model))

# Use it like a normal PyTorch model
with torch.no_grad():
    logits = compiled_model(input_ids)

Under the hood, torchax.compile() wraps your model in a JittableModule and applies jax.jit. The first call triggers compilation; subsequent calls are fast. The NoCacheModel wrapper ensures that boolean flags are constants (not traced) and that the output is a plain tensor (not a custom HuggingFace type that needs pytree registration).

Step 4: Text Classification

Let's use our JIT-compiled model for a practical task — sentiment classification. Since Gemma is an instruction-tuned model, we can use prompt engineering:

def classify_sentiment(text, model, tokenizer):
    prompt = f"""Classify the following text as POSITIVE or NEGATIVE.
Text: "{text}"
Classification:"""

    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("jax")

    with torch.no_grad():
        outputs = model(input_ids, use_cache=False)

    # Get the predicted next token
    next_token_logits = outputs.logits[0, -1, :]
    next_token_id = torch.argmax(next_token_logits).item()
    prediction = tokenizer.decode([next_token_id]).strip()
    return prediction

# Test it
texts = [
    "This movie was absolutely fantastic, I loved every minute!",
    "The service was terrible and the food was cold.",
    "A perfectly average experience, nothing special.",
]

for text in texts:
    result = classify_sentiment(text, model, tokenizer)
    print(f"Text: {text[:50]}...  =>  {result}")

Step 5: Text Generation (Autoregressive Decoding)

Classification is useful, but the real power of LLMs is generating text. Let's understand how this works.

How Autoregressive Decoding Works

An LLM predicts one token at a time. Given an input of length n, it produces scores for the next token. We pick one (e.g., the highest-scoring token via greedy decoding), append it to the input, and repeat:

Iteration 1: input (1, n)     → output (1, n)     → pick token
Iteration 2: input (1, n+1)   → output (1, n+1)   → pick token
Iteration 3: input (1, n+2)   → output (1, n+2)   → pick token
...

The problem: input shapes change every iteration. JIT compilation specializes for fixed shapes, so changing shapes means recompilation every step — worse than eager mode.

The KV Cache Solution

The KV (Key-Value) cache stores intermediate computations from previous tokens so the model only needs to process the new token each iteration:

Iteration 1: input (1, n)              → output + kv_cache(n)
Iteration 2: input (1, 1) + cache(n)   → output + kv_cache(n+1)
Iteration 3: input (1, 1) + cache(n+1) → output + kv_cache(n+2)

With a DynamicCache, the cache grows each step — shapes still change. With a StaticCache, the cache has a fixed maximum length — shapes stay constant, making it JIT-friendly.

Implementation with StaticCache

from transformers.cache_utils import StaticCache

# Register StaticCache as a pytree
def _flatten_static_cache(cache):
    return (
        cache.key_cache, cache.value_cache
    ), (cache.config, cache.max_batch_size, cache.max_cache_len,
        getattr(cache, "device", None), getattr(cache, "dtype", None))

def _unflatten_static_cache(aux, children):
    config, max_batch_size, max_cache_len, device, dtype = aux
    kwargs = {}
    if device is not None: kwargs["device"] = device
    if dtype is not None: kwargs["dtype"] = dtype
    cache = StaticCache(config, max_batch_size, max_cache_len, **kwargs)
    cache.key_cache, cache.value_cache = children
    return cache

register_pytree_node(
    StaticCache,
    _flatten_static_cache,
    _unflatten_static_cache,
)

def generate_text(model, tokenizer, prompt, max_new_tokens=50):
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to("jax")
    batch_size, seq_length = input_ids.shape

    # Create a static cache with fixed maximum length
    past_key_values = StaticCache(
        config=model.config,
        max_batch_size=1,
        max_cache_len=seq_length + max_new_tokens,
        device="jax",
        dtype=model.dtype,
    )
    cache_position = torch.arange(seq_length, device="jax")

    # Prefill: process the full prompt
    with torch.no_grad():
        logits, past_key_values = model(
            input_ids,
            cache_position=cache_position,
            past_key_values=past_key_values,
            return_dict=False,
            use_cache=True,
        )

    next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
    generated_ids = [next_token[:, 0].item()]
    cache_position = torch.tensor([seq_length], device="jax")

    # Decode: generate one token at a time
    for _ in range(max_new_tokens - 1):
        with torch.no_grad():
            logits, past_key_values = model(
                next_token,
                cache_position=cache_position,
                past_key_values=past_key_values,
                return_dict=False,
                use_cache=True,
            )
        next_token = torch.argmax(logits[:, -1], dim=-1)[:, None]
        token_id = next_token[:, 0].item()

        if token_id == tokenizer.eos_token_id:
            break
        generated_ids.append(token_id)
        cache_position += 1

    return tokenizer.decode(generated_ids, skip_special_tokens=True)

# Generate!
result = generate_text(model, tokenizer, "The secret to baking a good cake is")
print(result)

Step 6: Distributed Inference (Tensor Parallelism)

If you have access to multiple devices (e.g., a TPU v2-8 with 8 chips, or multi-GPU), you can shard the model weights across devices for faster inference.

How Tensor Parallelism Works

In tensor parallelism, we split weight matrices across devices:

Column-parallel: Q, K, V, Gate, and Up projections are split along the output dimension
Row-parallel: O and Down projections are split along the input dimension
Between these two, only a single all-reduce operation is needed per layer

JAX's gSPMD handles the communication automatically — you just specify how each weight should be sharded.

Sharding the Weights

from jax.sharding import PartitionSpec as P, NamedSharding

# Create a device mesh
mesh = jax.make_mesh((jax.device_count(),), ("axis",))

def shard_weights(mesh, weights):
    sharded = {}
    for name, tensor in weights.items():
        if any(k in name for k in ["q_proj", "k_proj", "v_proj", "gate_proj", "up_proj"]):
            spec = P("axis", None)  # Column-parallel
        elif any(k in name for k in ["o_proj", "down_proj", "lm_head", "embed_tokens"]):
            spec = P(None, "axis")  # Row-parallel
        else:
            spec = P()  # Replicate (e.g., layer norms)
        sharded[name] = jax.device_put(tensor, NamedSharding(mesh, spec))
    return sharded

# Apply sharding
weights, jax_func = torchax.extract_jax(model)
weights = shard_weights(mesh, weights)

# Replicate the input across all devices
input_ids_sharded = jax.device_put(
    inputs["input_ids"], NamedSharding(mesh, P())
)

With sharded weights, the same jax.jit-compiled function now runs in parallel across all devices. The XLA compiler automatically inserts the necessary all-reduce operations.

Note: Tensor parallelism requires a multi-device environment. On free Colab TPU (single device), this section is for illustration. Use a TPU v2-8 or multi-GPU setup to run it.

Step 7: Build a Mini Chatbot

Let's wrap everything into a simple chat function using Gemma's instruction template:

def chat(model, tokenizer, user_message, max_new_tokens=100):
    # Gemma instruction format
    prompt = f"<start_of_turn>user\n{user_message}<end_of_turn>\n<start_of_turn>model\n"
    response = generate_text(model, tokenizer, prompt, max_new_tokens)
    return response

# Example conversation
questions = [
    "What is JAX and why would I use it?",
    "Explain tensor parallelism in simple terms.",
    "Write a haiku about machine learning.",
]

for q in questions:
    print(f"User: {q}")
    print(f"Gemma: {chat(model, tokenizer, q)}")
    print()

Swapping to a Larger Model

Everything above uses google/gemma-3-1b-it (1B parameters). To use a larger model, change the model name:

# 7B model — needs more memory (Colab Pro or multi-device)
model_name = "google/gemma-3-7b-it"

The rest of the code remains identical. Larger models produce higher quality outputs but require more memory and compute. The 7B model benefits significantly from tensor parallelism on multi-device setups.

Other models that work well with torchax include any standard HuggingFace AutoModelForCausalLM architecture — GPT-2, Llama, Mistral, Phi, and more.

Troubleshooting

TypeError: ... is not a valid JAX type
You need to register the type as a pytree. See the registration examples above for CausalLMOutputWithPast, DynamicCache, and StaticCache.

ConcretizationTypeError: Abstract tracer value encountered
A value that changes between calls (like a boolean flag) needs to be either: (1) made static via static_argnums in jax.jit, or (2) baked into a closure as a constant.

UserWarning: A large amount of constants were captured
Model weights are being inlined as constants in the compiled graph. Pass them as explicit function arguments instead of closing over them.

RuntimeError: No available devices
Ensure JAX can see your accelerator: print(jax.devices()). In Colab, check that your runtime type is set to TPU or GPU.

Conclusion

In this tutorial, we went from zero to a working chatbot running a HuggingFace model on JAX:

Forward pass — moved a PyTorch model to JAX with model.to("jax")
JIT compilation — compiled for 10-100x speedup with jax.jit
Text classification — used prompt engineering for sentiment analysis
Text generation — implemented autoregressive decoding with StaticCache
Distributed inference — sharded weights across devices with tensor parallelism
Chatbot — wrapped generation in an instruction-following chat function

The key insight: torchax lets you use the entire HuggingFace ecosystem — models, tokenizers, configs — while running on JAX's high-performance backend. No model rewrites needed.

Resources

torchax GitHub — library source and documentation
torchax Docs — official getting started guide
Original tutorial series by Han Qi — the 3-part blog series this tutorial builds on
JAX Documentation — JIT compilation, pytrees, distributed arrays
HuggingFace LLM Inference Optimization — StaticCache and torch.compile docs
Companion GitHub repo — all code, notebooks, and diagrams

Credits

This tutorial would not be possible without the work of:

Han Qi (@qihqi) — author of torchax and the original HuggingFace + JAX tutorial series
The torchax team at Google — for building and maintaining the library
The HuggingFace team — for the transformers ecosystem
The JAX team at Google — for JAX, XLA, and TPU support

What model will you try running on TPUs first? Let me know in the comments!