Forem: PEPPERCORN

[Day 7] Does Giving an AI More 'Thinking Time' Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX

PEPPERCORN — Tue, 19 May 2026 03:17:51 +0000

[Day 7] Does Giving an AI More "Thinking Time" Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX

Intro

Day 7!

Reddit kept surfacing this new project called OpenMythos in my feed with "12 days to replicate frontier AI, ASI is near" headlines, and I got curious enough to dig in.

Tools used: my home AI machine (DGX Spark) + OpenMythos (PyTorch reconstruction of the rumored Claude Mythos architecture) + synthetic multi-digit addition.

The question: does giving an AI more "thinking time" (= more recurrent loops at inference) actually make it smarter?

Today's setup

The hype

On 2026-04-07, Anthropic announced Claude Mythos. Press coverage highlights zero-day discovery capabilities — reportedly 271 zero-days in Firefox and a 27-year-old bug in OpenBSD — but the model's architecture and weights remain unreleased. Anthropic kept Mythos itself behind a limited-access coalition (Project Glasswing — AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto, ~40 organizations) rather than releasing it publicly.

Twelve days later, Kye Gomez (Swarms) released OpenMythos, a PyTorch reconstruction of the suspected architecture. The repo is explicit upfront:

"an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic"

So OpenMythos is not Mythos. It's a hypothesis-in-code: a Recurrent-Depth Transformer (RDT) with MoE FFNs and MLA/GQA attention, capable of being trained from scratch on standard text data. No leaked weights, no distillation.

Reddit's "ASI is near" framing skips this critical distinction. The interesting question, once you set the hype aside, is whether the architectural idea — recurrent depth — actually works.

Note for this article: OpenMythos is not Claude Mythos — it's a theoretical reconstruction inspired by looped-transformer research. The experiments below are not "Claude Mythos capability tests" but rather "how does a looped / recurrent-depth structure behave on a small synthetic task."

Three perspectives on looped transformers

Browsing the literature, I found three different studies giving different pictures of how looped transformers behave:

Source	Scale	Claim
Saunshi et al. 2025 (ICLR, research paper)	tens of M params, synthetic	Loops work: k layers looped L times approximately matches kL-layer fixed-depth, on addition / p-hop induction / math
Geiping et al. 2025 (Huginn, research paper)	3.5B params, 800B tokens	Task-dependent: at scale on natural-language benchmarks, gains can be marginal (T=4 → T=32 only +1.82 points on GSM8K), though effects vary by task and compute regime
Micheal Bee 2026-04 (Medium, independent experiment blog)	17M params, 12 GPU-hours on RTX 5070 Ti	Loops plateau at T=2 in this small-scale setup: hidden state reaches a fixed-point that subsequent iterations cannot escape

Theory, large-scale empirics, and an independent solo replication give different pictures. I wanted to add a fourth data point from my own DGX Spark on a clean, controlled task — multi-digit addition.

What I'd hoped to see

Does training-time accuracy phase-transition (grok) at some step? (Saunshi 3-stage prediction)
Does test-time loop count matter? At what point does it stop helping?
Does the hidden state actually keep evolving across loops, or does it hit a fixed-point early? (the Bee question)

Headline finding

Loops help, but only within a narrow window centered on the training loop count. With training-time max_loop_iters=4, accuracy peaks at exactly T=4 (100% across all digit counts) and decays in both directions — fewer loops underthink, more loops overthink.
Bee's "T=2 fixed-point" reproduced. Cosine similarity between consecutive hidden states jumps from ~0.72 to ~0.95 at T=2, then climbs slowly to ~0.99 by T=4 and stays flat through T=32.
Striking per-seed grokking variance. Same hyperparameters, four seeds: seeds 1 and 3 solve 5-digit addition by step 4,000; seed 2 takes 10,000; seed 0 stalls at <10% until step 16,000, then jumps to 100%.
No depth extrapolation in this setup. Saunshi's claim that training at T=4 should generalize to deeper T at inference does not reproduce here — instead, T>4 hurts.

🌀 What is a "looped" transformer?

A standard transformer (GPT-4, Llama, most local LLMs) routes input tokens through a stack of distinct layers, each used exactly once per forward pass. To make it "think deeper," you stack more layers — increasing parameter count.

A looped transformer reuses the same parameters across multiple iterations. The model has a Prelude → Recurrent Block × T → Coda structure: a few standard layers up front, then one block iterated T times with input injection at every step, then a few more standard layers.

Input tokens
   ↓
[Prelude P]          — standard layers, run once
   ↓
[Recurrent Block R]  — one block looped T times
   ↑_______↓          h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
   ↓
[Coda C]             — standard layers, run once
   ↓
Output logits

At each loop iteration t, the hidden state updates via the LTI injection rule, and the encoded input e (Prelude output) is re-injected to keep the original signal alive across arbitrary depth. The injection parameters are constrained so that spectral radius ρ(A) < 1, which prevents divergence over many loops (Parcae stability framework).

The key claim: more loops at inference = deeper reasoning, without adding parameters. This is conceptually analogous to chain-of-thought scaling — except the "thinking" happens in continuous latent space rather than discrete token space.

🔧 Experimental setup

I trained a deliberately tiny OpenMythos variant on multi-digit addition. The model is small enough to run 4 seeds in parallel on a single GPU but large enough to exhibit the looped-transformer phenomena.

OpenMythos tiny (3.4M params)
  ↓
Train 4 seeds in parallel, 30k steps each, fp32 on DGX Spark (GB10)
  ↓
Experiment A: greedy autoregressive accuracy
              loops ∈ {1, 2, 4, 8, 16, 32}  ×  digits ∈ {2, 3, 4, 5}
  ↓
Experiment B: cosine similarity between consecutive hidden states
              ⇒ does the recurrent block reach a fixed-point?
  ↓
Compare against Saunshi / Huginn / Bee

Model config

MythosConfig(
    vocab_size=16,         # digits 0-9 + '+', '=', pad, eos
    dim=256,
    n_heads=8,
    n_kv_heads=2,          # GQA
    max_seq_len=32,
    max_loop_iters=4,      # training depth; inference varies
    prelude_layers=1,
    coda_layers=1,
    attn_type="gqa",
    n_experts=4,           # MoE FFN inside recurrent block
    n_shared_experts=1,
    n_experts_per_tok=2,
    expert_dim=512,
    lora_rank=8,           # depth-wise LoRA per loop step
)

Total parameters: 3,386,658 (~3.4M).

Data

On-the-fly synthetic addition. Operands are uniformly sampled from [10^(d-1), 10^d - 1] for digit count d ∈ {2, 3, 4, 5}. Sequence format "A+B=R$", where R = str(A+B)[::-1] (reverse-order answer, following Saunshi's convention so left-to-right autoregressive generation can carry digits naturally).

Loss is applied only at positions following the = token (i.e., on the answer tokens).

Training

Optimizer: AdamW, betas (0.9, 0.95), wd 0.1
LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5
Grad clip: 1.0
Batch size: 128
Max steps: 30000
dtype: fp32

Initially I tried bf16 to use the GB10 efficiently, but OpenMythos stores RoPE frequencies as complex64 buffers, and model.to(bfloat16) silently drops the imaginary part, breaking attention. For a 3.4M-param model on 128 GB of unified memory, fp32 is fine — the bottleneck is not memory but parallel scheduling.

Four seeds {0, 1, 2, 3} run in parallel on the same GPU. Per-seed throughput drops to ~12K tok/s (vs ~50K solo), but wall-clock time for all four is approximately equivalent to one solo run.

📊 Results

Experiment A: accuracy heatmap

Mean fully-correct rate across 4 seeds, 500 eval samples per condition:

Inference loops	d=2	d=3	d=4	d=5
1	0.38 ± 0.12	0.19 ± 0.09	0.09 ± 0.07	0.02 ± 0.02
2	0.53 ± 0.17	0.50 ± 0.12	0.16 ± 0.08	0.21 ± 0.16
4 (train)	1.00	1.00	1.00	1.00
8	0.98 ± 0.01	0.98 ± 0.01	0.94 ± 0.03	0.86 ± 0.08
16	0.91 ± 0.04	0.91 ± 0.05	0.75 ± 0.10	0.56 ± 0.16
32	0.62 ± 0.12	0.65 ± 0.13	0.45 ± 0.13	0.26 ± 0.17

Observations:

Peak is exactly at training-time loop count (T=4), 100% across all digit counts.
One step of inference-time extrapolation (T=8) is near-peak but already shows degradation at d=5 (86%).
Beyond T=8, accuracy collapses monotonically. At T=32, even 2-digit addition drops to 62%.
Under-looping (T=1, T=2) hurts more at higher digit counts, consistent with depth being needed to chain carries.

Experiment B: fixed-point analysis

Mean cosine similarity between consecutive hidden states cos(h_t, h_{t-1}) over answer positions, averaged across 4 seeds, 200 samples per digit:

t	d=2	d=3	d=4	d=5
1	0.711	0.726	0.745	0.744
2	0.961	0.967	0.957	0.946
3	0.985	0.986	0.977	0.971
4	0.993	0.992	0.986	0.983
8	0.999	0.999	0.998	0.996
16	0.9995	0.9996	0.9992	0.998
32	0.9995	0.9996	0.999	0.998

Bee's T=2 fixed-point claim is reproduced in spirit but not literally: cosine similarity jumps to ~0.95 at T=2 (vs. Bee's near-1.0), then asymptotes to ~0.99 by T=4 and stays flat through T=32.

The difference vs. accuracy is telling: hidden state is effectively static (by cosine similarity) from T=4 onwards, yet accuracy collapses at T=16-32. Two non-exclusive interpretations: (a) overthinking — late loops drift away from a converged solution; (b) distribution shift — training used T=4, so T>>4 is simply an out-of-distribution use of the model. Worth noting that cosine similarity ≈ 1 doesn't prove the hidden state is doing nothing — small logit-relevant deltas may still accumulate.

Digit-count dependence on fixed-point timing is small (d=5 lags d=2 by ~0.01 in cosine sim). "Harder problems take more loops to converge" is not observed here — they converge at the same rate but the converged state is just less accurate at higher digit counts.

Bonus: training dynamics

The most striking thing in the training curves is seed-dependent grokking timing. Four runs of identical hyperparameters:

seed 1: loss → 0 by step 3,000, all digits ≥88% by step 4,000
seed 3: loss → 0 by step 4,000, all digits ≥87% by step 4,000
seed 2: stuck at loss ~0.35 plateau until step 8,000, then collapses to 0 by step 10,000; d=4/5 jump from <10% to 99% in 2,000 steps
seed 0: stuck at loss ~0.30 plateau until step 15,000, then collapses; d=4 groks at step 12,000-14,000, d=5 groks at step 16,000

This is textbook Saunshi-style three-stage grokking (memorization → in-distribution → systematic), with the third-stage trigger varying by a factor of 4x in step count purely on random init. The largest seed gap (seed 0 vs. seed 1) is ~12,000 steps, roughly 1 hour of wall-clock on this DGX.

If you trained a single seed and stopped early, you might conclude "OpenMythos can't generalize beyond d=3" — which would be wrong. The architecture can solve all 4 digit buckets; some random seeds just need much longer to find the systematic-generalization solution.

💡 What this means for the three perspectives

Where my data point lands

My single-DGX small-scale result lands somewhere between Bee and a partial refutation of Saunshi:

Bee's fixed-point at small T is reproduced. Hidden state effectively stops evolving by T=4 (cosine sim ≥ 0.99) and certainly by T=8.
Saunshi's depth-extrapolation does NOT reproduce. Inference at T > train_T does not improve accuracy — it harms it. T=8 is already at 86% on d=5 (vs. 100% at T=4), and T=32 collapses to 26%. The "train at depth k, infer at depth k·L" recipe assumes the recurrent block has learned to keep refining; in my setup it has not.
Huginn's limited-gain finding is consistent at small scale. Extra inference loops give negative ROI rather than diminishing positive ROI.
New observation: seed-dependent grokking with up to 12K-step variance. This is an under-emphasized variable in the public looped-transformer discourse — single-seed studies (Bee's solo replication, individual rows in Saunshi's tables) may be substantially under- or over-estimating typical behavior.

Reconciliation attempt

Theory (Saunshi), large-scale empirics (Huginn), and independent replication (Bee) may not actually be in contradiction — they may be measuring different facets of the same phenomenon at different scales:

Saunshi: shows loops can work on the right kind of problem (algorithmic, depth-bounded reasoning) at the right kind of scale (small synthetic).
Huginn: shows that loops trained at 3.5B / 800B token scale on natural-language data give only marginal gains on a benchmark (GSM8K) that already favors CoT.
Bee: shows that within a particular small-scale training recipe, the recurrent block's hidden state stops evolving very early in inference.

These three findings are compatible with a unified picture: loops carry compute, but only up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity. Beyond that depth, the hidden state stops moving meaningfully, and additional loops are computation without information.

What I'd watch next

Increase loop count during training (here I used 4) and see if the inference-time scaling extends further
Try ACT halting more aggressively to see how the model self-regulates loop depth per token
Add task heterogeneity (mix p-hop induction or parity) to test whether the fixed-point timing varies by problem class

🛠️ Technical details

Reproducing this experiment

git clone https://github.com/kyegomez/OpenMythos
cd OpenMythos
pip install -e .

# Data, training, evaluation scripts (this Day 7 folder):
python scripts/train.py --seed 0 --max_steps 30000
python scripts/eval_accuracy.py --seeds 0 1 2 3
python scripts/eval_fixedpoint.py --seeds 0 1 2 3
python scripts/plot.py

The training and evaluation scripts are at https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts.

What went wrong (and was fixed)

bf16 broke complex RoPE buffer: switched to fp32; fine at 3.4M parameters
Initial training-time max_loop_iters too small: kept at 4 per Saunshi's recipe; future experiments could vary this
Greedy generation is slow at high loop counts: each batch repeats n_loops forward passes through the recurrent block; for loops=32 this is non-trivial

Hyperparameter choices: why these

dim=256, expert_dim=512, 1 prelude / 1 coda layer: smallest config that still exhibits looping behavior; matches Saunshi's scale
n_experts=4: enough to demonstrate MoE routing without bloating params
lora_rank=8: depth-wise LoRA lets each loop iteration adapt slightly without breaking weight-sharing
max_seq_len=32: tight bound — d=5 addition fits in ~18 chars

References

Tomorrow: Day 8

A follow-up to Day 7, pushing looped thinking one step further into something harder…!

100ExperimentsWithDGX #LocalLLM

[Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person

PEPPERCORN — Tue, 12 May 2026 18:10:12 +0000

[Day 6] I Had an AI Look at 25,000 iPhone Photos and It Decided My Mom and I Are the Same Person

Intro

Day 6!

On Day 4, I had a local AI sort through 25,000 photos on my iPhone (Day 4 article). Today is the follow-up — I wanted to go one level deeper and have AI look at my behavioral patterns over time.

Tools used: my home AI machine (DGX Spark) + a face recognition AI (FaceNet) + a summarization LLM (Qwen2.5 72B running on Ollama).

Today's setup

What I actually did

Take 5 years of photos (25,000) and have an AI summarize my day-to-day life from them. Two phases:

Phase 1: aggregate by capture date + camera model + photo category (cat, food, scenery, etc.), then ask the LLM to read it
Phase 2: add face recognition AI to answer "who is in each photo," then ask the LLM again

The key bit of today

The face recognition AI treated me and my mom as the same person — but the interesting part is that all the other misclassifications were "different people with the same expression," whereas in our case it was "different expressions, same person" despite my mom being straight-faced and me grinning with teeth showing.

The AI gets fooled by expressions, but it also seems to pick up on something beyond expressions (bone structure? face shape?). That's today's headline.

🔧 How I went about it

25,000 photos (already categorized on Day 4)
   ↓
Phase 1: aggregate "capture date / camera model / category" only
   → ask the LLM to summarize year by year
   ↓
Phase 2: add face recognition AI to label "who is in each photo"
   → ask the LLM to summarize again

GPS data ("where") had been stripped during the iCloud export, so I substituted camera model as a proxy (iPhone = daily life, Olympus TG = travel, DJI handheld = video shoots, etc.).

(The tools and detailed steps are in the "Technical details" section at the end.)

📊 Results

Phase 1: date / camera / category only

I pulled "capture date + camera model + category (sorted on Day 4)" out of the 25,000 photos and turned it into four heatmaps showing year-over-year patterns. Then handed those to the LLM.

What's a heatmap? = A table where the rows × columns are filled with color intensity based on count. Dense color = a hotspot of activity, visible at a glance.

Photo count per year

2019 was a clear outlier at 4,931 photos — 2-3x the other years.

Year × Category

Cat photos exploded starting 2021 → matches the year my cat joined the household.

Year × Month

August 2019 was the single highest month at 1,082 photos.

Year × Camera model

Olympus TG dropped off sharply in 2020 (matches the COVID period). The DJI handheld shows up starting 2025.

When I handed this to the LLM and asked for a yearly summary, the output was along the lines of "this might have been a busy year" or "looks like an active year." Well, of course — the only info I gave the LLM was "when, what camera, what kind of subject." That's the ceiling for what it can say.

So the next question: what happens if you add who is in each photo? That's Phase 2.

Phase 2: adding "who's in the photo"

I ran face recognition AI over the 25,000 photos, detected 21,000 faces, and grouped similar-looking ones into 209 groups (C1, C2, …, C209). Plotting those over time:

What's a "similar-face group"? = a group the face recognition AI thinks contains "the same person" (technically called a "cluster"). The AI only manages them as numbered IDs, so a human still has to look at each group and label "this is person X."

Person cluster × Year heatmap

This heatmap turned out to be interesting:

Long-spanning groups (C1, C2, C3) → likely family or myself
Short-spanning groups → likely acquaintances from a specific period

…which gives you a working guess. When I fed this back to the LLM, the summary turned much more concrete: "C3 is a new appearance," "C2 is decreasing in frequency," etc.

💡 Today's biggest finding

I went through the face clusters one by one and saw that the AI's groupings landed in a mix of "worked great," "fair enough," and "failed":

✅ Worked great: grouped the same person across different angles and expressions (one group had all 4 photos of the same family member)
🤔 Fair enough: burst shots end up grouped (multiple groups were just consecutive frames of the same moment)
⚠️ Failure pattern A: grouped different people who happened to share a similar smile (happened in several groups)
😳 Failure pattern B: grouped me and my mom despite our totally different expressions

The most striking one was failure pattern B.

Failure patterns A and B are misclassifications for opposite reasons

Failure pattern A: different people, same expression

Different people grouped together because of a similar smile.

Three different people — but when smiles are similar, the AI calls them "same person."

※The actual experiment used real photos. The illustrations here are AI-generated stand-ins for privacy.

Failure pattern B: parent and child, different expressions

My mom and I in the same group — despite the expression difference (I'm grinning with teeth showing, she's neutral).

Parent and child with clearly different expressions — but the AI still says "same person."

※The actual experiment used real photos. The illustrations here are AI-generated stand-ins for privacy.

In the groups I eyeballed, "different expressions but same person" only happened in our case. Every other misclassification was "same expression, different people."

So my mom and I are a different kind of mistake. Either the AI is picking up on genetic facial similarity, or there's some other mechanism at work (I'll touch on this in the technical details). Hard to be definitive, but a fascinating case.

Summary: how the AI "sees" faces

Situation	AI judgment	Likely reason
Same person, different angle & expression	◯ to △	Bone structure matches well
Different people, same expression	✕ (often grouped)	Pulled in by expression noise
Parent & child, different expressions	✕ (sometimes grouped)	Bone structure similarity outweighs expression difference

The AI gets fooled by expressions, but seems to actually pick up on something beyond expressions (bone structure? face shape?) — that was the most interesting observation of the day.

🛠️ Technical details

:::details Tools used

EXIF extraction: Python pillow_heif + PIL.Image.getexif() (HEIC-aware)
Face recognition: facenet-pytorch (InceptionResnetV1, vggface2-pretrained)
Clustering: scikit-learn DBSCAN
LLM summarization: Qwen2.5 72B via Ollama
Compute: DGX Spark (lots of GPU memory)

What's EXIF? = the camera metadata embedded in each photo file (capture time, camera model, GPS, etc.).

What's FaceNet? = an AI that converts a face photo into a 512-dimensional vector. Same person's faces are close vectors, different people are far apart.

What's DBSCAN? = a classic ML clustering method that automatically groups similar items. You don't need to specify the number of groups in advance.

:::

:::details EXIF extraction script (parallelized, 6 seconds total)

pillow_heif to support HEIC, PIL.Image.getexif() to read EXIF. Parallelized with concurrent.futures.ProcessPoolExecutor (12 processes).

import pillow_heif
pillow_heif.register_heif_opener()

from PIL import Image, ExifTags

def extract_one(path):
    with Image.open(path) as img:
        exif = img.getexif()
        inner = exif.get_ifd(34665)  # ExifIFD
        # DateTimeOriginal lives inside ExifIFD
        if 36867 in inner:
            dt = parse_exif_datetime(inner[36867])
        make = exif.get(271)
        model = exif.get(272)
        gps = exif.get_ifd(34853)

Photos with no EXIF date (screenshots, etc.) fall back to file mtime, but that's just "the day I copied the file," so I excluded those from year-level aggregation.

:::

:::details Tuning DBSCAN's eps

Distance between embeddings is cosine distance (1 - dot product).

eps	Clusters	Largest cluster size	Verdict
0.4 (loose)	3	21,310	Everyone in one group
0.3	73	17,234	Still big lumps
0.25	146	12,905	Still too big
0.2	209	4,582	◎ chosen
0.18	216	3,131	Too tight — single people split into multiple clusters

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import normalize

embeds_n = normalize(embeds, norm="l2")
db = DBSCAN(eps=0.2, min_samples=5, metric="cosine", n_jobs=-1)
labels = db.fit_predict(embeds_n)

min_samples=5 means only people who show up 5+ times across photos get clustered.

:::

:::details Why parent and child tend to land in the same cluster

facenet-pytorch's InceptionResnetV1 (vggface2-pretrained) produces 512-dim embeddings that are designed to capture geometric (bone structure) features. Lighting, angle, and expression noise also leak in.

Parent and child share genetic bone structure, so their embeddings can be closer than you'd get between random different people. This is a known phenomenon in face recognition research — several papers have demonstrated it.

DBSCAN is density-based: if "A→B is close" and "B→C is close," then A and C end up in the same cluster even if A and C aren't directly close. If there's one photo of me looking especially like my mom that sits in between, that single bridge photo can connect us into one cluster.

:::

:::details Generating representative face thumbnails for manual labeling

Clusters are just IDs (C0, C1, …), so a human has to look at them and label "this is person X."

I wrote a script that crops the largest face from each cluster's representative photos and lays them out as a diagnostic image:

from facenet_pytorch import MTCNN

mtcnn = MTCNN(keep_all=True, min_face_size=40, device='cuda')
boxes, _ = mtcnn.detect(img)
if boxes is not None:
    biggest = max(boxes, key=lambda b: (b[2]-b[0]) * (b[3]-b[1]))
    crop = img.crop(biggest)

This image contains real faces of family and friends, so I kept it strictly local in private-data/day06-timeline/ (gitignored). Opened it via VS Code Remote-SSH to label by eye.

:::

📝 Today's takeaways

Handing the LLM only "when / what camera / what category" yields a blurry overview
Adding "who is in the photo" jumps the resolution of the analysis up several notches
Face recognition AI is sensitive to expression noise but does pick up something beyond expressions (bone structure / face shape)
Because of that, parent-child being grouped "despite different expressions" became the one unique case in my dataset
Keeping sensitive face data off the cloud is a big advantage of running this locally
Processing 25,000 photos in one go is also realistic on a local setup — no API costs to worry about

Tomorrow's preview: Day 7

Day 7 plan: local AI vs cloud AI, 5-round showdown.

Going to take the tasks I usually do with local AI (photo classification, credit card analysis, code completion, etc.), run them on both sides, and build a head-to-head matrix.

To be continued >>>

100ExperimentsWithDGX #LocalLLM #ImageAnalysis #FaceNet

[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical

PEPPERCORN — Mon, 11 May 2026 01:23:26 +0000

[Day 5] I Trained My Cat-LoRA on 22 vs 213 Photos and the Results Were Basically Identical

Intro

Day 5!

Today was originally going to be "have AI analyze a year of my Amazon order history," but downloading the Amazon purchase history just wouldn't work no matter what I tried. So that was a bust.

Pivoted.

On Day 2, I trained an AI to memorize my cat from 22 photos (Day 2 article). That thing is called a "LoRA."

What's a LoRA? = A small add-on that teaches an AI to recognize a specific subject. Pair photos with a trigger word like ohwx cat, train, and then writing ohwx cat in any prompt makes the AI draw my cat.

On Day 4, I had AI sort through 25,000 photos on my iPhone (Day 4 article). It found 999 photos it identified as cats.

Today's experiment: Will using those 999 photos make my cat-LoRA stronger?

A simple expectation, really. 22 photos → 999 photos is 45x more data. Surely the LoRA gets stronger, right?

TL;DR

Spoiler-free version:

Training with 999 photos made things worse, not better
After removing "other people's cats" from the dataset (down to 213 photos), I got LoRA quality matching my original 22-photo version
22 photos and 213 photos produced basically the same quality

I came in thinking "more photos = stronger LoRA." Turns out that's not really how it works, and today I learned why.

What I actually did

Trained on 999 photos → got worse (v2)

Same base model and trigger word (ohwx cat) as Day 2. Just bumped the photo count from 22 to 999. Kohya_ss training, 14 minutes. Calling this v2.

Generated test images and…

Photorealistic scene (left: no LoRA, center: v1=22 photos, right: v2=999 photos). v2 looks barely different from no-LoRA. 45x more data, but the cat identity is gone.

Creative prompts were worse:

Prompt: "ohwx cat as a cute chef." v2 produced a human woman as the chef, with the cat reduced to a tiny illustration on her apron.

Prompt: "ohwx cat as an astronaut." v2 produced a tabby (orange-striped) cat — the fur color is straight up wrong. My cat is black and white.

→ More data made the LoRA broadly worse, across both photorealistic and creative prompts.

Cause: "other cats" had snuck into the dataset

Once I thought about it, it was obvious.

Day 4's classifier labels images as "contains a cat or not" — it does NOT verify "is this MY cat." So the 999-photo "cat" folder included:

My cat
Friends' and family's cats
Stray cats from around town
Cats at pet stores

All mixed together. When I trained with the label ohwx cat = my cat, the model basically learned ohwx cat ≈ generic cat-shape.

Pulled out just my cat → 213 photos (v3)

To curate, I borrowed another AI — CLIP.

What's CLIP? = An OpenAI image-understanding model. Show it two images and it returns a similarity score.

I used the 22 confirmed-my-cat photos from Day 2 as a reference set, then asked CLIP to score how similar each of the 999 candidates was. Sorted by score, threw the thumbnails into a single HTML page, and went through visually — checking "this one's a different cat", "this has a person in it", and so on, marking exclusions as I went.

Final cut: 213 photos, all confirmed to be my cat. Re-trained → v3.

Result:

v3 is as sharp as v1. Tuxedo pattern, white chest, the works.

Creative prompts came back too:

The human chef from v2 is gone, replaced by my cat. The astronaut and forest cat similarly snapped back (more comparisons in the collapsible section below).

→ Cleaning the data was enough to fix everything.

Bonus: also tried natural-language captions (v4)

One more thing I wanted to test.

v1 (Day 2) and v3 (today) differ in their captions — the text labels paired with each training photo:

v1: hand-written natural sentences (ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles...)
v3: just the trigger word (ohwx cat) repeated for every image

What's a caption? = A short English text describing what's in each photo, paired with that photo during training.

Would adding richer captions on top of clean data push v3 further? Hand-writing 213 captions wasn't realistic, so I had another AI (Qwen2-VL) auto-generate them. Calling this v4.

Result: v4 looked basically identical to v3. Small differences here and there but nothing substantial.

→ Caption granularity barely matters once the data is clean.

The actual question: does more data make a stronger LoRA?

Now for the real comparison. v1 (22 photos) vs v4 (213 photos):

	Photos	Data purity	Captions
v1	22	My cat only	Hand-written natural language
v4	213 (10x!)	My cat only	VLM natural language (same style)

The only meaningful difference is photo count.

Five-way comparison:

Left to right: no LoRA, v1 (22), v2 (999, contaminated), v3 (213, trigger-only), v4 (213, natural captions).

v1 and v4 are essentially the same quality. To my eye, v1 has a slightly more painterly feel on the chef prompt, but otherwise — same.

Same pattern across all the other prompts:

→ 10x more photos. No visible improvement. This was today's main finding.

After the fact, I looked it up. Turns out this is common knowledge.

I found "more photos doesn't help" interesting enough to look up afterward, and:

Character LoRAs are typically trained on 25–40 images, with 40–80 as a soft cap
"Over 30 images shows diminishing returns; dataset quality matters more than dataset size"
"15–20 well-curated images beat 50 mediocre ones"
Too many images can actually overfit and degrade the result
DreamBooth (a closely related technique) was designed around 3–5 images

→ It's established consensus in the field: photo count saturates fast, and dataset purity is the real lever.

Day 2's 22 photos? Turns out that was already a healthy amount.

What I learned today

Quality > Quantity, apparently

22 photos (v1) ≈ 213 photos (v4): photo count doesn't push quality much
999 photos (v2): contamination made things worse
213 photos (v3): cleaning brought everything back

"More photos = better LoRA" runs out of road fast. What actually moves the needle is the right photos, not more photos.

A working playbook (so far)

From today's experiments:

Source photos that match the goal (photos of MY cat, not "any cat")
Aim for 20–30 photos — past that, diminishing returns
Captions help, but don't sweat the wording — auto-generated is fine
If you must use a big dataset, curate aggressively first — contamination is brutal

💡 Tip: when you want to use a big dataset anyway

If you're starting from a large unfiltered pile and want to keep it that way, pre-curation is essential. The approach that worked today:

Pick a small "ground truth" set (~20 confirmed examples)
Use CLIP image similarity to score the big pile against the ground truth
Browse thumbnails sorted by score, eyeball-exclude the misses
Train on what's left

Details in the collapsible section below.

Technical details (the AI explains)

The implementation details, walked through by Claude.

:::details 1. More v2 failure examples

Skipped from the main body for length, but worth seeing:

Prompt: "ohwx cat in a magical forest." v2 produced a black-bear-style illustration — the cat identity is completely gone.

The one photorealistic-ish prompt where v2 sort-of held it together.

:::

:::details 2. Data prep and CLIP similarity ranking

Day 4's _review/cat/ had 1,009 symlinks (503 HEIC, 505 JPG, 1 other). Resized to short-side 512px:

python3 shared/utils/resize-shortside.py \
  --src private-data/iphone-photos-classified/_review/cat \
  --dst private-data/cat-lora-v2/images-512 \
  --short-side 512

1,009 → 999 after collisions (9 stem collisions where IMG_XXXX.HEIC and IMG_XXXX.JPG produced the same .jpg name) and 1 resize failure.

CLIP similarity scoring with openai/clip-vit-base-patch32:

from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to("cuda")
ref_feats = embed(model, processor, ref_paths)     # 22 refs
cand_feats = embed(model, processor, cand_paths)   # 999 candidates
sim = cand_feats @ ref_feats.T                     # (999, 22)
score = sim.mean(dim=1)                            # (999,)

Score distribution:

Score band	Contents
≥ 0.85	Almost all solo shots of my cat
0.76 – 0.85	Mostly my cat, with occasional other-cat or human contamination
< 0.76	Mostly other cats or photos with people

Cut at 0.76 and reviewed everything above visually. 312 manual exclusions later: 213 photos.

:::

:::details 3. Browser-based curation UI

A single HTML page laying out all 999 thumbnails in score order, served via python3 -m http.server. Each thumbnail has a checkbox:

<div class="cell" data-name="IMG_2906.jpg">
  <img src="thumbs-256/IMG_2906.jpg">
  <div class="meta">#1 0.871</div>
  <input type="checkbox" onchange="toggleExclude(this)">
</div>
<script>
function exportExcluded(){
  const names = [...document.querySelectorAll('.cell.excluded')]
    .map(c => c.dataset.name);
  download('excluded.txt', names.join('\n'));
}
</script>

Click "Export excluded list" to download excluded.txt, then use that to filter the training dir.

:::

:::details 4. Training configs (Kohya_ss / TOML)

The training config is identical across v1/v2/v3/v4 — only the dataset and output name change:

output_name = "ohwx_cat_v3"   # or v4
max_train_epochs = 2
network_dim = 32
network_alpha = 16
unet_lr = 1e-4
text_encoder_lr = 5e-5

Step count is also matched:

	Math	Steps
v1	22 × 10 × 10 ÷ 2	1,100
v2	999 × 1 × 2 ÷ 2	999
v3 / v4	213 × 5 × 2 ÷ 2	1,065

All within ~1,000 steps, so the only variables in play are photo count and caption granularity.

cd ~/Kohya_ss && source venv/bin/activate
accelerate launch --num_cpu_threads_per_process 8 train_network.py \
  --config_file configs/train_v3.toml \
  --dataset_config configs/dataset_v3.toml

DGX Spark, 1.4 it/s, ~14 minutes per training run.

:::

:::details 5. Qwen2-VL caption auto-generation

Reusing Day 4's Qwen2-VL 7B Instruct setup. The prompt:

Describe what is happening in this cat photo using short comma-separated
phrases. Cover: (1) the cat's pose or action, (2) the view angle,
(3) the setting and notable background details. Keep it under 25 words.
Do NOT describe the cat's appearance (color, breed, fur, markings) — focus
only on the scene. Output the description directly without any preamble.
Example: walking on a metal kitchen counter, side profile, indoor kitchen
with spice bottles and shelves in the background

The "do not describe the cat's appearance" line is intentional: identity is supposed to come from the trigger word ohwx cat, so captions should only describe context.

desc = vlm_caption(model, processor, img)
caption = f"ohwx cat, {desc}"
txt_path.write_text(caption + "\n", encoding="utf-8")

213 captions in 6 minutes. Sample output:

ohwx cat, sitting, side view, indoor setting, wooden floor,
folding chair, curtain, air conditioner

Stylistically very close to Day 2's hand-written captions.

:::

:::details 6. Version summary

	v1 (Day 2)	v2	v3	v4
Photos	22	999	213	213
Cat content	My cat only	My cat + many others	My cat only	My cat only
Captions	Hand-written natural	`ohwx cat` only	`ohwx cat` only	VLM natural
Total steps	1,100	999	1,065	1,065
Training time	13m 3s	14m 0s	14m 0s	14m 0s

What each pair isolates:

v2 vs v3 → effect of data purity (same captions, only purity differs)
v3 vs v4 → effect of caption granularity (same data, only captions differ)
v1 vs v4 → effect of photo count (clean data, natural captions, only count differs)

:::

:::details 7. References on LoRA training dataset size

The "diminishing returns past ~30 photos" claim has multiple sources:

20–30 photos saturates; dataset quality > dataset size (Civitai: Large Dataset LoRA Tips)
15–20 well-curated images beat 50 mediocre ones (same)
Over-training and "overcooked" LoRAs from too much data (Hugging Face Blog: After 500+ LoRAs)
DreamBooth (the original subject-finetuning technique) was designed around 3–5 images (DreamBooth project page)

:::

Tomorrow's preview: Day 6

Day 6: still undecided. Decision tomorrow morning.

100ExperimentsWithDGX #LocalLLM #LoRA #StableDiffusion

[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone

PEPPERCORN — Thu, 07 May 2026 23:58:45 +0000

[Day 4] I Had a Local AI Sort Through 25,000 Photos on My iPhone

Intro

Day 4: I'm going to hand the 25,000 photos sitting on my iPhone over to a local AI for sorting.

This is experiment #4.

What I'm using today: DGX Spark + CLIP (image-understanding AI from OpenAI) + Qwen2-VL (a vision-language model that can chat about images, from Alibaba).

Today's setup

Data: 25,382 photos and videos sitting on my iPhone (96 GB).
Goal: Have AI find unnecessary photos so I can drop my phone storage subscription.
Approach:
- Stage 1: Quickly classify all 25K with CLIP.
- Stage 2: Have Qwen2-VL (a VLM) grade CLIP's classifications.
Comparison axis: Lightweight + fast classifier (CLIP) vs. heavyweight + smart conversational AI (VLM).

Bottom line: Overall agreement of 84.5% when the VLM grades CLIP's classifications. People detection: 99.2% — only 59 misses out of 7,195 photos. Documents and screenshots ended up wrong about half the time. Oh, and I gave up midway and just dumped everything into Amazon Photos because I'd just learned Prime members get unlimited photo storage. Five years a Prime member, never knew.

🔧 Steps

Big picture flow:

iPhone
   ↓ ① Sync via iCloud for Windows
myPC1 (Windows)
   ↓ ② scp transfer to DGX (96 GB)
DGX (Linux)
   ├─ ③ Split photos and videos by extension
   │     └ Photos 24,497 / Videos 884
   ├─ ④ Classify with CLIP (~20 min)
   │     └ Sorted into 8 categories
   └─ ⑤ Have VLM grade "is this category right?" (~3 hours)
         └ Overall agreement: 84.5%

Let's walk through each step.

Getting photos onto the DGX (the biggest hurdle)

iPhone → myPC1 (a Windows laptop I use day-to-day) → DGX, a two-leg relay.

The first leg started at 0.5 MB/s, with the ETA showing "6 days." After realizing my Wi-Fi was the bottleneck, I switched to wired LAN, fixed the hostname-resolution path, and got it up to 80 MB/s (~160x faster). Burned half a day. More technical details in the collapsible section below.

Splitting photos and videos

The 25,382 transferred files broke down like this:

Extension	Count	Type
HEIC	13,107	Photo (Apple's format)
JPG / JPEG	10,721	Photo
PNG	660	Photo (mostly screenshots)
WEBP	9	Photo
MOV	799	Video
MP4	85	Video
ini	1	System file (ignored)

I had Claude write a small script that splits photos and videos into separate folders by extension (one command, takes a few minutes — details in the collapsible section).

Result:

Photos: 24,497
Videos: 884
Photos are the focus from here.

What is CLIP?

CLIP ＝ an image-understanding AI from OpenAI, apparently. You hand it a photo and ask "is this a cat? a landscape? a screenshot?" with multiple labels, and it returns a similarity score for each. Lightweight and fast is its specialty, supposedly.

Stage 1: Classifying all 25K photos with CLIP

I set up 8 categories:

Trash candidates: screenshot / document / blank
Keep: food / landscape / other
Review: people / cat

For each category, I prepared multiple English captions (e.g., "a screenshot of an app", "a photo of a cat") and used the maximum similarity. Also: anything below 0.5 confidence goes into the uncertain bucket for manual review.

Batch size 64, ~20 minutes of GPU time, all done. Results in the next section!

The "How accurate is it?" question

CLIP did the classification, but how accurate is it really?

Normally you'd verify by manual inspection, but eyeballing 25,000 photos is not realistic.

So I decided to have a smarter AI grade CLIP's classifications.

What is a VLM?

A VLM (Vision-Language Model) is an AI that can hold a conversation about images, apparently.

How it differs from CLIP:

	CLIP	VLM
What it does	Category classification (returns probabilities)	Can describe image content in natural language
Smartness	Lightweight, fast, coarse	Heavy, slow, smart
Size	~400 MB	~16 GB

I picked Qwen2-VL 7B Instruct (Alibaba). Apache 2.0 licensed for commercial use, no Hugging Face authentication required for download — those were the selection criteria.

The plan: ask the VLM "is this a screenshot? answer yes or no" for each image and record the answer.

Stage 2: Grading all 25K photos with VLM

Started at 16 seconds per image (~5 days for the full set). The cause was image size — resizing to 448px on the short side dropped it to 0.3 sec/image (~54x faster). Even with one-image-at-a-time inference, the full set takes ~2-3 hours.

Started before bed, woke up to 24,496 graded results.

📊 Results

CLIP's classification results

After CLIP processed 24,496 photos, the distribution looked like this:

private-data/iphone-photos-classified/
├── _trash-candidate/      Trash candidates
│   ├── screenshots/    (981)
│   ├── documents/    (1,804)
│   └── blank/           (59)
├── _review/                Review
│   ├── people/       (7,195)
│   ├── cat/          (1,009)
│   └── uncertain/    (7,700)
└── _keep/                  Keep
    ├── food/         (1,682)
    ├── landscape/    (1,991)
    └── other/        (2,075)

Category	Count	Share
people	7,195	29.4%
uncertain (low confidence)	7,700	31.4%
other	2,075	8.5%
landscape	1,991	8.1%
document	1,804	7.4%
food	1,682	6.9%
cat	1,009	4.1%
screenshot	981	4.0%
blank	59	0.2%

That's a lot of cat photos...

Let's see how CLIP actually judged some of these.

🎯 Big wins


My cat	A meal	App screenshot	Mountain (landscape)
cat 0.97	food 0.999	screenshot 0.74	landscape 0.98

CLIP nailed the cat without hesitation, food at 0.999, screenshots and landscapes too. Reliable.

✨ Subtly impressive recognition


Cat keychain	Close-up of coffee beans
cat 0.64	food 0.53

Even the keychain got recognized as "cat." And coffee beans up close as "food." Quietly impressive.

🤔 Funny misclassifications (CLIP's quirks)

Browsing thumbnails by category, some interesting patterns emerged.

Food edition: "Trash sorting chart" beats "homemade cake" for being food-like


My homemade cake	Trash sorting chart
food 0.57	food 0.83

Both ended up in the "food" category. Apparently the trash sorting chart looks more food-like to CLIP than my homemade cake. Reacting to the text? The table layout? Mystery.

People edition: "A doodle" beats "Mona Lisa" for being people-like


The Mona Lisa	A face I doodled myself
people 0.50	people 0.52

Both in the "people" category. My crappy doodle edges out da Vinci's Mona Lisa for being more "people-like" (just barely).

CLIP's quirks — kind of charming.

VLM's grading results

I asked the VLM, one photo at a time, whether CLIP's category was correct. For example, photos in the cat folder got "is this a cat?", food folder got "is this food?" — yes/no answers.

Summary by final destination bucket:

Final bucket	Count	VLM agreement
people	7,195	99.2% 🎯
food	1,682	95.3% 🎯
cat	1,009	95.0% 🎯
other	2,075	93.6% 🎯
landscape	1,991	83.5% ⚠️
screenshot	981	75.2% ⚠️
document	1,804	67.4% ⚠️
blank	59	52.5% ⚠️
OVERALL	24,496	84.5%

People detection at 99.2% is quietly amazing. Out of 7,195 photos, the VLM said "no" to only 59.

Documents and screenshots, on the other hand, came back "no" about half the time. CLIP-only confidence isn't enough for those. Out of 24,496 photos, 3,808 got a "no" from the VLM — that's the part CLIP alone wouldn't have caught.

💡 Today's discoveries

Multimodal AI runs at home

Both CLIP (400 MB, classifier) and Qwen2-VL (16 GB, conversational) ran fine on my home machine. Reassuring.

CLIP's confidence is a reliable signal

VLM agreement broken down by CLIP confidence:

CLIP confidence	Count	VLM agreement
0.9+ (super confident)	3,555	96.5%
0.7–0.9	6,285	93.5%
0.5–0.7	6,956	86.0%
<0.5 (uncertain)	7,700	70.1%

Boring but important: when an AI says it's confident, you can trust it.

CLIP's weak spots

Things that clearly appear in photos — people, food, cats, objects — score 95%+. Abstract or compound subjects — documents, screenshots, landscapes — drop to 60-80%.

Documents at 67.4% in particular. That's where VLM re-grading earns its keep.

Role split: lightweight model × smart model

Use CLIP to triage everything quickly, VLM to grade the suspicious cases — a two-layer setup. Best of both worlds in speed and accuracy.

Day 3 had the same pattern: "aggregation = tools, interpretation = AI." Today's variant: "rough sorting = CLIP, accuracy check = VLM." Picking the right AI for the right task pays off in both performance and cost.

"Input quality matters more than model size" struck again

In Day 3 (credit card analysis), I learned "input quality > model size." The same pattern showed up today:

VLM with original-resolution images: 16 sec/image (5 days for full run)
VLM with resized 448px images: 0.3 sec/image (2 hours)

Just by tidying up the input, 54x speedup — small change, huge impact.

Not "biggest model possible" or "raw original" — clean up the input before sending it to the AI. This worked in Day 3 and Day 4 in a row.

Heart broken, switched to Amazon Photos

I tried to verify the trash candidate folder, then realized I'd need to cross-reference VLM scores too, then realized I never set clear criteria for "what to delete" in the first place. Couldn't finalize the cleanup, and morale broke.

Right then I learned that Amazon Prime members get unlimited photo storage, so I just dumped everything into Amazon Photos. Lol.

That said, I really should have defined the deletion criteria before starting.

The classified data on the DGX is a useful resource for future Day experiments.

🛠️ How I actually did this

:::details Wi-Fi 0.5 MB/s → wired LAN 80 MB/s journey

myPC1 → DGX over 96 GB started at 236 KB/s via WinSCP (ETA: 6 days). The cause was myPC1 being on Wi-Fi.

I plugged the PC into the router with a LAN cable → ping dropped close to 0 ms. But WinSCP was still stuck at 500 KB/s.

PowerShell ping spark-XXXX.local revealed the address resolved to DGX's Wi-Fi-side IP. The DGX was dual-homed (wired + Wi-Fi), and mDNS was returning the old route.

# Failure (routes through Wi-Fi)
scp -r "C:\Users\[user]\Pictures\iCloud Photos\Photos" [user]@spark-XXXX.local:...

# Success (direct IP over wired LAN)
scp -r "C:\Users\[user]\Pictures\iCloud Photos\Photos" [user]@10.0.0.205:...

Switched from hostname to explicit IP and watched it scream:

IMG_0190.HEIC                100% 1812KB  84.3MB/s   00:00
IMG_0190.MOV                 100%   17MB 102.4MB/s   00:00
IMG_0192.HEIC                100% 2256KB  81.6MB/s   00:00

Also discovered WinSCP (SFTP-based) struggles with many small files, while scp (stream transfer) is much faster. With 25,382 files, scp won by a landslide.

:::

:::details Splitting photos and videos by extension

PHOTO_EXTS = {".jpg", ".jpeg", ".heic", ".heif", ".png", ".webp"}
VIDEO_EXTS = {".mov", ".mp4", ".m4v"}

for src in input_dir.rglob("*"):
    if not src.is_file():
        continue
    ext = src.suffix.lower()
    if ext in PHOTO_EXTS:
        shutil.move(str(src), str(photos_out / src.name))
    elif ext in VIDEO_EXTS:
        shutil.move(str(src), str(videos_out / src.name))

Simple. Caught one snag: right after transfer, the directory permission was dr-x------ (read-only), so the first shutil.move died with PermissionError. chmod u+w fixed it.

:::

:::details CLIP classification script

Used transformers to load openai/clip-vit-base-patch32. For each category, multiple captions are prepared, and the max softmax score is used:

CATEGORIES = {
    "screenshot": [
        "a screenshot of an app",
        "a phone screenshot",
        "a screenshot of a website or chat",
    ],
    "document": [
        "a photo of a document or paper",
        "a photo of a receipt",
        "a QR code or barcode",
        "a photo of an ID card or driver's license",
    ],
    "people": [
        "a photo of a person",
        "a photo of people",
        "a portrait of someone",
    ],
    "cat": ["a photo of a cat"],
    "food": ["a photo of food or a meal"],
    "landscape": [
        "a photo of a landscape or scenery",
        "a photo of a building or city",
    ],
    "other": ["a photo of an object or item"],
}

inputs = processor(text=text_prompts, images=images,
                   return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

Anything below 0.5 confidence goes into _review/uncertain/. Near-black/near-white images get caught by a brightness check and routed to _trash-candidate/blank/ before they reach CLIP.

All per-image category scores are also saved to JSON. That JSON is what the VLM evaluation step consumes later.

:::

:::details The 54x speedup from image resizing for VLM

Qwen2-VL's vision token count scales with input resolution. Original-size images (several thousand pixels) consume hundreds to thousands of tokens, slowing inference dramatically.

processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    min_pixels=224 * 224,
    max_pixels=448 * 448,  # ← cap here
)

# Belt and suspenders — also pre-resize the image
img = Image.open(path).convert("RGB")
img = ImageOps.exif_transpose(img)
img.thumbnail((448, 448))

That took 16 sec/image → 0.3 sec/image.

The verification prompt is dead simple:

CATEGORY_PROMPTS = {
    "screenshot": "Is this image a screenshot of a phone screen, an app, or a website? Answer with one word: yes or no.",
    "document":   "Is this image primarily a document, receipt, ID card, or QR code? Answer with one word: yes or no.",
    "people":     "Does this image clearly show one or more human persons? Answer with one word: yes or no.",
    # ...
}

max_new_tokens=5 means only yes/no comes back. Minimal design.

:::

:::details Resumable checkpointing

Running 24,000 images for 3 hours straight, you really want recovery if something hiccups:

CHECKPOINT_INTERVAL = 100

for i, (name, r) in enumerate(todo):
    # ... inference ...
    if (i + 1) % CHECKPOINT_INTERVAL == 0:
        save_checkpoint(results, output_path)

And a --resume flag that picks up where the JSON left off:

if args.resume and args.output.is_file():
    with args.output.open() as f:
        results = json.load(f)
    print(f"Resumed from {len(results)} existing entries")
todo = [(name, r) for name, r in clip_data.items() if name not in results]

Essential for any overnight job.

:::

Next up: Day 5

Tomorrow: have an AI analyze a year of my Amazon purchase history.

Switching to Amazon Photos for storage made me realize Amazon also has my entire purchase history. What if I asked AI "what kind of person am I, based on this?" — see what patterns emerge that I never noticed myself.

To be continued ＞＞＞

100ExperimentsWithDGX #LocalLLM #ImageClassification #CLIP

[Day 3] I Had a Local LLM Analyze a Year of My Credit Card Statements

PEPPERCORN — Tue, 05 May 2026 22:52:50 +0000

[Day 3] I Had a Local LLM Analyze a Year of My Credit Card Statements

Intro

Day 3: I'm going to hand a year of credit card statements over to a local LLM and see what it can do.

This is experiment #3.

What I'm using today: DGX Spark + Ollama + Qwen2.5 (comparing 7B vs 72B). Ollama is the de-facto local-LLM runtime, and Qwen2.5 is a multilingual model from Alibaba (China) that handles Japanese reasonably well, apparently.

Today's setup

Data: 12 months of credit card statements from a single card.
Volume: 383 transactions, ¥2,761,555 in total spend.
Goal: get the AI to spot waste patterns and propose savings.
Comparison axes:
- Model size: 7B (light) vs 72B (heavy)
- Input format: raw CSV vs pandas-aggregated summary
- → 4 patterns total

Takeaway: "If you ask an AI to aggregate raw data, the numbers come out way off." / "If you pre-aggregate with a spreadsheet tool first and then feed the AI, you get fast and accurate results." A small but practical finding.

1. Get the CSVs onto the DGX

Log into the credit card company's web statements page on myPC1 (my Windows laptop), download 12 months of CSVs, then push them to the DGX.

I deliberately skipped GitHub for the transfer this time — once you push something, it's in the history forever, and credit card data shouldn't be there even briefly. Instead, I used direct PC-to-PC transfer over SSH (one command, finishes in seconds; details in the collapsibles at the end). The .gitignore excludes private-data/ too, so accidental commits are ruled out.

2. Install Ollama

Ollama is the de-facto runtime for local LLMs. One command should be enough.

There was a small password hiccup during install (details below), but eventually it was up and running.

The DGX Spark specs really show through:

Memory: 121 GB
Default context window: ~262,144 tokens

In other words: "throw a whole book at it, no problem" territory. Reassuring.

3. Two model sizes: Qwen2.5 7B vs 72B

The strategy: same model family, different sizes. That way the differences come from size, not architecture.

7B (light): ~4.7 GB, downloads in 5 minutes. Fast.
72B (heavy): ~47 GB, 25 minutes to download. Slow but smart.

What does "B" mean? Short for Billion. It's the number of "weights" inside the AI — more weights, more it remembers, basically. So 7B has 7 billion weights, 72B has 72 billion.

Loading both onto the DGX simultaneously, memory usage looks like:

AI model	Memory occupied
qwen2.5:72b	61 GB
qwen2.5:7b	8.2 GB
Total	69 GB

69 GB. Spacious!

4. Prepping the CSVs

Once I had the CSVs in hand, three small headaches before they were ready for the AI:

Headache 1: An older encoding (Windows Japanese flavor) → needs converting to modern UTF-8
Headache 2: Some merchant names contain commas, which breaks naive CSV parsing
Headache 3: Each file has a "monthly total" line at the end that isn't actually data

Details in the collapsible. After cleanup, the 12 files merge into a single dataset:

Item	Value
Transactions	383
Period	12 months (1 year)
Total spend	¥2,761,555
Avg per tx	¥7,210
Median per tx	¥3,000
Largest single tx	¥209,283 (overseas flight)
Smallest	¥-3,980 (refund)

Now to feed this to 7B and 72B and see what each of them says.

5. Experiment 1: Throw the raw CSV at the AI

No tricks: all 383 rows, straight at the AI. Prompt is the full ask: "As a household budget consultant, output category breakdown / monthly trend / waste patterns / savings suggestions / lifestyle hypothesis."

7B's answer (75 seconds)

...this is where the numbers go wildly off.

Item	What 7B said	Real data	Match?
Amazon total	¥2,014,386 (257 tx)	¥693,663 (166 tx)	❌
Amazon Downloads	¥2,014,386 (257 tx)	¥80,323 (50 tx)	❌
Outdoor brand	¥495,740	¥154,820	❌
A local recreation venue	"¥49,574" cited	(a different small charge actually exists)	❌

None of the numbers line up. Amazon total is roughly 3× off, Amazon Downloads about 25× off, and the cited venue context is a different charge entirely.

Reading 383 rows of CSV and computing totals turned out to be a heavy lift for the 7B model.

72B's answer (12m 9s)

What if we throw size at the problem? After 12 minutes of patience:

Item	What 72B said	Real data	Match?
Amazon total	¥635,792 (104 tx)	¥693,663 (166 tx)	❌
AI/dev tools	¥193,629 (21 tx)	¥176,850 (24 tx)	❌
Travel	¥487,555 (43 tx)	¥416,268 (8 tx)	❌

Not exact, but the off-by amounts are within ~10%, and there are no fabricated venues. A real improvement.

However — when asked about the monthly trend, here's what 72B said:

Month 1: ¥316,789 → Month 2: ¥229,600 → Month 3: ¥237,500 → ... → Month 12: ¥291,500
(Gradually increasing.)

The actual range is ¥69,961 (low) to ¥493,072 (high) — a chaotic up-and-down waveform. "Gradually increasing" isn't quite right. Even 72B isn't great at aggregating distributed data over a long CSV.

6. Experiment 2: Aggregate first, then feed the AI

If the AI struggles with aggregation, do the aggregation in a different tool first and only hand the AI the result.

The flow:

📥 Raw CSV (22,132 chars, 383 rows)
       ↓
🔧 Pre-aggregate with a spreadsheet tool (Python's pandas)
       ↓
📋 Aggregate summary (1,884 chars, ~90% smaller)
       ↓
🤖 Hand it to the AI (let it interpret and propose)

Python's pandas = a spreadsheet-like library, but ~10,000× more powerful than Excel functions, used for tabular data analysis.

7B + pre-aggregated input (50 seconds)

Numbers are fully accurate now.

Item	What 7B said	Real data	Match?
Amazon total	¥693,663	¥693,663	✅
AI/dev tools	¥176,850	¥176,850	✅
Monthly max	¥493,072	¥493,072	✅
Monthly min	¥69,961	¥69,961	✅

Quoting straight from the pre-aggregated numbers, the hallucinations vanished.

And 7B did this in 50 seconds — better quality than the 72B + raw CSV at 12 minutes. Quietly remarkable.

	Before (raw CSV)	After (aggregated)
Time	75s	50s
Numbers	wildly off	exact
Verdict	not usable as-is	quote directly

72B + pre-aggregated input (12m 13s)

72B's numbers also match exactly (well, since they're being quoted from pre-aggregated data, that's expected). The proposal quality was the strongest of the four patterns:

Reduce Amazon dependency

Current: online shopping (Amazon family) is 25.1% of total (¥693,663).

Suggestion: stick to essentials only, regular review, avoid impulse buys.

Expected savings: ¥57,805/month average (25% reduction) → ¥693,660/year

...wait, hold on. Annual Amazon spend was ¥693,663. The "savings" 72B suggests is ¥693,660. That's basically the same number. So the proposal is effectively "stop buying on Amazon entirely (100%)" — definitely not 25%. Apparently 72B's percentage arithmetic isn't bulletproof either.

That aside, the lifestyle hypothesis section was kind of striking. Here's what 72B observed:

Heavy reliance on apps and subscriptions: "App/subscription" category is 10.5% of total

Frequent international travel: "Travel/airline" is 15.1%, with notable overseas charges

Frequent online shopping: "Online (Amazon)" is 25.1% of total

It's just one card's data, so this isn't a complete picture — but if I fed an AI my full household financials, the analysis and advice would probably go a lot deeper.

Summary: 4 patterns

#	Model	Input	Time	Numerical accuracy	Proposal quality
1	7B	Raw CSV	75s	❌ Numbers way off	△
2	72B	Raw CSV	12m 9s	△ Misread monthly trend	○
3	7B	Aggregated	50s	✅ Exact	○ Some repetition
4	72B	Aggregated	12m 13s	✅ Exact	◎ Best (mind the % math)

Quietly notable: 72B takes ~12 minutes regardless of input size (shrinking the prompt didn't change wall-clock time much). Output generation is the bottleneck. Which strengthens the case for "small model + pre-aggregate" as the cost-effective default.

7. Cross-check: the actual graphs

Before trusting any of the AI output, let me put the real numbers on charts using the spreadsheet tool (pandas).

Monthly spending

Average ¥230,130/month, but the range is ¥69,961 (lowest) to ¥493,072 (highest) — about a 7× spread. The 72B's "gradually increasing" claim was a bit off the mark; the reality is bouncy.

Category share

"Other" being 32% is because my categorization rule is sloppy. I just wrote a simple "if the merchant name contains keyword X, bucket Y" rule, and lots of merchants didn't match any keyword and ended up in "Other." Reading meaning from a merchant name is exactly the kind of thing AI is good at, so next time I'll let the AI do the categorization itself.

Top 15 merchants

Amazon at ¥421,978 (105 tx) is far and away #1. Amazon really is too convenient...

Weekday rhythm

Tuesday alone is ¥692,549 — way above the rest. Probably because that's when most of the subscription auto-charges land.

8. Today's takeaways

Separate "aggregation" from "interpretation"

AI is bad at	AI is good at
Multi-row sum/average (numbers go wildly off)	Categorization (interpreting fuzzy meaning)
Percentage math (saw "25% off → 100% off")	Pattern recognition / hypothesis generation
Distributed aggregation like monthly totals	Narrative interpretation, savings proposals

→ Aggregation is the spreadsheet tool's job; interpretation is the AI's. When you split the work, things go fast and accurate. "Data prep matters before analysis" — yeah, that old saying really is true. Note to self.

Sometimes input quality beats raw size

"7B + pre-aggregated input in 50 seconds" outperformed "72B + raw CSV in 12 minutes". Sometimes you don't need a bigger model — you need cleaner input. Felt that one today.

The local-LLM angle

Feeding 12 months of raw credit card data to an AI without a single byte going to the cloud — it was surprisingly stress-free. This is one of the spots local LLMs really shine. Got personal info, or anything cloud-uncomfortable? This is the place for them.

9. Tech details (Claude explains)

The technical bits, written up by my AI pair.

SCP transfer to the DGX (mDNS, no IP needed)

NVIDIA Sync auto-configures a Host alias in ~/AppData/Local/NVIDIA Corporation/Sync/config/ssh_config:

Host spark-XXXX.local
  Hostname spark-XXXX.local
  User [user]
  Port 22
  IdentityFile "...\\nvsync.key"

Which means I can SSH/SCP using spark-XXXX.local without ever looking up an IP. The .local suffix uses mDNS (Multicast DNS) for hostname resolution within the LAN.

Transfer command (one line, from PowerShell on the Windows side):

scp -r "C:\Users\[user]\Desktop\docs\dgx\csv" spark-XXXX.local:/home/[user]/personal/dgx-100-experiments/private-data/credit-card-csv

Ollama install + the sudo-TTY catch + GPU detection log

Ollama install:

curl -fsSL https://ollama.com/install.sh | sh

Running this through Claude Code's Bash, it errored at the sudo password prompt — an interactive TTY is required there:

sudo: a terminal is required to read the password

Reopened a separate SSH session, ran the same command manually, and it went through.

Once installed, systemd auto-starts the service. The GPU detection log via journalctl -u ollama:

inference compute id=GPU-986c194b... name=CUDA0 description="NVIDIA GB10"
total="121.7 GiB" available="79.0 GiB"
default_num_ctx=262144

VRAM (DGX Spark unified memory): 121.7 GiB
Default context: 262,144 tokens

Compared with a typical RTX 4090 (24 GB VRAM, 8K–32K default context), the gap is significant.

Loading both models simultaneously

ollama pull qwen2.5:7b   # 4.7 GB
ollama pull qwen2.5:72b  # 47 GB

After loading both, ollama ps shows:

NAME           SIZE      PROCESSOR    CONTEXT    
qwen2.5:72b    61 GB     100% GPU     32768
qwen2.5:7b     8.2 GB    100% GPU     32768

Total ~69 GB used out of 79 GB available. Both models stay resident, switching between them is instant.

Custom CSV parser for the credit card data

Three quirks needed handling: CP932 encoding, no quotes (commas in some merchant names break parsing), and a trailing summary row in each file.

def parse_line(line: str) -&gt; list[str] | None:
    fields = line.rstrip("\r\n").split(",")
    if len(fields) &lt; 7 or not fields[0]:
        return None  # skip blank/summary rows
    if len(fields) &gt; 7:
        merchant = ",".join(fields[1:-5])
        fields = [fields[0], merchant] + fields[-5:]
    return fields


def load_one(path: Path) -&gt; pd.DataFrame:
    rows = []
    with path.open(encoding="cp932") as f:
        next(f)  # skip header (cardholder metadata)
        for line in f:
            parsed = parse_line(line)
            if parsed is not None:
                rows.append(parsed)
    df = pd.DataFrame(rows, columns=COLUMNS)
    df["利用日"] = pd.to_datetime(df["利用日"], format="%Y/%m/%d")
    df["利用金額"] = df["利用金額"].astype(int)
    return df

Japanese fonts in matplotlib

japanize-matplotlib doesn't work on Python 3.12 — it imports distutils, which was removed from the standard library.

The modern replacement is matplotlib-fontja:

pip install matplotlib-fontja

import matplotlib_fontja  # noqa: F401  ← just importing it sets up IPAexGothic

Calling Ollama from Python

The official ollama Python client is straightforward:

import ollama

client = ollama.Client()
stream = client.chat(
    model="qwen2.5:72b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_prompt},
    ],
    options={"temperature": 0.3},
    stream=True,
)
for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Streaming makes long generation easier to watch unfold.

Tomorrow: Day 4

Day 4 plan: let a local AI sort 20,000 iPhone photos.

The actual goal is to have a local image-recognition model (CLIP family?) clean up my photo library so I can stop paying iCloud for storage upgrades...!

100ExperimentsWithDGX #LocalLLM #Ollama

[Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene

PEPPERCORN — Tue, 05 May 2026 00:06:00 +0000

[Day 2] I Trained an AI on 22 Photos of My Cat — Now It Draws Her in Any Scene

So, yesterday I generated "some cat"

Day 1 ended with "I made my DGX draw a cat" — but the cat that came out was just "a cat from somewhere". Today, the goal is to teach the AI about my actual cat (who's currently being looked after at my parents' place back in Japan).

This is what people call LoRA training.

LoRA: A technique that teaches an AI model "specific features" using a small set of images, without touching the base model itself. Apparently. The output is a small "diff" file (tens of MB).

This is experiment #2.

The training data

Source material: 22 photos of my cat.

I picked a mix of angles — front-facing, full body, sleepy poses, varying lighting — to give the AI a fair shot at recognizing the cat's defining features (tuxedo black-and-white pattern, white socks, the black smudge on the nose).

Training pipeline

1. Pre-processing

iPhone HEIC files don't work directly with most AI tools, so first conversion to JPG. 10 of the 22 were HEIC.

Then resize to 512px on the short side for training. This is where I tripped over a sneaky bug — details in the collapsible section below.

2. Captions

Every image gets a text description like "ohwx cat, sitting on a wooden floor, indoor, soft lighting". The four-letter ohwx is a meaningless token that becomes the trigger word for "my specific cat" after training.

Drafting 22 captions by hand would be tedious — but Claude can read images directly, so it drafted them while I just reviewed. The accuracy was uncanny. For example:

ohwx cat, walking on a metal kitchen counter, side profile, indoor kitchen with spice bottles and shelves in the background

ohwx cat, in a loaf pose on a gray carpet, mouth open showing teeth, mid-yawn, indoor with shelves and warm lights in the background

ohwx cat, sitting on a wooden floor by a balcony window, viewed from behind, sharp sunlight casting long shadows, indoor

SUGOI.

3. Kohya_ss training

Kohya_ss is the de-facto LoRA training tool. Set up a TOML config, run one command:

$ accelerate launch train_network.py \
    --config_file configs/train.toml \
    --dataset_config configs/dataset.toml

Training logs scroll by, and the loss value gradually drops. Lower loss = the model is learning, apparently.

4. Done

1100 steps in 13 minutes 3 seconds on the DGX Spark.

Result 1: just typing "ohwx cat" gives me my cat

The first thing I tried was a "without LoRA vs with LoRA" comparison. Same prompt — "ohwx cat as a chef in a kitchen, ..." — first without the LoRA, then with it:

Left: no LoRA. Right: with LoRA.

Without LoRA, ohwx is gibberish to the model, so it's ignored and only "a chef in a kitchen" carries weight. Result: a human chef. A nice woman cooking in a pink kitchen.

With LoRA, ohwx becomes a real token that points at my cat. Same prompt, but now my cat is the chef.

This was the moment that hit.

Result 2: novel scene reproduction

The training set has no photo of the cat sitting on a wooden floor in this exact composition. So I tried it:

White socks: present. Nose smudge: present.

My cat, in places she's never been

ohwx cat in various scenes.

Sunny balcony

Cozy.

Chef (reprise)

The chef hat fits suspiciously well. Cooking ability unverified.

Autumn forest

A painterly take.

Astronaut

A doppelgänger via the helmet glass — but sci-fi all the same.

Today's takeaway

"Build your own AI from your own data" turned out to be way more accessible than I'd assumed.

Tech details (Claude explains)

The technical bits, written up by my AI pair.

HEIC → JPG conversion and the EXIF orientation trap

Reading iPhone HEIC files in Python is straightforward with pillow-heif. JPG conversion is a few lines:

from PIL import Image, ImageOps
from pillow_heif import register_heif_opener
register_heif_opener()

with Image.open("IMG_1234.HEIC") as img:
    oriented = ImageOps.exif_transpose(img)  # ← critical line
    rgb = oriented.convert("RGB")
    rgb.save("IMG_1234.jpg", quality=95)

What I tripped on

My first version skipped ImageOps.exif_transpose(). Result: 8 of 22 photos came out rotated 90° in the resized output.

iPhones save portrait shots with the actual pixels stored landscape-ways, plus an EXIF Orientation tag saying "rotate 90° on display". Pillow's default Image.open() ignores that tag — you have to call exif_transpose() explicitly.

Caught it before training started. If I hadn't, the LoRA would have learned "sideways cat" and generation would be weird.

Kohya_ss setup on ARM64 (DGX Spark)

There are two repos commonly referred to as "Kohya_ss":

bmaltais/kohya_ss — GUI wrapper, xformers dependency (clashes with ARM64)
kohya-ss/sd-scripts — the actual training engine, CLI/TOML driven

DGX Spark is ARM64, so I went with the latter:

git clone --depth 1 https://github.com/kohya-ss/sd-scripts.git ~/Kohya_ss
cd ~/Kohya_ss
python3 -m venv venv &amp;&amp; source venv/bin/activate
pip install --upgrade pip
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt

DGX Spark uses CUDA 12.8 + ARM64 (sbsa), so the PyTorch cu128 channel works directly. Surprisingly painless.

Training config (TOML)

# train.toml (excerpt)
pretrained_model_name_or_path = ".../Realistic_Vision_V6.0_NV_B1.safetensors"
vae = ".../vae-ft-mse-840000-ema-pruned.safetensors"

network_module = "networks.lora"
network_dim = 32
network_alpha = 16

optimizer_type = "AdamW8bit"
unet_lr = 1e-4
text_encoder_lr = 5e-5
lr_scheduler = "cosine_with_restarts"

max_train_epochs = 10
save_every_n_epochs = 2

mixed_precision = "bf16"
sdpa = true
cache_latents = true

# dataset.toml
[general]
shuffle_caption = false
caption_extension = ".txt"
keep_tokens = 1

[[datasets]]
resolution = 512
batch_size = 2
enable_bucket = true

  [[datasets.subsets]]
  image_dir = "/path/to/cat-photos-512"
  num_repeats = 10

22 photos × 10 repeats × 10 epochs ÷ batch 2 = 1100 steps. 13 minutes.

Base model: Realistic Vision V6.0 B1 noVAE (a photo-realistic SD 1.5 derivative). External VAE: sd-vae-ft-mse-original. The combination is good at fur detail.

Hitting the ComfyUI HTTP API for batch generation

Clicking through the GUI for one image at a time gets old fast. ComfyUI exposes an HTTP API that's easy to drive from Python — urllib.request from the standard library is enough (no extra deps).

import json, urllib.request, time

COMFY_URL = "http://127.0.0.1:8188"

def queue_prompt(workflow):
    payload = json.dumps({"prompt": workflow}).encode()
    req = urllib.request.Request(
        f"{COMFY_URL}/prompt",
        data=payload,
        headers={"Content-Type": "application/json"},
    )
    return json.loads(urllib.request.urlopen(req).read())["prompt_id"]

def wait_for_history(prompt_id, timeout=180):
    start = time.time()
    while time.time() - start &lt; timeout:
        with urllib.request.urlopen(f"{COMFY_URL}/history/{prompt_id}") as resp:
            data = json.loads(resp.read())
            if prompt_id in data:
                return data[prompt_id]
        time.sleep(0.5)

The workflow is ComfyUI's API format (a dict of node IDs with their connections). To use a LoRA, insert a LoraLoader node between the checkpoint loader and KSampler.

DGX Spark generates one 512×768 image in about 3 seconds. With seed/strength/prompt parametrized in a script, all 12 grid images came out in under a minute.

Tomorrow: Day 3

Day 3 plan: have a local AI analyze my credit card history.

The kind of data I'd rather not send to a cloud AI, but absolutely want to understand. Quintessential local-AI territory.

100ExperimentsWithDGX #LocalLLM

[Day 1] DGX Spark Came Home — I Made It Draw a Cat

PEPPERCORN — Mon, 04 May 2026 03:20:48 +0000

[Day 1] DGX Spark Came Home — I Made It Draw a Cat

So... what is "local LLM" again?

Honestly, I'm still figuring out what "local LLM" even means. But somehow, through a series of decisions I won't fully justify here, I ended up buying an NVIDIA DGX Spark — and now it's sitting in my house.

DGX Spark: NVIDIA's "supercomputer for the home" — a small but seriously expensive box with the latest-gen AI chip inside. Apparently.

What I really want to figure out is: when should I use local AI vs. cloud AI? Reading articles about it doesn't seem to help, so I'm going full hands-on. Goal: 100 experiments, one per day-ish, until I have an evidence-based answer.

This is experiment#1.

First, the hardware

So this is what showed up at my door — solidly packed in a sturdy cardboard box.

When I opened it, I was surprised at how small it actually is. "This is the AI machine?" kind of small.

Boot up → Initial OS setup

Power on, and an Ubuntu-based DGX OS 7.5.0 boots up.

Welcome screen

"Get started" — yes, please.

Language and timezone

Standard Linux installer territory — same as Ubuntu?

Privacy settings

Diagnostic data sharing prompts.

System update

The moment I plugged it in, it started updating itself. Modern Linux being Linux.

Setup complete

I picked a username and let the hostname auto-assign. DGX-side prep done.

Connecting from my Windows PC

Plugging a monitor into the DGX every time would be tedious, so I want to SSH in from my regular Windows machine (which I've nicknamed "myPC1").

NVIDIA provides a desktop app called NVIDIA Sync that's supposed to make SSH setup painless. So I install it.

…and that's where I fell into a trap big-time. Windows OpenSSH refused to connect with a "your SSH config has weird permissions, can't trust it" error.

Full troubleshooting steps are in the collapsible "Tech details" section below.

Inside the DGX, finally

After much wrestling, I made it inside. Here's the rough lay of the land:

Item	Spec
GPU	NVIDIA GB10 Grace Blackwell
Memory	128GB (unified between CPU and GPU)
Storage	4TB SSD (basically empty)
CPU	20 cores (perf + efficiency combo)
Idle power	4W (yes, four)

128GB of memory is apparently 8–16x what's in a typical laptop.

Setting up image generation → 🐱

This is the main event. I'm setting up ComfyUI to generate the first cat from this DGX.

The ComfyUI interface looks like this:

The "boxes connected by cables" view is intimidating at first, but the default workflow is pre-wired. You just type a prompt and hit Queue Prompt.

So:

a cute fluffy cat sitting on a sunny windowsill, photorealistic, high detail, beautiful lighting, soft fur, cinematic, masterpiece, best quality

A few seconds later...

🐱 There it is — the very first cat my DGX has ever drawn!

Tweaked the prompt and made some more.

Eyes a bit unsettling but yeah, fluffy cat.

Going a touch dark there.

…is this a cat? It feels artistic though.

Distinctive composition.

Each masterpiece takes a few to a dozen seconds. That speed means I can iterate on prompts without thinking about cost — which turned out to be quite addictive.

Tech details (let the AI explain it)

The rest is the technical stuff. Read on if you're curious.

I'm a non-engineer poking at this stuff for the first time, so I had Claude (my AI pair programmer for this challenge) write up the technical details. Hopefully useful for anyone walking the same path.

How to actually get SSH working on Windows

NVIDIA Sync should generate an SSH keypair, register the public key on the DGX side at ~/.ssh/authorized_keys, and let you connect without a password.

If it doesn't work, the cause is usually permissions on Windows SSH config files.

Symptom

$ ssh spark-XXXX.local
Bad permissions. Try removing permissions for user: [PC]\CodexSandboxUsers
on file C:/Users/[user]/.ssh/config.

If you've installed Codex CLI or similar sandboxing tools in the past, the [PC]\CodexSandboxUsers group may have inherited permissions on ~/.ssh/.

Fix (run from an elevated PowerShell)

Use environment variables to avoid hard-coding your username/PC name.

# Take ownership
takeown /f "$env:USERPROFILE\.ssh\config"
icacls "$env:USERPROFILE\.ssh\config" /grant:r "$($env:USERNAME):F"

# Disable inheritance and remove the bad user
icacls "$env:USERPROFILE\.ssh\config" /inheritance:d
icacls "$env:USERPROFILE\.ssh\config" /remove "$env:COMPUTERNAME\CodexSandboxUsers"

Use /inheritance:d rather than /inheritance:r — :r strips all permissions, locking yourself out.

NVIDIA Sync's internal config files need the same treatment

~/.ssh/config Includes an NVIDIA Sync config file, and that one inherits the same problem.

$cfg = "$env:LOCALAPPDATA\NVIDIA Corporation\Sync\config\ssh_config"
icacls $cfg /inheritance:d
icacls $cfg /remove "$env:COMPUTERNAME\CodexSandboxUsers"

$key = "$env:LOCALAPPDATA\NVIDIA Corporation\Sync\config\nvsync.key"
icacls $key /inheritance:d
icacls $key /remove "$env:COMPUTERNAME\CodexSandboxUsers"

Ghost SIDs that icacls can't remove

If you have SIDs from deleted user accounts lingering, icacls /remove won't touch them. You need PowerShell ACL manipulation:

$cfg = "$env:LOCALAPPDATA\NVIDIA Corporation\Sync\config\ssh_config"
$acl = Get-Acl $cfg
$badRules = $acl.Access | Where-Object {
    $_.IdentityReference.Value -like "S-1-5-*" -and
    $_.IdentityReference.Translate([System.Security.Principal.NTAccount]) -isnot [System.Security.Principal.NTAccount]
}
$badRules | ForEach-Object { $acl.RemoveAccessRule($_) | Out-Null }
Set-Acl -Path $cfg -AclObject $acl

After this, ssh spark-XXXX.local connects on the first try (replace XXXX with your hostname).

Commands to check DGX specs

# GPU
$ nvidia-smi
NVIDIA-SMI 580.142    Driver Version: 580.142    CUDA Version: 13.0
GPU 0: NVIDIA GB10    36C    P8    4W / N/A

# OS
$ uname -a
Linux spark-XXXX 6.17.0-1014-nvidia ... aarch64 GNU/Linux

# Memory
$ free -h
Mem: 121Gi  2.6Gi  118Gi

# Storage
$ df -h
/dev/nvme0n1p2  3.7T  47G  3.5T  2%  /

# CPU
$ lscpu
Architecture:  aarch64
CPU(s):        20
Model name:    Cortex-X925 + Cortex-A725

Notable bits:

CUDA 13.0 (latest)
aarch64 (ARM64) architecture — yes, the DGX is ARM
121Gi (≈128GB) unified memory
20 cores in big.LITTLE layout (10 perf + 10 efficient)

ComfyUI installation steps

Following the official NVIDIA Comfy UI playbook.

# Virtual environment
cd ~
python3 -m venv comfyui-env
source comfyui-env/bin/activate

# PyTorch with CUDA 13.0
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130

# ComfyUI itself
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt

# Model (SD 1.5, ~2GB)
cd models/checkpoints/
wget https://huggingface.co/Comfy-Org/stable-diffusion-v1-5-archive/resolve/main/v1-5-pruned-emaonly-fp16.safetensors

# Launch server
cd ~/ComfyUI
python main.py --listen 0.0.0.0

Key packages installed:

torch 2.11.0+cu130
cuDNN 9.19
NCCL 2.28
transformers 5.7.0
comfyui-frontend-package 1.42.15

Open http://spark-XXXX.local:8188 from your Windows PC's browser to access ComfyUI (XXXX is your hostname).

Download speed

The 2GB model came down at 40.6 MB/s in 50 seconds from HuggingFace's CDN. About half of my home 1Gbps LAN.

Tomorrow: Day 2

Day 2 plan: Train a LoRA on photos of my actual cat.

Today's SD 1.5 only knows "some cat from somewhere". With LoRA fine-tuning, I should be able to teach it about my specific cat. That kind of personalization feels like the killer feature of running locally.

100ExperimentsWithDGX #LocalLLM