Forem: Sam Hartley

I Expanded My GPU Rental Fleet to 6 Cards — Here's What Happened to My Earnings

Sam Hartley — Sat, 09 May 2026 08:05:49 +0000

I Expanded My GPU Rental Fleet to 6 Cards — Here's What Happened to My Earnings

A few weeks ago I wrote about renting out my single RTX 3060 on Vast.ai for passive income. The experiment worked better than I expected, so I did what any reasonable person would do: I went and dug out the five other GPUs sitting in my storage room.

This is the honest follow-up. What actually happened when I went from 1 GPU to 6.

The Backstory

I had a bunch of GPUs from an older setup — two RTX 3070s, one RTX 3080, and two more RTX 3060s. They were collecting dust. The PC they came from got upgraded, the cards went into cardboard boxes, the boxes went under a shelf.

Total VRAM across all six: around 62GB. Combined retail value when new: probably $3,000+. Current value sitting in boxes: $0/month.

The math wasn't complicated.

What the Expansion Actually Took

Here's what I underestimated: it's not just "plug cards in, profit."

The hardware side

You can't just stack 6 GPUs into a regular PC case. I had to think about:

PCIe slots and bandwidth. A standard ATX board has maybe 2-3 real x16 slots. For 6 cards, you're looking at risers, which means a mining-style open frame or a server chassis.
Power. Each card pulls 150-250W under load. Six cards = potentially 1,200-1,500W just in GPU power. Plus CPU, drives, RAM. My existing 850W PSU was not going to cut it.
Cooling. Cards in a tight case thermal-throttle each other. Open frame was the answer.

I ended up using an open-air mining frame I found used for cheap, two PSUs daisy-chained (a sketchy-but-common approach in the mining world), and PCIe risers.

Setup time: about a full weekend.

The software side

Getting all six cards recognized wasn't plug-and-play either. I run Windows on the main PC (easier driver support for NVIDIA), and Vast.ai has a Windows daemon that mostly works — except when it doesn't.

A few issues I hit:

Two risers were flaky and caused cards to drop off
One 3070 had a driver conflict until I did a clean DDU reinstall
Vast.ai's host dashboard showed 5 GPUs after setup; took me an hour to figure out the sixth wasn't being detected

Total debugging time before everything was stable: another weekend.

The Earnings Comparison

Setup	Cards	VRAM	Weekly Earnings
Before (1 card)	RTX 3060	12GB	~$12-18
After (6 cards)	3060 × 3, 3070 × 2, 3080 × 1	62GB	~$65-95

Not exactly linear scaling. Here's why:

Demand is unpredictable. Sometimes 4 of my 6 cards are rented simultaneously. Sometimes 1. The RTX 3080 gets picked up more often than the 3060s — higher VRAM matters for LLM inference jobs that need room to load bigger models.

Not all hours are equal. Utilization spikes during US business hours and drops overnight (Turkey time). I'm in a timezone where "overnight for me" overlaps with "peak US working hours," which actually helps.

Pricing matters more than I thought. I dropped my per-card price slightly and saw utilization go up noticeably. A few cents per hour makes a real difference when renters are comparing a dozen similar options.

Current Monthly Run Rate

Across all six cards, I'm averaging around $280-340/month before electricity.

Power costs are real. Six GPUs under load is serious wattage. My electricity bill went up — I haven't calculated the exact delta yet because my bill is shared (I'm not the only one using power in my building), but I'd estimate $40-60/month in additional costs.

Net: roughly $220-280/month in real passive income.

Is that life-changing? No. Is it meaningful for money that was doing nothing? Absolutely.

What I'd Do Differently

1. Start with a proper open-frame rig, not a cobbled-together case.
The mining frame was cheap but took time to source. If I were doing this again I'd budget for it from day one.

2. Get a proper high-wattage PSU setup.
Running two PSUs linked together works but it's inelegant. A server PSU with the right adapter is cleaner and safer.

3. Test each card individually before combining them.
I wasted time troubleshooting "which card is the problem" when I could've confirmed each one worked before building the full rig.

4. Set minimum job duration.
Short jobs (under an hour) rack up overhead — container spin-up time, handshaking — without much earnings. I set a minimum of 2 hours and earnings-per-hour improved.

The Unexpected Part

I expected this to be a boring passive income setup. It mostly is. But I've learned a surprising amount about how the AI inference market actually works by watching what gets rented and when.

Most renters are running:

Fine-tuning jobs (need sustained GPU hours)
LLM inference (need VRAM more than raw compute)
Image generation (FLUX, Stable Diffusion variants)
Dev environments (people testing stuff without committing to a cloud contract)

Watching the demand patterns is actually interesting data about what the AI dev community is building right now. The 3080 almost always goes first — 10GB VRAM hits a sweet spot for smaller Llama and Mistral models.

Is It Worth It?

Depends on your situation.

Yes, if: You already have the GPUs and they're sitting idle. The marginal cost of setting this up is mostly your time, and the monthly return is real.

Maybe, if: You'd have to buy the GPUs. At current used-market prices, payback period is 6-12 months depending on utilization. That's not terrible but it's not obvious.

No, if: You're renting out your daily-driver GPU. The rental platform can grab your card at inconvenient times. Keep at least one card reserved for your own use.

What's Next

I'm looking at adding the Ubuntu server I have running as a CPU-only Vast.ai host for smaller workloads. Less money per unit but zero additional hardware cost.

Also thinking about whether it makes sense to eventually get into the dedicated hosting side rather than the rental marketplace — more stable income, more setup required. Still researching.

For now, 6 cards, ~$250/month net, and a weekend's worth of setup. I'll take it.

Questions about the rig setup or Vast.ai specifics? Drop them in the comments.

→ Check out my automation work on Fiverr
→ Follow along on Telegram

My 3-Machine AI Lab: How I Divide Work Between a Mac Mini, a Windows PC, and an Ubuntu Box

Sam Hartley — Thu, 07 May 2026 08:06:27 +0000

I keep seeing posts about running AI on a single machine. "Just use Ollama on your laptop!" Sure, that works — until you want to run a 30B model while your IDE is indexing, your test suite is running, and you're editing a video.

I have three machines. Not because I'm rich — because I kept old hardware alive and gave each one a job. Here's how I split AI workloads across a Mac Mini M4, a Windows PC with an RTX 3060, and an old Ubuntu box, and why one machine was never enough.

The Three Machines

Machine	Specs	Role
Mac Mini M4	10-core, 16GB RAM	Orchestrator, coding, light inference
Windows PC	AMD 9970X, 128GB RAM, RTX 3060 12GB	Heavy inference, image generation, GPU rental
Ubuntu Box	Older CPU, 16GB RAM	Background services, OmniParser, CPU inference fallback

Total cost for the additional machines: the Windows PC was a workstation I already had, the Ubuntu box is a repurposed laptop. The Mac Mini is the only machine I bought specifically for this.

Why Three? Because One Keeps Running Into Walls.

Here's what happens when you try to do everything on a single machine:

On the Mac Mini alone: Run a 9B model → fine. Try a 30B model → the fans sound like a jet engine and inference drops to 2 tokens/second because Apple Silicon shares RAM between CPU and GPU. Start a build while inference is running? Everything crawls.

On the Windows PC alone: The RTX 3060 crushes inference — a 30B model at 4-bit runs at 8-12 tok/s. But the machine is also my GPU rental host (I rent it on Vast.ai for passive income). When someone rents it, I can't use it. And running heavy inference while coding in an IDE on the same machine? Stutter city.

On the Ubuntu box alone: It's slow. CPU-only inference on a 7B model takes 30+ seconds per token. But it runs 24/7 without complaints, never reboots for Windows updates, and costs almost nothing in power.

The answer was never "pick the best one." It was "give each one the right job."

The Division of Labor

Mac Mini: The Brain

This is where I actually work. Code editing, git operations, web browsing, writing. It runs:

Ollama with small models — Qwen 3 4B for quick chat, granite3.2-vision for screenshots
My automation agents — OpenClaw runs here 24/7, orchestrating the other machines
Dev environment — VS Code, terminal, browser, all the tools

The key insight: the Mac Mini is for interactive work. Fast response time matters more than throughput. A 4B model answering in 0.3 seconds beats a 30B model answering in 8 seconds when I'm in the middle of a thought.

What it does NOT do: heavy batch processing, image generation, anything that pins the CPU for more than a minute.

Windows PC: The Muscle

The RTX 3060 with 12GB VRAM is the workhorse. It runs:

Ollama with big models — Qwen3-Coder 30B for complex code, DeepSeek R1 8B for reasoning
FLUX image generation — when I need AI photos (I run a Fiverr gig for brand character photos — $0 generation cost when it's your own GPU)
Vast.ai host — when I'm not using it, someone else pays to rent it. ~$50-130/month passive income.

The scheduling is simple: during my working hours (roughly 9am-11pm Istanbul time), I pause Vast.ai rentals and use the GPU myself. Overnight, it goes back on the rental market.

128GB of RAM means I can load truly large models or run multiple services simultaneously. The GPU handles inference; the CPU cores handle data prep and post-processing.

Ubuntu Box: The Backbone

This machine's superpower is reliability. It runs:

OmniParser — a UI parsing service that my automation agents use to "see" screens (runs on port 8100, I SSH-tunnel to it from the Mac)
CPU inference fallback — when the Windows GPU is rented out and the Mac is busy, the Ubuntu box runs minicpm-v for vision tasks. Slow (~180 seconds per query) but it works.
Background cron jobs — data scraping, health checks, monitoring scripts
Testing ground — I deploy new services here first because if something crashes, my main workflow isn't affected

It's the machine I interact with least directly, but it's the one I'd miss most if it went down.

How They Talk to Each Other

This was the hardest part to figure out. Three machines that don't communicate are just three isolated computers.

The network:

Mac Mini (192.168.1.102) ←→ Windows PC (192.168.1.106) ←→ Ubuntu (192.168.1.100)

All on the same local network. No VPN needed at home, but I set up Tailscale for remote access when I'm not home.

Ollama routing:
The Mac Mini's Ollama is configured with OLLAMA_HOST=0.0.0.0 so it's accessible from the other machines. Same for the Windows PC. My automation agent on the Mac Mini routes requests based on model size:

def get_ollama_url(model: str) -> str:
    """Route to the right machine based on model size."""
    small_models = ["qwen3:4b", "granite3.2-vision:2b", "moondream"]
    big_models = ["qwen3-coder:30b", "deepseek-r1:8b"]

    if model in big_models:
        return "http://192.168.1.106:11434"  # Windows GPU
    elif model in small_models:
        return "http://localhost:11434"       # Mac Mini local
    else:
        return "http://192.168.1.100:11434"   # Ubuntu fallback

Simple. No Kubernetes, no service mesh. Just a function that knows which machine has which model.

SSH tunnels for services:
The Ubuntu box runs OmniParser on port 8100, but I need it on the Mac Mini. Instead of exposing it publicly:

# Auto-reconnecting SSH tunnel (runs via launchd on Mac)
ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3 \
    -L 8100:localhost:8100 exp@192.168.1.100 -N

Now localhost:8100 on the Mac Mini reaches OmniParser on Ubuntu. Clean, secure, zero config on the Ubuntu side.

The Mistakes I Made

Mistake 1: Trying to make the Mac Mini do everything.
For the first month, I ran all inference on the Mac Mini. The 16GB unified memory seems like a lot until a 30B model is eating 20GB and your IDE is using another 4GB and macOS starts swapping to SSD. The SSD is fast, but swap is still swap. Code completion went from instant to 3-second delays.

Mistake 2: Not setting up Ollama access from day one.
I installed Ollama separately on each machine and used them in isolation for weeks. The moment I realized I could call the Windows Ollama from the Mac Mini was a lightbulb moment. The models were always there — I just wasn't using them.

Mistake 3: Ignoring the Ubuntu box because it's "slow."
CPU inference is slow. But "slow" is relative. A background task that takes 3 minutes on CPU doesn't matter if you're not waiting for it. I was treating every request as if it needed to be instant. Turns out, most don't.

Mistake 4: No fallback plan.
When the Windows PC rebooted for an update, my entire inference pipeline died because I had no fallback routing. Now: if Windows is down, Ubuntu takes over. If Ubuntu is down, the Mac runs a tiny model. Something always answers.

What This Setup Enables

Running three machines isn't about having the biggest, fastest setup. It's about never being blocked.

Windows GPU is rented out? Route to Mac or Ubuntu.
Mac Mini is busy with a build? Use the Windows GPU via network.
Ubuntu is running a heavy OmniParser job? Queue it, the Mac handles the request.
Need to generate 50 images for a client? Windows GPU does it in batch while I keep coding on the Mac.

And the cost breakdown:

Item	Monthly Cost
Mac Mini power	~$8
Windows PC power (idle + rental hours)	~$15-25
Ubuntu box power	~$5
Vast.ai earnings (offset)	~$50-130
Net	-$17 to -$97 (net profit)

The setup literally pays for itself. The Windows PC's GPU rental income exceeds the total power cost of all three machines.

Do You Need Three Machines?

Probably not. But you might need two.

If you're running AI on a laptop: get a cheap desktop (even without a GPU) and put it on your network as a background worker. Ollama on a second machine means your main machine stays responsive.

If you have a gaming PC: you already have a GPU. Install Ollama on it, set OLLAMA_HOST=0.0.0.0, and call it from your laptop. You now have a two-machine AI lab for zero additional cost.

The Ubuntu box in my setup could be a $50 Raspberry Pi 5 with 8GB RAM for the same purpose. It doesn't need to be fast. It needs to be there.

What I'm Adding Next

I have five more GPUs in storage (2× RTX 3070, 1× RTX 3080, 2× RTX 3060). The plan is to build an open-air rig and add them all to the Windows PC for both local inference and Vast.ai rental. That'd give me 62GB total VRAM — enough to run a 70B model locally or rent out as a cluster.

The Ubuntu box is getting a dedicated role as a monitoring and alerting server. Right now my health checks are scattered across cron jobs on all three machines. Centralizing them makes debugging easier.

And the Mac Mini? It stays the brain. More RAM would be nice (16GB is tight with Ollama + IDE + browser), but the M4's efficiency is hard to beat for interactive work.

The Takeaway

One machine is a workstation. Two machines is a lab. Three machines is a distributed system that happens to fit on a desk.

The magic isn't in the hardware — it's in the routing. Knowing which machine should handle which task, building fallback paths, and making sure a single machine going down doesn't kill your workflow.

Start with what you have. Add a second machine when you feel the friction. Route intelligently. The three-machine setup happened organically — each machine earned its place by solving a problem the others couldn't.

I write about running AI locally, home lab setups, and turning hardware into income. New post every few days — follow along if that's your thing.

My GPU rental story: One card → six cards → $250/month passive

From Idea to Deployed Tool in 3 Hours — How AI Coding Agents Changed My Workflow

Sam Hartley — Tue, 05 May 2026 08:06:19 +0000

I used to think AI coding assistants were autocomplete on steroids. Fancy IntelliSense. Then I tried using one as an actual junior developer — someone who writes the first draft while I review and refine.

Two months later, my workflow is unrecognizable. I just shipped a complete B2B configuration tool — interactive maps, zone polygons, dynamic forms, the works — in under three hours. Here's what changed.

The Old Way

Before AI agents, my process for a new tool looked like this:

Research — How does Leaflet.js work? What's the API for geo polygons? Stack Overflow, docs, tutorials. 45 minutes.
Boilerplate — HTML structure, CSS grid, JavaScript imports, event listeners. 30 minutes.
Core logic — The actual thing the tool needs to do. 2–3 hours.
Debugging — Why doesn't the map render? Why is the polygon offset? 1–2 hours.
Polish — Styling, responsive layout, edge cases. 1 hour.

Total: 6–8 hours for a medium-complexity tool. And that's if I know the stack. If it's something new (like Garmin's Monkey C or a mapping library I haven't used), double it.

The New Way

Last week a client asked for a heat zone map for 21 European countries. Click a country, see the heating zones, pick one, get the right configuration. With polygon boundaries, country-specific defaults, and a responsive UI.

Here's how it went:

Hour 0–0.5: Prompt engineering
I wrote a detailed spec. Not "make a map" — that's useless. I described:

The data structure (country → zones → polygon coordinates)
The UI flow (dropdown → map render → zone selection)
The tech stack (Leaflet.js, vanilla JS, no frameworks)
Edge cases (what happens when a country has no zones?)

Hour 0.5–1.5: First draft from the agent
I fed the spec to Codex (Claude Code via CLI). It generated the full HTML file — 800+ lines — with Leaflet integration, zone polygons, event handlers, and the selection logic.

Was it perfect? No. The polygon coordinates were placeholder circles. The styling was bare-bones. But the architecture was right. The map rendered. The flow worked.

Hour 1.5–2.5: My turn — polish and fix
I replaced the placeholder polygons with real GeoJSON-ish coordinates for all 21 countries. Tweaked the CSS for mobile. Added validation. Fixed a bug where the map didn't re-center when switching countries.

Hour 2.5–3: Integration and deploy
Hooked it into the existing project structure. Git commit. Done.

Total: 3 hours. And the heavy lifting — the Leaflet setup, the polygon rendering logic, the event wiring — was handled by the agent. I did the creative/problem-solving work: defining the problem, validating the output, fixing the edge cases.

What Actually Works (and What Doesn't)

After two months of daily use, here's my honest assessment:

✅ What works brilliantly:

Boilerplate and plumbing — Setting up projects, imports, basic structure. The agent is faster and makes fewer typos than me.
API integration patterns — "Here's an endpoint, here's the expected response, write the fetch and parse logic." It gets this right 90% of the time.
Refactoring — "Rename this function and update all callers across 5 files." Instant, error-free.
Exploring unfamiliar territory — I hadn't used Leaflet in years. The agent got me to a working state without reading docs for an hour.

❌ What doesn't work (yet):

Complex business logic — Anything with nuanced rules, edge cases, or domain-specific constraints. The agent generates something plausible that breaks in production.
UI/UX design — It makes functional UIs. They look like a developer made them (because one did). You'll still need a human eye for polish.
Debugging its own mistakes — When the agent writes a bug, it's often subtle. You need to understand the code to catch it. This is not a replacement for knowing how to code.
Large-scale architecture — It works file-by-file. Designing a system with proper separation of concerns, caching strategies, and scalability? That's still on you.

The Mindset Shift

The biggest change isn't speed. It's how I think about problems.

Before: "How do I implement this?" → Research → Code → Debug.
Now: "How do I describe this so an agent can implement a first draft?" → Spec → Review → Refine.

I'm the architect and editor now, not the typist. The agent is the junior dev who writes fast but needs supervision.

This matters because it scales. I can explore 3 approaches in the time it used to take to build 1. I can say "what if we used a canvas instead of Leaflet?" and get a working comparison in 10 minutes. The cost of experimentation dropped to near-zero.

The Setup

If you want to try this:

Use Claude Code or Codex CLI — The terminal interface lets you iterate fast. Chat-based tools (ChatGPT, etc.) are too slow for code generation.
Write detailed specs — The agent is only as good as your prompt. Include examples, expected outputs, and constraints.
Review every line — Don't blindly commit. The agent writes plausible-looking bugs.
Keep a "golden test" — A known input/output pair you can run after every change. Catches regressions instantly.
Iterate in small chunks — "Add the map" → review → "Add zone polygons" → review. Don't ask for 500 lines at once.

What I'd Do Differently

Two months in, my main learning: the spec is everything.

My fastest sessions happen when I spend 10 minutes writing a bullet-point spec before touching the agent. My slowest sessions happen when I vague-prompt my way through and spend an hour fixing misunderstandings.

The second learning: agents excel at breadth, humans at depth. Use the agent to explore options. Use your brain to pick the right one.

If you're using AI coding agents, what's your experience? I'm curious if the "architect + editor" model resonates, or if you've found a different pattern that works. Drop your thoughts below — I'm still figuring this out too.

I Built a Model Router That Picks the Right AI for Every Task — Here's Why You Should Too

Sam Hartley — Sun, 03 May 2026 08:01:54 +0000

I caught myself doing something stupid the other day. I used a 30B parameter model to summarize a two-paragraph email. The GPU spun up, the fans kicked in, and 8 seconds later I had my summary. The same summary my 4B model would've given me in 0.3 seconds.

That's when I realized: I'd been treating every AI request the same way. Big model for everything. Small model for nothing. No intelligence in the routing at all.

So I built a model router. Not a fancy orchestrator with Kubernetes and service meshes — just a simple function that looks at what you're asking and sends it to the right model on the right machine. It's the single most impactful thing I've done for my local AI setup this year.

The Problem

I run three machines with Ollama (Mac Mini M4, Windows PC with RTX 3060, Ubuntu box). Between them, I have maybe 8 models installed. Before the router, my workflow was:

Need something → open terminal
Think about which model is good enough
Think about which machine has it
Type the full URL: curl http://192.168.1.106:11434/api/generate -d '{"model": "qwen3-coder:30b", ...}'
Wait

Steps 2-4 happened every single time. I was spending more mental energy on routing than on the actual task. And half the time I'd default to the biggest model just because "it's probably better."

Here's the thing: it's usually not better. A 4B model summarizing text is 95% as good as a 30B model summarizing text. But it's 25x faster and uses 1/10th the resources. The big model should earn its keep on tasks where size actually matters.

The Router

I started with a Python function. Nothing fancy:

import re

# Model registry: what's available where
MODELS = {
    "quick": {
        "model": "qwen3:4b",
        "endpoint": "http://localhost:11434",
    },
    "vision": {
        "model": "granite3.2-vision:2b",
        "endpoint": "http://192.168.1.106:11434",
    },
    "code": {
        "model": "qwen3-coder:30b",
        "endpoint": "http://192.168.1.106:11434",
    },
    "reasoning": {
        "model": "deepseek-r1:8b",
        "endpoint": "http://192.168.1.106:11434",
    },
    "fallback": {
        "model": "minicpm-v",
        "endpoint": "http://192.168.1.100:11434",
    },
}

def route(prompt: str) -> dict:
    p = prompt.lower()
    if any(w in p for w in ["image", "screenshot", "picture"]):
        return MODELS["vision"]
    if any(w in p for w in ["function", "class", "debug", "refactor", "implement"]):
        return MODELS["code"]
    if any(w in p for w in ["prove", "solve", "calculate", "math", "logic"]):
        return MODELS["reasoning"]
    return MODELS["quick"]

That's it. That's the whole router. It classifies prompts by keywords and sends them to the appropriate model. Is it perfect? No. Does it need to be? Also no.

Why This Matters More Than You Think

Before the router, my typical day looked like this:

30 quick questions (summarize this, what does this error mean, rephrase this)
5 code tasks (write a function, debug this, add tests)
2 vision tasks (what's in this screenshot)
1 deep reasoning task (complex analysis)

Without routing, I'd use the big model for everything. 38 requests to the 30B model. Each taking 5-15 seconds. Total wait time: ~4-5 minutes of just... waiting.

With routing:

30 quick → 4B model on Mac Mini, 0.3s each → 9 seconds total
5 code → 30B on Windows GPU, 8-12s each → ~50 seconds
2 vision → vision model on GPU, 4-13s each → ~15 seconds
1 reasoning → 8B model on GPU, 5-8s → ~6 seconds

Total: ~80 seconds instead of ~5 minutes. And my GPU is free for most of that time.

The Fallback System

The real value hit me when the Windows PC went down for a Windows Update mid-session. Before, this was catastrophic.

Now the router has fallback logic — when a machine is down, it finds the next best option. Code tasks go to the small model (worse but functional). Vision tasks try Ubuntu. Quick tasks keep humming on the Mac Mini. Something always answers.

The Token Cost Angle

I track tokens per model per day. Not because I'm cheap (these are all free — local), but because it tells me where I'm spending compute:

After a week: 72% of my requests went to the quick model. Only 15% needed the big code model.

I was using a sledgehammer for 72% of my nails. No wonder the GPU always felt busy.

What I'd Do Differently

1. Build the router on day one. It took an afternoon. I wasted months manually routing.

2. Start with keyword routing. I considered embeddings, classifiers, even using an LLM to pick the LLM. Keywords work for 90% of cases. Ship the simple thing.

3. Make the fallback automatic. My first version just errored when the GPU machine was down. A degraded response is infinitely better than no response.

4. Log everything. You can't optimize what you don't measure. The 72% stat jumped out immediately once I started tracking.

Beyond Keywords

The keyword router works but has blind spots. So I'm adding:

Confidence scoring: If a prompt matches multiple categories, try the cheaper model first. Auto-retry with bigger model if quality seems low.
Context-aware routing: If the last 3 prompts were about the same codebase, keep using the code model even without explicit keywords.
Cost-aware fallback: When local models can't handle something, the router should know whether a cloud API call is "worth it" or not.

Do You Need This?

If you run one model on one machine: no. You're fine.

If you have more than 2 models: yes. Otherwise you'll default to the biggest one every time, and that's a waste.

If you have multiple machines: absolutely. The router isn't just about picking the right model — it's about picking the right machine. Code tasks go where the GPU is. Quick tasks stay local. Background tasks go to the always-on box.

Start with 2 models and a simple if/else. Add more as you grow. The architecture stays the same.

I write about running AI locally, home lab setups, and turning hardware into income. If that's your jam, I post every few days.

I Built a Model Router That Picks the Right AI for Every Task — Here's Why You Should Too

Sam Hartley — Fri, 01 May 2026 08:02:21 +0000

That's when I realized: I'd been treating every AI request the same way. Big model for everything. Small model for nothing. No intelligence in the routing at all.

The Problem

I run three machines with Ollama (Mac Mini M4, Windows PC with RTX 3060, Ubuntu box). Between them, I have maybe 8 models installed. Before the router, my workflow was:

Need something → open terminal
Think about which model is good enough
Think about which machine has it
Type the full URL: curl http://192.168.1.106:11434/api/generate -d '{\"model\": \"qwen3-coder:30b\", ...}'
Wait

Steps 2-4 happened every single time. I was spending more mental energy on routing than on the actual task. And half the time I'd default to the biggest model just because "it's probably better."

The Router

I started with a Python function. Nothing fancy:

import re

# Model registry: what's available where
MODELS = {
    # Quick chat, summaries, simple Q&A → Mac Mini (fast response)
    "quick": {
        "model": "qwen3:4b",
        "endpoint": "http://localhost:11434",
    },
    # Vision / image analysis → Windows GPU (needs VRAM)
    "vision": {
        "model": "granite3.2-vision:2b",
        "endpoint": "http://192.168.1.106:11434",
    },
    # Code generation, complex reasoning → Windows GPU
    "code": {
        "model": "qwen3-coder:30b",
        "endpoint": "http://192.168.1.106:11434",
    },
    # Deep reasoning, math, logic puzzles → Windows GPU
    "reasoning": {
        "model": "deepseek-r1:8b",
        "endpoint": "http://192.168.1.106:11434",
    },
    # Fallback for when Windows is rented out → Ubuntu CPU
    "fallback": {
        "model": "minicpm-v",
        "endpoint": "http://192.168.1.100:11434",
    },
}

def route(prompt: str) -> dict:
    """Decide which model should handle this prompt."""
    p = prompt.lower()

    # Vision tasks
    if any(w in p for w in ["image", "screenshot", "picture", "see this", "what's on"]):
        return MODELS["vision"]

    # Code tasks
    if any(w in p for w in ["function", "class", "debug", "refactor", "implement",
                             "write code", "python", "javascript", "typescript",
                             "api endpoint", "sql", "regex"]):
        return MODELS["code"]

    # Reasoning tasks
    if any(w in p for w in ["prove", "solve", "calculate", "math", "logic",
                             "reason", "analyze", "compare", "evaluate"]):
        return MODELS["reasoning"]

    # Default: quick model for everything else
    return MODELS["quick"]

That's it. That's the whole router. It classifies prompts by keywords and sends them to the appropriate model. Is it perfect? No. Does it need to be? Also no.

Why This Matters More Than You Think

Before the router, my typical day looked like this:

30 quick questions (summarize this, what does this error mean, rephrase this)
5 code tasks (write a function, debug this, add tests)
2 vision tasks (what's in this screenshot)
1 deep reasoning task (complex analysis)

Without routing, I'd use the big model for everything. 38 requests to the 30B model. Each taking 5-15 seconds. Total wait time: ~4-5 minutes of just... waiting. While my GPU is maxed out and I can't run anything else.

With routing:

30 quick → 4B model on Mac Mini, 0.3s each → 9 seconds total
5 code → 30B on Windows GPU, 8-12s each → ~50 seconds
2 vision → vision model on GPU, 4-13s each → ~15 seconds
1 reasoning → 8B model on GPU, 5-8s → ~6 seconds

Total: ~80 seconds instead of ~5 minutes. And my GPU is free for most of that time, so I could be generating images or renting it out simultaneously.

The router doesn't just save time. It frees capacity.

The Fallback System

The real value of routing hit me when the Windows PC went down for a Windows Update mid-session. Before, this was catastrophic — all my "important" models were on that machine.

Now the router has fallback logic:

def route_with_fallback(prompt: str) -> dict:
    """Route with health checks and fallbacks."""
    primary = route(prompt)

    # Check if primary endpoint is alive
    try:
        requests.get(f"{primary['endpoint']}/api/tags", timeout=2)
        return primary
    except:
        pass  # Primary is down, find fallback

    # Fallback chain based on task type
    if primary == MODELS["code"]:
        # Can't run 30B on Mac, but 4B is better than nothing
        return MODELS["quick"]
    elif primary == MODELS["vision"]:
        # Try Ubuntu's slow vision model
        try:
            requests.get(f"{MODELS['fallback']['endpoint']}/api/tags", timeout=2)
            return MODELS["fallback"]
        except:
            return MODELS["quick"]  # No vision available, use text
    elif primary == MODELS["reasoning"]:
        return MODELS["quick"]  # Quick model can reason a bit

    return MODELS["quick"]

When Windows reboots, my workflow doesn't stop. The router sends code tasks to the small model (worse but functional), vision tasks to Ubuntu (slow but works), and quick tasks keep humming along on the Mac Mini. Something always answers.

The Token Cost Angle

I also track tokens per model per day. Not because I'm cheap (these are all free — they run locally), but because it tells me where I'm spending compute:

# Simple token counter per route
token_log = defaultdict(int)

def track_tokens(route_name: str, tokens: int):
    token_log[route_name] += tokens
    # Log daily totals to a file
    with open(f"token-usage-{date.today()}.jsonl", "a") as f:
        f.write(json.dumps({"route": route_name, "tokens": tokens, "ts": time.time()}) + "\n")

After a week of tracking, I found: 72% of my requests were going to the quick model. Only 15% actually needed the big code model. 8% were vision. 5% reasoning.

I was using a sledgehammer for 72% of my nails. No wonder the GPU always felt busy.

What I'd Do Differently If I Started Over

1. Build the router on day one, not month three.

I wasted months manually routing. The router took an afternoon to write. If you're running multiple models, build the routing first. You can always make it smarter later.

2. Start with keyword routing. Don't over-engineer it.

I considered using embeddings to classify prompts. I considered training a tiny classifier. I considered using an LLM to decide which LLM to use (yes, really). Keyword matching works for 90% of cases. Ship the simple thing.

3. Make the fallback automatic.

My first version just errored when the GPU machine was down. The fallback logic was an afterthought. It should've been the first thing I built — because machines go down, and a degraded response is infinitely better than no response.

4. Log everything.

You can't optimize what you don't measure. Once I started logging which routes were used, the 72% quick-model stat jumped out immediately. Without logs, I would've kept thinking I needed the big model "most of the time."

Beyond Keywords: What I'm Adding Next

The keyword router works, but it has blind spots. A prompt like "help me understand why this function is slow" could be code OR reasoning. So I'm adding:

Confidence scoring: If the prompt matches multiple categories, route to the cheaper model first. If the response quality seems low (response is very short, model seems confused), auto-retry with the bigger model.

Context-aware routing: If the last 3 prompts were all about the same codebase, keep using the code model even if the current prompt doesn't have code keywords. Context continuity matters.

Cost-aware routing for cloud fallback: When local models can't handle something, I fall back to cloud APIs. The router should know: "this task isn't worth a $0.03 GPT-4 call, use the local model." Or: "this is worth it, spend the API credits."

Do You Need This?

If you run one model on one machine: no. You're fine. The router solves a multi-model, multi-machine problem.

If you're starting to accumulate models (a small one for chat, a big one for code, a vision model): yes. The moment you have more than 2 models, you need routing. Otherwise you'll default to the biggest one every time, and that's a waste.

The beauty is that the router grows with you. Start with 2 models and a simple if/else. Add more models, add more rules. Add more machines, add more endpoints. The architecture stays the same.

I write about running AI locally, home lab setups, and turning hardware into income. If that's your jam, I post every few days.

My AI lab setup: 3 machines, 1 desk, zero cloud dependency

My 3-Machine AI Lab: How I Divide Work Between a Mac Mini, a Windows PC, and an Ubuntu Box

Sam Hartley — Wed, 29 Apr 2026 08:02:28 +0000

The Three Machines

Machine	Specs	Role
Mac Mini M4	10-core, 16GB RAM	Orchestrator, coding, light inference
Windows PC	AMD 9970X, 128GB RAM, RTX 3060 12GB	Heavy inference, image generation, GPU rental
Ubuntu Box	Older CPU, 16GB RAM	Background services, OmniParser, CPU inference fallback

Total cost for the additional machines: the Windows PC was a workstation I already had, the Ubuntu box is a repurposed laptop. The Mac Mini is the only machine I bought specifically for this.

Why Three? Because One Keeps Running Into Walls.

Here's what happens when you try to do everything on a single machine:

The answer was never "pick the best one." It was "give each one the right job."

The Division of Labor

Mac Mini: The Brain

This is where I actually work. Code editing, git operations, web browsing, writing. It runs:

Ollama with small models — Qwen 3 4B for quick chat, granite3.2-vision for screenshots
My automation agents — OpenClaw runs here 24/7, orchestrating the other machines
Dev environment — VS Code, terminal, browser, all the tools

What it does NOT do: heavy batch processing, image generation, anything that pins the CPU for more than a minute.

Windows PC: The Muscle

The RTX 3060 with 12GB VRAM is the workhorse. It runs:

Ollama with big models — Qwen3-Coder 30B for complex code, DeepSeek R1 8B for reasoning
FLUX image generation — when I need AI photos (I run a Fiverr gig for brand character photos — $0 generation cost when it's your own GPU)
Vast.ai host — when I'm not using it, someone else pays to rent it. ~$50-130/month passive income.

The scheduling is simple: during my working hours (roughly 9am-11pm Istanbul time), I pause Vast.ai rentals and use the GPU myself. Overnight, it goes back on the rental market.

128GB of RAM means I can load truly large models or run multiple services simultaneously. The GPU handles inference; the CPU cores handle data prep and post-processing.

Ubuntu Box: The Backbone

This machine's superpower is reliability. It runs:

OmniParser — a UI parsing service that my automation agents use to "see" screens (runs on port 8100, I SSH-tunnel to it from the Mac)
CPU inference fallback — when the Windows GPU is rented out and the Mac is busy, the Ubuntu box runs minicpm-v for vision tasks. Slow (~180 seconds per query) but it works.
Background cron jobs — data scraping, health checks, monitoring scripts
Testing ground — I deploy new services here first because if something crashes, my main workflow isn't affected

It's the machine I interact with least directly, but it's the one I'd miss most if it went down.

How They Talk to Each Other

This was the hardest part to figure out. Three machines that don't communicate are just three isolated computers.

The network:

Mac Mini (192.168.1.102) ←→ Windows PC (192.168.1.106) ←→ Ubuntu (192.168.1.100)

All on the same local network. No VPN needed at home, but I set up Tailscale for remote access when I'm not home.

def get_ollama_url(model: str) -> str:
    """Route to the right machine based on model size."""
    small_models = ["qwen3:4b", "granite3.2-vision:2b", "moondream"]
    big_models = ["qwen3-coder:30b", "deepseek-r1:8b"]

    if model in big_models:
        return "http://192.168.1.106:11434"  # Windows GPU
    elif model in small_models:
        return "http://localhost:11434"       # Mac Mini local
    else:
        return "http://192.168.1.100:11434"   # Ubuntu fallback

Simple. No Kubernetes, no service mesh. Just a function that knows which machine has which model.

SSH tunnels for services:
The Ubuntu box runs OmniParser on port 8100, but I need it on the Mac Mini. Instead of exposing it publicly:

# Auto-reconnecting SSH tunnel (runs via launchd on Mac)
ssh -o ServerAliveInterval=60 -o ServerAliveCountMax=3 \
    -L 8100:localhost:8100 exp@192.168.1.100 -N

Now localhost:8100 on the Mac Mini reaches OmniParser on Ubuntu. Clean, secure, zero config on the Ubuntu side.

The Mistakes I Made

What This Setup Enables

Running three machines isn't about having the biggest, fastest setup. It's about never being blocked.

Windows GPU is rented out? Route to Mac or Ubuntu.
Mac Mini is busy with a build? Use the Windows GPU via network.
Ubuntu is running a heavy OmniParser job? Queue it, the Mac handles the request.
Need to generate 50 images for a client? Windows GPU does it in batch while I keep coding on the Mac.

And the cost breakdown:

Item	Monthly Cost
Mac Mini power	~$8
Windows PC power (idle + rental hours)	~$15-25
Ubuntu box power	~$5
Vast.ai earnings (offset)	~$50-130
Net	-$17 to -$97 (net profit)

The setup literally pays for itself. The Windows PC's GPU rental income exceeds the total power cost of all three machines.

Do You Need Three Machines?

Probably not. But you might need two.

If you're running AI on a laptop: get a cheap desktop (even without a GPU) and put it on your network as a background worker. Ollama on a second machine means your main machine stays responsive.

If you have a gaming PC: you already have a GPU. Install Ollama on it, set OLLAMA_HOST=0.0.0.0, and call it from your laptop. You now have a two-machine AI lab for zero additional cost.

The Ubuntu box in my setup could be a $50 Raspberry Pi 5 with 8GB RAM for the same purpose. It doesn't need to be fast. It needs to be there.

What I'm Adding Next

And the Mac Mini? It stays the brain. More RAM would be nice (16GB is tight with Ollama + IDE + browser), but the M4's efficiency is hard to beat for interactive work.

The Takeaway

One machine is a workstation. Two machines is a lab. Three machines is a distributed system that happens to fit on a desk.

I write about running AI locally, home lab setups, and turning hardware into income. New post every few days — follow along if that's your thing.

My GPU rental story: One card → six cards → $250/month passive

3 Months Running Everything Locally — What Broke, What Worked, What I'd Do Differently

Sam Hartley — Thu, 09 Apr 2026 08:03:48 +0000

It's been about three months since I made the switch. No more ChatGPT Plus. No more Claude subscription. No more Copilot. Everything I use day-to-day now runs on hardware sitting in my apartment — a Mac mini M4 and a PC with a mix of consumer GPUs.

I wrote the enthusiastic "I ditched OpenAI" post back in early March. This is the honest follow-up, because a lot of people asked me to come back after the honeymoon phase.

Some of it worked better than I expected. Some of it was genuinely annoying. Here's the unfiltered version.

The setup (short version)

Mac mini M4, 16 GB RAM — runs the orchestrator, small models, all the glue code
PC with an RTX 3060 12GB (plus spare 3070s and 3080s in a drawer) — runs the heavier models via Ollama
Everything talks over my LAN. Nothing leaves the house unless I tell it to.

I use this stuff for coding help, writing drafts, summarizing articles, transcribing voice notes, and a handful of personal automations.

What actually worked

1. Coding help for "normal" tasks

qwen3-coder:30b on the 3060 handles maybe 80% of what I used to ask GPT-4 for. Refactoring a function, explaining a gnarly regex, writing a quick shell script, sketching a React component. It's fast enough that I don't miss the cloud.

The latency is actually better than cloud APIs because there's no round trip to the US. I'll type a prompt and have tokens streaming back in under a second.

2. Voice notes and transcription

I didn't expect this to be the killer app, but it is. I talk to my Mac mini while I'm cooking, and Whisper on-device dumps the text into a daily markdown file. Zero cost, zero privacy worries, zero "oh I forgot to renew my API key."

This alone probably saved me more time than the coding stuff.

3. Batch jobs that used to rack up API bills

Summarizing 200 PDFs. Tagging a folder of screenshots. Generating alt text for a blog's image archive. These were the things that made me nervous to click "run" on OpenAI. Now I just let them chew overnight on the PC. Electricity is cheaper than tokens.

What broke or annoyed me

1. The "it's almost right" gap on hard stuff

For anything genuinely hard — like debugging a weird async bug in a codebase the model hasn't seen, or reasoning about a tricky algorithm — the gap between local 30B models and frontier cloud models is still real. It's not huge, but it's there.

I caught myself a few times thinking "I bet GPT-5 would get this in one shot" while I was on my fifth prompt with a local model. That's the honest truth.

My workaround: I keep a very small pay-as-you-go budget for the 2-3 times a month I actually need a frontier model. Probably $5/month total. Way cheaper than any subscription.

2. Context windows

Local models with huge contexts exist, but they get slow. Pasting a 50-file codebase into an 8B model and waiting is painful. I ended up writing a little script that does smart file selection instead of just dumping everything — basically a poor man's RAG. Worked better than I expected, but it was a weekend of yak-shaving I wasn't planning on.

3. Model sprawl

I have like 14 models pulled right now. qwen3-coder for code. deepseek-r1 for reasoning. A vision model for screenshots. Whisper for audio. A small embedding model. A translator. Each one made sense at the time. Now my Ollama directory is 180 GB and I can't remember what half of them are for.

I need to do a spring cleaning. I keep putting it off.

4. The "is it plugged in" problem

My PC is in another room. Sometimes my wife moves stuff and unplugs the switch. Sometimes Windows decides to reboot for updates at 3am. Sometimes the LAN cable gets bumped.

Cloud APIs just... work. Local stuff requires you to be your own SRE. I have health checks now. I have a Telegram bot that pings me when Ollama stops responding. This is not the kind of "home lab tinkering" I signed up for, but here we are.

What I'd do differently if I started today

Skip the "run everything locally or nothing" purity thing. I wasted a few weeks trying to make local models do things they're just not good at yet. The sweet spot is a hybrid setup: local for the 90% of boring stuff, a small cloud budget for the 10% that genuinely needs a bigger brain.

Buy one strong GPU instead of three mediocre ones. I have a drawer full of 3070s and 3080s I thought I'd use for "multi-GPU inference." In practice, a single 12GB card running a good 30B model handles almost everything I need, and juggling multiple cards adds complexity I don't want.

Put the models behind one API, not five. Ollama, llama.cpp, a Whisper server, a vision endpoint, embeddings... I should have stood up a single gateway that routes to the right backend. Instead I have five different base URLs in five different config files. It's fine, but it's ugly.

Write down what each model is for. Past-me assumed future-me would remember. Future-me does not remember.

Would I go back?

No. But I'd stop telling people "local AI is ready, ditch the subscriptions" like it's some binary choice. It's not.

If you're a developer who likes tinkering, has a decent GPU already, and wants to cut a $20-40/month subscription — yeah, it's great. You'll learn a lot, you'll own your stack, and you won't feel weird about pasting private code into someone else's API.

If you just want the best possible model for your work and don't want to babysit anything — stay on the cloud. There's no shame in it.

I'm in the first camp, but I was wrong to pretend the second camp was being lazy. They were being reasonable.

Anyone else running a mixed setup like this? I'm curious what other people's "local for X, cloud for Y" split looks like. Drop it in the comments — I'm especially interested in how you're handling the context-window problem without building your own RAG from scratch.

I Let AI Coding Agents Build My Side Projects for a Month — Here's My Honest Take

Sam Hartley — Sun, 05 Apr 2026 08:04:00 +0000

Last month I ran an experiment: instead of writing code myself, I delegated as much as possible to AI coding agents. Not just autocomplete — full autonomous agents that read files, run commands, and ship features.

I've been running a home lab (Mac Mini M4 + a Windows PC with GPUs + an Ubuntu box) for a while now, and I already had my dev workflow automated with AI agents. But this time I pushed further: what if the agents didn't just help me code, but actually wrote the code?

Here's what happened.

The Setup

I used a mix of tools:

Claude Code (CLI) — my go-to for complex, multi-file tasks
Codex (OpenAI) — good for one-shot generation with clear specs
Local models via Ollama — for quick iterations without burning API credits

The workflow: I'd describe what I wanted in plain English, point the agent at the right directory, and let it work. Sometimes I'd review the output. Sometimes I'd just run the tests and ship it.

What Actually Worked

1. Boilerplate and scaffolding — 10/10

This is where agents shine. "Set up a FastAPI project with SQLite, async endpoints, and Pydantic models for a subscriber management system." Done in 90 seconds. Would've taken me 20 minutes of copy-pasting from old projects.

2. Refactoring — 8/10

"Refactor this 400-line script into separate modules with proper error handling." The agents were surprisingly good at this. They understood the intent, split things logically, and even added type hints I'd been too lazy to write.

The two points I'm docking: they sometimes over-abstract. I'd ask for "cleaner code" and get an enterprise-grade factory pattern for a 50-line script. You still need taste.

3. Writing tests — 9/10

This was the biggest win I didn't expect. I hate writing tests. The agents love writing tests. "Write pytest tests for this module, cover edge cases" → comprehensive test suite in under a minute.

The catch: you need to actually read the tests. I caught a few that were testing the wrong thing — they'd pass, but they weren't testing what mattered. Still, it's a massive time saver.

4. Debugging — 7/10

Hit or miss. For straightforward bugs ("this function returns None when it should return a list"), agents crush it. For subtle timing issues or race conditions? They'd suggest fixes that looked right but didn't address the root cause.

My rule now: agents for the first pass, then I debug the debugger's output.

5. Documentation — 9/10

README files, docstrings, API docs. Agents are better at this than I am, honestly. They're more thorough, more consistent, and they don't get lazy halfway through.

What Didn't Work

Complex architecture decisions

"Should I use Redis or just in-memory caching for this?" The agent will give you a perfectly reasonable answer either way. That's the problem — it doesn't know your constraints the way you do. How many users? What's your memory budget? Are you running on a Raspberry Pi or a 128GB server?

I stopped asking agents for architecture advice. I make the decision, then let them implement it.

Multi-service orchestration

When I needed to coordinate between my Telegram bot, a background worker, and a database — and they all needed to agree on a shared state model — the agent would nail each piece individually but miss the integration points. I'd end up with three perfectly written services that didn't quite talk to each other.

Anything involving my specific hardware

"Configure this for my RTX 3060 with 12GB VRAM running on Windows with WSL2." The agents would give me generic CUDA setup instructions instead of what actually works on my rig. Local knowledge is still human knowledge.

The Numbers

Over the month, across 4 side projects:

Metric	Before (manual)	After (agent-assisted)
Time to MVP	~2 weeks	~4 days
Lines written by me	~80%	~30%
Bugs in first deploy	~8-12	~5-8
Time spent reviewing agent code	0	~2h/day
Overall velocity	1x	~2.5x

The 2.5x multiplier is real but misleading. I'm faster at producing code, but I spend more time reviewing code. The net gain is still significant — maybe 1.8x when you account for review time.

What I Changed About My Workflow

1. I write specs, not code.
My job shifted from "programmer" to "technical product manager." I write clear descriptions of what I want, define the interfaces, and let the agent fill in the implementation.

2. I review harder.
When I wrote the code myself, I'd eyeball it and move on. When an agent writes it, I actually read every line. Paradoxically, this has made me a better reviewer.

3. I prototype faster, throw away more.
Since generating a prototype takes minutes instead of hours, I build 2-3 approaches and pick the best one. This was a luxury I couldn't afford before.

4. I still write the hard parts.
State machines, complex business logic, performance-critical paths — I write these myself. Not because the agents can't, but because I need to understand them deeply to maintain them later.

The Uncomfortable Truth

Here's what nobody in the "AI will replace developers" discourse talks about: you need to be a good developer to use AI coding agents well.

Every time the agent produced something subtly wrong, I caught it because I knew what right looked like. Every time it over-engineered a solution, I could simplify it because I understood the problem. Every time it picked the wrong library or pattern, I could redirect it because I had opinions forged by years of mistakes.

Junior devs using these tools will ship faster. They'll also ship more bugs, more over-engineering, and more "it works but nobody can maintain it" code. The tools amplify whatever skill level you bring.

My Recommendation

If you're not using AI coding agents yet, start with:

Tests first. Let agents write your test suites. Low risk, high reward.
Boilerplate second. Project setup, CRUD endpoints, config files.
Refactoring third. Point it at your worst file and say "clean this up."
Complex features last. Only after you trust the tool and know its limits.

And always, always review the output. The agent is your fastest junior developer. It's also your most confident one — and confidence without experience is how bugs get shipped.

I write about building things with AI, self-hosting, and turning side projects into income. If you're into that, I post a new article every few days.

Running my own AI setup locally? Here's how I do it for $0/month.

I Expanded My GPU Rental Fleet to 6 Cards — Here's What Happened to My Earnings

Sam Hartley — Sat, 04 Apr 2026 15:44:11 +0000

I Expanded My GPU Rental Fleet to 6 Cards — Here's What Happened to My Earnings

This is the honest follow-up. What actually happened when I went from 1 GPU to 6.

The Backstory

Total VRAM across all six: around 62GB. Combined retail value when new: probably $3,000+. Current value sitting in boxes: $0/month.

The math wasn't complicated.

What the Expansion Actually Took

Here's what I underestimated: it's not just "plug cards in, profit."

The hardware side

You can't just stack 6 GPUs into a regular PC case. I had to think about:

PCIe slots and bandwidth. A standard ATX board has maybe 2-3 real x16 slots. For 6 cards, you're looking at risers, which means a mining-style open frame or a server chassis.
Power. Each card pulls 150-250W under load. Six cards = potentially 1,200-1,500W just in GPU power. Plus CPU, drives, RAM. My existing 850W PSU was not going to cut it.
Cooling. Cards in a tight case thermal-throttle each other. Open frame was the answer.

I ended up using an open-air mining frame I found used for cheap, two PSUs daisy-chained (a sketchy-but-common approach in the mining world), and PCIe risers.

Setup time: about a full weekend.

The software side

A few issues I hit:

Two risers were flaky and caused cards to drop off
One 3070 had a driver conflict until I did a clean DDU reinstall
Vast.ai's host dashboard showed 5 GPUs after setup; took me an hour to figure out the sixth wasn't being detected

Total debugging time before everything was stable: another weekend.

The Earnings Comparison

Setup	Cards	VRAM	Weekly Earnings
Before (1 card)	RTX 3060	12GB	~$12-18
After (6 cards)	3060 × 3, 3070 × 2, 3080 × 1	62GB	~$65-95

Not exactly linear scaling. Here's why:

Current Monthly Run Rate

Across all six cards, I'm averaging around $280-340/month before electricity.

Net: roughly $220-280/month in real passive income.

Is that life-changing? No. Is it meaningful for money that was doing nothing? Absolutely.

What I'd Do Differently

1. Start with a proper open-frame rig, not a cobbled-together case.

The mining frame was cheap but took time to source. If I were doing this again I'd budget for it from day one.

2. Get a proper high-wattage PSU setup.

Running two PSUs linked together works but it's inelegant. A server PSU with the right adapter is cleaner and safer.

3. Test each card individually before combining them.

I wasted time troubleshooting "which card is the problem" when I could've confirmed each one worked before building the full rig.

4. Set minimum job duration.

Short jobs (under an hour) rack up overhead — container spin-up time, handshaking — without much earnings. I set a minimum of 2 hours and earnings-per-hour improved.

The Unexpected Part

I expected this to be a boring passive income setup. It mostly is. But I've learned a surprising amount about how the AI inference market actually works by watching what gets rented and when.

Most renters are running:

Fine-tuning jobs (need sustained GPU hours)
LLM inference (need VRAM more than raw compute)
Image generation (FLUX, Stable Diffusion variants)
Dev environments (people testing stuff without committing to a cloud contract)

Is It Worth It?

Depends on your situation.

Yes, if: You already have the GPUs and they're sitting idle. The marginal cost of setting this up is mostly your time, and the monthly return is real.

Maybe, if: You'd have to buy the GPUs. At current used-market prices, payback period is 6-12 months depending on utilization. That's not terrible but it's not obvious.

No, if: You're renting out your daily-driver GPU. The rental platform can grab your card at inconvenient times. Keep at least one card reserved for your own use.

What's Next

I'm looking at adding the Ubuntu server I have running as a CPU-only Vast.ai host for smaller workloads. Less money per unit but zero additional hardware cost.

Also thinking about whether it makes sense to eventually get into the dedicated hosting side rather than the rental marketplace — more stable income, more setup required. Still researching.

For now, 6 cards, ~$250/month net, and a weekend's worth of setup. I'll take it.

Questions about the rig setup or Vast.ai specifics? Drop them in the comments.

→ Check out my automation work on Fiverr

→ Follow along on Telegram

I Use Telegram as My DevOps Dashboard — No Web UI, No VPN, Just Works

Sam Hartley — Mon, 23 Mar 2026 08:02:32 +0000

I have a bunch of things running 24/7 on a Mac Mini. GPU rental jobs, a Garmin watch face updater, a Fiverr inbox monitor, a funding rate tracker, a few cron jobs.

For a while I ran a Grafana dashboard to keep an eye on them. It looked impressive. I never opened it.

What I actually do is check my phone. So I built the monitoring layer there.

Here's the setup: a lightweight Telegram bot that serves as my entire DevOps interface. Status checks, alerts, and even simple commands — all from the Telegram app I already have open.

Why Not a Proper Dashboard?

Honest answer: dashboards are for teams. If you're a solo dev with a few projects, a fancy web UI creates more overhead than it solves.

Problems I had with Grafana:

VPN required to reach it from outside my home network
Needs to stay running (another thing to maintain)
I never actually opened the browser tab
It didn't push me information — I had to pull it

Telegram flips this: it pushes alerts to me. I glance at my phone, see what's happening, and move on.

The Architecture

Services (cron jobs, Python scripts, shell scripts)
  ↓
Central alert script: notify.sh
  ↓
Telegram Bot API → my phone
  ↓ (optional)
Command bot → runs queries on server

Two pieces:

Outbound alerts — services send me messages when things happen
Inbound commands — I can ask the bot questions from my phone

Part 1: Dead Simple Alert Script

Every service on my server can call this:

#!/bin/bash
# notify.sh — send a Telegram message from any script
# Usage: ./notify.sh "Your GPU job finished"

BOT_TOKEN="your_bot_token_here"
CHAT_ID="your_chat_id_here"
MESSAGE="$1"

curl -s -X POST "https://api.telegram.org/bot${BOT_TOKEN}/sendMessage" \
  -d chat_id="${CHAT_ID}" \
  -d text="${MESSAGE}" \
  -d parse_mode="HTML" > /dev/null

That's it. Any script can now send me a message in one line:

./notify.sh "✅ GPU rental job completed — earned $2.40"
./notify.sh "⚠️ Funding rate dropped below threshold on LYN_USDT"
./notify.sh "📬 New Fiverr inquiry from user987"
./notify.sh "❌ Garmin watch face API returned 503"

I spent maybe 20 minutes on this. It replaced a monitoring stack I spent days configuring.

Real Examples from My Setup

GPU rental monitor:

# Runs every 30 min
earnings=$(./check_gpu_earnings.sh)
if [ "$earnings" -gt "0" ]; then
  ./notify.sh "💰 GPU earned: $earnings today"
fi

Funding rate watcher:

# Python script, runs every 15 min via cron
rate = get_funding_rate("LYN_USDT")
if rate < -0.5:  # negative = people paying longs
    notify(f"🔥 LYN funding rate: {rate}% — worth checking")

Daily summary (9 AM cron):

#!/bin/bash
msg="📊 Daily Summary — $(date +%Y-%m-%d)

GPU Jobs: $(get_gpu_count) completed
Funding Earned: $(get_funding_total)
Fiverr Inquiries: $(get_fiverr_count)
Watch Face Updates: $(get_garmin_count)

Server uptime: $(uptime -p)"

./notify.sh "$msg"

I wake up, check my phone, and immediately know if anything needs attention. No browser, no VPN, no dashboard.

Part 2: The Command Interface

Outbound alerts are great. But sometimes I want to query the server from my phone.

I wrote a simple Python bot that listens for commands:

import telebot
import subprocess

BOT_TOKEN = "your_token"
ALLOWED_USER = 123456789  # your Telegram user ID

bot = telebot.TeleBot(BOT_TOKEN)

COMMANDS = {
    "/status": "uptime && free -h && df -h /",
    "/gpu": "./check_gpu_status.sh",
    "/funding": "python3 check_funding_rates.py --summary",
    "/services": "ps aux | grep -E '(python|node|ollama)' | grep -v grep",
}

@bot.message_handler(commands=list(COMMANDS.keys()))
def handle_command(message):
    if message.from_user.id != ALLOWED_USER:
        bot.reply_to(message, "Not authorized.")
        return

    cmd_text = message.text.split()[0]  # handle /status@botname format
    if cmd_text in COMMANDS:
        result = subprocess.run(
            COMMANDS[cmd_text], 
            shell=True, 
            capture_output=True, 
            text=True,
            timeout=15
        )
        output = result.stdout[:3000] or result.stderr[:3000] or "No output"
        bot.reply_to(message, f"```
{% endraw %}
\n{output}\n
{% raw %}
```", parse_mode="Markdown")

bot.polling(none_stop=True)

Now from Telegram I can type /status and get:

 11:23:15 up 14 days, 3:41,  1 user
Mem:   16Gi   8.2Gi   7.8Gi
/dev/sda1        245G   82G  163G  34%

Or /funding and get the current rate snapshot.

The key detail: ALLOWED_USER check. Only my Telegram ID can run commands. Everyone else gets "Not authorized." Bot tokens are public in the sense that anyone can message your bot — you need to validate the sender.

Keeping It Running

The command bot needs to stay alive. I use a simple systemd service:

[Unit]
Description=Telegram DevOps Bot
After=network.target

[Service]
ExecStart=/usr/bin/python3 /home/user/telegram-bot/bot.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

systemctl enable telegram-bot && systemctl start telegram-bot — and it survives reboots.

On macOS (my setup) I use a launchd plist, same concept.

What I Actually Get Alerts For

Not everything. Alert fatigue is real.

Alert on:

✅ Job completions (GPU task done, funding cycle closed)
❌ Errors that need action
📬 New customer inquiries (Fiverr inbox)
⚠️ Thresholds crossed (rate drops, disk usage, memory spikes)
📊 Daily summaries (once a day, morning)

Silence:

Routine successful runs (no news is good news)
Health checks that pass
Regular cron completions with no anomalies

The goal is: every message I receive from the bot is something I actually care about. If I'm ignoring 80% of notifications, I'm alerting on the wrong things.

The Full Cost

Telegram Bot API: free
curl command: comes with your OS
Python + telebot library: free
Running this bot: negligible CPU, ~20MB RAM

My entire monitoring setup costs $0/month and runs on the same Mac Mini as everything else. No SaaS, no cloud logging, no dashboard subscription.

Three Months Later

I send about 15-20 alerts per day. Daily summary at 9 AM, event-driven messages the rest of the day. I check my phone, see green checkmarks and earnings summaries, and know the server is doing its job.

The one time the GPU host went offline, I got a message within 5 minutes. Fixed it from my phone during lunch.

That's the whole point: not more tooling, just the right interface for how I actually work.

Got a monitoring setup you like? Drop it in the comments — always curious what others are running.

→ I build these kinds of automation setups on Fiverr
→ Follow CelebiBots on Telegram

I Rented Out My GPU for Passive Income — Here’s What Happened After My First Week

Sam Hartley — Sat, 21 Mar 2026 08:02:08 +0000

I had an RTX 3060 sitting on a shelf.

Not broken. Not old. Just... not doing anything. My Windows PC runs models when I need them, but most of the time it's idle. The fans spin, the power draw ticks along, and that 12GB of VRAM just sits there.

A week ago I connected it to Vast.ai — a GPU marketplace where people rent compute time. No code required. You install a daemon, set a price, and wait for someone to rent your machine.

Here's what actually happened.

Why I Didn't Just Mine Crypto

First thing people ask: "Why not just mine?"

Short answer: it's 2026, the margins are brutal, and I didn't want to deal with it. GPU compute rental is different — you're renting raw processing power, and the demand right now is AI inference and training. People building LLMs, running diffusion models, doing batch jobs.

The upside: no mining pool setup, no daily coin price anxiety, no special software. Your machine runs Docker containers, gets paid per second of use, you get a payout.

The Setup (Genuinely About 90 Minutes)

Created a Vast.ai account
Installed the host daemon on Windows (it's a one-click installer)
Set my RTX 3060 12GB at $0.15/hour
Went to bed

That's it. No configuration rabbit holes, no drivers to hunt down. The daemon manages everything — spinning up containers, cleaning up after renters, reporting uptime.

I set the minimum rental duration to 1 hour so I wouldn't get hit with a dozen 5-minute jobs.

The First Week Numbers

Day	Hours Rented	Earnings
Day 1	3.2h	$0.48
Day 2	11.5h	$1.73
Day 3	0h	$0.00
Day 4	16.8h	$2.52
Day 5	9.1h	$1.37
Day 6	22.0h	$3.30
Day 7	14.4h	$2.16

Week 1 total: ~$11.56

Annualized naively? About $600/year. Which would be great except day 3 was $0 and utilization is inconsistent.

A more realistic steady-state: $50–130/month depending on demand.

What People Actually Rent It For

Vast.ai shows you the jobs (anonymized). Mine has been used for:

Running vllm inference servers (Mistral, Qwen, LLaMA variants)
Stable Diffusion batch jobs
Some kind of PyTorch training run that lasted 8 hours

The 3060 with 12GB VRAM is actually sweet for inference — fits most 7B–13B models at 4-bit quantization without breaking a sweat. It's not the fastest card, but it's affordable to rent, which means demand is there.

The Honest Downsides

You can't use your GPU while it's rented. Sounds obvious, but the practical implication: if you need your machine for local inference and someone's rented it, tough luck. I started routing heavy tasks to my Mac Mini during rental periods.

Electricity. My RTX 3060 at load pulls about 150W. At Turkish electricity rates, that's roughly $8–15/month in power at typical utilization. So the net is lower than the gross numbers above.

It's genuinely passive but not predictable. Day 3 was $0. Day 6 was near-full utilization. There's no way to forecast demand.

Payouts have a minimum. Vast.ai pays out once you hit a threshold. Nothing to worry about, just something to know going in.

What I'm Going to Try Next

The obvious play is adding more GPUs. I have a few more in storage — an RTX 3080 and some older 3060s. If I rack those up, the math gets interesting:

GPU	Rate	Monthly (50% util)
RTX 3060 (current)	$0.15/h	~$54
RTX 3080 10GB	$0.20/h	~$72
2x RTX 3070 8GB	$0.16/h	~$115

That's ~$240/month without doing anything after setup. At 70% utilization: ~$340.

The real work is physical — pulling GPUs from storage, getting them into a rig, managing thermals. But the software side is almost zero maintenance.

Should You Try This?

If you have a spare GPU collecting dust: yes, probably. The setup is low friction, the risk is near-zero (worst case, you uninstall the daemon and move on), and even modest earnings beat $0.

If you're thinking about buying a GPU specifically for this: do the math carefully. At current rates, an RTX 3060 costs ~$300–350 used. Payback period at $50/month is 6–7 months, which is fine — but don't expect to fund your retirement from a single card.

The real value for me isn't the income (yet). It's that I now have a system running, I understand the demand patterns, and I know the path to scale looks viable.

The Setup Summary

Platform: Vast.ai (there's also RunPod if you want alternatives)
Time to set up: ~90 minutes
Technical skill required: Know how to install software on Windows
Ongoing maintenance: Almost none
Realistic earnings: $50–130/month per mid-tier GPU

Happy to answer questions if you try this and run into something weird. The daemon is pretty solid but there's always an edge case or two.

I write about running AI locally, automation side projects, and occasionally making money from hardware that would otherwise just collect dust. If any of this is useful, feel free to follow.

Local LLMs vs Cloud APIs — A Real Cost Comparison (2026)

Sam Hartley — Thu, 19 Mar 2026 08:03:20 +0000

"Just use ChatGPT" — sure, until your API bill hits $500/month.

I've been running both local and cloud AI for over a year. Here are the real numbers.

The Test Setup

Cloud: OpenAI GPT-4o, Anthropic Claude Sonnet, Google Gemini Pro
Local: Ollama with Qwen 3.5 9B (Mac Mini M4) + Qwen 3 Coder 30B (RTX 3060)

Workload: ~500 queries/day — code review, content generation, customer support, data analysis.

Monthly Cloud API Costs

For 500 queries/day:

OpenAI GPT-4o (200 queries): ~$90/month
Anthropic Claude Sonnet (200 queries): ~$72/month
Google Gemini Pro (100 queries): ~$25/month
Total: ~$187/month

Monthly Local Setup Costs

Mac Mini M4 (already owned): $0
RTX 3060 12GB (used, eBay): $150 one-time
Electricity 24/7: ~$12/month
Total: ~$12/month ongoing

Break-even: less than 1 month.

Quality Comparison (What Surprised Me)

For 80% of daily tasks, local models are good enough:

General chat: Qwen 3.5 9B is roughly GPT-4o quality (~90%)
Code generation: Qwen 3 Coder 30B is close to Claude Sonnet (~85-90%)
Simple Q&A and extraction: any 7B model matches cloud (~95%+)
Complex multi-step reasoning: cloud still wins here

The Hybrid Approach I Use

User query
  -> Simple? (Q&A, formatting, extraction)
       -> Local Qwen 3.5 9B  (free, instant)
  -> Code-heavy?
       -> Local Qwen 3 Coder 30B  (free, ~12s)
  -> Complex reasoning?
       -> Cloud Claude Sonnet  ($0.003-0.015 per query)

Result: cloud costs dropped from ~$187/month to ~$25/month.

Hidden Costs of Cloud

Things people forget:

Rate limits — hit the ceiling during a deadline? Too bad.
Latency — 500-2000ms per request vs 100-500ms local
Privacy — your code and data live on someone else's server
Vendor lock-in — OpenAI changes pricing, you're stuck
Downtime — their outage = your workflow stops

Hidden Costs of Local

Being fair:

Initial hardware — $150-500 for a GPU (pays off in under a month)
Setup time — 30 minutes with Ollama these days
Storage — models are 4-40GB each
Power — $10-15/month for 24/7 operation
No frontier models — you won't run GPT-4 locally yet

Getting Started in 10 Minutes

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull qwen3.5:9b

# Start chatting
ollama run qwen3.5:9b

Total time: 10 minutes. Total cost: $0.

Need help setting up a local AI server? I do this professionally.

Follow along: Telegram @celebibot_en

Sam Hartley — building AI things that actually work.