Forem: RubberDuckOps

CPU Inference on AMD EPYC 9334: Real Numbers for LLM and TTS Workloads

RubberDuckOps — Wed, 06 May 2026 13:58:10 +0000

TL;DR — GPU isn't always the right call for inference. At Leaseweb, we benchmarked a dual-socket EPYC 9334 on 7B–20B LLMs and three TTS models. Here's what the numbers actually look like — and when CPU inference makes sense.

Why inference is where your budget actually disappears

Training is a one-time cost. Inference is not. Once a model is in production, it runs continuously — and cost per query scales directly with traffic. For many teams, inference spend overtakes training spend within months of launch.

The hardware decision for inference is also different from training. Training wants large GPU clusters with high-bandwidth interconnects. Inference wants low latency, high throughput per dollar, and enough memory bandwidth to serve quantised weights efficiently. Those requirements don't always point to a GPU.

The two metrics that actually matter for LLM inference

When a prompt hits an LLM, two stages happen:

Prefill — the model converts input tokens, runs them through its layers, and builds a KV cache. Compute-bound. Ends when the first output token is generated.
Decode — the model generates each subsequent token one at a time, reading from the KV cache. Memory-bandwidth-bound.

These stages have different performance profiles, which is why benchmarks report two numbers:

Time to first token (TTFT) — elapsed time from prompt submission to first output token. Lower is better.
Tokens per second (tok/s) — decode throughput. Higher is better, especially for batch and streaming workloads.

For TTS, the standard metric is real-time factor (RTF) — the ratio of processing time to audio duration. RTF below 1.0 means the model generates audio faster than real time. Above 1.0 and it can't keep up.

Hardware and software setup

Component	Specification
CPU	AMD EPYC 9334 × 2 (dual socket)
Architecture	Zen 4
Cores / threads per socket	32 / 64
Base clock	2.7 GHz
L3 cache	128 MB per socket
TDP	210W per socket
Memory	64 GB DDR5

Two tools were used: llama-bench (part of llama.cpp) for local model evaluation, and OpenLLM with llmperf for API-level throughput testing.

Test configuration: 512-token prompt, 128 tokens generated, 24 CPU threads for LLM. 180-character input, 32 CPU threads, 30 inference runs per model for TTS.

Models tested

Model	Parameters	Type
DeepSeek-R1-0528-Qwen3-8B-Q4_K_M	8B	LLM
Gpt-oss-20b	20B	LLM
Llama-2-7b-Q4_K_M	7B	LLM
Mistral-7B-Instruct-v0.2-Q4_K_M	7B	LLM
Kokoro (ONNX Runtime)	82M	TTS
Microsoft SpeechT5	150M	TTS
Coqui XTTS-v2	400M	TTS

LLM results

Time to first token

Model	Quantisation	TTFT
DeepSeek-R1-8B	Q4_K_M	4.1s
DeepSeek-R1-8B	FP16	8.1s
GPT-OSS-20B	Q4	3.6s
GPT-OSS-20B	FP16	3.6s
Llama-2-7B	Q4_K_M	4.8s
Mistral-7B	Q4_K_M	~4.5s

Switching GPT-OSS-20B to FP16 had minimal effect on TTFT. For DeepSeek, the same switch more than doubled it.

Decode throughput

Model	Quantisation	Throughput
DeepSeek-R1-8B	Q4_K_M	27.8 tok/s
DeepSeek-R1-8B	FP16	8.1 tok/s
GPT-OSS-20B	Q4	18.3 tok/s
GPT-OSS-20B	FP16	26.2 tok/s
Llama-2-7B	Q4_K_M	~22 tok/s
Mistral-7B	Q4_K_M	~20 tok/s

The Q4 vs FP16 gap is significant for DeepSeek — a 3.4× throughput drop. For sustained batch workloads on CPU, Q4 quantisation is the practical default.

Memory utilisation

CPU utilisation stayed between 20–30% across all runs. Q4 models leave substantial DRAM headroom — useful for multi-tenant deployments where you want concurrent instances on the same node. DeepSeek at FP16 consumed close to 16 GB, which limits that option considerably.

GPU reference point

For comparison, the same FP16 throughput test ran on an Nvidia L4 GPU. The L4 produced 16.7 tok/s on DeepSeek-R1-8B and 58.6 tok/s on GPT-OSS-20B, versus 8.1 and 26.2 on the EPYC 9334. Roughly double the throughput. If throughput is your primary constraint, that gap matters. If cost predictability or workload type are the constraint, the CPU case still holds.

TTS results

Model	RTF	Memory	Verdict
Kokoro (82M, ONNX)	0.162	~0.5 GB	6× faster than real time. Tight p50/p95 spread.
Microsoft SpeechT5 (150M)	0.6	~1.4 GB	Comfortably real time. Good for single-speaker synthesis.
Coqui XTTS-v2 (400M)	1.41	~4 GB	Cannot serve real-time audio. Strong fit for batch jobs.

Kokoro is the standout — 82M parameters, RTF of 0.162, and consistent latency under load. XTTS-v2 is the most capable (voice cloning, multilingual) but at RTF 1.41 it belongs in overnight queues or batch audio generation, not streaming pipelines.

How to reproduce this

📌 Placeholder — confirm the exact commands used before publishing.

# LLM benchmark — llama-bench (part of llama.cpp)
llama-bench \
  -m /path/to/model.gguf \
  -p 512 \
  -n 128 \
  -t 24

# TTS benchmark — run per model, 30 iterations
# Kokoro: ONNX Runtime
# SpeechT5 + XTTS-v2: standard Python inference loop
# Input: 180-character text string, 32 threads

# API-level throughput — OpenLLM + llmperf
openllm start /path/to/model.gguf --backend llama-cpp

llmperf run \
  --model <model-name> \
  --num-concurrent-requests 1 \
  --num-output-tokens 128 \
  --num-input-tokens 512

Models sourced from HuggingFace. Search the model name directly (e.g. bartowski/DeepSeek-R1-0528-Qwen3-8B-GGUF) and pull the Q4_K_M variant for llama.cpp tests.

No special system prep was applied — no NUMA pinning or hugepage configuration. Results reflect default OS settings on the HPE ProLiant DL385 Gen11.

When to use CPU vs GPU for inference

CPU inference is a good fit for:

Batch summarisation and document processing
Audio transcription queues
Overnight report generation
Lightweight TTS (Kokoro, SpeechT5)
Edge deployments with cost or availability constraints
Multi-tenant setups with 7B–20B Q4 models

GPU is still the right call for:

Real-time, latency-critical workloads at scale
High-concurrency serving (maximise throughput)
Models above 20B without quantisation
Real-time TTS with complex models (XTTS-v2)
Streaming use cases where TTFT < 1s is required

Takeaway

The EPYC 9334 handles 7B–20B parameter models at Q4 quantisation with predictable throughput and acceptable latency for a broad class of production workloads. It doesn't replace a GPU for every inference job. For the workloads listed above, it doesn't need to.

If you're running batch inference or TTS queues and paying GPU rates, it's worth running these numbers against your actual workload before assuming a GPU is necessary.

Do we actually need more GPUs, or just the right one?

RubberDuckOps — Wed, 29 Apr 2026 13:58:04 +0000

Last week I was at a tech meetup in Berlin where we got into something: are teams actually making deliberate infrastructure decisions, or just reacting to AI hype? Three practitioners shared their real experience. Here's what stuck with me.

Don't lock in before you understand your workload

Ömer from #Youzu talked through their migration off hyperscalers after getting trapped by credits and tight service coupling. Not a hypothetical, they went through it.

His takeaway: decouple early so you can move workloads freely. Know roughly where you're heading before you build, then migrate toward full control progressively. Portability isn't a nice-to-have, it's insurance.

Question the GPU arms race

David from #SteliaAI made a point that a lot of teams need to hear right now: most people are provisioning for scale they don't have yet.

His suggestion was almost counterintuitively simple:

Start with half compute, half control plane
Get customers
Then revisit

Don't optimize for a scale you haven't reached yet, because the shape of your workload will change by the time you get there.

Observability is not optional

Felix from #Cloudeteer made the case that GPU utilization metrics alone don't tell the full story. You can be running at 100% capacity and still be producing wrong outputs.

Traces — not just metrics — are what let you catch problems before they fail silently. If your AI stack doesn't have tracing today, you're flying blind with a full tank.

The thread running through all three talks

AI hype is driving infrastructure decisions that don't match actual workload needs. Every speaker arrived at the same place from a different direction: start lean, stay observable, don't couple yourself to a provider before you understand what you're building.

Now, over to you

Are you provisioning GPUs reactively or from a clear workload map?
Have you ever scaled back after realizing you over-provisioned?
Where does observability fit in your AI stack today?

Drop your experience in the comments.

Looking for a European dedicated server or VPS? Here's what to consider.

RubberDuckOps — Tue, 03 Mar 2026 13:29:53 +0000

Hardware costs are moving across the industry. RAM, SSDs, AI infrastructure demand - it's affecting everyone. If you're evaluating your infrastructure options right now, here's an honest look at what matters and where we fit in.
Disclaimer: I'm on the infrastructure team at Leaseweb. EU-native, Netherlands-owned.

What actually matters when choosing a provider

Price per spec is the obvious starting point but it's rarely the whole story. A few things worth thinking through:
Where is your data? If you're in a regulated industry or just care about GDPR, EU-native infrastructure matters. Not EU region of a US company. Actually EU-owned and operated.

How predictable is your bill? Hourly cloud pricing looks cheap until you're running sustained workloads. At high utilisation, dedicated or longer-term contracts almost always win on cost. The maths changes fast above 60-70% utilisation.
What happens when something breaks? Support SLAs vary wildly in this space. Worth checking what you're actually getting before you need it.

What we offer

VPS from €3.59/month (2 vCPU, 4GB RAM, 80GB NVMe). Good for staging environments, isolated workloads, smaller production setups.
AMD EPYC dedicated servers from €112/month. Full root access, unmanaged, API-ready for IaC. EU sovereign data centres, DDoS protection standard.
Contract terms from 1 month up to 3 years. The longer the commitment, the better the rate - up to 25% off. You lock in your pricing upfront, no surprises for the duration of your term.
We're not the cheapest option in every category.
On price-performance for sustained EU workloads, we're worth a spot on your shortlist. One thing worth checking: if inter-node throughput is critical to your workload, compare our network specs against your requirements before committing

Is it worth switching if you're happy where you are?

Honestly, maybe not. Switching has friction and if your current setup is working, the disruption cost is real.
But if you're evaluating options, spinning up a test environment costs nothing.

Drop a comment with your current setup: vCPU count, RAM, storage, region. Happy to spec match or answer any questions you have!