<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: RubberDuckOps</title>
    <description>The latest articles on Forem by RubberDuckOps (@rubberduckops).</description>
    <link>https://forem.com/rubberduckops</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3696079%2F378d44fb-35dc-4075-8f33-a31d1e10ce94.png</url>
      <title>Forem: RubberDuckOps</title>
      <link>https://forem.com/rubberduckops</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/rubberduckops"/>
    <language>en</language>
    <item>
      <title>CPU Inference on AMD EPYC 9334: Real Numbers for LLM and TTS Workloads</title>
      <dc:creator>RubberDuckOps</dc:creator>
      <pubDate>Wed, 06 May 2026 13:58:10 +0000</pubDate>
      <link>https://forem.com/leaseweb/cpu-inference-on-amd-epyc-9334-real-numbers-for-llm-and-tts-workloads-54e7</link>
      <guid>https://forem.com/leaseweb/cpu-inference-on-amd-epyc-9334-real-numbers-for-llm-and-tts-workloads-54e7</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — GPU isn't always the right call for inference. At Leaseweb, we benchmarked a dual-socket EPYC 9334 on 7B–20B LLMs and three TTS models. Here's what the numbers actually look like — and when CPU inference makes sense.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why inference is where your budget actually disappears
&lt;/h2&gt;

&lt;p&gt;Training is a one-time cost. Inference is not. Once a model is in production, it runs continuously — and cost per query scales directly with traffic. For many teams, inference spend overtakes training spend within months of launch.&lt;/p&gt;

&lt;p&gt;The hardware decision for inference is also different from training. Training wants large GPU clusters with high-bandwidth interconnects. Inference wants low latency, high throughput per dollar, and enough memory bandwidth to serve quantised weights efficiently. Those requirements don't always point to a GPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  The two metrics that actually matter for LLM inference
&lt;/h2&gt;

&lt;p&gt;When a prompt hits an LLM, two stages happen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Prefill&lt;/strong&gt; — the model converts input tokens, runs them through its layers, and builds a KV cache. Compute-bound. Ends when the first output token is generated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decode&lt;/strong&gt; — the model generates each subsequent token one at a time, reading from the KV cache. Memory-bandwidth-bound.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These stages have different performance profiles, which is why benchmarks report two numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time to first token (TTFT)&lt;/strong&gt; — elapsed time from prompt submission to first output token. Lower is better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokens per second (tok/s)&lt;/strong&gt; — decode throughput. Higher is better, especially for batch and streaming workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For TTS, the standard metric is &lt;strong&gt;real-time factor (RTF)&lt;/strong&gt; — the ratio of processing time to audio duration. RTF below 1.0 means the model generates audio faster than real time. Above 1.0 and it can't keep up.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware and software setup
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Specification&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;AMD EPYC 9334 × 2 (dual socket)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture&lt;/td&gt;
&lt;td&gt;Zen 4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cores / threads per socket&lt;/td&gt;
&lt;td&gt;32 / 64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Base clock&lt;/td&gt;
&lt;td&gt;2.7 GHz&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L3 cache&lt;/td&gt;
&lt;td&gt;128 MB per socket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TDP&lt;/td&gt;
&lt;td&gt;210W per socket&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;64 GB DDR5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two tools were used: &lt;code&gt;llama-bench&lt;/code&gt; (part of llama.cpp) for local model evaluation, and &lt;code&gt;OpenLLM&lt;/code&gt; with &lt;code&gt;llmperf&lt;/code&gt; for API-level throughput testing.&lt;/p&gt;

&lt;p&gt;Test configuration: 512-token prompt, 128 tokens generated, 24 CPU threads for LLM. 180-character input, 32 CPU threads, 30 inference runs per model for TTS.&lt;/p&gt;




&lt;h2&gt;
  
  
  Models tested
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-R1-0528-Qwen3-8B-Q4_K_M&lt;/td&gt;
&lt;td&gt;8B&lt;/td&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gpt-oss-20b&lt;/td&gt;
&lt;td&gt;20B&lt;/td&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-2-7b-Q4_K_M&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral-7B-Instruct-v0.2-Q4_K_M&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kokoro (ONNX Runtime)&lt;/td&gt;
&lt;td&gt;82M&lt;/td&gt;
&lt;td&gt;TTS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft SpeechT5&lt;/td&gt;
&lt;td&gt;150M&lt;/td&gt;
&lt;td&gt;TTS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coqui XTTS-v2&lt;/td&gt;
&lt;td&gt;400M&lt;/td&gt;
&lt;td&gt;TTS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  LLM results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Time to first token
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quantisation&lt;/th&gt;
&lt;th&gt;TTFT&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-R1-8B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;4.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-R1-8B&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;8.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS-20B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;3.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS-20B&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;3.6s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-2-7B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;4.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral-7B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~4.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Switching GPT-OSS-20B to FP16 had minimal effect on TTFT. For DeepSeek, the same switch more than doubled it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decode throughput
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quantisation&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-R1-8B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;27.8 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-R1-8B&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;8.1 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS-20B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;18.3 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS-20B&lt;/td&gt;
&lt;td&gt;FP16&lt;/td&gt;
&lt;td&gt;26.2 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama-2-7B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~22 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral-7B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~20 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Q4 vs FP16 gap is significant for DeepSeek — a 3.4× throughput drop. For sustained batch workloads on CPU, &lt;strong&gt;Q4 quantisation is the practical default&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory utilisation
&lt;/h3&gt;

&lt;p&gt;CPU utilisation stayed between 20–30% across all runs. Q4 models leave substantial DRAM headroom — useful for multi-tenant deployments where you want concurrent instances on the same node. DeepSeek at FP16 consumed close to 16 GB, which limits that option considerably.&lt;/p&gt;

&lt;h3&gt;
  
  
  GPU reference point
&lt;/h3&gt;

&lt;p&gt;For comparison, the same FP16 throughput test ran on an Nvidia L4 GPU. The L4 produced &lt;strong&gt;16.7 tok/s on DeepSeek-R1-8B&lt;/strong&gt; and &lt;strong&gt;58.6 tok/s on GPT-OSS-20B&lt;/strong&gt;, versus 8.1 and 26.2 on the EPYC 9334. Roughly double the throughput. If throughput is your primary constraint, that gap matters. If cost predictability or workload type are the constraint, the CPU case still holds.&lt;/p&gt;




&lt;h2&gt;
  
  
  TTS results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Kokoro (82M, ONNX)&lt;/td&gt;
&lt;td&gt;0.162&lt;/td&gt;
&lt;td&gt;~0.5 GB&lt;/td&gt;
&lt;td&gt;6× faster than real time. Tight p50/p95 spread.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microsoft SpeechT5 (150M)&lt;/td&gt;
&lt;td&gt;0.6&lt;/td&gt;
&lt;td&gt;~1.4 GB&lt;/td&gt;
&lt;td&gt;Comfortably real time. Good for single-speaker synthesis.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coqui XTTS-v2 (400M)&lt;/td&gt;
&lt;td&gt;1.41&lt;/td&gt;
&lt;td&gt;~4 GB&lt;/td&gt;
&lt;td&gt;Cannot serve real-time audio. Strong fit for batch jobs.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Kokoro is the standout — 82M parameters, RTF of 0.162, and consistent latency under load. XTTS-v2 is the most capable (voice cloning, multilingual) but at RTF 1.41 it belongs in overnight queues or batch audio generation, not streaming pipelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to reproduce this
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;📌 &lt;strong&gt;Placeholder&lt;/strong&gt; — confirm the exact commands used before publishing.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# LLM benchmark — llama-bench (part of llama.cpp)&lt;/span&gt;
llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; /path/to/model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; 24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# TTS benchmark — run per model, 30 iterations&lt;/span&gt;
&lt;span class="c"&gt;# Kokoro: ONNX Runtime&lt;/span&gt;
&lt;span class="c"&gt;# SpeechT5 + XTTS-v2: standard Python inference loop&lt;/span&gt;
&lt;span class="c"&gt;# Input: 180-character text string, 32 threads&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# API-level throughput — OpenLLM + llmperf&lt;/span&gt;
openllm start /path/to/model.gguf &lt;span class="nt"&gt;--backend&lt;/span&gt; llama-cpp

llmperf run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; &amp;lt;model-name&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-concurrent-requests&lt;/span&gt; 1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-output-tokens&lt;/span&gt; 128 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-input-tokens&lt;/span&gt; 512
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Models sourced from HuggingFace. Search the model name directly (e.g. &lt;code&gt;bartowski/DeepSeek-R1-0528-Qwen3-8B-GGUF&lt;/code&gt;) and pull the &lt;code&gt;Q4_K_M&lt;/code&gt; variant for llama.cpp tests.&lt;/p&gt;

&lt;p&gt;No special system prep was applied — no NUMA pinning or hugepage configuration. Results reflect default OS settings on the HPE ProLiant DL385 Gen11.&lt;/p&gt;




&lt;h2&gt;
  
  
  When to use CPU vs GPU for inference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CPU inference is a good fit for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Batch summarisation and document processing&lt;/li&gt;
&lt;li&gt;Audio transcription queues&lt;/li&gt;
&lt;li&gt;Overnight report generation&lt;/li&gt;
&lt;li&gt;Lightweight TTS (Kokoro, SpeechT5)&lt;/li&gt;
&lt;li&gt;Edge deployments with cost or availability constraints&lt;/li&gt;
&lt;li&gt;Multi-tenant setups with 7B–20B Q4 models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GPU is still the right call for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time, latency-critical workloads at scale&lt;/li&gt;
&lt;li&gt;High-concurrency serving (maximise throughput)&lt;/li&gt;
&lt;li&gt;Models above 20B without quantisation&lt;/li&gt;
&lt;li&gt;Real-time TTS with complex models (XTTS-v2)&lt;/li&gt;
&lt;li&gt;Streaming use cases where TTFT &amp;lt; 1s is required&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;The EPYC 9334 handles 7B–20B parameter models at Q4 quantisation with predictable throughput and acceptable latency for a broad class of production workloads. It doesn't replace a GPU for every inference job. For the workloads listed above, it doesn't need to.&lt;/p&gt;

&lt;p&gt;If you're running batch inference or TTS queues and paying GPU rates, it's worth running these numbers against your actual workload before assuming a GPU is necessary.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>llm</category>
      <category>benchmark</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Do we actually need more GPUs, or just the right one?</title>
      <dc:creator>RubberDuckOps</dc:creator>
      <pubDate>Wed, 29 Apr 2026 13:58:04 +0000</pubDate>
      <link>https://forem.com/leaseweb/do-we-actually-need-more-gpus-or-just-the-right-one-f44</link>
      <guid>https://forem.com/leaseweb/do-we-actually-need-more-gpus-or-just-the-right-one-f44</guid>
      <description>&lt;p&gt;Last week I was at a tech meetup in Berlin where we got into something: are teams actually making deliberate infrastructure decisions, or just reacting to AI hype? Three practitioners shared their real experience. Here's what stuck with me.&lt;/p&gt;




&lt;h2&gt;
  
  
  Don't lock in before you understand your workload
&lt;/h2&gt;

&lt;p&gt;Ömer from #Youzu talked through their migration off hyperscalers after getting trapped by credits and tight service coupling. Not a hypothetical, they went through it.&lt;/p&gt;

&lt;p&gt;His takeaway: decouple early so you can move workloads freely. Know roughly where you're heading before you build, then migrate toward full control progressively. Portability isn't a nice-to-have, it's insurance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Question the GPU arms race
&lt;/h2&gt;

&lt;p&gt;David from #SteliaAI made a point that a lot of teams need to hear right now: most people are provisioning for scale they don't have yet.&lt;/p&gt;

&lt;p&gt;His suggestion was almost counterintuitively simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with half compute, half control plane&lt;/li&gt;
&lt;li&gt;Get customers&lt;/li&gt;
&lt;li&gt;Then revisit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't optimize for a scale you haven't reached yet, because the shape of your workload will change by the time you get there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Observability is not optional
&lt;/h2&gt;

&lt;p&gt;Felix from #Cloudeteer made the case that GPU utilization metrics alone don't tell the full story. You can be running at 100% capacity and still be producing wrong outputs.&lt;/p&gt;

&lt;p&gt;Traces — not just metrics — are what let you catch problems before they fail silently. If your AI stack doesn't have tracing today, you're flying blind with a full tank.&lt;/p&gt;




&lt;h2&gt;
  
  
  The thread running through all three talks
&lt;/h2&gt;

&lt;p&gt;AI hype is driving infrastructure decisions that don't match actual workload needs. Every speaker arrived at the same place from a different direction: start lean, stay observable, don't couple yourself to a provider before you understand what you're building.&lt;/p&gt;




&lt;h2&gt;
  
  
  Now, over to you
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Are you provisioning GPUs reactively or from a clear workload map?&lt;/li&gt;
&lt;li&gt;Have you ever scaled back after realizing you over-provisioned?&lt;/li&gt;
&lt;li&gt;Where does observability fit in your AI stack today?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop your experience in the comments.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>discuss</category>
      <category>infrastructure</category>
      <category>devops</category>
    </item>
    <item>
      <title>Looking for a European dedicated server or VPS? Here's what to consider.</title>
      <dc:creator>RubberDuckOps</dc:creator>
      <pubDate>Tue, 03 Mar 2026 13:29:53 +0000</pubDate>
      <link>https://forem.com/leaseweb/looking-for-a-european-dedicated-server-or-vps-heres-what-to-consider-55h4</link>
      <guid>https://forem.com/leaseweb/looking-for-a-european-dedicated-server-or-vps-heres-what-to-consider-55h4</guid>
      <description>&lt;p&gt;Hardware costs are moving across the industry. RAM, SSDs, AI infrastructure demand - it's affecting everyone. If you're evaluating your infrastructure options right now, here's an honest look at what matters and where we fit in.&lt;br&gt;
Disclaimer: I'm on the infrastructure team at Leaseweb. EU-native, Netherlands-owned.&lt;/p&gt;




&lt;h2&gt;
  
  
  What actually matters when choosing a provider
&lt;/h2&gt;

&lt;p&gt;Price per spec is the obvious starting point but it's rarely the whole story. A few things worth thinking through:&lt;br&gt;
Where is your data? If you're in a regulated industry or just care about GDPR, EU-native infrastructure matters. Not EU region of a US company. Actually EU-owned and operated.&lt;/p&gt;

&lt;p&gt;How predictable is your bill? Hourly cloud pricing looks cheap until you're running sustained workloads. At high utilisation, dedicated or longer-term contracts almost always win on cost. The maths changes fast above 60-70% utilisation.&lt;br&gt;
What happens when something breaks? Support SLAs vary wildly in this space. Worth checking what you're actually getting before you need it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we offer
&lt;/h2&gt;

&lt;p&gt;VPS from €3.59/month (2 vCPU, 4GB RAM, 80GB NVMe). Good for staging environments, isolated workloads, smaller production setups.&lt;br&gt;
AMD EPYC dedicated servers from €112/month. Full root access, unmanaged, API-ready for IaC. EU sovereign data centres, DDoS protection standard.&lt;br&gt;
Contract terms from 1 month up to 3 years. The longer the commitment, the better the rate - up to 25% off. You lock in your pricing upfront, no surprises for the duration of your term.&lt;br&gt;
We're not the cheapest option in every category. &lt;br&gt;
On price-performance for sustained EU workloads, we're worth a spot on your shortlist. One thing worth checking: if inter-node throughput is critical to your workload, compare our network specs against your requirements before committing&lt;/p&gt;




&lt;h2&gt;
  
  
  Is it worth switching if you're happy where you are?
&lt;/h2&gt;

&lt;p&gt;Honestly, maybe not. Switching has friction and if your current setup is working, the disruption cost is real.&lt;br&gt;
But if you're evaluating options, spinning up a test environment costs nothing. &lt;/p&gt;




&lt;p&gt;Drop a comment with your current setup: vCPU count, RAM, storage, region. Happy to spec match or answer any questions you have!&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>webdev</category>
      <category>infrastructure</category>
    </item>
  </channel>
</rss>
