<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: DevOps Daily</title>
    <description>The latest articles on Forem by DevOps Daily (@devopsdaily).</description>
    <link>https://forem.com/devopsdaily</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F382434%2F3b4f7f10-38d4-4f4f-8351-1dcb0c1bdfc7.png</url>
      <title>Forem: DevOps Daily</title>
      <link>https://forem.com/devopsdaily</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/devopsdaily"/>
    <language>en</language>
    <item>
      <title>Claude Code Hidden Features You Probably Missed</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Wed, 01 Apr 2026 17:21:58 +0000</pubDate>
      <link>https://forem.com/devopsdaily/claude-code-hidden-features-you-probably-missed-3ej0</link>
      <guid>https://forem.com/devopsdaily/claude-code-hidden-features-you-probably-missed-3ej0</guid>
      <description>&lt;p&gt;Most people use Claude Code to write code, fix bugs, and maybe generate a commit message. That's fine, but you're leaving a lot on the table.&lt;/p&gt;

&lt;p&gt;Boris Cherny, the creator of Claude Code, recently shared a &lt;a href="https://x.com/bcherny/status/2038454336355999749" rel="noopener noreferrer"&gt;thread on X&lt;/a&gt; about features that even daily users tend to overlook. Some of these genuinely changed how I work. Here's a rundown of the ones worth knowing about.&lt;/p&gt;

&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;Claude Code has mobile sessions, automated scheduling, voice input, parallel agents, git worktrees, hooks, and a browser extension. Most people use about 20% of what it can do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move Your Session Anywhere with /teleport
&lt;/h2&gt;

&lt;p&gt;You can start a session on your laptop and pick it up on your phone. Or move it to the web. The &lt;code&gt;/teleport&lt;/code&gt; command transfers your full session context between devices.&lt;/p&gt;

&lt;p&gt;The reverse also works. If you're reviewing something on your phone during a commute, you can &lt;code&gt;/teleport&lt;/code&gt; it back to your terminal when you sit down.&lt;/p&gt;

&lt;p&gt;There's also &lt;code&gt;/remote-control&lt;/code&gt; which lets you connect to a running session from another device without transferring it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On your laptop&lt;/span&gt;
/teleport

&lt;span class="c"&gt;# On your phone or web - enter the code to pick up the session&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is useful when you kick off a long-running task on your workstation and want to check progress from your phone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automate Repetitive Tasks with /loop and /schedule
&lt;/h2&gt;

&lt;p&gt;This one is a genuine workflow changer. You can tell Claude Code to run a task on a recurring schedule for up to a week.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Review PRs every 30 minutes&lt;/span&gt;
/loop 30m review open PRs and post comments

&lt;span class="c"&gt;# Run a health check every hour&lt;/span&gt;
/schedule every 1h check &lt;span class="k"&gt;if &lt;/span&gt;the staging environment is healthy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Think about what you do repeatedly: reviewing PRs, checking CI status, monitoring deployments, updating dependencies. You can automate all of it without writing a single script.&lt;/p&gt;

&lt;p&gt;Some practical examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review all open PRs every morning at 9 AM&lt;/li&gt;
&lt;li&gt;Monitor a Slack channel for feedback and create GitHub issues&lt;/li&gt;
&lt;li&gt;Run your test suite after every push and report failures&lt;/li&gt;
&lt;li&gt;Check for dependency updates weekly&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hooks for Deterministic Automation
&lt;/h2&gt;

&lt;p&gt;Hooks let you run code at specific points in Claude Code's lifecycle. Unlike the AI-driven &lt;code&gt;/loop&lt;/code&gt; command, hooks are deterministic - they always run the same way.&lt;/p&gt;

&lt;p&gt;You configure them in your settings and they fire on events like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Session start&lt;/strong&gt; - set up your environment, load context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Before bash commands&lt;/strong&gt; - validate or log commands before execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On permission requests&lt;/strong&gt; - auto-approve specific patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous operation&lt;/strong&gt; - keep Claude running without manual intervention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is powerful for teams. You can enforce standards (like running linters before every commit) without relying on each engineer to remember.&lt;/p&gt;

&lt;h2&gt;
  
  
  Git Worktrees for Parallel Sessions
&lt;/h2&gt;

&lt;p&gt;If you've ever wanted Claude to work on two different branches at the same time, worktrees make this possible. Each session gets its own isolated copy of the repo.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start a session in a worktree&lt;/span&gt;
claude &lt;span class="nt"&gt;--worktree&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this matters: you can have Claude refactoring module A while simultaneously building feature B. Neither session interferes with the other.&lt;/p&gt;

&lt;p&gt;This pairs well with &lt;code&gt;/batch&lt;/code&gt;, which fans out work across dozens of parallel agents. Need to update 50 files? &lt;code&gt;/batch&lt;/code&gt; can process them concurrently instead of one at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Voice Input with /voice
&lt;/h2&gt;

&lt;p&gt;You can dictate to Claude instead of typing. This sounds gimmicky until you try it for longer explanations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/voice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's particularly useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explaining complex requirements ("I need a migration that handles both the old and new schema formats, with a rollback path if...")&lt;/li&gt;
&lt;li&gt;Code reviews ("Look at the authentication flow in this PR and tell me if...")&lt;/li&gt;
&lt;li&gt;Brainstorming ("What's the best way to structure this API given these constraints...")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typing detailed prompts takes time. Talking is faster for anything longer than a few sentences.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Chrome Extension for Frontend Work
&lt;/h2&gt;

&lt;p&gt;Claude Code has a Chrome extension that lets the AI see what your app looks like in the browser. Instead of describing UI bugs, Claude can verify its own output visually.&lt;/p&gt;

&lt;p&gt;This closes the feedback loop for frontend work. Claude makes a change, checks the browser, adjusts if something looks off. You stop being the human screenshot tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  /branch and --fork-session for Experiments
&lt;/h2&gt;

&lt;p&gt;Want to try two different approaches to the same problem? &lt;code&gt;/branch&lt;/code&gt; creates a copy of your current session so you can explore a different path without losing your progress.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Fork the current session&lt;/span&gt;
/branch

&lt;span class="c"&gt;# Or fork when starting&lt;/span&gt;
claude &lt;span class="nt"&gt;--fork-session&lt;/span&gt; &amp;lt;session-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is like git branches but for your AI conversation. Try approach A in one branch, approach B in another, then pick the winner.&lt;/p&gt;

&lt;h2&gt;
  
  
  /btw for Side Questions
&lt;/h2&gt;

&lt;p&gt;When Claude is working on a long task, you might have an unrelated question. Instead of interrupting the main task, &lt;code&gt;/btw&lt;/code&gt; lets you ask a side question.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/btw what&lt;span class="s1"&gt;'s the difference between SIGTERM and SIGKILL?
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude answers your side question and goes right back to what it was doing. No context switching, no lost progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  --bare for SDK Speed
&lt;/h2&gt;

&lt;p&gt;If you're using Claude Code in scripts or CI pipelines, the &lt;code&gt;--bare&lt;/code&gt; flag skips loading plugins and extra features, making startup up to 10x faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--bare&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"generate a migration for adding user roles"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters when you're calling Claude from automation scripts where every second counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  --add-dir for Multi-Repo Work
&lt;/h2&gt;

&lt;p&gt;Working across multiple repositories? You can give Claude access to all of them in a single session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--add-dir&lt;/span&gt; ~/projects/api &lt;span class="nt"&gt;--add-dir&lt;/span&gt; ~/projects/frontend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Claude can see your API schema and your frontend code at the same time. No more copying types between repos or explaining your API structure manually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Custom Agents with --agent
&lt;/h2&gt;

&lt;p&gt;You can create custom agent configurations with their own system prompts and tool permissions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--agent&lt;/span&gt; reviewer    &lt;span class="c"&gt;# Uses your custom reviewer agent config&lt;/span&gt;
claude &lt;span class="nt"&gt;--agent&lt;/span&gt; deployer    &lt;span class="c"&gt;# Uses your custom deployer agent config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Define these in your &lt;code&gt;.claude/agents/&lt;/code&gt; directory. Each agent can have different instructions, different tool access, and different behaviors. A code reviewer agent doesn't need write access. A deployment agent doesn't need to browse the web.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for DevOps
&lt;/h2&gt;

&lt;p&gt;These features shift Claude Code from "AI code assistant" to "AI DevOps team member." The combination of scheduling, hooks, parallel sessions, and multi-repo access means you can automate workflows that previously required custom tooling.&lt;/p&gt;

&lt;p&gt;Here's a realistic DevOps setup:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;/schedule&lt;/code&gt; reviews all PRs every morning&lt;/li&gt;
&lt;li&gt;Hooks enforce linting and security scanning on every session&lt;/li&gt;
&lt;li&gt;Worktrees let you debug production while shipping features&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--add-dir&lt;/code&gt; gives Claude access to your infra and app repos simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/loop&lt;/code&gt; monitors your staging environment and alerts you on issues&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight from Boris's thread: "There is no one right way to use Claude Code." The tool is intentionally flexible. Experiment with these features and build the workflow that fits your team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Out
&lt;/h2&gt;

&lt;p&gt;If you haven't updated Claude Code recently, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Many of these features are recent additions. The mobile app, scheduling, and hooks in particular have been added in the last few months.&lt;/p&gt;

&lt;p&gt;For more DevOps tools and guides, check out our &lt;a href="https://dev.to/exercises"&gt;exercises&lt;/a&gt; and &lt;a href="https://dev.to/quizzes"&gt;quizzes&lt;/a&gt; to sharpen your skills.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was inspired by &lt;a href="https://x.com/bcherny/status/2038454336355999749" rel="noopener noreferrer"&gt;Boris Cherny's thread on X&lt;/a&gt;. Boris is the creator of Claude Code at Anthropic.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>devops</category>
      <category>linux</category>
    </item>
    <item>
      <title>🎄 Advent of DevOps: 25 Days to Level Up Your DevOps Game!</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Sun, 30 Nov 2025 22:00:00 +0000</pubDate>
      <link>https://forem.com/devopsdaily/advent-of-devops-25-days-to-level-up-your-devops-game-2fb5</link>
      <guid>https://forem.com/devopsdaily/advent-of-devops-25-days-to-level-up-your-devops-game-2fb5</guid>
      <description>&lt;p&gt;Hey DevOps enthusiasts! 👋&lt;/p&gt;

&lt;p&gt;Remember how exciting advent calendars were as a kid? Each day bringing a new surprise behind those little doors? Well, we're bringing that same excitement to the DevOps world, but instead of chocolate (sorry! 🍫), you're getting something even better: &lt;strong&gt;real-world DevOps skills that will make you a better engineer&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎁 What is Advent of DevOps?
&lt;/h2&gt;

&lt;p&gt;Think "Advent of Code" meets real-world DevOps challenges. Starting December 1st, we're releasing &lt;strong&gt;25 daily hands-on challenges&lt;/strong&gt; that cover everything you need to know to thrive in modern DevOps environments.&lt;/p&gt;

&lt;p&gt;Each day unlocks a new practical challenge focusing on tools and techniques you'll actually use in production. No theory-heavy lectures, no boring slides—just pure, hands-on learning that you can apply immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 What's Inside?
&lt;/h2&gt;

&lt;p&gt;Here's a taste of what you'll tackle over 25 days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐳 &lt;strong&gt;Containerization &amp;amp; Orchestration&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;⚙️ &lt;strong&gt;CI/CD &amp;amp; Automation&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🏗️ &lt;strong&gt;Infrastructure as Code&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🔒 &lt;strong&gt;Security &amp;amp; Observability&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;☁️ &lt;strong&gt;Cloud &amp;amp; Scaling&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💡 Why Join?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🎯 Real-World Skills&lt;/strong&gt;: Every challenge is based on actual scenarios you'll face in production&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📈 Progressive Learning&lt;/strong&gt;: Start easy, level up gradually. Whether you're a beginner or seasoned pro, there's something for you&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎮 Fun &amp;amp; Engaging&lt;/strong&gt;: Gamified progress tracking makes learning addictive (in a good way!)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🌟 Community-Driven&lt;/strong&gt;: Share solutions, learn from others, and grow together&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⏰ Learn at Your Pace&lt;/strong&gt;: Can't keep up daily? No problem! All challenges remain available year-round&lt;/p&gt;

&lt;h2&gt;
  
  
  🎄 How It Works
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick Your Challenge&lt;/strong&gt;: Start with Day 1 or jump to what interests you most&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get Hands-On&lt;/strong&gt;: Each challenge includes clear tasks, starter code, and success criteria&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build &amp;amp; Learn&lt;/strong&gt;: Complete the challenge at your own pace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Share &amp;amp; Celebrate&lt;/strong&gt;: Post your wins and solutions with the community&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level Up&lt;/strong&gt;: Review reference solutions and explanations to deepen your understanding&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each challenge includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Clear task description&lt;/li&gt;
&lt;li&gt;🎯 Success criteria&lt;/li&gt;
&lt;li&gt;🔧 Starter code (when applicable)&lt;/li&gt;
&lt;li&gt;💡 Solution &amp;amp; explanation&lt;/li&gt;
&lt;li&gt;🔗 Additional resources&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🌟 Join the Community
&lt;/h2&gt;

&lt;p&gt;This isn't just about solo learning—it's about growing together! &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Share your progress:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Follow us on X/Twitter: &lt;a href="https://x.com/thedevopsdaily" rel="noopener noreferrer"&gt;@thedevopsdaily&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Use hashtag: &lt;strong&gt;#AdventOfDevOps&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Share on LinkedIn, dev.to, wherever you hang out!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Contribute:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found a cool solution? Share it!&lt;/li&gt;
&lt;li&gt;Have ideas for challenges? We're open-source!&lt;/li&gt;
&lt;li&gt;Check out our &lt;a href="https://github.com/The-DevOps-Daily/devops-daily" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; and contribute&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🎯 Ready to Start?
&lt;/h2&gt;

&lt;p&gt;Don't wait for December 1st to check it out—head over to the page now and get familiar with what's coming:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://devops-daily.com/advent-of-devops" rel="noopener noreferrer"&gt;devops-daily.com/advent-of-devops&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mark your calendar 📅, set your reminders ⏰, and get ready to transform your DevOps skills one day at a time!&lt;/p&gt;

&lt;h2&gt;
  
  
  🤔 Who Should Join?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DevOps Engineers&lt;/strong&gt; looking to sharpen their skills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developers&lt;/strong&gt; wanting to understand the ops side better&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System Administrators&lt;/strong&gt; transitioning to DevOps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Students &amp;amp; Career Changers&lt;/strong&gt; building practical experience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone&lt;/strong&gt; curious about modern infrastructure practices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No gatekeeping here, if you're interested in DevOps, you're welcome! 🙌&lt;/p&gt;

&lt;h2&gt;
  
  
  🎊 Let's Make This December Special
&lt;/h2&gt;

&lt;p&gt;Learning doesn't have to be boring. It doesn't have to be stressful. And it definitely doesn't have to be lonely.&lt;/p&gt;

&lt;p&gt;This December, join hundreds (thousands?) of DevOps practitioners around the world in leveling up together. One challenge at a time, one skill at a time, one day at a time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See you on December 1st! 🎄✨&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;P.S. - Can't wait? Start exploring the challenges now at &lt;a href="https://devops-daily.com/advent-of-devops" rel="noopener noreferrer"&gt;devops-daily.com/advent-of-devops&lt;/a&gt;. They're already live and ready for early birds! 🐦&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;P.P.S. - This is completely free, open-source, and community-driven. No paywalls, no upsells, just pure learning. If you find value, give us a star on &lt;a href="https://github.com/The-DevOps-Daily/devops-daily" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and spread the word! ⭐&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow DevOps Daily:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐦 X/Twitter: &lt;a href="https://x.com/thedevopsdaily" rel="noopener noreferrer"&gt;@thedevopsdaily&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 GitHub: &lt;a href="https://github.com/The-DevOps-Daily/devops-daily" rel="noopener noreferrer"&gt;The-DevOps-Daily/devops-daily&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🌐 Website: &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;devops-daily.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy DevOps-ing! 🚀&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>beginners</category>
      <category>adventofcode</category>
    </item>
    <item>
      <title>Building a DDoS Attack Simulator to Understand Defense Strategies</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 21 Nov 2025 09:53:22 +0000</pubDate>
      <link>https://forem.com/devopsdaily/building-a-ddos-attack-simulator-to-understand-defense-strategies-lg4</link>
      <guid>https://forem.com/devopsdaily/building-a-ddos-attack-simulator-to-understand-defense-strategies-lg4</guid>
      <description>&lt;p&gt;I created an educational content piece for DevOps Daily and realized something: most explanations of DDoS attacks are either too abstract or too technical. We talk about "request floods" and "mitigation strategies," but it's hard to visualize what's actually happening.&lt;/p&gt;

&lt;p&gt;So I built an interactive simulator to help bridge that gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Learning About DDoS 📚
&lt;/h2&gt;

&lt;p&gt;When you're reading about DDoS protection, you see phrases like "distributes load across multiple servers" or "rate limiting prevents abuse." But what does that actually mean when thousands of requests are hitting your infrastructure?&lt;/p&gt;

&lt;p&gt;I wanted something that would help people - especially those newer to infrastructure work - actually see these concepts in action.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Simulator Does 🎮
&lt;/h2&gt;

&lt;p&gt;You can try it here: &lt;a href="https://devops-daily.com/games/ddos-simulator" rel="noopener noreferrer"&gt;devops-daily.com/games/ddos-simulator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It lets you simulate three common attack types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTTP Flood&lt;/strong&gt; 🌊 - overwhelming with legitimate-looking requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SYN Flood&lt;/strong&gt; 🔄 - exploiting TCP handshake mechanics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UDP Flood&lt;/strong&gt; 📦 - connectionless packet storms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting part is watching how different defense mechanisms respond. You can toggle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Firewall&lt;/strong&gt; 🛡️ - blocks about 30% based on signatures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load Balancer&lt;/strong&gt; ⚖️ - reduces impact by 50%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Rate Limit&lt;/strong&gt; 🚦 - blocks high-frequency traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I Learned Building It 💡
&lt;/h2&gt;

&lt;p&gt;A few things became clear while working on this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attack intensity matters less than you'd think.&lt;/strong&gt; The attack type and your defense configuration matter way more. A moderate SYN flood with no defenses is worse than an intense HTTP flood with proper rate limiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single defenses aren't enough.&lt;/strong&gt; This is obvious in theory, but seeing it play out makes it concrete. A firewall alone, or a load balancer alone, only gets you so far.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visualization helps understanding.&lt;/strong&gt; Watching the server health bar drop while packets animate across the screen creates an intuition that documentation doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Might Find This Useful ⚙️
&lt;/h2&gt;

&lt;p&gt;If you're:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learning about infrastructure security&lt;/li&gt;
&lt;li&gt;Trying to explain DDoS concepts to your team&lt;/li&gt;
&lt;li&gt;Deciding what protections to implement&lt;/li&gt;
&lt;li&gt;Just curious how attacks and defenses interact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It might be helpful to play around with it for a bit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next 🚀
&lt;/h2&gt;

&lt;p&gt;I'm planning to add more waves with additional attack vectors and defense mechanisms. Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Application-layer attacks&lt;/li&gt;
&lt;li&gt;CDN protection&lt;/li&gt;
&lt;li&gt;Anycast routing&lt;/li&gt;
&lt;li&gt;More realistic traffic patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have thoughts on what would be useful to include, I'd be interested to hear them.&lt;/p&gt;




&lt;p&gt;The goal here is education, not creating chaos. Understanding how attacks work helps you build better defenses. 🛡️&lt;/p&gt;

&lt;p&gt;If you try it out, let me know what you think or if anything is unclear.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>systemdesign</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Right-Sizing Kubernetes Resources with VPA and Karpenter</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 22 Aug 2025 17:02:04 +0000</pubDate>
      <link>https://forem.com/devopsdaily/right-sizing-kubernetes-resources-with-vpa-and-karpenter-22ah</link>
      <guid>https://forem.com/devopsdaily/right-sizing-kubernetes-resources-with-vpa-and-karpenter-22ah</guid>
      <description>&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;Setting CPU and memory requests too high in Kubernetes wastes money and reduces cluster efficiency. This guide shows you how to identify overprovisioned workloads, use Vertical Pod Autoscaler (VPA) to right-size your pods, and implement Karpenter for smarter node scaling. You'll also learn to monitor costs and validate your improvements with real metrics.&lt;/p&gt;

&lt;p&gt;When you set resource requests too conservatively in Kubernetes, your cluster reserves more capacity than workloads actually need. This leads to underutilized nodes and higher cloud bills. The problem gets worse at scale - imagine 200 pods each requesting 2 CPU cores but only using 200m. That's 400 reserved cores when actual demand is closer to 40 cores.&lt;/p&gt;

&lt;p&gt;The solution involves right-sizing both your pods and nodes. You'll use monitoring data to understand actual usage, apply VPA to adjust pod requests automatically, and leverage Karpenter to provision nodes that match your workload requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before you start, make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Kubernetes cluster (version 1.20 or higher) with metrics-server installed&lt;/li&gt;
&lt;li&gt;kubectl configured with admin access to your cluster&lt;/li&gt;
&lt;li&gt;Prometheus and Grafana deployed for monitoring (or similar observability stack)&lt;/li&gt;
&lt;li&gt;Basic understanding of Kubernetes resource requests and limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You'll also need the ability to install cluster-wide components like VPA and Karpenter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying Overprovisioned Workloads
&lt;/h2&gt;

&lt;p&gt;The first step is understanding how your current workloads use resources compared to what they request. You can start with kubectl to get a quick snapshot of resource usage across your cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check current resource usage for all nodes&lt;/span&gt;
kubectl top nodes

&lt;span class="c"&gt;# View pod resource usage across all namespaces&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cpu

&lt;span class="c"&gt;# Get detailed resource requests vs usage for a specific namespace&lt;/span&gt;
kubectl describe nodes | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 15 &lt;span class="s2"&gt;"Allocated resources"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These commands show you the gap between requested and actual resource usage. If you see pods consistently using 50Mi of memory while requesting 1Gi, or using 100m CPU while requesting 1000m, those are prime candidates for right-sizing.&lt;/p&gt;

&lt;p&gt;For deeper analysis, you'll want historical data from Prometheus. Here are some key queries to run in your Grafana dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# CPU utilization percentage (actual usage vs requests)
(rate(container_cpu_usage_seconds_total{container!=""}[5m]) * 100) /
(container_spec_cpu_quota{container!=""} / container_spec_cpu_period{container!=""})

# Memory utilization percentage
(container_memory_working_set_bytes{container!=""} * 100) /
container_spec_memory_limit_bytes{container!=""}

# Top 10 pods with the highest request-to-usage ratio (biggest waste)
topk(10,
  (container_spec_cpu_quota{container!=""} / container_spec_cpu_period{container!=""}) /
  rate(container_cpu_usage_seconds_total{container!=""}[5m])
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run these queries over a 2-week period to account for traffic variations and identify consistent patterns. Workloads running at 10-20% utilization with stable traffic are good candidates for optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing and Configuring VPA
&lt;/h2&gt;

&lt;p&gt;Vertical Pod Autoscaler analyzes your workloads and recommends optimal CPU and memory values. Start by installing VPA in your cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the VPA repository&lt;/span&gt;
git clone https://github.com/kubernetes/autoscaler.git
&lt;span class="nb"&gt;cd &lt;/span&gt;autoscaler/vertical-pod-autoscaler

&lt;span class="c"&gt;# Deploy VPA components&lt;/span&gt;
./hack/vpa-up.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This script installs three main components: the VPA recommender (analyzes usage), the updater (applies changes), and the admission controller (validates recommendations).&lt;/p&gt;

&lt;p&gt;Next, create a VPA configuration for a workload you want to optimize. Start with recommendation mode to see suggested values before making changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vpa-web-service.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VerticalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service-vpa&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;targetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;apps/v1'&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service&lt;/span&gt;
  &lt;span class="na"&gt;updatePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;updateMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Off'&lt;/span&gt; &lt;span class="c1"&gt;# Only provide recommendations, don't auto-update&lt;/span&gt;
  &lt;span class="na"&gt;resourcePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;containerPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-app&lt;/span&gt;
        &lt;span class="c1"&gt;# Set boundaries to prevent extreme recommendations&lt;/span&gt;
        &lt;span class="na"&gt;maxAllowed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2'&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;4Gi'&lt;/span&gt;
        &lt;span class="na"&gt;minAllowed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100m'&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;128Mi'&lt;/span&gt;
        &lt;span class="na"&gt;controlledResources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cpu'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;memory'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the VPA configuration and wait for recommendations to generate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; vpa-web-service.yaml

&lt;span class="c"&gt;# Wait a few minutes, then check recommendations&lt;/span&gt;
kubectl describe vpa web-service-vpa &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output shows recommended values for CPU and memory under the &lt;code&gt;Status&lt;/code&gt; section. VPA typically suggests values based on the 90th percentile of usage over the past 8 days, which provides a safety buffer while eliminating waste.&lt;/p&gt;

&lt;h2&gt;
  
  
  Applying VPA Recommendations Safely
&lt;/h2&gt;

&lt;p&gt;Once you have solid recommendations, you can apply them gradually. Start with non-critical workloads and monitor for any issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Update your deployment with VPA recommendations&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-app&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:1.21&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;250m'&lt;/span&gt; &lt;span class="c1"&gt;# Reduced from 1000m based on VPA recommendation&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;512Mi'&lt;/span&gt; &lt;span class="c1"&gt;# Reduced from 2Gi based on VPA recommendation&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;500m'&lt;/span&gt; &lt;span class="c1"&gt;# Set limits 2x requests for burst capacity&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1Gi'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After updating requests, monitor your workloads for at least a week. Watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increased pod restarts or OOMKilled events&lt;/li&gt;
&lt;li&gt;Higher response times or error rates&lt;/li&gt;
&lt;li&gt;Pods getting evicted under memory pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If everything runs smoothly, you can switch VPA to automatic mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update VPA to automatically apply changes&lt;/span&gt;
kubectl patch vpa web-service-vpa &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'merge'&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{"spec":{"updatePolicy":{"updateMode":"Auto"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Auto mode, VPA will restart pods when it detects they need different resource allocations. Make sure you have proper PodDisruptionBudgets in place to maintain availability during updates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Karpenter for Node Optimization
&lt;/h2&gt;

&lt;p&gt;While VPA optimizes individual pods, Karpenter optimizes your entire node infrastructure. Instead of fixed node groups, Karpenter provisions nodes dynamically based on your workload requirements.&lt;/p&gt;

&lt;p&gt;First, install Karpenter in your cluster. The exact steps depend on your cloud provider, but here's the process for AWS EKS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Karpenter using Helm&lt;/span&gt;
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; karpenter oci://public.ecr.aws/karpenter/karpenter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; &lt;span class="s2"&gt;"0.32.0"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; &lt;span class="s2"&gt;"karpenter"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="s2"&gt;"settings.clusterName=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="s2"&gt;"settings.interruptionQueueName=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, create a NodePool that defines what types of nodes Karpenter can provision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# karpenter-nodepool.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NodePool&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Template for nodes Karpenter will create&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;node-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Instance requirements - Karpenter will pick the best fit&lt;/span&gt;
      &lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/arch&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amd64'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/capacity-type&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;spot'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;on-demand'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Allow both for cost optimization&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.kubernetes.io/instance-type&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m6i.large'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m6i.xlarge'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m6i.2xlarge'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r6i.large'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r6i.xlarge'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

      &lt;span class="c1"&gt;# Node configuration&lt;/span&gt;
      &lt;span class="na"&gt;nodeClassRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1beta1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EC2NodeClass&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;

      &lt;span class="c1"&gt;# Taints to control which pods can schedule here&lt;/span&gt;
      &lt;span class="na"&gt;taints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/unschedulable&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true'&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;

  &lt;span class="c1"&gt;# Scaling and disruption policies&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt; &lt;span class="c1"&gt;# Maximum CPU across all nodes in this pool&lt;/span&gt;
  &lt;span class="na"&gt;disruption&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;consolidationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WhenUnderutilized&lt;/span&gt;
    &lt;span class="na"&gt;consolidateAfter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the corresponding EC2NodeClass for AWS-specific configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# karpenter-nodeclass.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EC2NodeClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# AMI and instance configuration&lt;/span&gt;
  &lt;span class="na"&gt;amiFamily&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AL2&lt;/span&gt;
  &lt;span class="na"&gt;subnetSelectorTerms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;${CLUSTER_NAME}'&lt;/span&gt;
  &lt;span class="na"&gt;securityGroupSelectorTerms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;${CLUSTER_NAME}'&lt;/span&gt;

  &lt;span class="c1"&gt;# Instance store configuration&lt;/span&gt;
  &lt;span class="na"&gt;userData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;#!/bin/bash&lt;/span&gt;
    &lt;span class="s"&gt;/etc/eks/bootstrap.sh ${CLUSTER_NAME}&lt;/span&gt;

  &lt;span class="c1"&gt;# Tags for cost tracking&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform&lt;/span&gt;
    &lt;span class="na"&gt;Environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply both configurations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; karpenter-nodepool.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; karpenter-nodeclass.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Karpenter will now monitor unschedulable pods and provision appropriately-sized nodes. When you deploy workloads with right-sized resource requests (thanks to VPA), Karpenter will select smaller, more cost-effective instances.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring Cost Impact
&lt;/h2&gt;

&lt;p&gt;To validate your optimizations, you need visibility into resource costs. Kubecost provides detailed insights into how much each workload costs and how much capacity you're wasting.&lt;/p&gt;

&lt;p&gt;Install Kubecost in your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add the Kubecost Helm repository&lt;/span&gt;
helm repo add kubecost https://kubecost.github.io/cost-analyzer/

&lt;span class="c"&gt;# Install Kubecost with Prometheus integration&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;kubecost kubecost/cost-analyzer &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kubecost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;kubecostToken&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-token-here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; prometheus.server.global.external_labels.cluster_id&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Access the Kubecost UI by port-forwarding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; kubecost deployment/kubecost-cost-analyzer 9090:9090
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the Kubecost dashboard, focus on these key metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency scores&lt;/strong&gt;: Shows the percentage of requested resources actually being used&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle costs&lt;/strong&gt;: Money spent on provisioned but unused resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-sizing recommendations&lt;/strong&gt;: Suggestions for adjusting requests and limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Namespace costs&lt;/strong&gt;: Helps identify which teams or applications drive costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Track these metrics before and after implementing VPA and Karpenter to quantify your savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Optimization Example
&lt;/h2&gt;

&lt;p&gt;Let's walk through optimizing a typical microservice deployment. You start with a Node.js API that was conservatively configured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before optimization&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1000m'&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2Gi'&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2000m'&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;4Gi'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running this workload for two weeks, your monitoring shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average CPU usage: 150m (15% of requests)&lt;/li&gt;
&lt;li&gt;Average memory usage: 400Mi (20% of requests)&lt;/li&gt;
&lt;li&gt;Peak CPU usage: 300m&lt;/li&gt;
&lt;li&gt;Peak memory usage: 800Mi&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Based on this data, VPA recommends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# VPA recommendations (with safety buffer)&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;200m'&lt;/span&gt; &lt;span class="c1"&gt;# Covers 99th percentile usage&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;512Mi'&lt;/span&gt; &lt;span class="c1"&gt;# Accounts for memory spikes&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;400m'&lt;/span&gt; &lt;span class="c1"&gt;# 2x requests for burst capacity&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1Gi'&lt;/span&gt; &lt;span class="c1"&gt;# Prevents OOM while allowing growth&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cost impact for 20 replicas of this service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before&lt;/strong&gt;: 20 CPU cores, 40Gi memory requested&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After&lt;/strong&gt;: 4 CPU cores, 10Gi memory requested&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings&lt;/strong&gt;: 80% reduction in resource allocation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Karpenter managing nodes, this workload now runs on smaller instances, further reducing costs by eliminating the need for oversized nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Resource Quotas and Guardrails
&lt;/h2&gt;

&lt;p&gt;As you roll out right-sizing across your organization, implement quotas to prevent teams from reverting to oversized requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# namespace-quota.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ResourceQuota&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend-team-quota&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests.cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;50'&lt;/span&gt; &lt;span class="c1"&gt;# Total CPU requests across all pods&lt;/span&gt;
    &lt;span class="na"&gt;requests.memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100Gi'&lt;/span&gt; &lt;span class="c1"&gt;# Total memory requests&lt;/span&gt;
    &lt;span class="na"&gt;limits.cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100'&lt;/span&gt; &lt;span class="c1"&gt;# Total CPU limits&lt;/span&gt;
    &lt;span class="na"&gt;limits.memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;200Gi'&lt;/span&gt; &lt;span class="c1"&gt;# Total memory limits&lt;/span&gt;
    &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100'&lt;/span&gt; &lt;span class="c1"&gt;# Maximum number of pods&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also create LimitRanges to enforce reasonable defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# limit-range.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LimitRange&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pod-limits&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Container&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# Default limits if not specified&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;500m'&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1Gi'&lt;/span&gt;
      &lt;span class="na"&gt;defaultRequest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# Default requests if not specified&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100m'&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;256Mi'&lt;/span&gt;
      &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# Maximum allowed values&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;4'&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;8Gi'&lt;/span&gt;
      &lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# Minimum required values&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;50m'&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;64Mi'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These guardrails help maintain optimization gains while giving teams flexibility within reasonable bounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting Common Issues
&lt;/h2&gt;

&lt;p&gt;When implementing VPA and Karpenter, you might encounter some challenges. Here are solutions to the most common problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPA recommendations seem too aggressive&lt;/strong&gt;: VPA sometimes suggests very low values during low-traffic periods. Check that your monitoring data covers representative traffic patterns. You can also adjust the VPA algorithm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resourcePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;containerPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-app&lt;/span&gt;
        &lt;span class="na"&gt;controlledValues&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RequestsOnly&lt;/span&gt; &lt;span class="c1"&gt;# Only adjust requests, leave limits alone&lt;/span&gt;
        &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Karpenter nodes aren't scaling down&lt;/strong&gt;: This usually happens when pods can't be evicted. Check for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Look for pods without PodDisruptionBudgets&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; wide | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; Terminating

&lt;span class="c"&gt;# Check for pods using local storage or host networking&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 5 hostNetwork

&lt;span class="c"&gt;# Verify PodDisruptionBudgets allow eviction&lt;/span&gt;
kubectl get pdb &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pods getting OOMKilled after VPA optimization&lt;/strong&gt;: This indicates VPA recommendations were too low. Temporarily increase memory requests and check for memory leaks in your application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check recent OOM events&lt;/span&gt;
kubectl get events &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.metadata.creationTimestamp | &lt;span class="nb"&gt;grep &lt;/span&gt;OOMKilled

&lt;span class="c"&gt;# Monitor memory usage patterns&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;memory &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can make VPA more conservative by setting higher safety margins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resourcePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;containerPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-app&lt;/span&gt;
        &lt;span class="na"&gt;maxAllowed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2Gi'&lt;/span&gt; &lt;span class="c1"&gt;# Set a reasonable upper bound&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Now that you have VPA and Karpenter working together, consider these additional optimizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal Pod Autoscaling&lt;/strong&gt;: Combine with VPA to handle both vertical and horizontal scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Autoscaler tuning&lt;/strong&gt;: If using multiple node provisioners, configure them to work together&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost alerts&lt;/strong&gt;: Set up notifications when resource costs exceed thresholds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regular reviews&lt;/strong&gt;: Schedule monthly reviews of VPA recommendations and cost reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also explore more advanced Karpenter features like multiple NodePools for different workload types (CPU-intensive, memory-intensive, GPU workloads) and spot instance strategies for non-critical workloads.&lt;/p&gt;

&lt;p&gt;The key is to treat right-sizing as an ongoing process. As your applications evolve and traffic patterns change, continue monitoring and adjusting to maintain optimal resource utilization.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloud</category>
      <category>docker</category>
    </item>
    <item>
      <title>The 5-Minute Kubernetes Cluster Health Check</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 15 Aug 2025 10:30:08 +0000</pubDate>
      <link>https://forem.com/devopsdaily/the-5-minute-kubernetes-cluster-health-check-b89</link>
      <guid>https://forem.com/devopsdaily/the-5-minute-kubernetes-cluster-health-check-b89</guid>
      <description>&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;You can check your Kubernetes cluster's health in under 5 minutes using five key commands: checking node status, monitoring resource usage, reviewing pod health across namespaces, investigating problem pods, and examining cluster events. This quick routine helps catch issues before they escalate into critical problems.&lt;/p&gt;

&lt;p&gt;Kubernetes is great until it's not. One bad node, a pod stuck in CrashLoopBackOff, or a resource spike can ruin your day. The good news? You don't need to spend an hour digging through dashboards to spot trouble early. With a few quick commands, you can get a solid read on your cluster's health in under 5 minutes.&lt;/p&gt;

&lt;p&gt;Here's how to do it effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Make Sure Your Nodes Are Happy
&lt;/h2&gt;

&lt;p&gt;Start by checking the overall status of your cluster nodes. This gives you the foundation-level health of your infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get nodes &lt;span class="nt"&gt;-o&lt;/span&gt; wide
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command displays all nodes in your cluster along with their detailed information. You'll see each node's status, roles, age, version, internal and external IPs, OS image, kernel version, and container runtime.&lt;/p&gt;

&lt;p&gt;What you want to see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STATUS&lt;/strong&gt; should be &lt;code&gt;Ready&lt;/code&gt; for all nodes&lt;/li&gt;
&lt;li&gt;No mystery nodes suddenly showing up in your cluster&lt;/li&gt;
&lt;li&gt;Roles, IPs, and ages that make sense for your environment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you spot &lt;code&gt;NotReady&lt;/code&gt;, that's your cue to dig deeper. A node in this state might be experiencing network issues, resource exhaustion, or kubelet problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check Resource Usage at a Glance
&lt;/h2&gt;

&lt;p&gt;Next, get a quick overview of resource consumption across your nodes to identify potential bottlenecks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command shows CPU and memory usage for each node in your cluster. It provides both absolute values and percentages, making it easy to spot resource pressure.&lt;/p&gt;

&lt;p&gt;Keep an eye out for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU or memory regularly above 80% on any node&lt;/li&gt;
&lt;li&gt;One node doing all the heavy lifting while others are barely working&lt;/li&gt;
&lt;li&gt;Sudden spikes that don't match your expected workload patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No &lt;code&gt;metrics-server&lt;/code&gt; running? Install it with this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The metrics-server is essential for resource monitoring and is required for horizontal pod autoscaling to work properly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Look at All Pods Across All Namespaces
&lt;/h2&gt;

&lt;p&gt;Get a bird's-eye view of all pods running in your cluster to quickly identify any that are misbehaving.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command lists every pod across all namespaces, showing their current status, restart count, and age. It's like taking the pulse of your entire application ecosystem.&lt;/p&gt;

&lt;p&gt;Healthy pods should be &lt;code&gt;Running&lt;/code&gt; or &lt;code&gt;Completed&lt;/code&gt;. If you see states like &lt;code&gt;CrashLoopBackOff&lt;/code&gt;, &lt;code&gt;ImagePullBackOff&lt;/code&gt;, &lt;code&gt;Pending&lt;/code&gt;, or &lt;code&gt;Error&lt;/code&gt;, note the namespace and pod name for further investigation.&lt;/p&gt;

&lt;p&gt;Also watch the &lt;strong&gt;RESTARTS&lt;/strong&gt; column closely. If a pod has restarted a dozen times in the last hour, something's definitely off. Frequent restarts often indicate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Application crashes due to bugs or configuration issues&lt;/li&gt;
&lt;li&gt;Failing health checks (readiness or liveness probes)&lt;/li&gt;
&lt;li&gt;Resource limits being exceeded&lt;/li&gt;
&lt;li&gt;Dependencies being unavailable&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Zoom In on Problem Pods
&lt;/h2&gt;

&lt;p&gt;When you spot problematic pods, dig deeper to understand what's causing the issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &amp;lt;namespace&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;&amp;lt;pod-name&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;namespace&amp;gt;&lt;/code&gt; with the actual values from your problem pods. This command provides detailed information about the pod's configuration, current state, and recent events.&lt;/p&gt;

&lt;p&gt;Check for these common issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Events at the bottom&lt;/strong&gt; (often the smoking gun that reveals the root cause)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failing readiness or liveness probes&lt;/strong&gt; that prevent the pod from receiving traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image pull errors&lt;/strong&gt; indicating registry access problems or incorrect image names&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource limit issues&lt;/strong&gt; where the pod exceeds its memory or CPU constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The events section is particularly valuable because it shows a chronological history of what happened to the pod, including scheduling decisions, volume mounts, and error conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check the Cluster's Event Log
&lt;/h2&gt;

&lt;p&gt;Get insight into what's been happening across your entire cluster by examining recent events.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get events &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.metadata.creationTimestamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command shows cluster-wide events sorted by when they occurred, giving you a timeline of recent activity. Events provide context about system-level operations and can reveal patterns or issues that affect multiple components.&lt;/p&gt;

&lt;p&gt;Events will tell you what's been happening behind the scenes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failed volume mounts that prevent pods from starting&lt;/li&gt;
&lt;li&gt;DNS resolution errors affecting service communication&lt;/li&gt;
&lt;li&gt;Scheduling issues when pods can't be placed on nodes&lt;/li&gt;
&lt;li&gt;Node pressure warnings indicating resource constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try k9s for a Better View
&lt;/h2&gt;

&lt;p&gt;If you want something more interactive than command-line tools, give &lt;strong&gt;&lt;a href="https://k9scli.io/" rel="noopener noreferrer"&gt;k9s&lt;/a&gt;&lt;/strong&gt; a try. It's a terminal-based UI for Kubernetes that provides real-time cluster information in an intuitive interface.&lt;/p&gt;

&lt;p&gt;k9s lets you browse resources, view logs, and drill into problems without typing long commands. You can navigate between different resource types using simple keystrokes, filter resources, and even perform actions like scaling deployments or deleting pods.&lt;/p&gt;

&lt;p&gt;Once you try k9s, it's hard to go back to plain kubectl for exploratory tasks. It's particularly useful when you need to quickly jump between different namespaces or resource types during troubleshooting.&lt;/p&gt;

&lt;p&gt;Five minutes a day is all it takes to stay ahead of most cluster problems. Make this health check part of your daily routine and you'll catch issues before they blow up and before your pager goes off at 3 a.m. Regular monitoring helps you understand your cluster's normal behavior, making it easier to spot anomalies when they occur.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloud</category>
      <category>linux</category>
    </item>
    <item>
      <title>What’s the Most Underrated DevOps Skill You’ve Learned (and How Did You Learn It)?</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Tue, 05 Aug 2025 07:43:56 +0000</pubDate>
      <link>https://forem.com/devopsdaily/whats-the-most-underrated-devops-skill-youve-learned-and-how-did-you-learn-it-5a7i</link>
      <guid>https://forem.com/devopsdaily/whats-the-most-underrated-devops-skill-youve-learned-and-how-did-you-learn-it-5a7i</guid>
      <description>&lt;p&gt;When we think about DevOps skills, we usually picture Kubernetes, Terraform, CI/CD pipelines, or cloud automation.&lt;/p&gt;

&lt;p&gt;But some of the most valuable skills are the ones that never make it into a certification or a tech stack diagram.&lt;/p&gt;

&lt;p&gt;It could be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Staying calm during a production incident and knowing how to prioritize actions&lt;/li&gt;
&lt;li&gt;Communicating effectively with teams under pressure&lt;/li&gt;
&lt;li&gt;Spotting patterns in logs and metrics that others might miss&lt;/li&gt;
&lt;li&gt;Finding ways to optimize cloud costs without slowing down delivery&lt;/li&gt;
&lt;li&gt;Automating the boring stuff so you can focus on the real problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, one of the most underrated skills I've learned is knowing when &lt;em&gt;not&lt;/em&gt; to automate something. Sometimes the "manual but reliable" approach saves you from a lot of complexity and maintenance overhead later.&lt;/p&gt;

&lt;p&gt;What about you?&lt;br&gt;
What's the most underrated DevOps skill you've picked up along the way, and how did you learn it?&lt;/p&gt;

&lt;p&gt;P.S. You might find some useful DevOps resources at &lt;a href="http://devops-daily.com/" rel="noopener noreferrer"&gt;devops-daily.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>beginners</category>
      <category>discuss</category>
    </item>
    <item>
      <title>What's the One DevOps "Best Practice" You Secretly Ignore (and Why)?</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Wed, 30 Jul 2025 14:04:41 +0000</pubDate>
      <link>https://forem.com/devopsdaily/whats-the-one-devops-best-practice-you-secretly-ignore-and-why-2460</link>
      <guid>https://forem.com/devopsdaily/whats-the-one-devops-best-practice-you-secretly-ignore-and-why-2460</guid>
      <description>&lt;p&gt;We've all read the books, followed the gurus, and tried to tick every box in the DevOps checklist.. but let’s be honest:&lt;/p&gt;

&lt;p&gt;There's always that one best practice that just doesn’t work for your team, your stack, or your sanity.&lt;/p&gt;

&lt;p&gt;Maybe you don't write as many tests as you should.&lt;br&gt;
Maybe you still SSH into production (👀).&lt;br&gt;
Maybe you use latest tags on your Docker images and pray.&lt;/p&gt;

&lt;p&gt;No judgment here, just real talk from the trenches.&lt;/p&gt;

&lt;p&gt;What's your "ignored" DevOps best practice, and why do you skip it?&lt;/p&gt;

&lt;p&gt;Bonus points if you share how it's actually worked out for you.&lt;/p&gt;




&lt;p&gt;🛠️ Posted by the team behind &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;DevOps Daily&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>linux</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The Complete DevOps Roadmap for 2025 🚀</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Sat, 26 Jul 2025 13:13:38 +0000</pubDate>
      <link>https://forem.com/devopsdaily/the-complete-devops-roadmap-for-2025-4n1h</link>
      <guid>https://forem.com/devopsdaily/the-complete-devops-roadmap-for-2025-4n1h</guid>
      <description>&lt;p&gt;The DevOps landscape continues to evolve rapidly, and 2025 presents incredible opportunities for aspiring engineers. Organizations are increasingly adopting DevOps practices to deliver software faster, more reliably, and at scale. The demand for skilled DevOps professionals has never been higher.&lt;/p&gt;

&lt;p&gt;Whether you're a developer looking to expand into operations, a system administrator aiming to modernize your skills, or a complete beginner drawn to this exciting field, this comprehensive roadmap will guide your journey to DevOps mastery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DevOps in 2025? 🌟
&lt;/h2&gt;

&lt;p&gt;DevOps represents a fundamental shift in how software is built, deployed, and maintained. It's not just about tools, it's about culture, collaboration, and continuous improvement. Here's why it matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🔄 Faster Delivery&lt;/strong&gt;: Teams deploy multiple times per day instead of monthly releases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛡️ Better Reliability&lt;/strong&gt;: Automated testing and monitoring catch issues early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ Improved Collaboration&lt;/strong&gt;: Breaks down silos between development and operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔧 Enhanced Automation&lt;/strong&gt;: Reduces manual work and human error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📈 Career Growth&lt;/strong&gt;: High demand for skilled professionals across all industries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But beyond the benefits, DevOps offers intellectually rewarding work where you solve complex problems and see immediate impact on product delivery.&lt;/p&gt;

&lt;h2&gt;
  
  
  DevOps in the Age of AI: Why Infrastructure Matters More Than Ever 🤖
&lt;/h2&gt;

&lt;p&gt;With AI transforming every industry, you might wonder: "Is DevOps still a smart career choice?" The answer is a resounding &lt;strong&gt;yes&lt;/strong&gt;, and here's why:&lt;/p&gt;

&lt;h3&gt;
  
  
  🏗️ AI Runs on Infrastructure
&lt;/h3&gt;

&lt;p&gt;Every AI application, from ChatGPT to autonomous vehicles, depends on robust, scalable infrastructure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🚀 Model Training&lt;/strong&gt;: Requires massive computational resources and distributed systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ Real-time Inference&lt;/strong&gt;: Needs low-latency, highly available services
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📊 Data Pipelines&lt;/strong&gt;: AI models need continuous data flow and processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔄 Model Deployment&lt;/strong&gt;: Rolling out AI models safely requires sophisticated CI/CD&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🤝 AI Enhances DevOps (Doesn't Replace It)
&lt;/h3&gt;

&lt;p&gt;Rather than replacing DevOps engineers, AI is becoming a powerful tool in our toolkit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🔍 Intelligent Monitoring&lt;/strong&gt;: AI helps predict system failures before they happen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛠️ Automated Remediation&lt;/strong&gt;: Smart systems can fix common issues automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📈 Resource Optimization&lt;/strong&gt;: AI optimizes cloud costs and performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔐 Security Enhancement&lt;/strong&gt;: AI-powered threat detection and response&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🎯 The Human Element Remains Critical
&lt;/h3&gt;

&lt;p&gt;While AI can automate many tasks, DevOps engineers provide irreplaceable value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🧠 Strategic Thinking&lt;/strong&gt;: Designing architecture and making technology choices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔧 Complex Problem Solving&lt;/strong&gt;: Debugging unique issues and system design&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;👥 Cross-team Collaboration&lt;/strong&gt;: Bridging technical and business requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📋 Compliance &amp;amp; Governance&lt;/strong&gt;: Ensuring systems meet regulatory requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🌐 Growing Complexity Requires Expertise
&lt;/h3&gt;

&lt;p&gt;As AI adoption accelerates, infrastructure becomes more complex:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🔀 Multi-cloud Strategies&lt;/strong&gt;: Managing resources across different providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚓ Container Orchestration&lt;/strong&gt;: Running AI workloads at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔒 Security Challenges&lt;/strong&gt;: Protecting sensitive AI models and data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📊 Observability Needs&lt;/strong&gt;: Understanding performance of AI-driven systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 9-Stage DevOps Learning Journey 🗺️
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Stage 1: Master the Fundamentals 💻
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Foundation Skills Every DevOps Engineer Needs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before diving into advanced tools, you need rock-solid fundamentals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🐧 Linux/Unix Systems&lt;/strong&gt;: Command line proficiency is essential&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📜 Shell Scripting (Bash)&lt;/strong&gt;: Automate repetitive tasks efficiently
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔀 Version Control (Git)&lt;/strong&gt;: Collaborate effectively with development teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🐍 Basic Programming&lt;/strong&gt;: Python or Go for automation scripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🌐 Networking Fundamentals&lt;/strong&gt;: Understand how services communicate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;💡 Pro Tip&lt;/strong&gt;: Don't rush this stage. These skills form the foundation for everything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Personal development environment, system monitoring scripts, automated backup solutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Infrastructure as Code 🏗️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Manage Infrastructure Through Code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Infrastructure as Code (IaC) transforms how we manage infrastructure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;⚙️ Terraform&lt;/strong&gt;: Industry standard for multi-cloud infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔧 Ansible&lt;/strong&gt;: Configuration management and application deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;☁️ CloudFormation&lt;/strong&gt;: AWS-native infrastructure provisioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✅ Infrastructure Testing&lt;/strong&gt;: Validate changes before deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real Impact&lt;/strong&gt;: Companies achieve consistent, reproducible deployments while reducing manual configuration errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Multi-environment infrastructure, automated web application stacks, infrastructure testing pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Containerization &amp;amp; Orchestration 📦
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Package and Orchestrate Applications&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Container technology has revolutionized application deployment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🐳 Docker Fundamentals&lt;/strong&gt;: Package applications consistently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔗 Container Networking&lt;/strong&gt;: Understand service communication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚓ Kubernetes&lt;/strong&gt;: Orchestrate containers at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📋 Helm Charts&lt;/strong&gt;: Simplify Kubernetes application deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔒 Container Security&lt;/strong&gt;: Protect your containerized workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why It Matters&lt;/strong&gt;: Containers solve the "it works on my machine" problem and enable consistent deployments across environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Microservices e-commerce platform, container CI/CD pipeline, production-ready Kubernetes cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 4: CI/CD Pipelines ⚡
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Automate Your Deployment Process&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Continuous Integration and Deployment revolutionize software delivery:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🚀 GitHub Actions&lt;/strong&gt;: Automate workflows directly in your repository&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔄 Jenkins&lt;/strong&gt;: Build complex, enterprise-grade pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🦊 GitLab CI&lt;/strong&gt;: Integrated DevOps platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🎯 ArgoCD&lt;/strong&gt;: GitOps-style deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🧪 Testing Automation&lt;/strong&gt;: Integrate quality gates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Game Changer&lt;/strong&gt;: Teams can deploy changes safely and frequently, with automatic rollback capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Multi-stage CI/CD pipeline, GitOps deployment system, blue-green deployment strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 5: Cloud Platforms ☁️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Master Modern Cloud Infrastructure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloud expertise is essential in today's landscape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🌐 AWS Fundamentals&lt;/strong&gt;: Learn the most widely adopted cloud platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔷 Azure Services&lt;/strong&gt;: Microsoft's comprehensive cloud ecosystem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔵 Google Cloud Platform&lt;/strong&gt;: Strong in data and AI services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🌍 Multi-Cloud Strategy&lt;/strong&gt;: Many organizations use multiple providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;💰 Cost Optimization&lt;/strong&gt;: Control and reduce cloud spending&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Industry Reality&lt;/strong&gt;: Most organizations have moved to cloud-first strategies, making these skills essential.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Multi-cloud architecture, serverless application suite, cost optimization dashboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 6: Monitoring &amp;amp; Observability 📊
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Ensure System Reliability and Performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Observability provides visibility into system behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;📈 Prometheus &amp;amp; Grafana&lt;/strong&gt;: Industry-standard metrics and visualization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📋 ELK Stack&lt;/strong&gt;: Centralized logging and analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔍 Distributed Tracing&lt;/strong&gt;: Track requests across microservices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ APM Tools&lt;/strong&gt;: Application performance monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🎯 SLO/SLI Design&lt;/strong&gt;: Define and measure service reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Critical Importance&lt;/strong&gt;: You can't improve what you can't measure. Monitoring prevents small issues from becoming major outages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Complete observability stack, SLO monitoring dashboard, performance analysis tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 7: Security &amp;amp; Compliance 🛡️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Integrate Security Throughout the Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Security must be built-in, not bolted-on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🔐 DevSecOps Practices&lt;/strong&gt;: Shift security left in the development process&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛡️ Container Security&lt;/strong&gt;: Secure runtime and images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔑 Secrets Management&lt;/strong&gt;: Handle credentials safely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📋 Compliance Automation&lt;/strong&gt;: Automate SOC2, GDPR requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔍 Security Scanning&lt;/strong&gt;: Integrate vulnerability detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Modern Approach&lt;/strong&gt;: Security teams collaborate with development from day one, rather than reviewing at the end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Secure CI/CD pipeline, zero-trust network, compliance automation systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 8: Database Management 🗄️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Handle Data Persistence and Reliability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data management remains critical across all applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🗃️ SQL &amp;amp; NoSQL&lt;/strong&gt;: Master both relational and document databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🤖 Database Automation&lt;/strong&gt;: Automate deployments and migrations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;💾 Backup Strategies&lt;/strong&gt;: Ensure data recovery capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ Performance Tuning&lt;/strong&gt;: Optimize database performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;☁️ Cloud Databases&lt;/strong&gt;: Leverage managed database services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Universal Need&lt;/strong&gt;: Every application needs data persistence, making these skills valuable across all projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Database migration pipeline, multi-database architecture, database monitoring system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 9: Continuous Learning 🎓
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Embrace Lifelong Growth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Technology evolves rapidly, making continuous learning essential:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🌍 Open Source Contribution&lt;/strong&gt;: Build your reputation in the community&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✍️ Technical Writing&lt;/strong&gt;: Share knowledge and build authority&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;👥 Mentoring&lt;/strong&gt;: Guide others and develop leadership skills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🎤 Conference Participation&lt;/strong&gt;: Stay current with industry trends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛠️ Side Projects&lt;/strong&gt;: Experiment with new technologies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-term Success&lt;/strong&gt;: The most successful DevOps engineers are those who adapt and grow with the technology landscape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Open source contributions, technical blog series, mentorship programs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Learning Approach 🎯
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Hands-On Projects Beat Theory 🔨
&lt;/h3&gt;

&lt;p&gt;Don't just read about tools, build with them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up a complete development environment&lt;/li&gt;
&lt;li&gt;Create infrastructure across multiple cloud providers&lt;/li&gt;
&lt;li&gt;Build and deploy a real application end-to-end&lt;/li&gt;
&lt;li&gt;Implement comprehensive monitoring and alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Learn in Public 📢
&lt;/h3&gt;

&lt;p&gt;Document your journey and help others:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write blog posts about your learnings and challenges&lt;/li&gt;
&lt;li&gt;Share code and configurations on GitHub&lt;/li&gt;
&lt;li&gt;Participate in DevOps communities and forums&lt;/li&gt;
&lt;li&gt;Help others troubleshoot problems you've solved&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Focus on Problem-Solving 🧩
&lt;/h3&gt;

&lt;p&gt;DevOps is about solving business problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand why tools exist, not just how to use them&lt;/li&gt;
&lt;li&gt;Practice troubleshooting and debugging systematically&lt;/li&gt;
&lt;li&gt;Learn to communicate with both technical and business stakeholders&lt;/li&gt;
&lt;li&gt;Think about reliability, scalability, and maintainability&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Embrace AI as a Tool 🤖
&lt;/h3&gt;

&lt;p&gt;Learn to work alongside AI rather than compete with it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use AI-powered tools to enhance your productivity&lt;/li&gt;
&lt;li&gt;Understand how to deploy and manage AI workloads&lt;/li&gt;
&lt;li&gt;Learn about MLOps practices and AI model lifecycle management&lt;/li&gt;
&lt;li&gt;Focus on the strategic and creative aspects that AI can't replace&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Industry Trends to Watch in 2025 🔮
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🏗️ Platform Engineering Rise
&lt;/h3&gt;

&lt;p&gt;Organizations are investing in internal developer platforms to improve developer experience and reduce cognitive load.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔄 GitOps Adoption
&lt;/h3&gt;

&lt;p&gt;Git-based deployment workflows are becoming the standard for managing infrastructure and applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  🤖 AI/ML Integration &amp;amp; Infrastructure Demands
&lt;/h3&gt;

&lt;p&gt;AI is transforming DevOps in multiple ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🧠 AI-Powered Tools&lt;/strong&gt;: Intelligent monitoring, predictive scaling, and automated incident response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🚀 MLOps Emergence&lt;/strong&gt;: New discipline combining ML and DevOps practices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ GPU Infrastructure&lt;/strong&gt;: Managing specialized hardware for AI workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📊 AI Model Pipelines&lt;/strong&gt;: Deploying and updating AI models safely at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🌱 Sustainability Focus
&lt;/h3&gt;

&lt;p&gt;Green DevOps practices are becoming important as organizations focus on reducing their environmental impact, especially with energy-intensive AI workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔒 Security-First Mindset
&lt;/h3&gt;

&lt;p&gt;Security considerations are moving earlier in the development lifecycle, making DevSecOps skills increasingly valuable, particularly for protecting AI models and data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your Next Steps 🚶‍♂️
&lt;/h2&gt;

&lt;p&gt;Starting your DevOps journey can feel overwhelming, but remember: every expert was once a beginner. Here's how to begin:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;📚 Start with the Fundamentals&lt;/strong&gt;: Master Linux, Git, and basic programming&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⏰ Practice Consistently&lt;/strong&gt;: Dedicate time each day to hands-on learning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;👥 Join Communities&lt;/strong&gt;: Connect with other learners and experienced practitioners&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔨 Build Projects&lt;/strong&gt;: Apply your knowledge to real-world scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔍 Stay Curious&lt;/strong&gt;: Technology evolves rapidly, embrace continuous learning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📝 Document Everything&lt;/strong&gt;: Keep notes and share your learning journey&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Interactive Learning Resources 🎮
&lt;/h2&gt;

&lt;p&gt;While this roadmap provides the structure, hands-on practice is essential. For an interactive experience with curated resources, practice labs, and detailed guidance for each skill, check out the &lt;a href="https://devops-daily.com/roadmap" rel="noopener noreferrer"&gt;complete DevOps roadmap&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The interactive version includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📚 Curated learning resources for each skill&lt;/li&gt;
&lt;li&gt;💻 Hands-on project ideas with difficulty levels&lt;/li&gt;
&lt;li&gt;🎯 Skills assessment and progress tracking&lt;/li&gt;
&lt;li&gt;🔗 Direct links to tutorials, documentation, and practice platforms&lt;/li&gt;
&lt;li&gt;🏆 Achievement badges and learning milestones&lt;/li&gt;
&lt;li&gt;💡 Real-world examples and use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion 🎉
&lt;/h2&gt;

&lt;p&gt;The DevOps field offers tremendous opportunities for those willing to invest in learning and skill development. Even in an AI-driven world, infrastructure expertise becomes more valuable, not less. As AI applications proliferate, they all depend on the robust, scalable systems that DevOps engineers build and maintain.&lt;/p&gt;

&lt;p&gt;With the right roadmap and consistent effort, you can build a rewarding career that combines technical challenges with meaningful business impact. The rise of AI doesn't diminish the importance of DevOps, it amplifies it.&lt;/p&gt;

&lt;p&gt;Remember, the goal isn't to master everything at once. Focus on building a strong foundation, then gradually expand your expertise. The industry rewards competence, problem-solving ability, and continuous learning, all qualities that define successful DevOps engineers.&lt;/p&gt;

&lt;p&gt;The journey may seem long, but every step builds upon the previous one. Start where you are, use what you have, and do what you can. Your future self will thank you for starting today.&lt;/p&gt;

&lt;p&gt;Start your journey now. The DevOps community is welcoming and always ready to help newcomers succeed! 🌟&lt;/p&gt;

&lt;p&gt;What's your current position on this roadmap? Share your DevOps learning journey in the comments below! 💬&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>ai</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The 10 Most Common DevOps Mistakes (And How to Avoid Them in 2025)</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Mon, 21 Jul 2025 11:00:00 +0000</pubDate>
      <link>https://forem.com/devopsdaily/the-10-most-common-devops-mistakes-and-how-to-avoid-them-in-2025-52gi</link>
      <guid>https://forem.com/devopsdaily/the-10-most-common-devops-mistakes-and-how-to-avoid-them-in-2025-52gi</guid>
      <description>&lt;p&gt;DevOps isn't just about shipping code faster, it's about doing it smarter, safer, and saner. But let's be real: even the best teams make mistakes. Some are harmless. Others take down production on a Friday afternoon (yes, &lt;em&gt;that&lt;/em&gt; Friday deploy).&lt;/p&gt;

&lt;p&gt;Here are 10 common DevOps mistakes in 2025, how to avoid them, and a few moments that might hit a little too close to home.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Treating Infrastructure as Code Like a One-Off Script
&lt;/h2&gt;

&lt;p&gt;You wrote Terraform once, it worked, and now it lives untouched in a dusty repo folder. That's not IaC, that's tech debt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Version control your IaC.&lt;/li&gt;
&lt;li&gt;Apply formatting and linting.&lt;/li&gt;
&lt;li&gt;Test it with tools like &lt;code&gt;terraform plan&lt;/code&gt; or &lt;code&gt;terratest&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav950b4q5xcks0dkuu72.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav950b4q5xcks0dkuu72.gif" alt="Please don't do this" width="498" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Not Enforcing Version Control on CI/CD Configs
&lt;/h2&gt;

&lt;p&gt;Your pipeline files are changing, but without versioning, there's no easy way to debug regressions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store all CI/CD config files (like GitHub Actions, GitLab CI, etc.) in version control.&lt;/li&gt;
&lt;li&gt;Treat pipeline logic like any other critical code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2g181s0hwsy4rdow1ihb.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2g181s0hwsy4rdow1ihb.gif" alt="Where did that config go?" width="498" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Poor Secrets Management
&lt;/h2&gt;

&lt;p&gt;Hardcoding secrets in code or using &lt;code&gt;.env&lt;/code&gt; files without encryption is a fast way to land on HN for the wrong reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Vault, Doppler, AWS Secrets Manager, or SOPS.&lt;/li&gt;
&lt;li&gt;Rotate secrets regularly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7j89typdjgcgu0chtfhm.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7j89typdjgcgu0chtfhm.gif" alt="It's fine" width="498" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. No Rollback Strategy
&lt;/h2&gt;

&lt;p&gt;You deploy. Something breaks. And there's no plan B.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use blue-green or canary deployments.&lt;/li&gt;
&lt;li&gt;Automate rollbacks on failure.&lt;/li&gt;
&lt;li&gt;Always have a &lt;code&gt;rollback.sh&lt;/code&gt; or previous image ready.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka8tilxaily9knheymoh.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka8tilxaily9knheymoh.gif" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Ignoring Observability Until It's Too Late
&lt;/h2&gt;

&lt;p&gt;Monitoring isn't just about uptime. You can't fix what you can't see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add metrics, logs, and traces from day one.&lt;/li&gt;
&lt;li&gt;Use tools like Prometheus, Grafana, and OpenTelemetry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrxey2qv78t2o1vyc6lf.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrxey2qv78t2o1vyc6lf.gif" width="498" height="318"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Too Many Tools, Not Enough Integration
&lt;/h2&gt;

&lt;p&gt;Your stack has 25 tools. None of them talk to each other. And your alert fatigue is real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consolidate tools where possible.&lt;/li&gt;
&lt;li&gt;Favor tools that integrate well with your existing stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2rk344iw8r99olhtb3x.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2rk344iw8r99olhtb3x.gif" width="600" height="600"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Manual Approval for Every Tiny Change
&lt;/h2&gt;

&lt;p&gt;A typo fix shouldn't need a 3-person review and a Slack war.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up clear policies: auto-approve safe changes, gate critical ones.&lt;/li&gt;
&lt;li&gt;Use GitHub environments, OPA, or custom bots to help.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjwt9dt67t62dc3sdgwt.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjwt9dt67t62dc3sdgwt.gif" alt="The sloth from Zootopia slowly stamping papers" width="498" height="498"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  8. No Documentation = Single Point of Failure
&lt;/h2&gt;

&lt;p&gt;"Ask Alex, they built it." Alex is on vacation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write docs as you go.&lt;/li&gt;
&lt;li&gt;Use tools like Backstage, Docusaurus, or just plain Markdown.&lt;/li&gt;
&lt;li&gt;Encourage a culture of async knowledge sharing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63q52iz1e7ep44s7ru4y.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63q52iz1e7ep44s7ru4y.gif" width="426" height="212"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Skipping Tests for Infrastructure Changes
&lt;/h2&gt;

&lt;p&gt;You test app code, but deploy infra changes directly to prod? Bold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use staging or preview environments.&lt;/li&gt;
&lt;li&gt;Test IaC with &lt;code&gt;checkov&lt;/code&gt;, &lt;code&gt;terratest&lt;/code&gt;, or &lt;code&gt;kitchen&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8t1epuvom2iwo2d9n1cq.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8t1epuvom2iwo2d9n1cq.gif" width="373" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Forgetting Security in Your Pipelines
&lt;/h2&gt;

&lt;p&gt;If your pipeline can deploy to prod, attackers might be able to as well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use least privilege for pipeline credentials.&lt;/li&gt;
&lt;li&gt;Run security checks like &lt;code&gt;trivy&lt;/code&gt;, &lt;code&gt;semgrep&lt;/code&gt;, and &lt;code&gt;snyk&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jcp8vru23tsw8bu70ol.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jcp8vru23tsw8bu70ol.jpg" width="506" height="500"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;DevOps is a journey. These mistakes are all lessons learned the hard way by teams around the world, and probably you, if you've been around long enough.&lt;/p&gt;

&lt;p&gt;Want to avoid these mistakes before they cost you time, sleep, or your weekend? We're building checklists, guides, and battle-tested content at &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;DevOps Daily&lt;/a&gt;. Come hang out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PS&lt;/strong&gt;: Got a DevOps horror story or lesson to share? Drop it in the comments or tag us on Twitter.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>security</category>
      <category>beginners</category>
    </item>
    <item>
      <title>What's Your Go-To Stack for Personal Projects in 2025?</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 18 Jul 2025 17:07:49 +0000</pubDate>
      <link>https://forem.com/devopsdaily/whats-your-go-to-stack-for-personal-projects-in-2025-3pg2</link>
      <guid>https://forem.com/devopsdaily/whats-your-go-to-stack-for-personal-projects-in-2025-3pg2</guid>
      <description>&lt;p&gt;When you're building a side project in 2025, what's your default stack these days?&lt;/p&gt;

&lt;p&gt;Are you still loving the reliability of Laravel or Ruby on Rails, or have you fully embraced Next.js, Bun, or something even more bleeding edge? Maybe you're mixing in tools like Supabase, Neon, or HTMX?&lt;/p&gt;

&lt;p&gt;Curious to hear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What's your go-to stack for quick MVPs or weekend builds?&lt;/li&gt;
&lt;li&gt;Do you keep it simple or try to mirror production setups?&lt;/li&gt;
&lt;li&gt;What are you hosting it on?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Been thinking about this a lot while working on something for &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;DevOps Daily&lt;/a&gt; and it made me wonder what others are using this year.&lt;/p&gt;

&lt;p&gt;Drop your stack below, someone might discover their next favorite combo from your setup!&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>discuss</category>
      <category>devops</category>
    </item>
    <item>
      <title>A Day in the Life of a DevOps Engineer</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 11 Jul 2025 13:00:00 +0000</pubDate>
      <link>https://forem.com/devopsdaily/a-day-in-the-life-of-a-devops-engineer-58ba</link>
      <guid>https://forem.com/devopsdaily/a-day-in-the-life-of-a-devops-engineer-58ba</guid>
      <description>&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;This post follows a DevOps engineer through a typical workday. You'll see how they handle morning deployments, infrastructure scaling, security alerts, and emergency hotfixes. The story covers real scenarios with tools like Kubernetes, Docker, Jenkins, and monitoring systems while showing how DevOps work directly impacts business operations. If you're curious about what DevOps engineers actually do day-to-day, this realistic walkthrough will give you insights into the challenges, responsibilities, and satisfying moments of the role.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Day at a Glance
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;05:47 AM ⚠️  PagerDuty Alert - API Response Time Critical
07:30 AM 🔧  Emergency Hotfix Deployment
11:30 AM 🔒  Security Incident Response
02:00 PM 📊  Performance Review &amp;amp; Feature Flag Deployment
06:00 PM 🔄  Kubernetes Cluster Maintenance
10:30 PM 🚨  Database Performance Emergency
12:00 AM 💤  Crisis Resolved, Systems Stable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;The phone buzzes at 5:47 AM. Not the alarm - that's set for 6:00 AM. It's PagerDuty. The production API response time has crossed the 2-second threshold, and customers are starting to complain on social media.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sound familiar?&lt;/strong&gt; Welcome to Monday morning in the life of a DevOps engineer.&lt;/p&gt;

&lt;p&gt;Rolling out of bed, laptop in hand, connecting to the VPN before the coffee even starts brewing. The monitoring dashboard shows a clear pattern: response times started climbing around 5:30 AM, right when the European market opened. The weekend's supposedly "minor" feature deployment is now causing 40% of API calls to timeout.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Incident Severity Assessment:
┌─────────────────────────────────────────────────────────────┐
│ 🔴 CRITICAL: 40% API timeout rate                           │
│ 📱 Social media complaints increasing                       │
│ 🌍 European market affected (peak hours)                    │
│ ⏰ US market opens in 3 hours                               │
│ 💰 Revenue impact: ~$2,000/minute                           │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;This is why DevOps engineers sleep with their phones next to the bed.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Morning Fire Fighting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🔥 Crisis Mode Activated&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The first instinct is to check the application logs. The ELK stack reveals the story immediately. The new payment processing feature is making synchronous calls to a third-party service, and those calls are taking 8-12 seconds to complete. When European users woke up and started making purchases, the connection pool got exhausted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Payment Flow Issue:
User Request → API Gateway → Payment Service → Third-Party Provider
     ↓              ↓              ↓              ↓
   Fast         Fast         SLOW (8-12s)     TIMEOUT

Connection Pool: [████████████████████] 200/200 (FULL!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A quick check shows 200 active connections - they've hit the maximum pool size. This needs an immediate fix while working on the root cause. The temporary solution is to scale up the payment service pods from 3 to 6, buying time to implement a proper fix.&lt;/p&gt;

&lt;p&gt;Watching the metrics after applying the scaling change, response times start dropping within two minutes. The immediate crisis is over, but this is just a band-aid. The real fix needs to happen in the application code, requiring coordination with the development team.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Sometimes the best solution is the fastest solution. Scaling infrastructure horizontally bought time to implement a proper fix without losing customers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💡 Pro Tip&lt;/strong&gt;: Always have a rollback plan ready. In this case, the scaling approach was reversible if it didn't work, keeping options open during the crisis.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Deployment Coordination
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;📞 Emergency War Room&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By 7:30 AM, the first video call of the day begins with the lead developer and product manager. They're discussing the hotfix strategy while pulling up the deployment pipeline in Jenkins.&lt;/p&gt;

&lt;p&gt;"The payment timeout issue affects roughly 30% of our European customers," the product manager explains, checking analytics. "We need this fixed before the US market opens, or we're looking at significant revenue loss."&lt;/p&gt;

&lt;p&gt;The developer has already pushed a fix to the staging branch - making the third-party payment calls asynchronous with proper error handling. The DevOps engineer's job is to get this through the pipeline safely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hotfix Deployment Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ 1. Code Review    ✅ (expedited, focused review)            │
│ 2. Build &amp;amp; Test   ✅ (automated, 5 minutes)                 │
│ 3. Staging Deploy ✅ (integration tests passing)            │
│ 4. Smoke Tests    ✅ (payments working correctly)           │
│ 5. Production     🟡 (waiting for approval)                 │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The staging deployment goes smoothly. Integration tests pass, and end-to-end tests confirm that payments are now processing correctly with the new asynchronous flow. The green light for production deployment comes at 8:45 AM.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Production hotfixes require extra caution. Even with time pressure, proper testing in staging prevented a second incident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎯 Reality Check&lt;/strong&gt;: In emergency situations, communication becomes even more critical. Clear status updates kept all stakeholders informed and aligned.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Infrastructure Scaling Challenges
&lt;/h2&gt;

&lt;p&gt;With the payment crisis resolved, attention turns to a brewing infrastructure problem. The marketing team is launching a major campaign next week, expecting a 3x increase in traffic. The current Kubernetes cluster can barely handle normal peak loads.&lt;/p&gt;

&lt;p&gt;Opening Terraform to review the current infrastructure setup reveals t3.medium instances that are cost-effective for normal operations but won't handle the expected load surge. A scaling strategy is needed that can handle the traffic spike without breaking the budget.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Current Infrastructure:
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster                                          │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐             │
│ │ t3.medium   │ │ t3.medium   │ │ t3.medium   │             │
│ │ Node 1      │ │ Node 2      │ │ Node 3      │             │
│ └─────────────┘ └─────────────┘ └─────────────┘             │
│                                                             │
│ Campaign Week (3x traffic) = 💥 OVERLOAD                    │
└─────────────────────────────────────────────────────────────┘

Solution: Pre-provisioned c5.xlarge nodes (scaled to 0 until needed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The plan involves creating a new node group with c5.xlarge instances, pre-created but kept at zero capacity until the campaign starts. This way, they can scale up quickly when needed and scale down immediately after to control costs.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Planning for predictable traffic spikes is cheaper than dealing with unexpected outages. Pre-provisioning resources that can be quickly activated saves both money and stress.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Security Alert Response
&lt;/h2&gt;

&lt;p&gt;At 11:30 AM, the security monitoring tool flags something suspicious. The intrusion detection system shows unusual network traffic patterns from one of the application servers. Security incidents can escalate quickly, so immediate attention is required.&lt;/p&gt;

&lt;p&gt;Initial investigation shows someone is trying to access the MySQL database directly from an external IP. A quick check of security groups and firewall rules shows they look correct - database access should only be allowed from application servers within the VPC. But the logs show connection attempts from a completely different IP range.&lt;/p&gt;

&lt;p&gt;Digging deeper into the application logs reveals the issue. A developer accidentally committed database credentials to a public GitHub repository three days ago. The credentials were scraped by automated tools and are now being used for unauthorized access attempts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Security Incident Timeline:
Day 1: Dev commits credentials → GitHub (public repo)
Day 2: Automated scrapers find credentials
Day 3: Credentials posted on dark web forums
Day 4: Unauthorized access attempts begin ← WE ARE HERE

Threat Actor → Internet → Firewall → Database (attempting access)
                             ↓
                       ⚠️  BLOCKED (but trying)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The immediate response is clear: rotate database credentials immediately and update the Kubernetes secret. The security incident is contained, but this requires a longer-term solution - implementing automated secret scanning in the CI/CD pipeline and scheduling security training for the development team.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Security incidents are rarely just technical problems. They're usually process problems that require both immediate fixes and long-term prevention strategies.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Monitoring and Alerting Improvements
&lt;/h2&gt;

&lt;p&gt;After lunch, focus shifts to improving the monitoring setup. The morning's payment issue could have been caught earlier with better alerting. Opening Prometheus to review the current metrics collection shows it only monitors basic metrics like CPU and memory usage.&lt;/p&gt;

&lt;p&gt;Working with the developer to add business-specific metrics that would have caught the payment timeout issue earlier becomes the priority. Custom metrics for payment processing duration, active connections, and success/failure rates are implemented.&lt;/p&gt;

&lt;p&gt;With these metrics in place, new alerting rules are created that would have triggered within minutes of the morning's incident, giving time to respond before customers were affected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Afternoon Deployment Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🚀 Major Feature Release&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The afternoon brings a scheduled deployment of the new user dashboard feature. This is a major feature that's been in development for six weeks, and the product team is eager to get it in front of users.&lt;/p&gt;

&lt;p&gt;The staging environment looks good, but something concerning appears in the performance tests. The new dashboard is making 47 database queries per page load. With the expected traffic increase from the marketing campaign, this could cause serious performance problems.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Database Query Analysis:
┌─────────────────────────────────────────────────────────────┐
│ Current Dashboard: 3 queries per page                       │
│ New Dashboard: 47 queries per page                          │
│                                                             │
│ Expected Traffic: 10,000 concurrent users                   │
│ Query Load: 470,000 queries/second                          │
│ Database Capacity: 50,000 queries/second                    │
│                                                             │
│ Result: 💥 DATABASE MELTDOWN                                │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An emergency meeting with the development team follows. The conversation is tense - the marketing campaign is already scheduled, and delaying the dashboard feature would mean missing the promotional opportunity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Dilemma:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Ship on time → Happy marketing team, potential system failure&lt;/li&gt;
&lt;li&gt;❌ Delay feature → Disappointed stakeholders, stable system&lt;/li&gt;
&lt;li&gt;🤔 Find middle ground → ???&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"We can't deploy this as-is," becomes the message, showing the performance metrics. "Each page load is hitting the database 47 times. With 10,000 concurrent users, that's 470,000 database queries per second. Our database will fall over."&lt;/p&gt;

&lt;p&gt;The lead developer looks at the query analysis. "Most of these are N+1 queries. We can fix the worst ones with some eager loading, but it'll take at least two days to properly optimize."&lt;/p&gt;

&lt;p&gt;A compromise is proposed: deploy the feature with a feature flag, initially enabled for only 10% of users. This gives real-world performance data while limiting the impact on the infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Feature Flag Strategy:
┌─────────────────────────────────────────────────────────────┐
│ Incoming Users: 10,000/second                               │
│                                                             │
│ 90% → Old Dashboard (stable, fast)                          │
│ 10% → New Dashboard (testing, monitored)                    │
│                                                             │
│ Database Load: Manageable vs. Catastrophic                  │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The deployment goes ahead with the feature flag in place. Database performance is monitored closely as the feature rolls out to the limited user group. The impact is manageable at 10% traffic, but the metrics confirm concerns about a full rollout.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Feature flags aren't just for A/B testing. They're a powerful risk management tool that lets you test production performance without betting the entire infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔄 DevOps Wisdom&lt;/strong&gt;: The best compromise is often a gradual rollout. It satisfies business needs while protecting system stability.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Evening Infrastructure Maintenance
&lt;/h2&gt;

&lt;p&gt;As the day winds down, planned maintenance tasks need attention. The Kubernetes cluster needs a version upgrade, and several security patches need to be applied to the worker nodes.&lt;/p&gt;

&lt;p&gt;The upgrade process requires careful coordination to avoid downtime. Nodes are drained one by one, system updates are applied, kubelet is restarted with the new version, and then the node is uncordoned back into service.&lt;/p&gt;

&lt;p&gt;The upgrade process takes about 90 minutes, but it goes smoothly. Application metrics are monitored throughout the process - response times stay normal, and no alerts fire.&lt;/p&gt;

&lt;h2&gt;
  
  
  Late Night Emergency
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🌙 10:30 PM - Not Again...&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Just when getting ready for bed at 10:30 PM, the phone buzzes again. This time it's a critical alert: the main application database is reporting high CPU usage and slow query performance. The European overnight batch processing jobs are running much longer than usual.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Every DevOps engineer knows this feeling - the dreaded "just one more alert" before bed.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Connecting to the database server immediately reveals the problem. One of the batch jobs is running a query that's been executing for 3 hours. The query is scanning a table with 50 million rows without using an index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Database Performance Crisis:
┌─────────────────────────────────────────────────────────────┐
│ Query: SELECT * FROM user_activities WHERE...               │
│ Status: Running for 3 hours ⏱️                              │
│ Rows Scanned: 50,000,000 (NO INDEX!)                        │
│ CPU Usage: ████████████████████████████████████ 95%         │
│ Other Queries: ⏳ WAITING... WAITING... WAITING...          │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The batch job developer probably tested with a small dataset and didn't realize the performance implications. A tough decision emerges: kill the long-running query to restore database performance, meaning the batch job will need to restart from the beginning, or let it finish but risk affecting the morning's application performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Midnight Decision Matrix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Option 1: Kill Query + Create Index
├─ Pros: Immediate relief, proper fix
├─ Cons: Batch job restarts (3 hours lost)
└─ Risk: Low

Option 2: Let Query Finish
├─ Pros: Batch job completes
├─ Cons: Database stays slow
└─ Risk: High (morning traffic impact)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The choice is made to kill the query and create the missing database index. The index creation takes 45 minutes on the large table, but once it's complete, the batch job can restart and finish in just 20 minutes instead of hours.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Key Insight&lt;/strong&gt;: Sometimes you have to make tough decisions with incomplete information. The ability to quickly assess risk and choose the least harmful option is crucial in DevOps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚡ Late Night Wisdom&lt;/strong&gt;: The best decisions aren't always the easiest ones. Protecting tomorrow's users was worth the short-term pain of restarting the batch job.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Reflection and Planning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🌅 Midnight - Systems Stable, Lessons Learned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;By midnight, the laptop finally closes. The day started with a production crisis, included a security incident, featured a challenging deployment decision, and ended with a database performance emergency. Each situation required different skills: quick problem-solving, technical analysis, team coordination, and risk assessment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Daily Impact Summary:
┌─────────────────────────────────────────────────────────────┐
│ 🔧 Issues Resolved: 4 critical, 2 medium priority           │
│ 👥 Customers Affected: Minimal (thanks to quick response)   │
│ 💰 Revenue Protected: ~$50,000 (prevented outages)          │
│ 🛠️ Systems Improved: 3 (monitoring, security, indexing)     │
│ 📈 Infrastructure: Scaled and optimized                     │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tomorrow will bring new challenges. The marketing campaign is getting closer, and the infrastructure scaling plan needs finalization. The dashboard feature needs performance optimization before it can be fully rolled out. The development team needs security training to prevent credential leaks. The monitoring system needs those new business metrics.&lt;/p&gt;

&lt;p&gt;But tonight, millions of users were able to make purchases, view their dashboards, and access the application without interruption. The infrastructure held up under pressure, the team collaborated effectively during crises, and the systems are more robust than they were this morning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tomorrow's Action Items:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Finalize campaign infrastructure scaling&lt;/li&gt;
&lt;li&gt;✅ Optimize dashboard database queries&lt;/li&gt;
&lt;li&gt;✅ Implement automated secret scanning&lt;/li&gt;
&lt;li&gt;✅ Deploy enhanced monitoring metrics&lt;/li&gt;
&lt;li&gt;✅ Schedule security training session&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;This is the reality of DevOps work - part firefighting, part planning, part collaboration, and part continuous improvement. It's demanding and sometimes stressful, but it's also rewarding to know that your work directly enables the business to serve its customers.&lt;/p&gt;

&lt;p&gt;The phone is on silent for the next six hours, but somewhere, monitoring systems are keeping watch, automated processes are handling routine tasks, and the infrastructure is quietly supporting thousands of users around the world. That's the real success of DevOps - building systems that work reliably, even when you're not watching.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Human Side of DevOps
&lt;/h2&gt;

&lt;p&gt;Being a DevOps engineer means being part detective, part architect, part diplomat, and part firefighter. Every day brings new challenges, but also new opportunities to make systems better, faster, and more reliable. The work never ends, but neither does the satisfaction of building technology that makes a real difference in people's lives.&lt;/p&gt;

&lt;p&gt;The morning payment issue wasn't just about fixing code - it was about understanding the business impact of technical decisions. When European customers couldn't complete their purchases, it affected real people trying to buy gifts, pay bills, or run their businesses. The quick response prevented thousands of failed transactions and potential customer churn.&lt;/p&gt;

&lt;p&gt;The security incident required more than just technical fixes. It highlighted the need for better developer education and process improvements. The conversation with the development team wasn't about blame - it was about learning and preventing similar issues in the future.&lt;/p&gt;

&lt;p&gt;The deployment decision for the dashboard feature showcased the constant balance between business needs and technical constraints. The marketing campaign couldn't be delayed, but releasing a feature that would crash the database wasn't an option. The feature flag solution satisfied both requirements while providing valuable data for future improvements.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Broader Impact
&lt;/h2&gt;

&lt;p&gt;DevOps work extends far beyond keeping servers running. It's about enabling the entire organization to move faster and more reliably. The monitoring improvements implemented today will prevent future incidents. The infrastructure scaling plan will support business growth. The security training will protect customer data.&lt;/p&gt;

&lt;p&gt;Each technical decision has ripple effects throughout the organization. The choice to scale up the payment service immediately instead of waiting for a code fix meant that the customer service team didn't get flooded with complaint calls. The decision to implement feature flags for the dashboard deployment gave the product team valuable usage data while protecting system stability.&lt;/p&gt;

&lt;p&gt;The database performance fix at midnight wasn't just about query optimization - it was about ensuring that the morning's business reports would be ready on time, that the analytics team could access their data, and that the automated systems could process customer orders without delay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Skills Beyond Technology
&lt;/h2&gt;

&lt;p&gt;While technical skills are essential, DevOps engineering requires much more. Communication skills are crucial for coordinating with development teams, explaining technical issues to business stakeholders, and writing clear documentation for on-call procedures.&lt;/p&gt;

&lt;p&gt;Problem-solving skills go beyond debugging code. They involve understanding complex systems, identifying root causes of issues, and designing solutions that prevent future problems. The ability to work under pressure while maintaining clear thinking is essential when production systems are down and customers are affected.&lt;/p&gt;

&lt;p&gt;Risk assessment becomes second nature - every change, every deployment, every infrastructure modification needs to be evaluated for potential impact. The ability to make quick decisions with incomplete information is valuable when incidents are unfolding and time is critical.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Satisfaction of Reliability
&lt;/h2&gt;

&lt;p&gt;The most rewarding aspect of DevOps work isn't the dramatic incident responses or the complex technical solutions. It's the quiet satisfaction of building systems that work consistently, day after day, serving users around the world without interruption.&lt;/p&gt;

&lt;p&gt;When a deployment goes smoothly, when monitoring catches an issue before it affects users, when an infrastructure upgrade happens without downtime - these moments of seamless operation represent the true success of DevOps practices.&lt;/p&gt;

&lt;p&gt;The tools and technologies will continue to evolve, but the core mission remains the same: bridge the gap between development and operations, automate repetitive tasks, monitor everything that matters, and respond quickly when things go wrong. It's challenging work, but for those who enjoy solving complex problems and working with cutting-edge technology, there's nothing quite like it.&lt;/p&gt;

&lt;p&gt;The best DevOps engineers are those who can see the bigger picture - understanding how their technical decisions impact users, businesses, and teams. They're the ones who can remain calm during crises, think strategically about infrastructure improvements, and communicate effectively with both technical and non-technical stakeholders.&lt;/p&gt;

&lt;p&gt;This is what a day in the life of a DevOps engineer really looks like - not just managing servers and writing scripts, but being a crucial part of the technology ecosystem that powers modern business operations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways for Aspiring DevOps Engineers
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🎯 Essential Skills Demonstrated Today:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Crisis Management&lt;/strong&gt;: Quick thinking under pressure while maintaining system stability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk Assessment&lt;/strong&gt;: Evaluating trade-offs between speed and reliability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-team Communication&lt;/strong&gt;: Coordinating with developers, product managers, and business stakeholders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technical Versatility&lt;/strong&gt;: From Kubernetes to databases to security incidents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business Impact Awareness&lt;/strong&gt;: Understanding how technical decisions affect revenue and customers&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;🛠️ Core Tools in Action:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: ELK Stack, Prometheus, PagerDuty&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure&lt;/strong&gt;: Kubernetes, Docker, Terraform, AWS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD&lt;/strong&gt;: Jenkins, automated testing pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Intrusion detection, credential management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databases&lt;/strong&gt;: MySQL, query optimization, indexing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;📚 Want to Learn More?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If this day-in-the-life resonates with you, here are some next steps:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;🚀 Getting Started&lt;/strong&gt;: Practice with containerization (&lt;a href="https://devops-daily.com/guides/introduction-to-docker" rel="noopener noreferrer"&gt;Docker&lt;/a&gt;), learn &lt;a href="https://devops-daily.com/guides/introduction-to-kubernetes" rel="noopener noreferrer"&gt;Kubernetes basics&lt;/a&gt;, and get comfortable with Linux command line&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🔍 Dive Deeper&lt;/strong&gt;: Set up monitoring in a personal project, practice incident response scenarios, learn infrastructure as code&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;💼 Career Path&lt;/strong&gt;: Consider starting as a systems administrator, junior DevOps engineer, or SRE to build foundational skills&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;The Reality Check&lt;/strong&gt;: DevOps isn't just about tools and automation. It's about building reliable systems that let businesses focus on serving their customers. Every alert, every deployment, every optimization contributes to that mission.&lt;/p&gt;

&lt;p&gt;The most rewarding part? Knowing that somewhere in the world, users are seamlessly making purchases, accessing services, and getting value from applications - all because the infrastructure you built and maintain is working exactly as it should.&lt;/p&gt;

&lt;p&gt;Go over the following &lt;a href="https://devops-daily.com/roadmap" rel="noopener noreferrer"&gt;DevOps Roadmap&lt;/a&gt; to see how you can build your skills and career in this exciting field.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;That's the real satisfaction of DevOps work - building the invisible foundation that makes everything else possible.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>docker</category>
      <category>database</category>
    </item>
    <item>
      <title>Thinking of Launching a SaaS in 2025? Here's My #1 Piece of Advice</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Thu, 10 Jul 2025 09:22:57 +0000</pubDate>
      <link>https://forem.com/devopsdaily/thinking-of-launching-a-saas-in-2025-heres-my-1-piece-of-advice-1al5</link>
      <guid>https://forem.com/devopsdaily/thinking-of-launching-a-saas-in-2025-heres-my-1-piece-of-advice-1al5</guid>
      <description>&lt;p&gt;If I had to start over today, I'd keep it painfully simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Build stuff but also focus on marketing&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most developers (myself included) fall into the trap of tweaking features, polishing deploy scripts, and 'optimizing infra' way too early.&lt;/p&gt;

&lt;p&gt;But in the early days, the real work is figuring out what people actually care about, and how fast you can get paid to solve it.&lt;/p&gt;

&lt;p&gt;Here's what I wish someone told me earlier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Launch before you're ready.&lt;/li&gt;
&lt;li&gt;Start charging as soon as it works.&lt;/li&gt;
&lt;li&gt;One Droplet, managed Postgres, maybe Redis, don't over-engineer.&lt;/li&gt;
&lt;li&gt;Use boring tech you can support solo.&lt;/li&gt;
&lt;li&gt;Spend more time on writing and talking than coding.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Anyway, if you're building something in 2025, what are you doing differently?&lt;/p&gt;

&lt;p&gt;Let's trade notes.&lt;/p&gt;

&lt;p&gt;(P.S. I recently redesigned &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;DevOps Daily&lt;/a&gt; so I have to start doing marketing 🤪)&lt;/p&gt;

</description>
      <category>saas</category>
      <category>webdev</category>
      <category>devops</category>
      <category>discuss</category>
    </item>
  </channel>
</rss>
