<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Last9</title>
    <description>The latest articles on Forem by Last9 (@last9).</description>
    <link>https://forem.com/last9</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2738%2F17712055-4a3b-4eb5-8dbd-8ef654bc7184.png</url>
      <title>Forem: Last9</title>
      <link>https://forem.com/last9</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/last9"/>
    <language>en</language>
    <item>
      <title>The 6 Questions to Ask Before Adding a High-Cardinality Label</title>
      <dc:creator>Nishant Modak</dc:creator>
      <pubDate>Mon, 19 Jan 2026 20:38:17 +0000</pubDate>
      <link>https://forem.com/last9/the-6-questions-to-ask-before-adding-a-high-cardinality-label-3pef</link>
      <guid>https://forem.com/last9/the-6-questions-to-ask-before-adding-a-high-cardinality-label-3pef</guid>
      <description>&lt;p&gt;Last month, a team I was talking to added a &lt;code&gt;pod_id&lt;/code&gt; label to debug a networking issue. Seemed harmless - only 200 pods.&lt;/p&gt;

&lt;p&gt;But with 50 metrics per pod and 2-minute pod churn during deployments, they created &lt;strong&gt;150,000 new series per hour&lt;/strong&gt;. Prometheus memory climbed from 8GB to 32GB in a week. They didn't notice until it OOMKilled during a production incident.&lt;/p&gt;

&lt;p&gt;The fix took 10 minutes. The outage took 3 hours. The postmortem took a week.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Checklist
&lt;/h2&gt;

&lt;p&gt;Before adding any label that could explode, ask:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Which system stores this?
&lt;/h3&gt;

&lt;p&gt;Prometheus pays cardinality costs at write time (memory). ClickHouse pays at query time (aggregation). Know your failure mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Is this for alerting or investigation?
&lt;/h3&gt;

&lt;p&gt;Alerting &lt;strong&gt;must&lt;/strong&gt; be bounded. Investigation can be unbounded - but maybe shouldn't live in Prometheus.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. What's the expected cardinality?
&lt;/h3&gt;

&lt;p&gt;distinct_values × other_label_combinations = series count&lt;br&gt;
  200 pods × 50 metrics × 10 endpoints = 100,000 series. Per deployment.&lt;/p&gt;
&lt;h3&gt;
  
  
  4. What's the growth rate?
&lt;/h3&gt;

&lt;p&gt;Will this 10x in a year? Containers, request IDs, user IDs - these grow with traffic.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. Is there a fallback?
&lt;/h3&gt;

&lt;p&gt;Can you drop this label via &lt;code&gt;metric_relabel_configs&lt;/code&gt; if it explodes? Test this before you need it.&lt;/p&gt;
&lt;h3&gt;
  
  
  6. Who owns this label?
&lt;/h3&gt;

&lt;p&gt;When it causes problems at 3am, who gets paged?&lt;/p&gt;
&lt;h2&gt;
  
  
  Metrics to Watch
&lt;/h2&gt;

&lt;p&gt;Before cardinality bites:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  prometheus_tsdb_head_series              # Active series count
  prometheus_tsdb_head_chunks_created_total # Rate of new chunks
  prometheus_tsdb_symbol_table_size_bytes  # Memory for interned strings
  process_resident_memory_bytes            # Actual memory usage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If head_series grows faster than expected, you have a problem brewing.&lt;/p&gt;




&lt;p&gt;Going Deeper&lt;/p&gt;

&lt;p&gt;I wrote a full breakdown of how Prometheus and ClickHouse handle cardinality differently at the storage engine level - head blocks, posting lists, Gorilla encoding, columnar storage, GROUP BY explosions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://last9.io/blog/high-cardinality-metrics-prometheus-clickhouse/" rel="noopener noreferrer"&gt;https://last9.io/blog/high-cardinality-metrics-prometheus-clickhouse/&lt;/a&gt; - covers why they fail in completely different ways and how to design pipelines knowing that.&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>monitoring</category>
      <category>prometheus</category>
      <category>devops</category>
    </item>
    <item>
      <title>Is Bun Production-Ready in 2026? A Practical Assessment</title>
      <dc:creator>Nishant Modak</dc:creator>
      <pubDate>Fri, 16 Jan 2026 22:00:00 +0000</pubDate>
      <link>https://forem.com/last9/is-bun-production-ready-in-2026-a-practical-assessment-181h</link>
      <guid>https://forem.com/last9/is-bun-production-ready-in-2026-a-practical-assessment-181h</guid>
      <description>&lt;h1&gt;
  
  
  Is Bun Production-Ready in 2026? A Practical Assessment
&lt;/h1&gt;

&lt;p&gt;Bun has come a long way since its initial release. With the &lt;a href="https://www.anthropic.com/news/anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone" rel="noopener noreferrer"&gt;Anthropic acquisition in December 2025&lt;/a&gt;, the project now has significant backing and a clear path forward. But is it ready for your production workloads?&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Changed Recently
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Bun 1.3+ Features
&lt;/h3&gt;

&lt;p&gt;The recent releases have focused on developer experience:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config frontend dev&lt;/strong&gt;: Run &lt;code&gt;bun index.html&lt;/code&gt; directly — it handles hot reloading, ES modules, React transpilation. No Vite or Webpack config needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in database clients&lt;/strong&gt;: &lt;code&gt;Bun.SQL&lt;/code&gt; supports PostgreSQL, MySQL, and SQLite natively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Package security&lt;/strong&gt;: &lt;code&gt;bun pm check&lt;/code&gt; integrates with Socket.dev for vulnerability scanning.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Anthropic Factor
&lt;/h3&gt;

&lt;p&gt;The acquisition signals long-term investment. Bun powers Claude Code (which hit $1B ARR), so Anthropic has strong incentive to keep it stable and performant. The team remains the same, and it stays MIT-licensed.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Bun Makes Sense
&lt;/h2&gt;

&lt;p&gt;Based on production experience, here's where Bun shines:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good fits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;APIs and microservices (fast cold starts matter)&lt;/li&gt;
&lt;li&gt;CLI tools and scripts (native TypeScript, fast startup)&lt;/li&gt;
&lt;li&gt;Internal tooling (speed up dev cycles)&lt;/li&gt;
&lt;li&gt;SSR apps with React/Vue (built-in bundling)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Proceed with caution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Apps heavily dependent on native Node modules&lt;/li&gt;
&lt;li&gt;Workloads requiring every Node.js API to match exactly&lt;/li&gt;
&lt;li&gt;Mission-critical systems without thorough dependency testing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Practical Considerations
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pin your versions&lt;/strong&gt; — Bun's patch releases sometimes include new features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test your dependencies&lt;/strong&gt; — Most npm packages work, but edge cases exist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check Node.js API coverage&lt;/strong&gt; — Some APIs have gaps or behave slightly differently&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;If you're evaluating Bun for a new project, check out this &lt;a href="https://last9.io/blog/getting-started-with-bun-js/" rel="noopener noreferrer"&gt;comprehensive getting started guide&lt;/a&gt; that covers installation, configuration, and use cases in detail.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;Bun is production-viable for many workloads in 2026. The Anthropic backing reduces abandonment risk, and the tooling has matured. Start with lower-stakes projects, validate your specific dependencies, and scale from there.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your experience with Bun in production? Drop a comment below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>bunjs</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Why High-Cardinality Metrics Break Everything</title>
      <dc:creator>Nishant Modak</dc:creator>
      <pubDate>Thu, 01 Jan 2026 20:56:38 +0000</pubDate>
      <link>https://forem.com/last9/why-high-cardinality-metrics-break-everything-36d</link>
      <guid>https://forem.com/last9/why-high-cardinality-metrics-break-everything-36d</guid>
      <description>&lt;p&gt;High-cardinality metrics are one of those ideas that sound obviously right—until you try to use them in production.&lt;/p&gt;

&lt;p&gt;In theory, they promise precision. Instead of averages and rollups, you get specificity: &lt;code&gt;per-request&lt;/code&gt;, &lt;code&gt;per-user ID&lt;/code&gt;, &lt;code&gt;per-container&lt;/code&gt;, &lt;code&gt;per-feature&lt;/code&gt; insights. The kind of detail engineers instinctively want when something is on fire.&lt;/p&gt;

&lt;p&gt;And then things start breaking.&lt;/p&gt;

&lt;p&gt;Not immediately. Not loudly. But quietly—often in ways that feel like mysterious bugs until you realize the system itself was never designed for this shape of data.&lt;/p&gt;

&lt;p&gt;What makes high-cardinality failures especially painful is that nothing crashes. Dashboards still load. Alerts still fire. Deploys continue as usual. The only early signal is often an unexplainable cost spike or queries that suddenly feel sluggish during incidents.&lt;/p&gt;

&lt;p&gt;Under the hood, the reason is mechanical. Every unique label combination creates a new time series. Each series needs storage, index entries, memory during ingestion, and ongoing compaction work. As cardinality grows, cost and query complexity don’t scale linearly—they multiply.&lt;/p&gt;

&lt;p&gt;At query time, the problem gets worse. Filters that once narrowed the search space stop being selective. Queries fan out across hundreds of thousands—or millions—of sparse, short-lived series. The query engine isn’t broken; it’s doing exactly what it was asked to do, across far more data than anyone realized they’d created.&lt;/p&gt;

&lt;p&gt;The most dangerous failure mode isn’t cost or performance—it’s trust. Charts flicker. Series appear and disappear. Queries return inconsistent shapes. Engineers stop believing what they see and quietly fall back to logs, not because logs are better, but because they’re predictable.&lt;/p&gt;

&lt;p&gt;The takeaway isn’t that high cardinality is bad. It’s that unbounded, accidental cardinality shows up later as cost surprises, slow queries, and trust erosion unless systems are explicitly designed for it.&lt;/p&gt;

&lt;p&gt;If this sounds familiar, the full post walks through: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;why these failures are hard to detect early&lt;/li&gt;
&lt;li&gt;the systems-level mechanics behind them and&lt;/li&gt;
&lt;li&gt;the patterns teams use to make high-cardinality metrics survivable in practice&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Complete article here&lt;br&gt;
&lt;a href="https://last9.io/blog/why-high-cardinality-metrics-break/" rel="noopener noreferrer"&gt;https://last9.io/blog/why-high-cardinality-metrics-break/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>timeseries</category>
      <category>monitoring</category>
      <category>devops</category>
    </item>
    <item>
      <title>Log Anything vs Log Everything</title>
      <dc:creator>Nishant Modak</dc:creator>
      <pubDate>Wed, 16 Oct 2024 02:37:07 +0000</pubDate>
      <link>https://forem.com/last9/log-anything-vs-log-everything-2c50</link>
      <guid>https://forem.com/last9/log-anything-vs-log-everything-2c50</guid>
      <description>&lt;p&gt;Log Everything vs. Log Anything  ⚡️&lt;/p&gt;

&lt;p&gt;&lt;a href="https://last9.io/blog/log-anything-vs-log-everything/" rel="noopener noreferrer"&gt;https://last9.io/blog/log-anything-vs-log-everything/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Log Everything: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured, consistent logging across services&lt;/li&gt;
&lt;li&gt;High-cardinality data that adds context&lt;/li&gt;
&lt;li&gt;Events that tell a story about system behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Fine wine, complex yet clear. Helps future you debug at 3 AM.&lt;/p&gt;

&lt;p&gt;Log Anything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Random console.log("here") sprinkled like confetti&lt;/li&gt;
&lt;li&gt;Unstructured text that's a pain to parse&lt;/li&gt;
&lt;li&gt;Non-thoughtful severity levels.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mystery juice, might be tasty, might be toxic. Future you curses past you at 3 AM.&lt;/p&gt;

</description>
      <category>logging</category>
      <category>webdev</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Prometheus Remote Write</title>
      <dc:creator>Nishant Modak</dc:creator>
      <pubDate>Mon, 16 Sep 2024 23:33:57 +0000</pubDate>
      <link>https://forem.com/last9/prometheus-remote-write-51n6</link>
      <guid>https://forem.com/last9/prometheus-remote-write-51n6</guid>
      <description>&lt;p&gt;Ever felt like your Prometheus setup is about to burst at the seams? You're not alone. We've all been there, watching our monitoring system groan under the weight of a million time series.&lt;/p&gt;

&lt;p&gt;But fear not! I've just penned an epic saga on taming the beast that is &lt;a href="https://last9.io/blog/what-is-prometheus-remote-write/" rel="noopener noreferrer"&gt;Prometheus remote write&lt;/a&gt;. We're talking queue wizardry, cardinality kung-fu, and relabeling magic that'll make your metrics flow like butter.&lt;br&gt;
Oh, and there's a juicy example where we turned a metric firehose into a well-behaved garden sprinkler. Spoiler: It involves a 60% CPU diet and a 70% latency liposuction.&lt;/p&gt;

&lt;p&gt;Curious? Hop over to the blogpost for the full scoop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://last9.io/blog/optimizing-prometheus-remote-write-performance-guide/" rel="noopener noreferrer"&gt;Optimizing remote write performance&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Happy optimizing, and may your dashboards be ever green! 🚀📉&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>promql</category>
      <category>prometheus</category>
    </item>
    <item>
      <title>Logging in Golang</title>
      <dc:creator>Nishant Modak</dc:creator>
      <pubDate>Sat, 14 Sep 2024 05:44:52 +0000</pubDate>
      <link>https://forem.com/last9/logging-in-golang-40k2</link>
      <guid>https://forem.com/last9/logging-in-golang-40k2</guid>
      <description>&lt;p&gt;Practical insights into Golang logging, including how to use the log package, popular third-party libraries, and tips for structured logging.&lt;/p&gt;

&lt;p&gt;Table of Contents&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Introduction to Golang Logging&lt;/li&gt;
&lt;li&gt;The Standard Library: log Package
&lt;em&gt;How I Learned to Stop Worrying and Love fmt.Println()&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Popular Third-Party Logging Libraries
&lt;em&gt;Because Reinventing the wheel is so 2000s&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Structured Logging in Go
&lt;em&gt;JSON: Its whats for dinner&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Configuring Log Levels and Output Formats
&lt;em&gt;Choosing your adventure&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Integrating with Observability Platforms
&lt;em&gt;Because logs are lonely without metrics and traces&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Best Practices and Performance Considerations
&lt;em&gt;How to not shoot yourself in the foot&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Real-World Examples
&lt;em&gt;I really have used this stuff myself&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Conclusion
&lt;em&gt;Log everything; but log it right with a schema (Otel)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://last9.io/blog/golang-logging-guide-for-developers/" rel="noopener noreferrer"&gt;Golang logging guide&lt;/a&gt; for developers &lt;/p&gt;

</description>
      <category>logging</category>
      <category>go</category>
      <category>programming</category>
    </item>
    <item>
      <title>Prometheus Alternatives</title>
      <dc:creator>Prathamesh Sonpatki</dc:creator>
      <pubDate>Tue, 07 Feb 2023 11:29:16 +0000</pubDate>
      <link>https://forem.com/last9/prometheus-alternatives-3j7b</link>
      <guid>https://forem.com/last9/prometheus-alternatives-3j7b</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczoaoav9153n8wgj5rc4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fczoaoav9153n8wgj5rc4.jpg" alt="Prometheus Alternatives" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Prometheus is a popular open-source platform for metrics and alerting created by SoundCloud in 2012 and officially released as open-source in 2015. Designed for both dynamic service-oriented architectures and system monitoring, Prometheus focuses on reliability, multidimensional data collection, and data visualization.&lt;/p&gt;

&lt;p&gt;While &lt;a href="https://last9.io/blog/prometheus-monitoring" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt; is an excellent option for tracking metrics, other open-source and SAAS alternatives in the ecosystem might better suit your needs.&lt;/p&gt;

&lt;p&gt;This article compares Prometheus with InfluxDB, Zabbix, Datadog, and Graphite, Grafana based on their data model and storage, architecture, APIs and access methods, partitioning, compatible operating systems, pricing, visualization, alerting, and supported programming languages, use cases and supported workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Prometheus Alternatives&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The following is an overview of each tool compared in this article.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is Prometheus?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;As mentioned above, &lt;a href="https://github.com/prometheus/prometheus" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Prometheus&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; is a monitoring and alerting system that helps developers manage applications, tools, databases, and even network monitoring. It has a comprehensive set of built-in features for collecting metric data and acts as a full-stack observability and monitoring system for microservices and &lt;a href="https://dev.to/prathamesh/kubernetes-monitoring-with-prometheus-and-grafana-2ic3-temp-slug-4793421"&gt;cloud-native applications.&lt;/a&gt; It has merged with &lt;a href="https://www.cncf.io/projects/prometheus/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Cloud Native Computing Foundation(CNCF)&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; since 2016 as the second most popular project after &lt;a href="https://www.cncf.io/projects/kubernetes/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Kubernetes&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;&lt;u&gt;. &lt;/u&gt;&lt;/strong&gt; While Prometheus is an excellent tool for DevOps and SRE teams, it can run into scalability issues where tools such as &lt;a href="https://dev.to/prathamesh/thanos-vs-cortex-2hh5-temp-slug-2519808"&gt;Thanos&lt;/a&gt;, &lt;a href="https://dev.to/prathamesh/thanos-vs-cortex-2hh5-temp-slug-2519808"&gt;Cortex&lt;/a&gt;, and &lt;a href="https://last9.io/products/levitate/" rel="noopener noreferrer"&gt;Levitate&lt;/a&gt; can help.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;InfluxDB&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.influxdata.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;InfluxDB&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; is a leading time series database that comes in three editions: an open-source version called InfluxDB and two commercial versions called InfluxDB Cloud and InfluxDB Enterprise. It provides a complete set of data tools for ingesting, processing, and manipulating multiple data points. It includes the InfluxDB user interface (InfluxDB UI) and Flux, a functional scripting and query language.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Zabbix&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.zabbix.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Zabbix&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; is a scalable, accessible, open-source monitoring solution used for both small environments and enterprise-level distributed systems with millions of metrics.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Datadog&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.datadoghq.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Datadog&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; is a monitoring and analytics platform used for event monitoring and measuring the performance of cloud applications and infrastructure. It combines real-time metrics from disparate sources such as applications, servers, databases, and containers with end-to-end tracing to deliver alerts and visualizations. It can collect data from various data sources with its built-in integrations.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Graphite&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Created by Chris Davis at Orbitz in 2006 and released as open source in 2008, &lt;a href="https://graphiteapp.org/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Graphite&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; is a monitoring solution that collects time series data from applications, servers, infrastructure, and networks. It focuses on storing passive time series data and analyzing it through the Graphite web UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Grafana
&lt;/h2&gt;

&lt;p&gt;Grafana is a data visualization tool developed by Grafana Labs. It is available as open source, managed (Grafana Cloud), or enterprise edition. Grafana can combine data from many data sources into a single dashboard. It solves the problem of visualization of time series data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is Grafana the same as Prometheus?
&lt;/h3&gt;

&lt;p&gt;We keep seeing this common question; while Prometheus is a time series database, Grafana is a data visualization tool. It supports Prometheus, Graphite, and InfluxDB as data sources. So they are not the same, but they work better together. Grafana is a standard for the visualization of Prometheus data.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Prometheus Alternatives in action&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This section compares Prometheus to InfluxDB, Zabbix, Datadog, and Graphite using the following criteria:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data model and storage&lt;/li&gt;
&lt;li&gt;Architecture&lt;/li&gt;
&lt;li&gt;APIs and access methods&lt;/li&gt;
&lt;li&gt;Partitioning&lt;/li&gt;
&lt;li&gt;Compatible operating systems&lt;/li&gt;
&lt;li&gt;Supported programming languages&lt;/li&gt;
&lt;li&gt;Open Source vs. Proprietary&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data Model and Storage&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prometheus captures and accumulates metric data as time series data and stores it in a local database. A metric name and optional key-value pairs are unique identifiers or labels for each time series.&lt;/p&gt;

&lt;p&gt;Data can be queried in real-time using the &lt;a href="https://prometheus.io/docs/prometheus/latest/querying/basics/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Prometheus Query Language&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; (PromQL) and presented in tabular or graphical form.&lt;/p&gt;

&lt;p&gt;Prometheus supports the float64 data type with limited support for strings and millisecond resolution timestamps. Prometheus also supports long-term storage to different layers via &lt;a href="https://dev.to/prathamesh/how-to-improve-prometheus-remote-write-performance-at-scale-34c6-temp-slug-8212458"&gt;Prometheus remote write&lt;/a&gt; protocol and can be run in an agent mode.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;InfluxDB: Data Model and Storage&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;InfluxDB maintains a time series database optimized for time-stamped data, much like Prometheus. Data elements also comprise a unique combination of timestamps, tags, fields, and measurements. Tags are indexed key-value pairs used as labels, while fields are sequenced key-value pairs, which function as secondary labels with limited use.&lt;/p&gt;

&lt;p&gt;InfluxDB uses a proprietary query language similar to SQL called &lt;a href="https://docs.influxdata.com/influxdb/v1.7/query_language/spec/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;InfluxQL&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt; and supports timestamp, float64, int64, string, and bool data types.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Zabbix: Data Model and Storage&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Zabbix uses an external database to store the collected data and configuration information. It integrates with leading relational database management system (RDBMS) database engines such as MySQL, MariaDB, Oracle, PostgreSQL, IBM Db2, and SQLite, which allows Zabbix to store more complex data types such as system logs. Zabbix stores raw data collected from hosts in history tables, while trends tables store consolidated hourly data.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Datadog: Data Model and Storage&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Datadog uses Kafka to process incoming data points and a mix of Redis, Cassandra, and S3 to store and query time series. It also uses Elasticsearch to store and query events (such as alerts and deployments) that are not represented as a time series and uses PostgreSQL for metadata.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Graphite: Data Model and Storage&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Like Prometheus, Graphite stores time series data using its specialized database, but data collection is passive. Data is collected from collection daemons or other monitoring tools (including Prometheus) and sent to Graphite's Carbon component.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://dev.to/prathamesh/prometheus-vs-influxdb-23do-temp-slug-126429"&gt;InfluxDB&lt;/a&gt; and Graphite both use time series databases similar to Prometheus. Graphite, however, doesn't store raw data as Prometheus does. InfluxDB offers full support for strings and timestamps as well as int64 and bool data types, while Prometheus only provides full support for float64. Zabbix integrates with more familiar RDBMS database engines and is suitable for storing historical data. At the same time, Datadog uses several data models and storage types to store both time-series and non-time-series data.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Architecture&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prometheus servers are standalone and run independently of each other. They rely on local on-disk storage rather than network or remote storage services for the core functionality of scraping, rule processing, and alerting. Data is stored for fourteen days, but Prometheus can be integrated with remote solutions such as &lt;a href="https://last9.io/products/levitate/" rel="noopener noreferrer"&gt;Levitate&lt;/a&gt; for long-term storage.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;InfluxDB: Architecture&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Like Prometheus, open-source InfluxDB servers are standalone and use local storage for scraping, alerting, and rule processing. Commercial InfluxDB versions come with distributed storage by default that allows queries and storage to be managed by many nodes simultaneously, making it easier to perform horizontal scaling.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Zabbix: Architecture&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Zabbix architecture comprises servers that store statistical, operational, and configuration data and agents installed on the machines that collect the data. Agents monitor and report data collected from local resources and applications to Zabbix servers.&lt;/p&gt;

&lt;p&gt;Agents and servers support passive checks, where the server requests a value from the agent, and active checks, where the agent periodically sends results to the server.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Datadog: Architecture&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Datadog uses Kafka for independent storage systems. It acts as a persistent storage and query layer. Kafka is an open-source, distributed, partitioned, replicated log service developed by LinkedIn as a unified platform for handling large-scale, real-time data feeds.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Graphite: Architecture&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Graphite architecture is made up of three components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Carbon, the primary backend daemon that listens for time series data sent to Graphite and stores it in Whisper, the backend database&lt;/li&gt;
&lt;li&gt;Whisper, a fast, file-based local time series database that creates one file per stored metric&lt;/li&gt;
&lt;li&gt;The Graphite web UI, the frontend UI for the backend storage system that renders graphs on demand&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;While InfluxDB and Prometheus both use standalone servers, commercial versions of InfluxDB offer distributed storage to support horizontal scaling. The Zabbix architectural model uses servers with agents, which allows for both passive and active data checks. Datadog's use of Kafka for its persistent data storage layer will enable it to store large amounts of real-time data. Graphite's architecture includes a web app, which is a good choice if you want to render graphics on demand.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;APIs and Access Methods&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prometheus uses RESTful HTTP endpoints with responses in JSON.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;InfluxDB: APIs and Access Methods&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The InfluxDB API provides a set of HTTP endpoints for accessing and managing system information, security and access control, resource access, data I/O, and other resources and returns JSON-formatted responses. The Enterprise version also provides support for TCP and UDP ports.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Zabbix: APIs and Access Methods&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Zabbix uses the JSON-RPC 2.0 protocol. Requests and responses between clients and the API are encoded using JSON.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Datadog: APIs and Access Methods&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Datadog uses the HTTP REST API. Resource-oriented URLs are used to call the API, with JSON being returned from all requests.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Graphite: APIs and Access Methods&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Graphite data is queried over HTTP via its Metrics API or the Render URL API. The Graphite API is an alternative to the Graphite web UI that retrieves metrics from a time series database and renders graphs or generates JSON data based on these time series.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;All tools provide support for HTTP requests and JSON-formatted responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Partitioning&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prometheus supports sharding. You can scale horizontally by splitting target metrics into shards on multiple Prometheus servers to create more minor instances.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;InfluxDB: Partitioning&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;InfluxDB organizes data into shards to create a highly scalable approach that increases throughput and maintains performance as the data grows. Shards are placed into shard groups containing encoded and compressed time series data for a specific time range. The shard group duration defines the period for each shard group, and each group has a corresponding retention policy that applies to all the shards within the group.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Zabbix: Partitioning&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Partitioning with Zabbix depends on the database being used. MySQL, PostgreSQL, IBM Db2, and MariaDB (with the Spider storage engine) offer sharding capabilities.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Datadog: Partitioning&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Datadog uses Kafka partitions to scale by customer, metric, and tag set. You can isolate by the customer or scale concurrently by metric. Sharding is implemented as a group of Kafka partitions.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Graphite: Partitioning&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Graphite does not support partitioning.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;All tools except for Graphite offer some form of support for portioning. Prometheus, InfluxDB, and Datadog provide sharding and horizontal scaling features, while Zabbix support depends on your chosen external database.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Compatible Operating Systems&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prometheus supports the Linux and Windows operating systems.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;InfluxDB: Compatible Operating Systems&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;InfluxDB supports Linux, Windows, and macOS.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Zabbix: Compatible Operating Systems&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Zabbix supports Linux, Windows, macOS, IBM AIX, Solaris, and HP-UX operating systems.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Datadog: Compatible Operating Systems&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Datadog supports Windows, Linux, and macOS operating systems and cloud service providers, including Google Cloud, AWS, Red Hat OpenShift, and Microsoft Azure.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Graphite: Compatible Operating Systems&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Graphite supports Linux and Unix operating systems.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;All tools except Graphite supports Windows and Linux operating systems; Graphite only supports Linux and Unix. InfluxDB, Zabbix, and Datadog also support macOS, with Datadog providing additional support for cloud service providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Supported Programming Languages&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prometheus provides several official and unofficial client libraries for .NET, C++, Go, Haskell, Java, JavaScript (Node.js), Python, and Ruby. It also supports &lt;a href="https://dev.to/prathamesh/best-practices-using-and-writing-prometheus-exporters-34lb-temp-slug-6306814"&gt;Prometheus Exporters&lt;/a&gt; to collect data from systems that do not directly have client libraries.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;InfluxDB: Supported Programming Languages&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;InfluxDB supports client libraries for C++, Java, JavaScript, .NET, Perl, PHP, and Python. It can be directly used with the REST API.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Zabbix: Supported Programming Languages&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Zabbix supports Java, JavaScript, .NET, Perl, PHP, Python, R, Ruby, Elixir, Go, and Rust.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Datadog: Supported Programming Languages&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Client libraries are available in C#/.NET, Java, Python, PHP, Go, Node.js, Ruby, and Swift, along with many integrations.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Graphite: Supported Programming Languages&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Graphite has client libraries in Python and JavaScript (Node.js) programming languages.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Prometheus, InfluxDB, Zabbix, and Datadog all support the major programming languages. Graphite, however, only provides support for Python and JavaScript.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Comparison summary&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Prometheus&lt;/th&gt;
&lt;th&gt;InfluxDB&lt;/th&gt;
&lt;th&gt;Zabbix&lt;/th&gt;
&lt;th&gt;Datadog&lt;/th&gt;
&lt;th&gt;Graphite&lt;/th&gt;
&lt;th&gt;Levitate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data Model and Storage&lt;/td&gt;
&lt;td&gt;Multi-dimensional data model with Time series data&lt;/td&gt;
&lt;td&gt;Time series data&lt;/td&gt;
&lt;td&gt;External database stores including RDBMS&lt;/td&gt;
&lt;td&gt;Both time series and non time series data&lt;/td&gt;
&lt;td&gt;Time series data&lt;/td&gt;
&lt;td&gt;PromQL compatible time series data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API and Access methods&lt;/td&gt;
&lt;td&gt;HTTP API&lt;/td&gt;
&lt;td&gt;HTTP API&lt;/td&gt;
&lt;td&gt;HTTP API&lt;/td&gt;
&lt;td&gt;HTTP API&lt;/td&gt;
&lt;td&gt;HTTP API&lt;/td&gt;
&lt;td&gt;HTTP API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Partitioning&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Supported, depends on RDBMS of choice&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Managed TSDB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open Source&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes. Proprietary also available.&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No. Proprietary&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No. Proprietary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Programming languages&lt;/td&gt;
&lt;td&gt;Tons of client libraries and exporters&lt;/td&gt;
&lt;td&gt;C++, Java, JavaScript, .NET, Perl, PHP, and Python.&lt;/td&gt;
&lt;td&gt;Java, JavaScript, .NET, Perl, PHP, Python, R, Ruby, Elixir, Go, and Rust&lt;/td&gt;
&lt;td&gt;Tons of integrations&lt;/td&gt;
&lt;td&gt;Python and JavaScript (Node.js)&lt;/td&gt;
&lt;td&gt;It can be directly used with the REST API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Prometheus's strengths lie in its support for multidimensional data collection. It has a powerful query language that can be used for both dynamic service-oriented architectures and machine-centric monitoring. It's a good choice when you primarily want to record numeric time series.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/prathamesh/prometheus-vs-influxdb-23do-temp-slug-126429"&gt;InfluxDB and Prometheus&lt;/a&gt; use similar data compression techniques and support multidimensional data using key-value data stores; InfluxDB is better for event logging. A commercial version provides the best option if you need to process large amounts of data, as its default configuration scales horizontally.&lt;/p&gt;

&lt;p&gt;Zabbix focuses on hardware and device management and monitoring. It's a better option than Prometheus if you are more familiar with RDBMS database engines and need to store many historical and varied data types. However, the use of an external database can slow down performance.&lt;/p&gt;

&lt;p&gt;Prometheus's internal time series database provides faster connectivity to data but is not suitable for storing data types like text or event logs. Since Prometheus only keeps data for fourteen days, it's also not a good option if you need to store historical data (unless configured for remote storage).&lt;/p&gt;

&lt;p&gt;Datadog and Prometheus can be used for application performance monitoring(APM). However, Datadog has more application monitoring capabilities than Prometheus and is geared toward monitoring infrastructure at scale. Datadog is best for monitoring infrastructure and apps and visualizing data from disparate sources in mid to large-scale environments.&lt;/p&gt;

&lt;p&gt;Graphite runs well on all hardware and cloud infrastructure, making it suitable for small businesses with limited resources and large-scale production environments. Choose Graphite when you need a solution focused on storing and analyzing historical data and fast retrieval.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Prometheus is a popular option for tracking metrics and alerting, but one of the four alternatives mentioned above might suit your needs depending on your requirements.&lt;/p&gt;

&lt;p&gt;For processing large amounts of data, choose a commercial version of InfluxDB, but if you want the familiarity of an RDBMS engine, then go with Zabbix. Datadog's wide range of monitoring features makes it the go-to choice for monitoring infrastructure in larger environments. Still, if you operate on a smaller scale, Graphite can get the job done with whatever hardware and resources you have.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://last9.io/" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Last9&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;, a site reliability engineering (SRE) platform. We remove the guesswork in improving the reliability of your distributed systems. Last9's &lt;a href="https://last9.io/products/levitate" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;u&gt;Levitate&lt;/u&gt;&lt;/strong&gt;&lt;/a&gt;, a managed time series database(TSDB), helps you understand, track, and improve your organization's system dependencies to reduce the challenges of time series database management.&lt;/p&gt;

&lt;p&gt;Access the intelligence you need to deliver reliable software with Last9's reliability platform.&lt;/p&gt;




&lt;p&gt;This post was originally published on &lt;a href="https://last9.io/blog/prometheus-alternatives/" rel="noopener noreferrer"&gt;Last9 Blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>prometheus</category>
      <category>influxdb</category>
      <category>grafana</category>
      <category>timeseries</category>
    </item>
    <item>
      <title>A practical guide for implementing SLO</title>
      <dc:creator>Prathamesh Sonpatki</dc:creator>
      <pubDate>Thu, 12 Jan 2023 05:30:00 +0000</pubDate>
      <link>https://forem.com/last9/a-practical-guide-for-implementing-slo-1pej</link>
      <guid>https://forem.com/last9/a-practical-guide-for-implementing-slo-1pej</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8z94m9u5yy8de9o985m.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8z94m9u5yy8de9o985m.jpg" alt="A practical guide for implementing SLO" width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a mini guide to the SLO process that SREs and DevOps teams can use as a rule of thumb. This guide not necessarily automates the SLO process but gives a direction in which one can go using SLOs effectively.&lt;/p&gt;

&lt;p&gt;The process essentially involves 3 steps&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identify the level of the Service&lt;/li&gt;
&lt;li&gt;Identify the right type of the SLO&lt;/li&gt;
&lt;li&gt;Set the SLO Targets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before diving deep into it, let’s understand a few terminologies in the Site Reliability Engineering and Observability world.&lt;/p&gt;

&lt;h2&gt;
  
  
  SLO Terminologies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Service Level Indicator(SLI)
&lt;/h3&gt;

&lt;p&gt;A Service level indicator ( &lt;strong&gt;SLI&lt;/strong&gt; ) is a measure of the service level provided by a service provider to a customer. It is a quantitative measure that captures key metrics, like the percentage of successful requests or completed requests within 200 milliseconds, for example.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Level Objective(SLO)
&lt;/h3&gt;

&lt;p&gt;A &lt;a href="https://sre.google/sre-book/service-level-objectives/" rel="noopener noreferrer"&gt;Service Level objective&lt;/a&gt; is a codified way to define a goal for service behaviour using a Service Level indicator within a compliance target.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Level Agreement(SLA)
&lt;/h3&gt;

&lt;p&gt;A &lt;a href="https://www.gartner.com/en/information-technology/glossary/sla-service-level-agreement" rel="noopener noreferrer"&gt;service level agreement&lt;/a&gt; defines the level of service expected by users in terms of customer experience. They also include penalties in case of agreement violation.&lt;/p&gt;

&lt;p&gt;Let’s go through the SLO process now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identify the level of Service
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Customer-Facing Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A Service running HTTP API / apps/ GRPC workloads where the caller expects an immediate response to the request they submit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stateful Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Services like a database. It is common to confuse a database as not being a service in a microservices environment where multiple services call the same database. Try answering this straightforward question next time you are unable to decide.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;My service HAS a database OR my Service CALLS a database.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Asynchronous Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Any service that does not respond with the request result instead queues it to be processed later. The only response is to acknowledge whether the service successfully accepted the task or not; the service will process the actual result/available later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Operational Services are usually internal to an organization and deal with jobs like Reconciliation, Infrastructure bring-up, tear-down, etc. These jobs are typically asynchronous. But with a greater focus on accuracy vs. throughput. The Job may run late, but it must be correct as much as possible&lt;/p&gt;

&lt;h2&gt;
  
  
  Identify the right type of the SLO
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Request Based SLO
&lt;/h3&gt;

&lt;p&gt;Request-based SLOs perform &lt;strong&gt;&lt;em&gt;some&lt;/em&gt;&lt;/strong&gt; aggregation of Good &lt;strong&gt;requests&lt;/strong&gt; vs. The total number of &lt;strong&gt;requests&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, there is a notion of a &lt;strong&gt;Request&lt;/strong&gt;. A request is a single operation on a component that succeeds or fails in generic terms.&lt;/li&gt;
&lt;li&gt;Secondly, the SLIs have to be not pre-aggregated because Request SLOs perform an aggregation over a period of time. One can’t use pre-aggregated metrics(eg. Cloudwatch / Stackdriver which directly returns P99 latency rather than total requests and latency per request) for Request SLOs.&lt;/li&gt;
&lt;li&gt;Additionally, for low-traffic services, Request SLOs can be noisy because they can keep flapping even when a very small % of requests fail. Eg. if your throughput is 10 rpm in a day, setting a 99% compliance target does not make sense because 1 request will bring down the compliance to 90% depleting the error budget.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Window Based SLO
&lt;/h3&gt;

&lt;p&gt;Window-based SLO is a ratio of Good &lt;strong&gt;time intervals&lt;/strong&gt; vs. total &lt;strong&gt;time intervals&lt;/strong&gt;. For some sources, the requests are not available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For example,&lt;/strong&gt; In the case of a Kubernetes Cluster, the availability of a Cluster is the percentage of pods allocated vs. pods requested. Sometimes, you may not want to calculate the SLO as the overall performance of the service over a period of time.&lt;/p&gt;

&lt;p&gt;Eg. in the case of a payment service, even if only 2% of requests fail in a window of 5 minutes, it is unacceptable because it is a critical service for my business. Even though overall performance has not degraded but that 2% of requests none of the payments was successful. Window-based SLOs are useful in such cases.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Using the above guidelines, we can create a rough flowchart to decide which type of SLO to choose depending on certain decision points.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdpjwob5n6zpcvrj6lr53.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdpjwob5n6zpcvrj6lr53.png" alt="A practical guide for implementing SLO" width="800" height="444"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;SLO Process&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Set the SLO Targets
&lt;/h2&gt;

&lt;p&gt;When you start thinking about setting objectives, some questions will arise:&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Should I set 99.999% from the start or be conservative?&lt;/u&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start conservatively. Look at historical numbers and calculate your 9s or dive right in with the lowest 9 such as 90%.&lt;/li&gt;
&lt;li&gt;The baseline of the service or historical data of the customer experience can be helpful in this case.&lt;/li&gt;
&lt;li&gt;Keep your systems running against this objective for a period of time and see if there is no depletion of the error budget.&lt;/li&gt;
&lt;li&gt;If there are, improve your system’s stability. If there aren’t, move up to the next ladder of service reliability. From 90% go to 95 % then to 99% and so on.&lt;/li&gt;
&lt;li&gt;Keep in mind Service Level agreements or SLAs that you may have with customers or third-party upstream services that you are dependent on. You can’t have a higher compliance target than a third-party service giving you a lower SLA.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;u&gt;What should be the compliance window?&lt;/u&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Generally, this is 2x of your sprint window so that you can measure the performance of the service in a large enough duration to make an informed decision in the next sprint cycle on whether to focus on new features or maintenance.&lt;/li&gt;
&lt;li&gt;If you are not sure start with a day and expand to a week. Remember that the longer your window, the longer the effects of a broken / recovered SLO.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;u&gt;How many ms should I set for latency?&lt;/u&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;It depends. What kind of user experience are you aiming for? Is your application a payment gateway? Is it a batch processing system where real-time feedback isn’t important?&lt;/li&gt;
&lt;li&gt;To start out, measure your P50, and P99 latencies and initially give yourself some headroom and set your SLOs against P99 latency. Depending on the stability of your systems, use the same ladder-based approach as shown above and iterate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Service Level Objectives are not a silver bullet
&lt;/h2&gt;

&lt;p&gt;Let us take a simple scenario:&lt;/p&gt;

&lt;p&gt;A user makes a request to a web application hosted on Kubernetes served via a load balancer. The request flow is as follows:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftshbm6p5qy4eoxej5mvi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftshbm6p5qy4eoxej5mvi.png" alt="A practical guide for implementing SLO" width="800" height="111"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Request Flow&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Instead of setting a blind SLO on the load balancer and calling it a day, ask yourself the following questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where should I set the SLO - ALB or Ambassador or K8s or all of them? Typically SLOs are best set closest to the user or something that represents the end user’s experience e.g. if in the above example, one might want to set an SLO on the ALB but if the same ALB is serving multiple backends it might be a good idea to set the SLO on the next hop - Ambassador.&lt;/li&gt;
&lt;li&gt;If I set a latency SLO, what should be the right latency value? Look at baseline percentile numbers. Do you want to catch degradations of the P50 customer experience, the P95 customer experience, or a static number?&lt;/li&gt;
&lt;li&gt;Do I have enough metrics  I need to construct an SLI expression? AWS Cloudwatch reports latency numbers as pre-calculated P99 values i.e. if you want to set a request-based SLO with the expression, you can’t do that because the data is pre-aggregated. So you cannot set request-based SLOs, you can only use window-based SLOs.&lt;/li&gt;
&lt;li&gt;Suppose you set an availability SLO on Ambassador with the expression                &lt;code&gt;availability = 1 - (5xx / throughput)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;What happens if the Ambassador pod crashes on K8s and does not emit &lt;code&gt;5xx&lt;/code&gt; / &lt;code&gt;throughput&lt;/code&gt; signal?&lt;/li&gt;
&lt;li&gt;Does the expression become &lt;code&gt;availability = 1 - 0 / 0&lt;/code&gt;  or &lt;code&gt;availability = undefined&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;For a payment processing application, there might be a lag between the time at which the transaction was initiated v/s the time at which it was completed.&lt;/li&gt;
&lt;li&gt;How does &lt;code&gt;availability = 1 - (5xx / throughput)&lt;/code&gt; work now?&lt;/li&gt;
&lt;li&gt;How do I know  &lt;code&gt;5xx&lt;/code&gt; that I got was for a request present in the current throughput or was it a previous retry that failed?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not an exhaustive list of questions. Real-world scenarios will be complicated and that makes the task of setting achievable reliability targets involving multiple stakeholders and critical user journeys tricky.&lt;/p&gt;

&lt;h3&gt;
  
  
  So does this mean all hope is &lt;em&gt;SLOst&lt;/em&gt;?
&lt;/h3&gt;

&lt;p&gt;Of course not! SLOs are a way to gauge your system’s health and customer experience over a time period. But they are not the &lt;strong&gt;only&lt;/strong&gt; way. In the above scenario, one could:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set a request-based SLO on the Ambassador.&lt;/li&gt;
&lt;li&gt;Set an uptime window SLO or an alert that checks for no-data situations for signals that are always ≥ 0 e.g. Ambassador throughput.&lt;/li&gt;
&lt;li&gt;Set relevant alerts to catch pod crashes of the application.&lt;/li&gt;
&lt;li&gt;Set alerts on load balancer 5xx to catch scenarios where ALB had an issue and the request was not forwarded to the Ambassador backend.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Want to know more about Last9 and how we make using SLOs dead simple? Check out &lt;a href="http://last9.io/" rel="noopener noreferrer"&gt;last9.io&lt;/a&gt;; we're building SRE tools to make running systems at scale, fun, and _ &lt;strong&gt;embarrassingly easy.&lt;/strong&gt; _ &lt;strong&gt;🟢&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>slo</category>
      <category>deepdives</category>
      <category>last9engineering</category>
      <category>observability</category>
    </item>
    <item>
      <title>Watermelon Metrics</title>
      <dc:creator>Nishant Modak</dc:creator>
      <pubDate>Tue, 13 Jul 2021 04:38:42 +0000</pubDate>
      <link>https://forem.com/last9/watermelon-metrics-23pf</link>
      <guid>https://forem.com/last9/watermelon-metrics-23pf</guid>
      <description>&lt;p&gt;Watermelon Metrics; A situation where individual dashboards look green, but the overall performance is broken and red inside.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.last9.io/need-for-systems-observability/"&gt;https://blog.last9.io/need-for-systems-observability/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>monitoring</category>
      <category>sre</category>
    </item>
    <item>
      <title>Managing infra code ⚙️🛠🧰</title>
      <dc:creator>Prathamesh Sonpatki</dc:creator>
      <pubDate>Wed, 12 Aug 2020 16:52:58 +0000</pubDate>
      <link>https://forem.com/last9/managing-infra-code-43bp</link>
      <guid>https://forem.com/last9/managing-infra-code-43bp</guid>
      <description>&lt;p&gt;Do you care about the quality of your infra code?&lt;/p&gt;

&lt;p&gt;A. As much as product code&lt;br&gt;
B. Somewhat but mostly no&lt;br&gt;
C. We create infra via UI&lt;/p&gt;

&lt;p&gt;Let's discuss how do you manage Infra code! Feel free to share your thoughts in the comments section.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>sre</category>
      <category>devops</category>
      <category>poll</category>
    </item>
  </channel>
</rss>
