Forem: Last9

The 6 Questions to Ask Before Adding a High-Cardinality Label

Nishant Modak — Mon, 19 Jan 2026 20:38:17 +0000

Last month, a team I was talking to added a pod_id label to debug a networking issue. Seemed harmless - only 200 pods.

But with 50 metrics per pod and 2-minute pod churn during deployments, they created 150,000 new series per hour. Prometheus memory climbed from 8GB to 32GB in a week. They didn't notice until it OOMKilled during a production incident.

The fix took 10 minutes. The outage took 3 hours. The postmortem took a week.

The Checklist

Before adding any label that could explode, ask:

1. Which system stores this?

Prometheus pays cardinality costs at write time (memory). ClickHouse pays at query time (aggregation). Know your failure mode.

2. Is this for alerting or investigation?

Alerting must be bounded. Investigation can be unbounded - but maybe shouldn't live in Prometheus.

3. What's the expected cardinality?

distinct_values × other_label_combinations = series count
200 pods × 50 metrics × 10 endpoints = 100,000 series. Per deployment.

4. What's the growth rate?

Will this 10x in a year? Containers, request IDs, user IDs - these grow with traffic.

5. Is there a fallback?

Can you drop this label via metric_relabel_configs if it explodes? Test this before you need it.

6. Who owns this label?

When it causes problems at 3am, who gets paged?

Metrics to Watch

Before cardinality bites:

  prometheus_tsdb_head_series              # Active series count
  prometheus_tsdb_head_chunks_created_total # Rate of new chunks
  prometheus_tsdb_symbol_table_size_bytes  # Memory for interned strings
  process_resident_memory_bytes            # Actual memory usage

If head_series grows faster than expected, you have a problem brewing.

Going Deeper

I wrote a full breakdown of how Prometheus and ClickHouse handle cardinality differently at the storage engine level - head blocks, posting lists, Gorilla encoding, columnar storage, GROUP BY explosions.

https://last9.io/blog/high-cardinality-metrics-prometheus-clickhouse/ - covers why they fail in completely different ways and how to design pipelines knowing that.

Is Bun Production-Ready in 2026? A Practical Assessment

Nishant Modak — Fri, 16 Jan 2026 22:00:00 +0000

Is Bun Production-Ready in 2026? A Practical Assessment

Bun has come a long way since its initial release. With the Anthropic acquisition in December 2025, the project now has significant backing and a clear path forward. But is it ready for your production workloads?

What's Changed Recently

Bun 1.3+ Features

The recent releases have focused on developer experience:

Zero-config frontend dev: Run bun index.html directly — it handles hot reloading, ES modules, React transpilation. No Vite or Webpack config needed.
Built-in database clients: Bun.SQL supports PostgreSQL, MySQL, and SQLite natively.
Package security: bun pm check integrates with Socket.dev for vulnerability scanning.

The Anthropic Factor

The acquisition signals long-term investment. Bun powers Claude Code (which hit $1B ARR), so Anthropic has strong incentive to keep it stable and performant. The team remains the same, and it stays MIT-licensed.

When Bun Makes Sense

Based on production experience, here's where Bun shines:

Good fits:

APIs and microservices (fast cold starts matter)
CLI tools and scripts (native TypeScript, fast startup)
Internal tooling (speed up dev cycles)
SSR apps with React/Vue (built-in bundling)

Proceed with caution:

Apps heavily dependent on native Node modules
Workloads requiring every Node.js API to match exactly
Mission-critical systems without thorough dependency testing

Practical Considerations

Pin your versions — Bun's patch releases sometimes include new features
Test your dependencies — Most npm packages work, but edge cases exist
Check Node.js API coverage — Some APIs have gaps or behave slightly differently

Getting Started

If you're evaluating Bun for a new project, check out this comprehensive getting started guide that covers installation, configuration, and use cases in detail.

Bottom Line

Bun is production-viable for many workloads in 2026. The Anthropic backing reduces abandonment risk, and the tooling has matured. Start with lower-stakes projects, validate your specific dependencies, and scale from there.

What's your experience with Bun in production? Drop a comment below.

Why High-Cardinality Metrics Break Everything

Nishant Modak — Thu, 01 Jan 2026 20:56:38 +0000

High-cardinality metrics are one of those ideas that sound obviously right—until you try to use them in production.

In theory, they promise precision. Instead of averages and rollups, you get specificity: per-request, per-user ID, per-container, per-feature insights. The kind of detail engineers instinctively want when something is on fire.

And then things start breaking.

Not immediately. Not loudly. But quietly—often in ways that feel like mysterious bugs until you realize the system itself was never designed for this shape of data.

What makes high-cardinality failures especially painful is that nothing crashes. Dashboards still load. Alerts still fire. Deploys continue as usual. The only early signal is often an unexplainable cost spike or queries that suddenly feel sluggish during incidents.

Under the hood, the reason is mechanical. Every unique label combination creates a new time series. Each series needs storage, index entries, memory during ingestion, and ongoing compaction work. As cardinality grows, cost and query complexity don’t scale linearly—they multiply.

At query time, the problem gets worse. Filters that once narrowed the search space stop being selective. Queries fan out across hundreds of thousands—or millions—of sparse, short-lived series. The query engine isn’t broken; it’s doing exactly what it was asked to do, across far more data than anyone realized they’d created.

The most dangerous failure mode isn’t cost or performance—it’s trust. Charts flicker. Series appear and disappear. Queries return inconsistent shapes. Engineers stop believing what they see and quietly fall back to logs, not because logs are better, but because they’re predictable.

The takeaway isn’t that high cardinality is bad. It’s that unbounded, accidental cardinality shows up later as cost surprises, slow queries, and trust erosion unless systems are explicitly designed for it.

If this sounds familiar, the full post walks through:

why these failures are hard to detect early
the systems-level mechanics behind them and
the patterns teams use to make high-cardinality metrics survivable in practice

Complete article here
https://last9.io/blog/why-high-cardinality-metrics-break/

Log Anything vs Log Everything

Nishant Modak — Wed, 16 Oct 2024 02:37:07 +0000

Log Everything vs. Log Anything ⚡️

https://last9.io/blog/log-anything-vs-log-everything/

Log Everything:

Structured, consistent logging across services
High-cardinality data that adds context
Events that tell a story about system behavior

Fine wine, complex yet clear. Helps future you debug at 3 AM.

Log Anything:

Random console.log("here") sprinkled like confetti
Unstructured text that's a pain to parse
Non-thoughtful severity levels.

Mystery juice, might be tasty, might be toxic. Future you curses past you at 3 AM.

Prometheus Remote Write

Nishant Modak — Mon, 16 Sep 2024 23:33:57 +0000

Ever felt like your Prometheus setup is about to burst at the seams? You're not alone. We've all been there, watching our monitoring system groan under the weight of a million time series.

But fear not! I've just penned an epic saga on taming the beast that is Prometheus remote write. We're talking queue wizardry, cardinality kung-fu, and relabeling magic that'll make your metrics flow like butter.
Oh, and there's a juicy example where we turned a metric firehose into a well-behaved garden sprinkler. Spoiler: It involves a 60% CPU diet and a 70% latency liposuction.

Curious? Hop over to the blogpost for the full scoop.

Optimizing remote write performance

Happy optimizing, and may your dashboards be ever green! 🚀📉

Logging in Golang

Nishant Modak — Sat, 14 Sep 2024 05:44:52 +0000

Practical insights into Golang logging, including how to use the log package, popular third-party libraries, and tips for structured logging.

Table of Contents

Introduction to Golang Logging
The Standard Library: log Package How I Learned to Stop Worrying and Love fmt.Println()
Popular Third-Party Logging Libraries Because Reinventing the wheel is so 2000s
Structured Logging in Go JSON: Its whats for dinner
Configuring Log Levels and Output Formats Choosing your adventure
Integrating with Observability Platforms Because logs are lonely without metrics and traces
Best Practices and Performance Considerations How to not shoot yourself in the foot
Real-World Examples I really have used this stuff myself
Conclusion Log everything; but log it right with a schema (Otel)

Golang logging guide for developers

Prometheus Alternatives

Prathamesh Sonpatki — Tue, 07 Feb 2023 11:29:16 +0000

Prometheus is a popular open-source platform for metrics and alerting created by SoundCloud in 2012 and officially released as open-source in 2015. Designed for both dynamic service-oriented architectures and system monitoring, Prometheus focuses on reliability, multidimensional data collection, and data visualization.

While Prometheus is an excellent option for tracking metrics, other open-source and SAAS alternatives in the ecosystem might better suit your needs.

This article compares Prometheus with InfluxDB, Zabbix, Datadog, and Graphite, Grafana based on their data model and storage, architecture, APIs and access methods, partitioning, compatible operating systems, pricing, visualization, alerting, and supported programming languages, use cases and supported workloads.

Prometheus Alternatives

The following is an overview of each tool compared in this article.

What is Prometheus?

As mentioned above, Prometheus is a monitoring and alerting system that helps developers manage applications, tools, databases, and even network monitoring. It has a comprehensive set of built-in features for collecting metric data and acts as a full-stack observability and monitoring system for microservices and cloud-native applications. It has merged with Cloud Native Computing Foundation(CNCF) since 2016 as the second most popular project after Kubernetes . While Prometheus is an excellent tool for DevOps and SRE teams, it can run into scalability issues where tools such as Thanos, Cortex, and Levitate can help.

InfluxDB

InfluxDB is a leading time series database that comes in three editions: an open-source version called InfluxDB and two commercial versions called InfluxDB Cloud and InfluxDB Enterprise. It provides a complete set of data tools for ingesting, processing, and manipulating multiple data points. It includes the InfluxDB user interface (InfluxDB UI) and Flux, a functional scripting and query language.

Zabbix

Zabbix is a scalable, accessible, open-source monitoring solution used for both small environments and enterprise-level distributed systems with millions of metrics.

Datadog

Datadog is a monitoring and analytics platform used for event monitoring and measuring the performance of cloud applications and infrastructure. It combines real-time metrics from disparate sources such as applications, servers, databases, and containers with end-to-end tracing to deliver alerts and visualizations. It can collect data from various data sources with its built-in integrations.

Graphite

Created by Chris Davis at Orbitz in 2006 and released as open source in 2008, Graphite is a monitoring solution that collects time series data from applications, servers, infrastructure, and networks. It focuses on storing passive time series data and analyzing it through the Graphite web UI.

Grafana

Grafana is a data visualization tool developed by Grafana Labs. It is available as open source, managed (Grafana Cloud), or enterprise edition. Grafana can combine data from many data sources into a single dashboard. It solves the problem of visualization of time series data.

Is Grafana the same as Prometheus?

We keep seeing this common question; while Prometheus is a time series database, Grafana is a data visualization tool. It supports Prometheus, Graphite, and InfluxDB as data sources. So they are not the same, but they work better together. Grafana is a standard for the visualization of Prometheus data.

Prometheus Alternatives in action

This section compares Prometheus to InfluxDB, Zabbix, Datadog, and Graphite using the following criteria:

Data model and storage
Architecture
APIs and access methods
Partitioning
Compatible operating systems
Supported programming languages
Open Source vs. Proprietary

Data Model and Storage

Prometheus captures and accumulates metric data as time series data and stores it in a local database. A metric name and optional key-value pairs are unique identifiers or labels for each time series.

Data can be queried in real-time using the Prometheus Query Language (PromQL) and presented in tabular or graphical form.

Prometheus supports the float64 data type with limited support for strings and millisecond resolution timestamps. Prometheus also supports long-term storage to different layers via Prometheus remote write protocol and can be run in an agent mode.

InfluxDB: Data Model and Storage

InfluxDB maintains a time series database optimized for time-stamped data, much like Prometheus. Data elements also comprise a unique combination of timestamps, tags, fields, and measurements. Tags are indexed key-value pairs used as labels, while fields are sequenced key-value pairs, which function as secondary labels with limited use.

InfluxDB uses a proprietary query language similar to SQL called InfluxQL and supports timestamp, float64, int64, string, and bool data types.

Zabbix: Data Model and Storage

Zabbix uses an external database to store the collected data and configuration information. It integrates with leading relational database management system (RDBMS) database engines such as MySQL, MariaDB, Oracle, PostgreSQL, IBM Db2, and SQLite, which allows Zabbix to store more complex data types such as system logs. Zabbix stores raw data collected from hosts in history tables, while trends tables store consolidated hourly data.

Datadog: Data Model and Storage

Datadog uses Kafka to process incoming data points and a mix of Redis, Cassandra, and S3 to store and query time series. It also uses Elasticsearch to store and query events (such as alerts and deployments) that are not represented as a time series and uses PostgreSQL for metadata.

Graphite: Data Model and Storage

Like Prometheus, Graphite stores time series data using its specialized database, but data collection is passive. Data is collected from collection daemons or other monitoring tools (including Prometheus) and sent to Graphite's Carbon component.

Summary

InfluxDB and Graphite both use time series databases similar to Prometheus. Graphite, however, doesn't store raw data as Prometheus does. InfluxDB offers full support for strings and timestamps as well as int64 and bool data types, while Prometheus only provides full support for float64. Zabbix integrates with more familiar RDBMS database engines and is suitable for storing historical data. At the same time, Datadog uses several data models and storage types to store both time-series and non-time-series data.

Architecture

Prometheus servers are standalone and run independently of each other. They rely on local on-disk storage rather than network or remote storage services for the core functionality of scraping, rule processing, and alerting. Data is stored for fourteen days, but Prometheus can be integrated with remote solutions such as Levitate for long-term storage.

InfluxDB: Architecture

Like Prometheus, open-source InfluxDB servers are standalone and use local storage for scraping, alerting, and rule processing. Commercial InfluxDB versions come with distributed storage by default that allows queries and storage to be managed by many nodes simultaneously, making it easier to perform horizontal scaling.

Zabbix: Architecture

Zabbix architecture comprises servers that store statistical, operational, and configuration data and agents installed on the machines that collect the data. Agents monitor and report data collected from local resources and applications to Zabbix servers.

Agents and servers support passive checks, where the server requests a value from the agent, and active checks, where the agent periodically sends results to the server.

Datadog: Architecture

Datadog uses Kafka for independent storage systems. It acts as a persistent storage and query layer. Kafka is an open-source, distributed, partitioned, replicated log service developed by LinkedIn as a unified platform for handling large-scale, real-time data feeds.

Graphite: Architecture

Graphite architecture is made up of three components:

Carbon, the primary backend daemon that listens for time series data sent to Graphite and stores it in Whisper, the backend database
Whisper, a fast, file-based local time series database that creates one file per stored metric
The Graphite web UI, the frontend UI for the backend storage system that renders graphs on demand

Summary

While InfluxDB and Prometheus both use standalone servers, commercial versions of InfluxDB offer distributed storage to support horizontal scaling. The Zabbix architectural model uses servers with agents, which allows for both passive and active data checks. Datadog's use of Kafka for its persistent data storage layer will enable it to store large amounts of real-time data. Graphite's architecture includes a web app, which is a good choice if you want to render graphics on demand.

APIs and Access Methods

Prometheus uses RESTful HTTP endpoints with responses in JSON.

InfluxDB: APIs and Access Methods

The InfluxDB API provides a set of HTTP endpoints for accessing and managing system information, security and access control, resource access, data I/O, and other resources and returns JSON-formatted responses. The Enterprise version also provides support for TCP and UDP ports.

Zabbix: APIs and Access Methods

Zabbix uses the JSON-RPC 2.0 protocol. Requests and responses between clients and the API are encoded using JSON.

Datadog: APIs and Access Methods

Datadog uses the HTTP REST API. Resource-oriented URLs are used to call the API, with JSON being returned from all requests.

Graphite: APIs and Access Methods

Graphite data is queried over HTTP via its Metrics API or the Render URL API. The Graphite API is an alternative to the Graphite web UI that retrieves metrics from a time series database and renders graphs or generates JSON data based on these time series.

Summary

All tools provide support for HTTP requests and JSON-formatted responses.

Partitioning

Prometheus supports sharding. You can scale horizontally by splitting target metrics into shards on multiple Prometheus servers to create more minor instances.

InfluxDB: Partitioning

InfluxDB organizes data into shards to create a highly scalable approach that increases throughput and maintains performance as the data grows. Shards are placed into shard groups containing encoded and compressed time series data for a specific time range. The shard group duration defines the period for each shard group, and each group has a corresponding retention policy that applies to all the shards within the group.

Zabbix: Partitioning

Partitioning with Zabbix depends on the database being used. MySQL, PostgreSQL, IBM Db2, and MariaDB (with the Spider storage engine) offer sharding capabilities.

Datadog: Partitioning

Datadog uses Kafka partitions to scale by customer, metric, and tag set. You can isolate by the customer or scale concurrently by metric. Sharding is implemented as a group of Kafka partitions.

Graphite: Partitioning

Graphite does not support partitioning.

Summary

All tools except for Graphite offer some form of support for portioning. Prometheus, InfluxDB, and Datadog provide sharding and horizontal scaling features, while Zabbix support depends on your chosen external database.

Compatible Operating Systems

Prometheus supports the Linux and Windows operating systems.

InfluxDB: Compatible Operating Systems

InfluxDB supports Linux, Windows, and macOS.

Zabbix: Compatible Operating Systems

Zabbix supports Linux, Windows, macOS, IBM AIX, Solaris, and HP-UX operating systems.

Datadog: Compatible Operating Systems

Datadog supports Windows, Linux, and macOS operating systems and cloud service providers, including Google Cloud, AWS, Red Hat OpenShift, and Microsoft Azure.

Graphite: Compatible Operating Systems

Graphite supports Linux and Unix operating systems.

Summary

All tools except Graphite supports Windows and Linux operating systems; Graphite only supports Linux and Unix. InfluxDB, Zabbix, and Datadog also support macOS, with Datadog providing additional support for cloud service providers.

Supported Programming Languages

Prometheus provides several official and unofficial client libraries for .NET, C++, Go, Haskell, Java, JavaScript (Node.js), Python, and Ruby. It also supports Prometheus Exporters to collect data from systems that do not directly have client libraries.

InfluxDB: Supported Programming Languages

InfluxDB supports client libraries for C++, Java, JavaScript, .NET, Perl, PHP, and Python. It can be directly used with the REST API.

Zabbix: Supported Programming Languages

Zabbix supports Java, JavaScript, .NET, Perl, PHP, Python, R, Ruby, Elixir, Go, and Rust.

Datadog: Supported Programming Languages

Client libraries are available in C#/.NET, Java, Python, PHP, Go, Node.js, Ruby, and Swift, along with many integrations.

Graphite: Supported Programming Languages

Graphite has client libraries in Python and JavaScript (Node.js) programming languages.

Summary

Prometheus, InfluxDB, Zabbix, and Datadog all support the major programming languages. Graphite, however, only provides support for Python and JavaScript.

Comparison summary

	Prometheus	InfluxDB	Zabbix	Datadog	Graphite	Levitate
Data Model and Storage	Multi-dimensional data model with Time series data	Time series data	External database stores including RDBMS	Both time series and non time series data	Time series data	PromQL compatible time series data
API and Access methods	HTTP API	HTTP API	HTTP API	HTTP API	HTTP API	HTTP API
Partitioning	Supported	Supported	Supported, depends on RDBMS of choice	Supported	Supported	Managed TSDB
Open Source	Yes	Yes. Proprietary also available.	Yes	No. Proprietary	Yes	No. Proprietary
Programming languages	Tons of client libraries and exporters	C++, Java, JavaScript, .NET, Perl, PHP, and Python.	Java, JavaScript, .NET, Perl, PHP, Python, R, Ruby, Elixir, Go, and Rust	Tons of integrations	Python and JavaScript (Node.js)	It can be directly used with the REST API

Prometheus's strengths lie in its support for multidimensional data collection. It has a powerful query language that can be used for both dynamic service-oriented architectures and machine-centric monitoring. It's a good choice when you primarily want to record numeric time series.

InfluxDB and Prometheus use similar data compression techniques and support multidimensional data using key-value data stores; InfluxDB is better for event logging. A commercial version provides the best option if you need to process large amounts of data, as its default configuration scales horizontally.

Zabbix focuses on hardware and device management and monitoring. It's a better option than Prometheus if you are more familiar with RDBMS database engines and need to store many historical and varied data types. However, the use of an external database can slow down performance.

Prometheus's internal time series database provides faster connectivity to data but is not suitable for storing data types like text or event logs. Since Prometheus only keeps data for fourteen days, it's also not a good option if you need to store historical data (unless configured for remote storage).

Datadog and Prometheus can be used for application performance monitoring(APM). However, Datadog has more application monitoring capabilities than Prometheus and is geared toward monitoring infrastructure at scale. Datadog is best for monitoring infrastructure and apps and visualizing data from disparate sources in mid to large-scale environments.

Graphite runs well on all hardware and cloud infrastructure, making it suitable for small businesses with limited resources and large-scale production environments. Choose Graphite when you need a solution focused on storing and analyzing historical data and fast retrieval.

Conclusion

Prometheus is a popular option for tracking metrics and alerting, but one of the four alternatives mentioned above might suit your needs depending on your requirements.

For processing large amounts of data, choose a commercial version of InfluxDB, but if you want the familiarity of an RDBMS engine, then go with Zabbix. Datadog's wide range of monitoring features makes it the go-to choice for monitoring infrastructure in larger environments. Still, if you operate on a smaller scale, Graphite can get the job done with whatever hardware and resources you have.

Last9, a site reliability engineering (SRE) platform. We remove the guesswork in improving the reliability of your distributed systems. Last9's Levitate, a managed time series database(TSDB), helps you understand, track, and improve your organization's system dependencies to reduce the challenges of time series database management.

Access the intelligence you need to deliver reliable software with Last9's reliability platform.

This post was originally published on Last9 Blog.

A practical guide for implementing SLO

Prathamesh Sonpatki — Thu, 12 Jan 2023 05:30:00 +0000

This is a mini guide to the SLO process that SREs and DevOps teams can use as a rule of thumb. This guide not necessarily automates the SLO process but gives a direction in which one can go using SLOs effectively.

The process essentially involves 3 steps

Identify the level of the Service
Identify the right type of the SLO
Set the SLO Targets

Before diving deep into it, let’s understand a few terminologies in the Site Reliability Engineering and Observability world.

SLO Terminologies

Service Level Indicator(SLI)

A Service level indicator ( SLI ) is a measure of the service level provided by a service provider to a customer. It is a quantitative measure that captures key metrics, like the percentage of successful requests or completed requests within 200 milliseconds, for example.

Service Level Objective(SLO)

A Service Level objective is a codified way to define a goal for service behaviour using a Service Level indicator within a compliance target.

Service Level Agreement(SLA)

A service level agreement defines the level of service expected by users in terms of customer experience. They also include penalties in case of agreement violation.

Let’s go through the SLO process now.

Identify the level of Service

Customer-Facing Services

A Service running HTTP API / apps/ GRPC workloads where the caller expects an immediate response to the request they submit.

Stateful Services

Services like a database. It is common to confuse a database as not being a service in a microservices environment where multiple services call the same database. Try answering this straightforward question next time you are unable to decide.

My service HAS a database OR my Service CALLS a database.

Asynchronous Services

Any service that does not respond with the request result instead queues it to be processed later. The only response is to acknowledge whether the service successfully accepted the task or not; the service will process the actual result/available later.

Operational Services

Operational Services are usually internal to an organization and deal with jobs like Reconciliation, Infrastructure bring-up, tear-down, etc. These jobs are typically asynchronous. But with a greater focus on accuracy vs. throughput. The Job may run late, but it must be correct as much as possible

Identify the right type of the SLO

Request Based SLO

Request-based SLOs perform some aggregation of Good requests vs. The total number of requests.

First, there is a notion of a Request. A request is a single operation on a component that succeeds or fails in generic terms.
Secondly, the SLIs have to be not pre-aggregated because Request SLOs perform an aggregation over a period of time. One can’t use pre-aggregated metrics(eg. Cloudwatch / Stackdriver which directly returns P99 latency rather than total requests and latency per request) for Request SLOs.
Additionally, for low-traffic services, Request SLOs can be noisy because they can keep flapping even when a very small % of requests fail. Eg. if your throughput is 10 rpm in a day, setting a 99% compliance target does not make sense because 1 request will bring down the compliance to 90% depleting the error budget.

Window Based SLO

Window-based SLO is a ratio of Good time intervals vs. total time intervals. For some sources, the requests are not available.

For example, In the case of a Kubernetes Cluster, the availability of a Cluster is the percentage of pods allocated vs. pods requested. Sometimes, you may not want to calculate the SLO as the overall performance of the service over a period of time.

Eg. in the case of a payment service, even if only 2% of requests fail in a window of 5 minutes, it is unacceptable because it is a critical service for my business. Even though overall performance has not degraded but that 2% of requests none of the payments was successful. Window-based SLOs are useful in such cases.

Using the above guidelines, we can create a rough flowchart to decide which type of SLO to choose depending on certain decision points.

SLO Process

Set the SLO Targets

When you start thinking about setting objectives, some questions will arise:

Should I set 99.999% from the start or be conservative?

Start conservatively. Look at historical numbers and calculate your 9s or dive right in with the lowest 9 such as 90%.
The baseline of the service or historical data of the customer experience can be helpful in this case.
Keep your systems running against this objective for a period of time and see if there is no depletion of the error budget.
If there are, improve your system’s stability. If there aren’t, move up to the next ladder of service reliability. From 90% go to 95 % then to 99% and so on.
Keep in mind Service Level agreements or SLAs that you may have with customers or third-party upstream services that you are dependent on. You can’t have a higher compliance target than a third-party service giving you a lower SLA.

What should be the compliance window?

Generally, this is 2x of your sprint window so that you can measure the performance of the service in a large enough duration to make an informed decision in the next sprint cycle on whether to focus on new features or maintenance.
If you are not sure start with a day and expand to a week. Remember that the longer your window, the longer the effects of a broken / recovered SLO.

How many ms should I set for latency?

It depends. What kind of user experience are you aiming for? Is your application a payment gateway? Is it a batch processing system where real-time feedback isn’t important?
To start out, measure your P50, and P99 latencies and initially give yourself some headroom and set your SLOs against P99 latency. Depending on the stability of your systems, use the same ladder-based approach as shown above and iterate.

Service Level Objectives are not a silver bullet

Let us take a simple scenario:

A user makes a request to a web application hosted on Kubernetes served via a load balancer. The request flow is as follows:

Request Flow

Instead of setting a blind SLO on the load balancer and calling it a day, ask yourself the following questions:

Where should I set the SLO - ALB or Ambassador or K8s or all of them? Typically SLOs are best set closest to the user or something that represents the end user’s experience e.g. if in the above example, one might want to set an SLO on the ALB but if the same ALB is serving multiple backends it might be a good idea to set the SLO on the next hop - Ambassador.
If I set a latency SLO, what should be the right latency value? Look at baseline percentile numbers. Do you want to catch degradations of the P50 customer experience, the P95 customer experience, or a static number?
Do I have enough metrics I need to construct an SLI expression? AWS Cloudwatch reports latency numbers as pre-calculated P99 values i.e. if you want to set a request-based SLO with the expression, you can’t do that because the data is pre-aggregated. So you cannot set request-based SLOs, you can only use window-based SLOs.
Suppose you set an availability SLO on Ambassador with the expression availability = 1 - (5xx / throughput).
What happens if the Ambassador pod crashes on K8s and does not emit 5xx / throughput signal?
Does the expression become availability = 1 - 0 / 0 or availability = undefined?
For a payment processing application, there might be a lag between the time at which the transaction was initiated v/s the time at which it was completed.
How does availability = 1 - (5xx / throughput) work now?
How do I know 5xx that I got was for a request present in the current throughput or was it a previous retry that failed?

This is not an exhaustive list of questions. Real-world scenarios will be complicated and that makes the task of setting achievable reliability targets involving multiple stakeholders and critical user journeys tricky.

So does this mean all hope is SLOst?

Of course not! SLOs are a way to gauge your system’s health and customer experience over a time period. But they are not the only way. In the above scenario, one could:

Set a request-based SLO on the Ambassador.
Set an uptime window SLO or an alert that checks for no-data situations for signals that are always ≥ 0 e.g. Ambassador throughput.
Set relevant alerts to catch pod crashes of the application.
Set alerts on load balancer 5xx to catch scenarios where ALB had an issue and the request was not forwarded to the Ambassador backend.

Want to know more about Last9 and how we make using SLOs dead simple? Check out last9.io; we're building SRE tools to make running systems at scale, fun, and _ embarrassingly easy. _ 🟢

Watermelon Metrics

Nishant Modak — Tue, 13 Jul 2021 04:38:42 +0000

Watermelon Metrics; A situation where individual dashboards look green, but the overall performance is broken and red inside.

https://blog.last9.io/need-for-systems-observability/

Managing infra code ⚙️🛠🧰

Prathamesh Sonpatki — Wed, 12 Aug 2020 16:52:58 +0000

Do you care about the quality of your infra code?

A. As much as product code
B. Somewhat but mostly no
C. We create infra via UI

Let's discuss how do you manage Infra code! Feel free to share your thoughts in the comments section.