Forem: Catchpoint

Semantic Caching: What We Measured, Why It Matters

Leon Adato — Mon, 27 Oct 2025 04:00:00 +0000

Semantic caching promises to make AI systems faster and cheaper by reducing duplicate calls to large language models (LLMs). But what happens when it doesn’t work as expected?

We built a test environment to find out. Through a caching system, we evaluated how semantically similar queries would behave. When the cache worked, response times were fast. When it didn’t, things got expensive. In fact, a single semantic cache miss increased latency by more than 2.5x. These failures didn’t show up in the API logs. But they cost real time and real money.

This blog shares our findings, why semantic caching matters, what causes it to fail, and how to monitor it effectively.

Traditional caching stores responses based on exact matches. If you’ve asked the same question before—verbatim—you’ll get a fast answer from cache. But if you phrase it differently, even slightly, it’s treated as a brand-new request.

Semantic caching is different. Instead of matching by text, it matches by meaning.

It uses embeddings—numerical representations of the intent behind a query—to determine whether a new request is close enough to a previous one. If two queries are semantically similar, the system can return the same result without reprocessing it through the LLM.

This is especially useful in AI systems where users might ask the same thing in dozens of ways. With semantic caching, “Who’s the President of the US?” and “Who runs America?” can trigger the same cached response—saving time, compute, and cost.

Why semantic caching matters in Agentic AI systems

Agentic AI systems don’t just respond to commands—they plan, reason, and act across multiple steps. Each of those steps often involves an LLM call: retrieving documents, rephrasing responses, or deciding what to do next.

The problem? LLM calls are expensive, especially when repeated across variations of the same question.

Instead of reprocessing every variation of a query, it can reuse results from previous, similar requests—so long as the meaning is close enough.

That’s where things get risky: in an agentic AI world, a silent failure in semantic caching doesn’t just mean a slower API call—it can derail entire multistep AI workflows. When semantic cache misses occur, queries go straight to the backend LLM—creating higher latency and skyrocketing costs. Worse yet, these failures are often invisible.

Your API returns a 200 OK, but behind the scenes, your cost and performance are suffering.

Bottom line: Unlike traditional caching, semantic caches introduce new risks:

Sudden model updates change embeddings and break matches.
Vector drift causes cache misses even for similar queries.
Users phrase things differently, leading to unexpected misses.

What we observed in the lab

To experiment with how semantic caching affects user experience and infrastructure cost, we built a testing lab environment locally.

Here’s how it worked:

We created a FastAPI application running locally, exposing an endpoint /search via a public endpoint to make it reachable.
Every incoming search query followed this logic:
Cosmos DB Lookup → Check whether we’ve seen a similar semantic query before, using the query text as the key.
✅ Cache Hit → Return the cached vector embedding directly, saving time and cost.
❌ Cache Miss → Call Gemini Pro to generate a new vector embedding for the query, then store it in Cosmos DB for future reuse.
We sent the embedding to Azure AI Search, performing a vector similarity search to find the top matching documents.
Finally, we returned the search results plus two crucial custom HTTP headers:
X-Semantic-Cache: hit or miss
X-Semantic-Score (e.g. 0.8543) indicating how close the new query is to previous ones.

Test results

We configured a Catchpoint test to simulate user queries to our public endpoint URL. The test:

Sent randomized prompts (e.g. “NYC weather,” “New York forecast”) to trigger both cache hits and misses.
Captured and parsed the semantic headers using regex in Catchpoint to track:
Cache efficiency (% of hits vs. misses)
Semantic similarity scores
Visualized this data in dashboards to see latency differences between hits and misses.

This gave us real evidence of how semantic caching reduces API calls to the expensive LLM backend and improves response times—insights we could quantify directly in the Catchpoint Internet Performance Monitoring (IPM) portal.

Below, are some helpful insights from our testing.

Cache Hits vs. Misses: Measurable Latency Gap

The trend line in the graph above shows that overall, the response time was about 50% to 250% higher when semantic cache returned a MISS compared to a HIT.

Diving deeper, we observed that the first run of a prompt went to the backend (a cache miss), with higher latency and costs.

First Run: Cache Miss with High Latency

The second run of the same semantic query hit the cache, cutting response times by 40%.

Second Run: Served from Cache (hit). Same semantic score (prompt matched). Response time is 50% lower

Monitoring strategies to make semantic caching reliable

Semantic caching is no longer a background optimization—it’s a core pillar of agentic AI systems that reason and act in real-time. But to trust it, we need to measure how well it’s working.

Here are three ways to monitor and improve its reliability:

1. Test semantically similar queries

Semantic caching lives or dies on how well it matches similar questions. Use synthetic monitoring to simulate different phrasings of the same intent:

“Who’s the President of the US?”
“Who runs the US government?”
“Commander-in-Chief of America?”

Then compare their outcomes:

Did they result in cache hits or misses?
What were the semantic similarity scores?

This gives you visibility into whether your caching system is recognizing intent consistently. For agentic AI, it’s not enough for one query to be fast. You need confidence that all user expressions of the same intent are covered.

2. Track semantic similarity scores

Semantic caches often return a similarity score (e.g. 0.85) to indicate how close the new query is to an existing cached answer. If your cache system returns a similarity score (e.g. 0.8343), you can:

Monitor it over time
Visualize trends
Alert if scores drop below thresholds

For instance, in our tests, both requests returned the same semantic score of 0.85224825.

But if the model changes or query phrasing drifts, scores could drop, leading to unexpected misses and rising costs.

Monitoring these numbers ensures your semantic cache stays reliable—and that you’re not wasting money on backend calls unnecessarily.

3. Measure real-world latency differences

One of the biggest promises of semantic caching is speed. Cache hits should be significantly faster than misses.

Monitoring this can:

Split metrics for cache hits vs. misses
Show precise latency differences
Alert when cache misses cause slowdowns

From our test results:

Cache miss response time: ~5 seconds
Cache hit response time: ~2 seconds

That’s a 2.5x speedup. In the world of agentic AI, that gap is the difference between a seamless conversation—and a frustrating pause.

Final takeaway

Semantic caching isn’t just a nice-to-have—it’s becoming core infrastructure for real-time AI systems. Cloud leaders like Fastly, AWS, and Azure are already baking it into their architectures. But it’s also uniquely fragile. Changes in language, embedding drift, or model updates can quietly degrade performance.

By combining semantic caching with IPM*,* teams can ensure that their systems are not only fast—but reliably so.

If you're running AI agents at scale, silent cache failures aren't just inefficiencies. They're risks. Measure them, monitor them, and mitigate them.

Learn more:

‍

Semantic caching promises to make AI systems faster and cheaper by reducing duplicate calls to large language models (LLMs). But what happens when it doesn’t work as expected?

This blog shares our findings, why semantic caching matters, what causes it to fail, and how to monitor it effectively.

What is semantic caching?

Semantic caching is different. Instead of matching by text, it matches by meaning.

Why semantic caching matters in Agentic AI systems

The problem? LLM calls are expensive, especially when repeated across variations of the same question.

Instead of reprocessing every variation of a query, it can reuse results from previous, similar requests—so long as the meaning is close enough.

Your API returns a 200 OK, but behind the scenes, your cost and performance are suffering.

Bottom line: Unlike traditional caching, semantic caches introduce new risks:

Sudden model updates change embeddings and break matches.
Vector drift causes cache misses even for similar queries.
Users phrase things differently, leading to unexpected misses.

What we observed in the lab

To experiment with how semantic caching affects user experience and infrastructure cost, we built a testing lab environment locally.

Here’s how it worked:

We created a FastAPI application running locally, exposing an endpoint /search via a public endpoint to make it reachable.
Every incoming search query followed this logic:
Cosmos DB Lookup → Check whether we’ve seen a similar semantic query before, using the query text as the key.
✅ Cache Hit → Return the cached vector embedding directly, saving time and cost.
❌ Cache Miss → Call Gemini Pro to generate a new vector embedding for the query, then store it in Cosmos DB for future reuse.
We sent the embedding to Azure AI Search, performing a vector similarity search to find the top matching documents.
Finally, we returned the search results plus two crucial custom HTTP headers:
X-Semantic-Cache: hit or miss
X-Semantic-Score (e.g. 0.8543) indicating how close the new query is to previous ones.

Test results

We configured a Catchpoint test to simulate user queries to our public endpoint URL. The test:

Sent randomized prompts (e.g. “NYC weather,” “New York forecast”) to trigger both cache hits and misses.
Captured and parsed the semantic headers using regex in Catchpoint to track:
Cache efficiency (% of hits vs. misses)
Semantic similarity scores
Visualized this data in dashboards to see latency differences between hits and misses.

Below, are some helpful insights from our testing.

Cache Hits vs. Misses: Measurable Latency Gap

The trend line in the graph above shows that overall, the response time was about 50% to 250% higher when semantic cache returned a MISS compared to a HIT.

Diving deeper, we observed that the first run of a prompt went to the backend (a cache miss), with higher latency and costs.

First Run: Cache Miss with High Latency

The second run of the same semantic query hit the cache, cutting response times by 40%.

Second Run: Served from Cache (hit). Same semantic score (prompt matched). Response time is 50% lower

Monitoring strategies to make semantic caching reliable

Semantic caching is no longer a background optimization—it’s a core pillar of agentic AI systems that reason and act in real-time. But to trust it, we need to measure how well it’s working.

Here are three ways to monitor and improve its reliability:

1. Test semantically similar queries

Semantic caching lives or dies on how well it matches similar questions. Use synthetic monitoring to simulate different phrasings of the same intent:

“Who’s the President of the US?”
“Who runs the US government?”
“Commander-in-Chief of America?”

Then compare their outcomes:

Did they result in cache hits or misses?
What were the semantic similarity scores?

2. Track semantic similarity scores

Monitor it over time
Visualize trends
Alert if scores drop below thresholds

For instance, in our tests, both requests returned the same semantic score of 0.85224825.

But if the model changes or query phrasing drifts, scores could drop, leading to unexpected misses and rising costs.

Monitoring these numbers ensures your semantic cache stays reliable—and that you’re not wasting money on backend calls unnecessarily.

3. Measure real-world latency differences

One of the biggest promises of semantic caching is speed. Cache hits should be significantly faster than misses.

Monitoring this can:

Split metrics for cache hits vs. misses
Show precise latency differences
Alert when cache misses cause slowdowns

From our test results:

Cache miss response time: ~5 seconds
Cache hit response time: ~2 seconds

That’s a 2.5x speedup. In the world of agentic AI, that gap is the difference between a seamless conversation—and a frustrating pause.

Final takeaway

By combining semantic caching with IPM*,* teams can ensure that their systems are not only fast—but reliably so.

If you're running AI agents at scale, silent cache failures aren't just inefficiencies. They're risks. Measure them, monitor them, and mitigate them.

Learn more:

‍

Here’s the proof: What the fastest sites on the web have in common

Leon Adato — Mon, 20 Oct 2025 04:00:00 +0000

60% of Gen Z won’t engage with a slow-loading website. In today’s digital economy, that’s a deal-breaker. Whether it’s a banking portal, a travel app, or an AI-powered SaaS platform, users expect performance. Instant loading, global reliability, and smooth interactivity aren’t just nice to have—they define the winners.

At Catchpoint, we’ve spent the past several years benchmarking the web performance of the world’s most recognizable brands—from HSBC and Google Flights to Salesforce and Nike. Our industry benchmark reports span sectors including banking, GenAI, airlines, hotels, travel aggregators, and athletic footwear and apparel.

Each report is powered by our industry leading IPM platform and Global Agent Network of over 3,000+ agents worldwide. This data-driven approach gives us a uniquely authoritative view into what separates the fastest sites on the web from the rest.

This blog distils the key findings from our latest benchmark reports: six traits the fastest sites share, and how organizations across any industry can apply them to build faster, more resilient digital experiences.

#1. Lightning-fast infrastructure: DNS & TTFB

One common trait is a rock-solid network foundation – quick DNS lookups and rapid server responses. The fastest sites minimize any lag between a user’s request and the first byte of data returned:

Optimized DNS: Across industries, the fastest sites keep DNS resolution times extremely low—often under 50 ms. A quick DNS lookup shaves valuable milliseconds off the initial request and helps the page begin loading faster.

For instance, in our 2024 benchmark of athletic footwear and apparel brands, top performers like Kappa and Under Armour resolved DNS in under 1 ms, while others such as Skechers and On remained comfortably below the 50 ms threshold. At the other end of the spectrum, slower-loading sites like Nike and VEJA saw DNS times of 169–187 ms—over 3× slower than the best performers.

This early delay often signals broader infrastructure inefficiencies. In nearly every sector we’ve analyzed, lower DNS times strongly correlate with faster overall load speeds.

Snappy TTFB: Likewise, elite sites boast backend servers that respond in a heartbeat. The top two banking websites all returned server responses in under 200 ms, providing a swift handoff to the browser.

Banking Website Performance Benchmark report 2025

But in the athletic apparel space, only Puma and Under Armour hit that bar (101 ms and 107 ms, respectively).

Many others struggled: Nike, VEJA, and Kappa all had TTFB exceeding 950 ms, with Nike peaking at 1.1 seconds—more than 5× slower than the recommended threshold.

This kind of server-side delay often negates gains made elsewhere, like in front-end optimization or CDN use. Across all verticals we benchmarked, fast TTFB was a strong predictor of high overall performance.

#2. Globally distributed delivery

The fastest websites don’t just perform well in one country – they invest in global delivery infrastructure, so users everywhere get quick load times. Our data shows that sites leveraging Content Delivery Networks (CDNs) and regional optimizations avoid the huge geographic performance gaps that plague others:

Localized CDNs: Leading sites deploy globally distributed servers and CDN endpoints to bring content closer to users. In banking, for example, the highest performers (UBS, ING, HSBC, etc.) achieved consistent experiences worldwide by using localized infrastructure and optimized DNS routing. In contrast, sites without regional presence suffered severe slowdowns – in Africa and South America, some banks had page load times double those in North America. (One test showed 10.7 s waits in Africa vs ~5 s in the U.S.)

SaaS Website Performance Benchmark Report 2025

Consistent speed everywhere: The data confirms that being fast everywhere is part of being a fastest-in-class website. While lower-ranked sites still show wide regional performance gaps, top performers prove it’s possible to deliver fast, reliable experiences globally—though Africa remains a challenge for all.

Regional disparities in document complete times across continents in the 2025 Banking Report, showing Africa significantly lagging behind other regions

Performance ratios reveal dramatic disparities: In the 2025 Banking Report, Africa and South America take more than double the time to complete documents compared to North America

#3. Lean pages, fast Loads

Another hallmark of the web’s fastest sites is lightweight, efficiently coded pages that reach full load in just a few seconds. They avoid the bloat and complexity that drag others down:

Blazing-fast loads: The elite websites consistently fully load in ~2–3 seconds or less. In banking, only 25% of sites met the ideal of <3 s document complete, but every top-ranked bank did. In fact, the best performers in finance all loaded in under 2 s while maintaining ~99.9% uptime. Similarly, across other industries, top sites hit very low page load times – travel sites like Skyscanner achieved sub-0.7 s Largest Contentful Paint and mere ~1.2 s full loads, indicating extremely fast content delivery.

Minimal baggage: In contrast, slower sites tend to be weighed down by heavy content and excess scripts. Many well-known brands ranked outside the top tier due to “heavy content and front-end complexity.” These bulky pages took over 5–9 seconds to load, or in the case of Marriot below, over 10 seconds.

Hotel Website Performance Benchmark Report

‍

The takeaway: fastest sites keep their pages lean – optimizing images, compressing files, and capping third-party elements to avoid sluggish load times.

By keeping page weight low and complexity in check, top performers ensure users aren’t left staring at blank screens. Fast complete load times not only improve user satisfaction but also reduce bounce rates – a win-win for experience and business metrics.

#4. Stellar front-end optimization & core web vitals

Beyond just loading quickly, the fastest websites also feel fast and stable to users. They achieve this through meticulous front-end optimization, evidenced by outstanding Core Web Vitals metrics like Largest Contentful Paint (LCP) and Cumulative Layout Shift (CLS):

Fast content rendering (low LCP): Top sites prioritize delivering meaningful content immediately. In Catchpoint’s travel industry test, all of the 10 fastest sites had LCP well within recommended standards – many under ~1 second. For example, Skyscanner’s LCP came in around 613 ms, with even the 10th-best still near 1.1 s – comfortably beating the ~2.5 s guideline. This means users see the page’s main content almost instantly on these sites.

Travel Website Performance Benchmark Report

Slower sites, on the other hand, often struggle with LCP above the 2.5 s mark, making them feel sluggish even if the total load isn’t horrific.

Stable layouts (no janky moves): A striking data point – the top 10 travel websites all had a perfect CLS score of 0.00. Likewise, the best banks and retail sites exhibited virtually no unexpected layout shifts during load. In practice, this means no annoying page jumps (caused by late-loading ads or images) for the user.

Airline Website Performance Benchmark Report

Fast sites achieve this by reserving space for images/media and loading content in a controlled way. By contrast, lesser performers saw CLS values up to 0.8–1.5, causing noticeable “jank” and hurting user experience.

It’s clear that front-end experience is the real differentiator among websites. Sites that not only load quickly but also render smoothly (low LCP) and stay visually stable (low CLS) ended up at the top of Catchpoint’s rankings. These Core Web Vitals were highly correlated with overall performance scores, underscoring that fastest sites focus heavily on user-centric front-end metrics.

#5. High availability without sacrificing speed

Leading websites understand that availability and speed must go hand in hand. It’s not enough to be up 24/7 if your site is slow or error-prone when loaded. The fastest sites manage to deliver near-perfect uptime and quick performance simultaneously:

Near-perfect uptime: The majority of top performers maintain 99.9% or better availability. In the 2025 Banking benchmark, 75% of sites had at least 99.9% uptime – and all of the top finishers were essentially always-online. In fact, the #1 sites in multiple industries achieved ≈100% uptime during the testing period. This level of reliability ensures users rarely encounter errors or downtime.

2025 Banking Website Performance Benchmark Report

Reliability and Speed: Crucially, the best sites don’t treat uptime as a safety net for poor performance. Availability isn’t the safety net it used to be.Several lower-ranked companies across industries had 99.9% uptime yet still fell behind due to slow pages or unstable fronts. Top sites avoid this trap by pairing high availability with fast, stable loads. They have resilient infrastructures to stay online, and they proactively monitor performance so that being “up” always means delivering a quality experience. Conversely, a few sites had serious outages (dipping to ~90% uptime) which obviously knocked them out of contention.

The bottom line: reliability is foundational, but true excellence comes from reliability + speed. The fastest websites prioritize both. Users expect a site to be available and to respond instantly – leading sites deliver on that expectation consistently.

#6. Culture of continuous optimization

Speed isn’t a one-time project. The fastest websites treat performance as an ongoing discipline—one rooted in measurement, iteration, and readiness for peak demand.

Instacart is a clear example. While their average TTFB during the Super Bowl benchmark period was a sluggish 1,051 ms, that changed on game day. On February 9th, when their Super Bowl ad aired, Instacart slashed TTFB significantly—demonstrating that with the right optimizations, even high-traffic moments can be performance wins.

Instacart’s TTFB improvement on the day of the Super Bowl

This kind of readiness doesn’t happen by accident. Across all industries, one of the strongest shared habits of high-performing sites is this culture of continuous optimization. When performance is treated as a living system—not just a launch checklist—organizations are better equipped to handle traffic spikes, user expectations, and the unexpected.

Final thoughts: speed is strategy

The patterns are clear. The fastest sites on the web invest in performance at every level: from back-end infrastructure to front-end polish, from global delivery to rigorous uptime, all under a philosophy of ongoing improvement. These sites prove that it’s possible to be both fast and reliable– delighting users with sub-3 second loads, silky-smooth pages, and dependable service. Meanwhile, competitors that neglect these areas struggle with slow load times, unstable pages, or regional outages, putting them at a serious disadvantage.

The good news is that these best practices are well-understood. As our benchmark reports show, the top performers simply execute them better – and reap the rewards in user experience. By adopting the common habits of the fastest websites, including smart DNS/CDN usage, lightweight design, and Core Web Vitals focus, any website can dramatically improve its speed. After all, as the data reminds us, “it’s no longer enough to be available; you have to be fast everywhere”. The fastest sites have set the bar – and they invite others to catch up.

Explore the full benchmark series

See how your industry stacks up View all benchmark reports

‍

Summary

#1. Lightning-fast infrastructure: DNS & TTFB

This early delay often signals broader infrastructure inefficiencies. In nearly every sector we’ve analyzed, lower DNS times strongly correlate with faster overall load speeds.

Banking Website Performance Benchmark report 2025

But in the athletic apparel space, only Puma and Under Armour hit that bar (101 ms and 107 ms, respectively).

Many others struggled: Nike, VEJA, and Kappa all had TTFB exceeding 950 ms, with Nike peaking at 1.1 seconds—more than 5× slower than the recommended threshold.

#2. Globally distributed delivery

SaaS Website Performance Benchmark Report 2025

Regional disparities in document complete times across continents in the 2025 Banking Report, showing Africa significantly lagging behind other regions

Performance ratios reveal dramatic disparities: In the 2025 Banking Report, Africa and South America take more than double the time to complete documents compared to North America

#3. Lean pages, fast Loads

Another hallmark of the web’s fastest sites is lightweight, efficiently coded pages that reach full load in just a few seconds. They avoid the bloat and complexity that drag others down:

Hotel Website Performance Benchmark Report

‍

The takeaway: fastest sites keep their pages lean – optimizing images, compressing files, and capping third-party elements to avoid sluggish load times.

#4. Stellar front-end optimization & core web vitals

Travel Website Performance Benchmark Report

Slower sites, on the other hand, often struggle with LCP above the 2.5 s mark, making them feel sluggish even if the total load isn’t horrific.

Airline Website Performance Benchmark Report

#5. High availability without sacrificing speed

2025 Banking Website Performance Benchmark Report

#6. Culture of continuous optimization

Speed isn’t a one-time project. The fastest websites treat performance as an ongoing discipline—one rooted in measurement, iteration, and readiness for peak demand.

Instacart’s TTFB improvement on the day of the Super Bowl

Final thoughts: speed is strategy

Explore the full benchmark series

See how your industry stacks up View all benchmark reports

‍

This is some text inside of a div block.

Observability isn’t about the tool. It’s about the truth

Leon Adato — Mon, 13 Oct 2025 04:00:00 +0000

An enterprise client reports latency. Your dashboards say everything is fine. They blame you. You blame them. Nobody can prove it either way.

This is where most monitoring efforts hit a wall. Too often, the conversation gets stuck on dashboards and tools instead of the one thing that really matters: truth.

Observability isn’t about collecting metrics or building pretty dashboards. It’s about knowing the truth — the ability to quickly get to the root of a problem when your reputation and revenue are on the line.

Not vanity metrics. Not checkbox features. Just fast, end-to-end, and undeniable truth.

What happens when two companies see the same issue differently?

A leading financial services provider (let’s call them Company A) was suddenly under pressure. A key enterprise client—Company B—reported delays of 3 to 6 seconds when hitting APIs embedded in their customer-facing apps.

Company B: "Your APIs are slow. It’s impacting our customer experience."
Company A (relying on Datadog APM): "Everything looks fine on our side."

A stalemate. And a textbook case of observability failure.

Why couldn’t Datadog find the issue?

This isn’t a knock on Datadog. It’s an excellent Application Performance Monitoring (APM) tool—but it wasn’t built to see beyond your own infrastructure.

So even though Company A had robust APM and logging, they couldn’t see anything outside their own walls. They couldn’t install agents in Company B’s infrastructure, and they certainly couldn’t drop Real User Monitoring (RUM) scripts into someone else’s codebase.

Here’s what each tool can (and can’t) do:

APM (like Datadog): Great inside the app — once traffic arrives.
RUM: Excellent for frontend insights — but only if you own the app.
Logs: Useful for what already happened — but not where packets got stuck in transit.

The common denominator with all three is that none of them can see what’s happening between systems. Let’s get into why.

Why do APIs create blind spots between companies?

APIs are the interface between companies, the digital waiters of the software world. Just like you don’t walk into a restaurant kitchen to talk to the chef, companies don’t peek behind each other’s firewalls. They interact through APIs, exchanging structured requests and responses without ever seeing what’s really cooking on the other side.

And that’s where blind spots creep in.

When two systems communicate through APIs, they lack visibility to each other’s inner workings. The moment a request leaves your infrastructure, it enters the black box of “someone else’s problem,” which include infrastructure, networks, and dependencies you don’t own and can’t instrument

The root problem is that the Internet isn’t instrumentable. You can’t deploy agents or RUM scripts across the networks and infrastructure you don’t control. That’s why traditional observability tools stop at the edge. Beyond that lies the unknown.

But delivering a great digital experience depends on multiple networks, protocols, agents, and sub-systems to work together in concert. These dependencies form what we call the Internet Stack: DNS, CDN, BGP, ISP, last mile, backbone, and more.

The Internet Stack

When performance breaks down somewhere in that chain, it doesn’t matter if it’s your fault or not—your customers still feel it. APIs, after all, were designed for efficiency, not visibility.

This is where Internet Performance Monitoring (IPM) becomes essential. IPM enables deep visibility into every layer of the Internet that can impact your service. Think of it as APM for the Internet Stack; purpose-built for the systems you don't own but still rely on.

How do you get to the truth when APM falls short?

When traditional observability tools couldn’t explain the latency, IPM filled the gap. Instead of guessing, Company A used IPM to run synthetic API tests across real-world networks:

From user ISPs: major U.S. carriers and fiber providers
From backbone and enterprise vantage points
From inside Company A’s own infrastructure

Each test simulated actual API calls, complete with traceable request IDs and timestamps. And the results were undeniable.

This diagram maps the full path of an API call — from the client through Akamai, to internal proxy infrastructure and upstream systems. It clearly shows where latency accumulates:

DNS, connect, and SSL times are negligible.
Akamai's edge processing is fast (~48ms).
Major delays occur during origin fetch (3,143ms) and proxy fetch (2,364ms)—both inside the server infrastructure.
This confirms the problem isn’t with the client or CDN, but deep in the backend

Latency breakdown across cities

This chart tracks average response and wait times across major U.S. cities. The key insight:

Latency patterns are remarkably consistent across geography.
A single spike appears across multiple regions, ruling out a location-specific issue.
This supports the insight that the bottleneck lives within the origin infrastructure, not in external networks.

ISP breakdown

Here, performance is analyzed by ISP (e.g., AT&T, Comcast, Verizon):

Despite some noise, the pattern is stable across providers, with no single ISP showing consistently worse performance.
This helps eliminate ISP-side routing or congestion as a root cause.
The brief AT&T spike aligns with the same moment seen in city-level data.

The result: Consistent 3–6 second latency, internally and externally.

With that intelligence, they could rule out the usual suspects:

It wasn’t the ISP
It wasn’t the CDN
It wasn’t DNS
It wasn’t the proxy (Envoy)

The process of elimination worked like a proper diagnostic: isolate each layer, eliminate what’s clean, and close in on the source. Parsing response headers like x-envoy-upstream-service-time confirmed the latency was occurring further upstream, deep within Company A’s own service environment. This pointed engineers in the right direction without them needing to sift through endless log lines. Trace IDs and timestamps were shared with internal teams to help pinpoint issues around application dependencies—eventually confirmed to be the root cause.

This methodical approach, including initial discussion and setup, took just three hours and about 15 test runs. There was no guesswork. Just clarity.

After internal validation, teams began work on the improvements, which are still ongoing but already measurable where it matters most.

Backend latency has dropped significantly: both upstream service time and overall wait time have been cut nearly in half. These gains reflect steady optimization efforts that are clearly moving in the right direction.

What IPM delivers that APM can’t

Let’s be clear: Datadog, New Relic, and Dynatrace are outstanding at what they do — inside your infrastructure. But they weren’t designed to monitor the Internet itself.

Catchpoint IPM was. Here’s how:

A vast Global Agent Network

3000+ agents across last-mile, backbone, cloud, enterprise, and on-prem environments
Real-user network emulation, not cloud-only testbeds

Full synthetic coverage

HTTP/S, APIs, Browser, DNS, SSL, BGP, MQTT, QUIC, Custom scripts

Advanced diagnostics

Packet loss, jitter, path tracing, hop analysis
Region-specific degradation detection

Frontend visibility

WebPageTest for in-depth frontend perf
Browser + mobile RUM SDKs for teams who can instrument the frontend

Seamless integration

Feeds directly into Datadog, Splunk, New Relic, Dynatrace
Enhances existing observability stacks without replacing them
Provides end-to-end visibility across the Internet Stack via a real-time dependency map

Why teams cling to familiar tools even when they’re not fit for purpose

Familiar tools are comfortable. They’re already deployed, widely understood, and politically safe. But too often, comfort wins out over capability—especially in large, mature organizations where tooling decisions are influenced by inertia, not fitness for purpose. But when seconds matter and customers are impacted, you need clarity, not comfort.

Who takes the blame when APIs are slow?

In this case, Company B blamed Company A. Company A blamed Company B. But neither had data to prove their case.

Meanwhile, users just saw a slow experience.

End users don’t know an API call is crossing company boundaries. They only see the brand they’re interacting with. If it’s slow, they assume that brand is to blame. That’s why solving performance issues quickly is about more than technical hygiene. It’s about protecting business relationships and customer trust.

Final thought: What’s the real job of observability?

Observability isn’t about the coolest UI or the biggest vendor budget. It’s about getting to the truth, fast. And often, the truth lies outside your four walls.

In an AI-driven world, data powers decisions. But if your data is incomplete or your telemetry is limited to your own infrastructure, your AI is just guessing.

Catchpoint IPM gives teams the ability to:

Validate performance from the outside in
Prove or disprove internal assumptions with independent data
Pinpoint root causes in minutes, not days

Because the point of observability isn’t the tool.

It’s the truth.

Got a latency mystery your tools can’t solve? Let’s talk.

‍

Summary

An enterprise client reports latency. Your dashboards say everything is fine. They blame you. You blame them. Nobody can prove it either way.

This is where most monitoring efforts hit a wall. Too often, the conversation gets stuck on dashboards and tools instead of the one thing that really matters: truth.

Not vanity metrics. Not checkbox features. Just fast, end-to-end, and undeniable truth.

What happens when two companies see the same issue differently?

Company B: "Your APIs are slow. It’s impacting our customer experience."
Company A (relying on Datadog APM): "Everything looks fine on our side."

A stalemate. And a textbook case of observability failure.

Why couldn’t Datadog find the issue?

This isn’t a knock on Datadog. It’s an excellent Application Performance Monitoring (APM) tool—but it wasn’t built to see beyond your own infrastructure.

Here’s what each tool can (and can’t) do:

APM (like Datadog): Great inside the app — once traffic arrives.
RUM: Excellent for frontend insights — but only if you own the app.
Logs: Useful for what already happened — but not where packets got stuck in transit.

The common denominator with all three is that none of them can see what’s happening between systems. Let’s get into why.

Why do APIs create blind spots between companies?

And that’s where blind spots creep in.

The Internet Stack

When performance breaks down somewhere in that chain, it doesn’t matter if it’s your fault or not—your customers still feel it. APIs, after all, were designed for efficiency, not visibility.

How do you get to the truth when APM falls short?

When traditional observability tools couldn’t explain the latency, IPM filled the gap. Instead of guessing, Company A used IPM to run synthetic API tests across real-world networks:

From user ISPs: major U.S. carriers and fiber providers
From backbone and enterprise vantage points
From inside Company A’s own infrastructure

Each test simulated actual API calls, complete with traceable request IDs and timestamps. And the results were undeniable.

This diagram maps the full path of an API call — from the client through Akamai, to internal proxy infrastructure and upstream systems. It clearly shows where latency accumulates:

DNS, connect, and SSL times are negligible.
Akamai's edge processing is fast (~48ms).
Major delays occur during origin fetch (3,143ms) and proxy fetch (2,364ms)—both inside the server infrastructure.
This confirms the problem isn’t with the client or CDN, but deep in the backend

Latency breakdown across cities

This chart tracks average response and wait times across major U.S. cities. The key insight:

Latency patterns are remarkably consistent across geography.
A single spike appears across multiple regions, ruling out a location-specific issue.
This supports the insight that the bottleneck lives within the origin infrastructure, not in external networks.

ISP breakdown

Here, performance is analyzed by ISP (e.g., AT&T, Comcast, Verizon):

Despite some noise, the pattern is stable across providers, with no single ISP showing consistently worse performance.
This helps eliminate ISP-side routing or congestion as a root cause.
The brief AT&T spike aligns with the same moment seen in city-level data.

The result: Consistent 3–6 second latency, internally and externally.

With that intelligence, they could rule out the usual suspects:

It wasn’t the ISP
It wasn’t the CDN
It wasn’t DNS
It wasn’t the proxy (Envoy)

This methodical approach, including initial discussion and setup, took just three hours and about 15 test runs. There was no guesswork. Just clarity.

After internal validation, teams began work on the improvements, which are still ongoing but already measurable where it matters most.

What IPM delivers that APM can’t

Let’s be clear: Datadog, New Relic, and Dynatrace are outstanding at what they do — inside your infrastructure. But they weren’t designed to monitor the Internet itself.

Catchpoint IPM was. Here’s how:

A vast Global Agent Network

3000+ agents across last-mile, backbone, cloud, enterprise, and on-prem environments
Real-user network emulation, not cloud-only testbeds

Full synthetic coverage

HTTP/S, APIs, Browser, DNS, SSL, BGP, MQTT, QUIC, Custom scripts

Advanced diagnostics

Packet loss, jitter, path tracing, hop analysis
Region-specific degradation detection

Frontend visibility

WebPageTest for in-depth frontend perf
Browser + mobile RUM SDKs for teams who can instrument the frontend

Seamless integration

Feeds directly into Datadog, Splunk, New Relic, Dynatrace
Enhances existing observability stacks without replacing them
Provides end-to-end visibility across the Internet Stack via a real-time dependency map

Why teams cling to familiar tools even when they’re not fit for purpose

Who takes the blame when APIs are slow?

In this case, Company B blamed Company A. Company A blamed Company B. But neither had data to prove their case.

Meanwhile, users just saw a slow experience.

Final thought: What’s the real job of observability?

Observability isn’t about the coolest UI or the biggest vendor budget. It’s about getting to the truth, fast. And often, the truth lies outside your four walls.

In an AI-driven world, data powers decisions. But if your data is incomplete or your telemetry is limited to your own infrastructure, your AI is just guessing.

Catchpoint IPM gives teams the ability to:

Validate performance from the outside in
Prove or disprove internal assumptions with independent data
Pinpoint root causes in minutes, not days

Because the point of observability isn’t the tool.

It’s the truth.

Got a latency mystery your tools can’t solve? Let’s talk.

‍

This is some text inside of a div block.

From the source to the edge: the six agent types you can’t ignore

Leon Adato — Mon, 06 Oct 2025 04:00:00 +0000

Recently, Catchpoint expanded our Global Agent Network to over 3,000 agents. In a crowded space, this is by far one of our key differentiators. At the time of writing, no one else boasts 395 providers in 105 countries and 346 cities. As Director of ISP Strategy, I’m not here to pat myself on the back—my real question is: why? Why build such a massive, independent network — going through all the effort to place backbone agents in hard-to-reach regions like China, Russia, and several African countries?

The answer lies in how the Internet is built. It isn’t a single, monolithic network but a patchwork quilt of thousands of independent networks—ISPs, data-centers, backbones, wireless carriers, and more—all stitched together by peering and transit agreements worldwide. To monitor performance accurately, you need visibility into every layer of that quilt, or, what we like to call the Internet Stack.

The Internet Stack

In this article, we’ll unpack the five types of Catchpoint synthetic agents—backbone, wireless, last-mile, cloud, and BGP—and show you exactly when and why each matters for keeping that quilt intact. First, let’s explore how the Internet actually connects end to end.

Why multiple vantage points are key

The “Internet” isn’t a single cloud—you can’t just tap into one place and see it all. It’s really tens of thousands of independent networks (ISPs, data centers, wireless carriers) stitched together by peering and transit deals. In a peering arrangement, two networks exchange traffic for free. In a transit relationship, one network pays another to carry its traffic. Peering keeps traffic local; transit carries it farther afield.

Because each network makes its own choices about peering and transit, you need monitoring agents at many points to see what’s happening. A performance hiccup in one ISP’s peering location might not show up at a different ISP’s vantage point. That’s why Catchpoint places agents in dozens of key networks—so you won’t miss an issue that affects only a slice of the Internet.

A map of how thousands of networks (Autonomous Systems) peer and buy transit around the world. Source

Why tiers matter

All those peering and transit agreements naturally sort networks into tiers:

Tier 1: These are very large networks that peer with each other and don’t need to buy any transit at all to reach any corner of the Internet. They are considered the backbone of the Internet as they will typically carry long-distance Internet traffic. While the networks belonging to this group have changed since the beginning of the Internet, it’s remained relatively stable, including providers such as Lumen, AT&T, Cogent, Verizon, Orange, GTT, NTT, and Telxius. This category features very large traditional telecom providers that have been serving their domestic market for many years.
Tier 2: Regional providers that both peer and buy transit. The scale of operations as well as the type of services they provide (IP transit, Ethernet or Dark Fiber wavelengths) will dictate the number of peering connections and transit providers. Most networks will fall into this category if they peer at one or more Internet exchange points and have two or more upstream providers.
Tier 3: Smaller, local ISPs (often single-homed) that feed to an upstream provider. They show you what your end users see on a residential or localized network.

Putting agents in each tier matters because a Tier 1 network will have more visibility into global events (total or partial outages, backbone congestions, etc) as opposed to a regional Tier 2 or a local Tier 3 network. On the other hand, localized outages affecting a limited number of providers in a particular geographical area, won’t be easily observed unless having visibility from one of the affected networks.

Why single-homed Tier 1/Tier 2 connectivity matters

Now that we know how networks sort into tiers, let’s look at how those tiers influence the way we connect our agents.

Many data centers, hosting and managed service providers typically use multiple upstream ISPs (Tier 1 and Tier 2) to create a single, aggregated connection to the Internet. This Internet connectivity, normally offered as a service, is often called blended bandwidth or multihoming.

Multihoming improves redundancy as traffic can be rerouted if one ISP goes down or has packet loss/congestion. It also improves performance as different ISPs may offer better latency to different geographies.

But for Internet Performance Monitoring (IPM) however, the use of multihomed agents instead of single-homed Tier1/Tier2 carriers introduces variability in your monitoring data, making it harder to identify and troubleshoot the issues affecting performance.

Here is why more than 96% of Catchpoint backbone agents use single-homed Tier 1/ Tier 2 connectivity instead of blended bandwidth:

Path consistency: A consistent Tier 1 upstream path reduces variability, making anomalies and degradations easier to detect and attribute.
Backbone visibility: Tier1 visibility is essential to observe how the core Internet behaves in relation to routing anomalies, BGP hijacks or backbone congestion.
Performance stability: With blended connectivity, routes may change dynamically based on load-balancing or pricing strategies (e.g., BGP-based traffic engineering), affecting performance results.

Now that you understand how and why Catchpoint chooses single-homed Tier 1/Tier 2 connectivity, let’s look at each of our five synthetic agent types. We’ll explain where we place them, how we build them, and—most importantly—exactly what visibility each one gives you.

Backbone agents

Backbone agents give you a “core-of-Internet” vantage point to catch global outages, BGP hijacks, and CDN-level issues no other agent can see.

We place backbone agents in Tier 1 or Tier 2 ISPs worldwide, selecting carriers by:

Geography and market importance (global connectivity hubs)
CAIDA ASRank & APNIC eyeball data (to cover the most interconnected networks)

Each backbone agent runs as a server cluster in a carrier-neutral data center with dedicated IP transit. Carrier neutrality ensures multiple international and domestic carriers via cross-connects, while colocating servers in one facility reduces colocation costs. In emerging markets (e.g., parts of Africa or China), where neutral data centers are scarce, we may host clusters in carrier-owned facilities—only as a last resort.

Connectivity diagram of backbone agents at a data center facility

‍

Measuring performance and availability from backbone agents is critical for:

Experience Level Objective (XLO) measurements: Validate service performance when source and target share the same ISP.
CDN performance/validation: Ensure fast, reliable content delivery across the backbone.
Competitive benchmarking: Compare your service to peers in the same Tier 1/2 networks.
Peering/ISP monitoring: Detect routing changes, BGP anomalies, or unexpected transit behavior.
Geo-based DNS validation: Confirm DNS resolution speed and correctness from the core network

Public cloud agents

Cloud agents give you visibility right inside public-cloud data centers—so you can catch platform-specific issues before they impact users.

We run 280+ cloud agents across every key availability region in AWS, Azure, Google, Oracle, Alibaba, Tencent, Akamai Compute, and OVH.

Measuring performance and availability to and from cloud agents is essential if you are hosting applications in the cloud or using any of their computing products. Cloud agents allow your SRE teams to pre-emptively detect performance degradations on public clouds that can affect how your users are experiencing your applications and services.

Wireless agents

Wireless agents simulate real-world cellular conditions, giving you a true picture of how your applications perform on 3G/4G/5G networks.

We place wireless agents using AWS Wavelength and independent carriers in the US, Canada, Japan, Germany, Korea, India, and the UK (e.g., Verizon, KDDI, BT, T-Mobile 5G, AT&T 5G).

Running wireless tests alongside backbone tests lets you compare mobile experience to core-network performance—so you can spot issues like packet loss or DNS slowdowns that only affect cellular users.

Last-mile agents

Last-mile agents live in real homes, giving you a true end-user view of broadband performance.

Our last-mile agents run on small customer-premise devices that connect to a residential ISP. Use these agents to troubleshoot ISP-specific issues—like throttling, DNS failures, or regional outages—that only affect subscribers on a particular network.

Enterprise agents

Enterprise agents give you visibility into your own network—from branch offices to data centers to edge locations.

Enterprise agents are deployed within your organization’s infrastructure. That includes office networks, private data centers, retail locations, or edge devices. These agents help you monitor internal applications, APIs, and services with the same level of granularity you get for external traffic. Combined with our Global Agent Network, enterprise agents complete the picture—giving you visibility from both outside-in and inside-out.

BGP agents

BGP agents watch the real-time routing table, so you can catch hijacks, leaks, or unexpected path changes that threaten your service.

Catchpoint maintains a route collector infrastructure that process real-time routing data from 1700+ BGP agents with the goal of monitoring BGP activity and detecting issues such as route hijacks and leaks.

In addition to using RIPE RIS and RouteViews datasets, Catchpoint operates its own private collector infrastructure which includes agreements to receive data from 330+ BGP agents from 100 unique networks.

If you share your own BGP sessions with our private collectors, you’ll gain even deeper insights in your portal—so you see exactly how routing anomalies affect your prefixes.

Wrapping it up

When I asked, “Why build a network of over 3,000 agents in 105 countries and 346 cities?” The simple answer is that today’s Internet isn’t one giant cloud but a patchwork quilt of independent networks.

By spreading our agents across every layer of the Internet Stack, we can reveal problems at the very moment they start—whether it’s a routing change in a distant backbone, a subtle slowdown in a public cloud region, or a local ISP hiccup affecting a handful of homes.

This broad visibility isn’t about boasting coverage; it’s about ensuring that whenever something goes wrong, you know exactly where to look. In other words, the effort we put into building and maintaining such a diverse network isn’t just a technical feat. It’s the key to preventing those 3 a.m. wake-up calls for your IT team, avoiding frantically assembled war rooms, and keeping your users happy no matter where they connect.

Ready to unlock the full potentialof observability?

Learn more about Catchpoint'sintelligent agent network and how it can transform your monitoring strategy: https://www.catchpoint.com/global-observability-network

‍

Summary

The Internet Stack

Why multiple vantage points are key

A map of how thousands of networks (Autonomous Systems) peer and buy transit around the world. Source

Why tiers matter

All those peering and transit agreements naturally sort networks into tiers:

Tier 1: These are very large networks that peer with each other and don’t need to buy any transit at all to reach any corner of the Internet. They are considered the backbone of the Internet as they will typically carry long-distance Internet traffic. While the networks belonging to this group have changed since the beginning of the Internet, it’s remained relatively stable, including providers such as Lumen, AT&T, Cogent, Verizon, Orange, GTT, NTT, and Telxius. This category features very large traditional telecom providers that have been serving their domestic market for many years.
Tier 2: Regional providers that both peer and buy transit. The scale of operations as well as the type of services they provide (IP transit, Ethernet or Dark Fiber wavelengths) will dictate the number of peering connections and transit providers. Most networks will fall into this category if they peer at one or more Internet exchange points and have two or more upstream providers.
Tier 3: Smaller, local ISPs (often single-homed) that feed to an upstream provider. They show you what your end users see on a residential or localized network.

Why single-homed Tier 1/Tier 2 connectivity matters

Now that we know how networks sort into tiers, let’s look at how those tiers influence the way we connect our agents.

Here is why more than 96% of Catchpoint backbone agents use single-homed Tier 1/ Tier 2 connectivity instead of blended bandwidth:

Path consistency: A consistent Tier 1 upstream path reduces variability, making anomalies and degradations easier to detect and attribute.
Backbone visibility: Tier1 visibility is essential to observe how the core Internet behaves in relation to routing anomalies, BGP hijacks or backbone congestion.
Performance stability: With blended connectivity, routes may change dynamically based on load-balancing or pricing strategies (e.g., BGP-based traffic engineering), affecting performance results.

Backbone agents

Backbone agents give you a “core-of-Internet” vantage point to catch global outages, BGP hijacks, and CDN-level issues no other agent can see.

We place backbone agents in Tier 1 or Tier 2 ISPs worldwide, selecting carriers by:

Geography and market importance (global connectivity hubs)
CAIDA ASRank & APNIC eyeball data (to cover the most interconnected networks)

Connectivity diagram of backbone agents at a data center facility

‍

Measuring performance and availability from backbone agents is critical for:

Experience Level Objective (XLO) measurements: Validate service performance when source and target share the same ISP.
CDN performance/validation: Ensure fast, reliable content delivery across the backbone.
Competitive benchmarking: Compare your service to peers in the same Tier 1/2 networks.
Peering/ISP monitoring: Detect routing changes, BGP anomalies, or unexpected transit behavior.
Geo-based DNS validation: Confirm DNS resolution speed and correctness from the core network

Public cloud agents

Cloud agents give you visibility right inside public-cloud data centers—so you can catch platform-specific issues before they impact users.

We run 280+ cloud agents across every key availability region in AWS, Azure, Google, Oracle, Alibaba, Tencent, Akamai Compute, and OVH.

Wireless agents

Wireless agents simulate real-world cellular conditions, giving you a true picture of how your applications perform on 3G/4G/5G networks.

We place wireless agents using AWS Wavelength and independent carriers in the US, Canada, Japan, Germany, Korea, India, and the UK (e.g., Verizon, KDDI, BT, T-Mobile 5G, AT&T 5G).

Last-mile agents

Last-mile agents live in real homes, giving you a true end-user view of broadband performance.

Enterprise agents

Enterprise agents give you visibility into your own network—from branch offices to data centers to edge locations.

BGP agents

BGP agents watch the real-time routing table, so you can catch hijacks, leaks, or unexpected path changes that threaten your service.

If you share your own BGP sessions with our private collectors, you’ll gain even deeper insights in your portal—so you see exactly how routing anomalies affect your prefixes.

Wrapping it up

Ready to unlock the full potentialof observability?

Learn more about Catchpoint'sintelligent agent network and how it can transform your monitoring strategy: https://www.catchpoint.com/global-observability-network

‍

This is some text inside of a div block.

Real-time detection of BGP blackholing and prefix hijacks

Leon Adato — Mon, 22 Sep 2025 04:00:00 +0000

Border Gateway Protocol (BGP) remains the backbone of inter-domain routing on the Internet, but its fundamental trust model leaves it vulnerable to misconfigurations, hijacks, and blackholing. When these issues occur, they often go undetected by the impacted networks—until users report degraded performance or service outages.

This post walks through a real-world incident in which a legitimate traffic spike led to an upstream provider mistakenly blackholing a critical IP address. The scenario illustrates how BGP blackholing can silently disrupt service and how external observability enables rapid diagnosis and resolution.

Understanding BGP blackholing

BGP blackholing is a commonly used DDoS mitigation tactic. A network under attack announces a more specific route for the targeted IP or subnet, directing that traffic to a null interface to prevent it from reaching the intended service infrastructure. While effective in protecting resources during volumetric attacks, this approach can inadvertently block legitimate traffic when applied too aggressively.

Let us try to understand this with help of an example:

AS2 creates BGP blackhole. No traffic reaching intended server.

In this case, the /24 prefix 1.0.0.0/24 was owned and announced by one autonomous system (AS1). A specific point-of-presence (PoP) within this prefix was responsible for live-streaming a global event. The virtual IP for this PoP—1.0.0.100—saw a surge in traffic from viewers worldwide.

The traffic passed through an upstream provider (AS2), which monitored for DDoS patterns. Seeing the sudden spike, AS2’s automated mitigation system assumed the traffic was malicious. It responded by injecting a more specific /32 route for 1.0.0.100 into the global routing table and directed it to a null interface.

The effect was immediate: traffic destined for the live-streaming service was dropped silently by AS2, resulting in widespread loss of availability for users across multiple regions.

Challenges in diagnosing upstream blackholing

From AS1’s perspective, the service infrastructure remained operational, and no anomalies were observed in internal telemetry. However, users were unable to access the stream.

Why traditional monitoring misses upstream blackholing

Because the traffic was dropped before reaching AS1’s infrastructure, no logs or packet traces indicated a problem.

This is a common limitation when relying solely on internal monitoring. In upstream blackholing scenarios, routing changes happen outside of the origin network’s control, and the only observable symptom may be an unexplained drop in traffic or availability.

The diagnostic challenge is further complicated by the specificity of the blackhole route. While the legitimate route for 1.0.0.0/24 remained active, the injected /32 for 1.0.0.100 took precedence due to BGP’s longest-prefix match rule, causing traffic to be rerouted and dropped at AS2.

Detecting origin AS mismatches and route hijacks

The incident was identified through external route monitoring that detected an origin AS mismatch—the /32 prefix was being originated by AS2 instead of the expected AS1. This deviation triggered an alert, which prompted further analysis of the BGP path and propagation behavior.

Catchpoint platform BGP alert for ASN origin mismatch

An inspection of the AS path confirmed that certain regions were receiving the incorrect /32 advertisement and routing traffic through AS2, which blackholed the packets. The blackhole route had global reachability in select geographies, explaining the outage pattern observed by users.

Mapping the propagation of the erroneous route helped identify the scope of the impact and enabled coordination with AS2 to withdraw the blackhole announcement. Once removed, traffic to 1.0.0.100 resumed normal routing, and the live-streaming service was restored.

Catchpoint platform showing BGP path, clearly identifying where traffic was split due to blackhole

Broader implications

This incident highlights the fragility of the global routing layer and the potential for automated systems to cause collateral damage, even when operating as designed. It also underscores the limitations of relying solely on internal data to understand end-to-end Internet performance.

Visibility into prefix propagation across the globe

External BGP monitoring allows operators to observe how their prefixes are being routed across the Internet and to detect anomalies such as:

Prefix hijacks by unintended or malicious ASes
Upstream blackholing through more-specific announcements
AS path divergence and propagation anomalies

Such visibility is critical for large-scale services that rely on third-party transit and upstream providers to reach global users.

Looking forward

BGP remains a powerful but fragile protocol, and incidents like this illustrate the importance of proactive, third-party observability into Internet routing. As automated mitigation systems become more prevalent, it is increasingly important for network operators to verify not just whether their services are available, but whether their prefixes are being routed as intended.

For a detailed exploration of BGP monitoring techniques and best practices, check out our in-depth guide

‍

Summary

Understanding BGP blackholing

Let us try to understand this with help of an example:

AS2 creates BGP blackhole. No traffic reaching intended server.

The effect was immediate: traffic destined for the live-streaming service was dropped silently by AS2, resulting in widespread loss of availability for users across multiple regions.

Challenges in diagnosing upstream blackholing

From AS1’s perspective, the service infrastructure remained operational, and no anomalies were observed in internal telemetry. However, users were unable to access the stream.

Why traditional monitoring misses upstream blackholing

Because the traffic was dropped before reaching AS1’s infrastructure, no logs or packet traces indicated a problem.

Detecting origin AS mismatches and route hijacks

Catchpoint platform BGP alert for ASN origin mismatch

Catchpoint platform showing BGP path, clearly identifying where traffic was split due to blackhole

Broader implications

Visibility into prefix propagation across the globe

External BGP monitoring allows operators to observe how their prefixes are being routed across the Internet and to detect anomalies such as:

Prefix hijacks by unintended or malicious ASes
Upstream blackholing through more-specific announcements
AS path divergence and propagation anomalies

Such visibility is critical for large-scale services that rely on third-party transit and upstream providers to reach global users.

Looking forward

For a detailed exploration of BGP monitoring techniques and best practices, check out our in-depth guide

‍

This is some text inside of a div block.

Google’s Agent-to-Agent (A2A) Protocol is here—Now Let’s Make it Observable

Leon Adato — Mon, 15 Sep 2025 04:00:00 +0000

Can your AI tools really work together, or are they still stuck in silos? With Google’s new Agent-to-Agent (A2A) protocol, the days of isolated AI agents are numbered. This emerging standard lets specialized agents communicate, delegate, and collaborate—unlocking a new era of modular, scalable AI systems. Here’s how A2A could transform your workflows, and why making it observable is just as important as making it possible.

Why agent-to-agent is a breakthrough for collaborative AI

To understand why A2A is such a breakthrough, it helps to look at how AI agents have evolved. Until now, most agents have relied on the Model Context Protocol (MCP), a mechanism that lets them enrich their responses by calling out to external tools, APIs, or functions in real time.

MCP has been a game-changer, connecting agents to everything from knowledge bases and analytics dashboards to external services like GitHub and Jira, giving them far more context than what’s stored in their training data..

However, MCP is still fundamentally a single-agent architecture: the agent enhances itself by calling tools.

Google’s A2A protocol takes things a step further. It introduces a standard for how multiple AI agents can discover, understand, and collaborate with one another—delegating parts of a query to the agent most capable of resolving it.

In a world where agents are being trained for niche domains (e.g., finance, healthcare, customer support, or DevOps), this multi-agent collaboration model could redefine how we build intelligent applications—modular, scalable, and highly specialized.

The industry has already gone multi—AI is next

To appreciate why A2A is such a meaningful step, it helps to zoom out and see the broader trend across modern infrastructure:

Across DNS, CDN, cloud, and even AI, we've seen a shift from relying on a single provider to orchestrating multi-vendor ecosystems that optimize for performance, cost, reliability, and use-case fit.

DNS: Where once a single DNS provider was the norm, many enterprises now use multi-DNS strategies for faster resolution, better geographic coverage, and built-in failover.
CDN: The move from one CDN to multi-CDN architectures enables companies to route traffic based on latency, region, or cost—while improving redundancy and performance at the edge.
Cloud: With AWS, Azure, GCP, and others offering differentiated services, multi-cloud is now a strategic choice. Teams pick the best-in-class services across vendors and reduce dependency on any single provider.

This "multi" strategy is not just about risk management—it's about specialization and optimization.

Now, in the AI domain, we're witnessing the same pattern. While early adopters picked a single foundation model (e.g., GPT-4, Gemini, Claude), the next generation of intelligent systems will likely be multi-agent systems. One agent might be optimized for data interpretation, another for decision-making, and another for domain-specific compliance.

Inside A2A: How agents discover and delegate in real time

Google’s A2A protocol enables a framework where agents can collaborate dynamically. Think of this scenario:

A user asks: "What’s the weather in New York?"

Agent 1 receives the query but lacks access to real-time weather data. However, it knows (via the A2A protocol) that Agent 2 is specialized in live weather updates. It queries Agent 2, gets the accurate data, and serves it back to the user—seamlessly.

This interaction is powered by a few key concepts:

Host agent (client agent): The initiating agent that receives the user query and delegates it if needed.
Remote agent: An agent capable of fulfilling specialized tasks when invoked by another.
Agent card: A JSON-based metadata descriptor published by agents to advertise their capabilities and endpoints—helping other agents discover and route tasks intelligently.

A2A facilitates communication between a "client" agent and a “remote” agent.

I tried implementing a basic A2A interaction locally using the open-source specification from Google. It’s remarkably modular and extensible—just like APIs revolutionized service-to-service communication, A2A may do the same for agent-to-agent orchestration.

Here’s a snapshot from my local implementation:

The remote agent listens on port 8001, ready to receive tasks. It advertises its capabilities via an Agent Card and executes incoming requests accordingly.

The host agent first discovers the remote agent, retrieves its capabilities, and sends a query prompt to the appropriate endpoint defined in the Agent Card. It then receives and returns the final response.

Achieving end-to-end visibility in multi-agent systems

Multi-agent AI systems bring powerful new capabilities—but also new risks. In traditional architectures, observability stops at the edge of your stack. But in an A2A world, a single user request might pass through a chain of agents—each running on different systems, owned by different teams, and dependent on different APIs.

Every agent interaction is essentially a service call. That means:

Added latency
More failure points
Greater complexity when something goes wrong

Take a chatbot for a ticket booking app. It may rely on internal microservices for availability and payments, but call out to a weather agent or flight-status agent using A2A. If one of those agents is slow or unresponsive, the whole experience degrades. And it’s hard to fix what you can’t see.

This is where visibility matters. By mapping your service and agent dependencies—internal and external—you can:

Pinpoint where slowdowns or errors occur
Understand how agents interact across the chain
Quickly isolate root causes when something fails

Tools like Catchpoint’s Internet Stack Map help teams visualize these flows. It leverages Internet Performance Monitoring (IPM) to illustrate how requests flow through internal components and out to external agent APIs, making it clear where dependencies exist and where issues could arise.

Catchpoint’s Internet Stack Map

Just as we evolved from single-CDN to multi-CDN, or from monolithic apps to microservices, we are now entering an age of multi-agent intelligence. And just like we learned to monitor those distributed system, we’ll now need to monitor multi-agent systems with the same rigor.

Because the future isn’t just AI—it’s AI working together. Modular. distributed, collaborative. And IPM is what makes that visibility possible.

Learn more

See how Internet Stack Map can help you stay ahead of disruptions—schedule a demo today.

‍

Summary

Why agent-to-agent is a breakthrough for collaborative AI

However, MCP is still fundamentally a single-agent architecture: the agent enhances itself by calling tools.

The industry has already gone multi—AI is next

To appreciate why A2A is such a meaningful step, it helps to zoom out and see the broader trend across modern infrastructure:

DNS: Where once a single DNS provider was the norm, many enterprises now use multi-DNS strategies for faster resolution, better geographic coverage, and built-in failover.
CDN: The move from one CDN to multi-CDN architectures enables companies to route traffic based on latency, region, or cost—while improving redundancy and performance at the edge.
Cloud: With AWS, Azure, GCP, and others offering differentiated services, multi-cloud is now a strategic choice. Teams pick the best-in-class services across vendors and reduce dependency on any single provider.

This "multi" strategy is not just about risk management—it's about specialization and optimization.

Inside A2A: How agents discover and delegate in real time

Google’s A2A protocol enables a framework where agents can collaborate dynamically. Think of this scenario:

A user asks: "What’s the weather in New York?"

This interaction is powered by a few key concepts:

Host agent (client agent): The initiating agent that receives the user query and delegates it if needed.
Remote agent: An agent capable of fulfilling specialized tasks when invoked by another.
Agent card: A JSON-based metadata descriptor published by agents to advertise their capabilities and endpoints—helping other agents discover and route tasks intelligently.

A2A facilitates communication between a "client" agent and a “remote” agent.

Here’s a snapshot from my local implementation:

The remote agent listens on port 8001, ready to receive tasks. It advertises its capabilities via an Agent Card and executes incoming requests accordingly.

Achieving end-to-end visibility in multi-agent systems

Every agent interaction is essentially a service call. That means:

Added latency
More failure points
Greater complexity when something goes wrong

This is where visibility matters. By mapping your service and agent dependencies—internal and external—you can:

Pinpoint where slowdowns or errors occur
Understand how agents interact across the chain
Quickly isolate root causes when something fails

Catchpoint’s Internet Stack Map

Because the future isn’t just AI—it’s AI working together. Modular. distributed, collaborative. And IPM is what makes that visibility possible.

Learn more

See how Internet Stack Map can help you stay ahead of disruptions—schedule a demo today.

‍

This is some text inside of a div block.

Monitoring in the Age of Complexity: 5 Assumptions CIOs Need to Rethink

Leon Adato — Mon, 08 Sep 2025 04:00:00 +0000

In 2025, the average enterprise juggles over 150 SaaS applications, hybrid cloud infrastructures, and a workforce that expects seamless digital experiences—yet most CIOs still rely on monitoring strategies built for the data center era. The result? A $1.5 trillion annual hit to global GDP from downtime and performance lags, according to recent industry estimates. The problem isn’t the tools—it’s the thinking behind them.

Monitoring isn’t just about keeping the lights on anymore. It’s a strategic lever for resilience, customer trust, and competitive edge. But outdated assumptions about what ‘good monitoring’ looks like are holding organizations back. Here are five myths CIOs and VPs must confront to lead in an era where complexity is the only constant.

Myth #1: Monitoring Is just an IT operations problem

The reality: Monitoring is a business-critical function that impacts revenue and customer experience.

When 73% of customers say they’ll abandon a brand after two bad digital experiences, monitoring becomes a C-suite priority. It’s not about server uptime—it’s about revenue protection and brand equity. The narrative needs to shift from reactive alerts to proactive business alignment. Monitoring should be seen as a strategic function that directly impacts customer satisfaction and business outcomes.

What you can do:

Align IT metrics with business outcomes: Use metrics like customer churn, conversion rates, or revenue impact instead of focusing solely on technical KPIs.
Adopt observability practices: Integrate observability tools that provide real-time insights into how IT performance affects business outcomes.
Promote cross-functional collaboration: Ensure IT teams work closely with business units to prioritize monitoring efforts that directly impact customer satisfaction.

Gartner predicts that by 2026, 70% of organizations successfully applying observability will achieve shorter latency for decision-making, enabling competitive advantage for IT and business processes. This highlights the importance of aligning monitoring with business outcomes, not just IT metrics.

Myth #2: More data equals better visibility

The reality: More data often creates noise; actionable insights come from focusing on the right data.

Modern systems generate terabytes of telemetry daily, but effective monitoring isn’t about collecting everything—it’s about identifying patterns and correlations that matter. AI-powered tools can help prioritize signal over noise, enabling faster root cause analysis.

What you can do:

Focus on key metrics: Identify the most critical KPIs for your business and monitor those closely.
Leverage AI for noise reduction: Use AI-driven tools to filter irrelevant data and surface actionable insights.
Implement distributed tracing: Understand how different services interact with Tracing to pinpoint bottlenecks or failures more effectively.

Organizations that implement AI-powered monitoring tools should see a reduction in mean time to resolution (MTTR). This is because AI can help identify patterns and anomalies in large datasets, making it easier to pinpoint the root cause of issues.

Myth #3: Internal metrics tell the full story

The Reality: Most performance issues originate outside your firewall—true visibility requires end-to-end observability.

Your cloud provider’s 99.99% uptime SLA doesn’t account for the last mile—where 80% of performance issues originate. True observability looks beyond the firewall to the user’s reality. It’s essential to monitor the end-to-end user experience, including external factors that could affect performance. This holistic view ensures that organizations can address issues that impact the user experience, not just internal metrics.

What you can do:

Expand monitoring scope: Include metrics like page load times, API response times, and third-party service performance.
Integrate business metrics with XLOs: Move beyond technical KPIs to monitor customer-centric metrics such as abandonment rates, user satisfaction scores, and conversion rates. These Experience Level Objectives (XLOs) bridge IT performance with business outcomes.
Use Internet Performance Monitoring (IPM): Simulate user interactions from different geographies with IPM to proactively identify potential issues.

The shift towards experience-centric monitoring will enable organizations to make more informed decisions and prioritize investments based on their impact on the bottom line.

Myth #4: AI will fix monitoring automatically

The reality: AI is only as good as the data it analyzes—clean, contextual data is essential for success.

AI can enhance monitoring by identifying patterns and predicting failures, but poor data quality undermines its effectiveness. Feed it garbage, and you’ll get polished garbage out. Gartner warns that by 2026, 60% of AI-driven IT projects will fail without proper data readiness.

What you can do:

Invest in data quality: Establish governance frameworks to ensure clean and consistent data inputs for AI models.
Adopt shift-left Observability: Integrate monitoring into development cycles to identify issues earlier in the lifecycle.
Tailor AI solutions by context: Customize AI-driven monitoring strategies based on the criticality of each application or service.

Organizations must adopt a "shift-left" approach to monitoring, where monitoring is integrated into the development lifecycle from the beginning. This allows organizations to identify and address potential issues early on, reducing the risk of costly downtime and performance problems.

Myth #5: Downtime is the only metric that matters

The reality: Slow is the new down—Performance degradation can erode trust long before outages occur.

53% of mobile users drop off if a page takes more than 3 seconds to load. Performance degradation silently erodes trust before outages even hit. Monitoring must evolve from tracking availability alone to measuring user experience metrics like page load times and transaction speeds.

What you can do:

Monitor user experience metrics: Track latency, load times, and transaction completion rates alongside traditional uptime metrics. These XLOs ensure monitoring aligns with user satisfaction and business outcomes.
Use predictive analytics: Leverage historical data trends to anticipate potential slowdowns before they impact users, enabling proactive intervention.
Implement proactive remediation plans: Automate responses for common performance issues, such as traffic spikes or resource bottlenecks, to minimize user impact and ensure seamless experiences.

Leveraging AI-powered predictive analytics to improve IT operations will enable organizations to move from a reactive to a proactive approach to monitoring, reducing downtime and improving overall system reliability

Rethinking monitoring for the age of complexity

As enterprises face increasing complexity, monitoring has evolved from a back-office function to a strategic enabler of resilience, customer trust, and competitive differentiation. CIOs who cling to outdated assumptions risk falling behind—not just competitors, but their own customers’ expectations. The myths addressed in this article highlight the need for a paradigm shift in how organizations approach monitoring.

Modern monitoring isn’t just about uptime or data collection; it’s about aligning IT performance with business outcomes, prioritizing user experience, and leveraging predictive analytics to stay ahead of issues. By embracing these principles, CIOs can transform monitoring into a competitive advantage.

Key takeaways for CIOs

Monitoring is strategic: Elevate monitoring from an IT operations function to a C-suite priority tied directly to revenue and customer satisfaction.
Focus on actionable insights: Collect the right data—not just more data—and use AI-driven tools to surface meaningful patterns.
Expand visibility: Go beyond internal metrics to monitor end-to-end user experiences and external factors affecting performance.
Prioritize data quality: Invest in clean, contextual data to unlock the full potential of AI-driven monitoring.
Measure user experience: Adopt XLOs to track metrics that reflect customer satisfaction alongside technical KPIs.

Ask yourself: Are you monitoring what truly matters—or just what’s easy? The answer will define your organization’s ability to thrive in an era where seamless digital experiences are the foundation of success.

Dig deeper:

Check out Goodbye LAN – The Internet is the Network on VMblog to learn how the Internet became the new enterprise network.
Mastering IPM: Monitor what matters from where it matters – learn how Internet Performance Monitoring helps you stay resilient by focusing on the right metrics, from the right vantage points.

‍

Summary

Myth #1: Monitoring Is just an IT operations problem

The reality: Monitoring is a business-critical function that impacts revenue and customer experience.

What you can do:

Align IT metrics with business outcomes: Use metrics like customer churn, conversion rates, or revenue impact instead of focusing solely on technical KPIs.
Adopt observability practices: Integrate observability tools that provide real-time insights into how IT performance affects business outcomes.
Promote cross-functional collaboration: Ensure IT teams work closely with business units to prioritize monitoring efforts that directly impact customer satisfaction.

Myth #2: More data equals better visibility

The reality: More data often creates noise; actionable insights come from focusing on the right data.

What you can do:

Focus on key metrics: Identify the most critical KPIs for your business and monitor those closely.
Leverage AI for noise reduction: Use AI-driven tools to filter irrelevant data and surface actionable insights.
Implement distributed tracing: Understand how different services interact with Tracing to pinpoint bottlenecks or failures more effectively.

Myth #3: Internal metrics tell the full story

The Reality: Most performance issues originate outside your firewall—true visibility requires end-to-end observability.

What you can do:

Expand monitoring scope: Include metrics like page load times, API response times, and third-party service performance.
Integrate business metrics with XLOs: Move beyond technical KPIs to monitor customer-centric metrics such as abandonment rates, user satisfaction scores, and conversion rates. These Experience Level Objectives (XLOs) bridge IT performance with business outcomes.
Use Internet Performance Monitoring (IPM): Simulate user interactions from different geographies with IPM to proactively identify potential issues.

The shift towards experience-centric monitoring will enable organizations to make more informed decisions and prioritize investments based on their impact on the bottom line.

Myth #4: AI will fix monitoring automatically

The reality: AI is only as good as the data it analyzes—clean, contextual data is essential for success.

What you can do:

Invest in data quality: Establish governance frameworks to ensure clean and consistent data inputs for AI models.
Adopt shift-left Observability: Integrate monitoring into development cycles to identify issues earlier in the lifecycle.
Tailor AI solutions by context: Customize AI-driven monitoring strategies based on the criticality of each application or service.

Myth #5: Downtime is the only metric that matters

The reality: Slow is the new down—Performance degradation can erode trust long before outages occur.

What you can do:

Monitor user experience metrics: Track latency, load times, and transaction completion rates alongside traditional uptime metrics. These XLOs ensure monitoring aligns with user satisfaction and business outcomes.
Use predictive analytics: Leverage historical data trends to anticipate potential slowdowns before they impact users, enabling proactive intervention.
Implement proactive remediation plans: Automate responses for common performance issues, such as traffic spikes or resource bottlenecks, to minimize user impact and ensure seamless experiences.

Rethinking monitoring for the age of complexity

Key takeaways for CIOs

Monitoring is strategic: Elevate monitoring from an IT operations function to a C-suite priority tied directly to revenue and customer satisfaction.
Focus on actionable insights: Collect the right data—not just more data—and use AI-driven tools to surface meaningful patterns.
Expand visibility: Go beyond internal metrics to monitor end-to-end user experiences and external factors affecting performance.
Prioritize data quality: Invest in clean, contextual data to unlock the full potential of AI-driven monitoring.
Measure user experience: Adopt XLOs to track metrics that reflect customer satisfaction alongside technical KPIs.

Dig deeper:

Check out Goodbye LAN – The Internet is the Network on VMblog to learn how the Internet became the new enterprise network.
Mastering IPM: Monitor what matters from where it matters – learn how Internet Performance Monitoring helps you stay resilient by focusing on the right metrics, from the right vantage points.

‍

This is some text inside of a div block.

Critical Requirements for Modern API Monitoring

Leon Adato — Mon, 01 Sep 2025 04:00:00 +0000

Enterprises lose millions annually due to API outages and performance degradation. Modern observability strategies are crucial to mitigate these risks.

Today, almost every system is dependent on APIs. Data integration, authentication, payment processing, and many other functions rely on multiple reliable and performant APIs. Banks around the world, for example, have adopted the Open Banking API for payments, credit scoring, lending origination, fraud detection, and lots more.

APIs are everywhere—and critical to everything

APIs are the internal workers of the internet. Connecting to a single website, using a business application like an ATM, or a mobile app likely means dozens if not hundreds or even thousands of API calls. Each one has the possibility to impact the overall service: if it’s slow, the service might be slow. If it returns an error, the service may fail. Understanding the interaction between your services and the APIs they consume is critical to making your services resilient.

There are different ways to monitor APIs. The minimum that every system should use is proactively monitoring, measuring, and testing your critical APIs – both your own as well as third party APIs. API monitoring systems have been around for some time. From the basic ping to ensure an API is reachable to more advanced multi-step, scripted, proactive monitoring that looks at response time, functional validation, etc. More advanced API monitoring will incorporate Chaos Engineering methodologies, like blocking or simulating an error on particular APIs and observing the impact on the overall system.

API resilience isn’t optional—here’s why

In a world where most applications and systems that are interconnected via APIs are geographically distributed, touching different clouds and traversing multiple points across the internet, simple proactive monitoring is no longer enough. A traditional approach to API monitoring will not only be insufficient, but it may also miss many important incidents and would not be helpful in identifying root cause.

The goal is to have resilient APIs. The formula for resilience is

Reachability: Can I get to the API from where I am? (Or in the case of APIs, where its consumers are).
Availability: Is the API functional – does it do what it is supposed to do?
Performance: Does the API respond within the expected time?
Reliability: Can I trust the API will be working consistently?

Then, take this formula and combine it for every API used in your system!

System Resilience = minimum resilience across all APIs in use

Let’s illustrate the point:

A system that has 100 APIs with five nines of availability must have five nines of availability from each API! If designed in a resilient way, it’s possible to have individual API dependencies fail without impacting the overall system, but it has to be carefully designed and verified to be resilient.

Figure 1: One misbehaving dependency can make your whole system fail

With this goal in mind, let’s understand the requirements for API monitoring.

What a basic API Monitoring strategy should include

These are the foundational capabilities every API monitoring strategy should include. They help teams detect issues, ensure availability, and validate performance at a basic operational level.

Response Time: Measures the time taken for an API to respond to a request, helping identify latency issues.
Error Rate: Tracks the percentage of failed requests to detect anomalies or bugs.
Throughput: Monitors the number of API requests processed over a specific period to ensure scalability.
Uptime and Availability: Ensures APIs are consistently reachable and operational.
Logging: Collect detailed logs of API events, including timestamps, event types (e.g., errors, warnings), and messages, to aid in troubleshooting and post-incident analysis.
Alerts: Set up alerts based on predefined thresholds or anomalies (e.g., response time exceeding 200ms or error rates surpassing 5%).
Functional testing: verify that API endpoints return expected results
CI/CD integration: the ability to integrate monitoring into pipelines (and tools like Jenkins or Terraform) for automated creation and update of tests also known as “monitoring as code”.
Proactive monitoring: Use of synthetic mechanisms to continuously observe API performance to detect issues as they occur.
Scripting: Support for scripting standards such as Playwright to enable testing specific customer & API flows.
Historic data: A minimum of 13-month data retention to enable comparison of performance with the same period a year ago
High-cardinality data analysis: Analyze detailed data points such as unique user IDs or session-specific information for granular insights into performance trends or anomalies.
Chaos Engineering: Introduce errors into the system on purpose during a period of low use, or in a non-production environment in order to verify resilience.

Modern, internet-centric API monitoring requirements

Today’s systems, however, demand more than basic uptime checks or response metrics. Modern API monitoring must account for real-world complexity—geography, infrastructure, user experience, and external dependencies. The capabilities below go beyond the basics to provide deep, actionable insight.

Monitor from where it matters – most monitoring tools have agents in cloud servers, which often have very different connectivity, resources, and bandwidth than real-world systems, and are blind to geographical differences in routing, ISP congestion, etc. It is critical to monitor from all the locations from where a system will consume an API, using an agent that has similar characteristics. It is also important to consider where your intermediate services are. For example, you may test your full application from the customer-facing API from last-mile agents. Then test intermediate microservices from the cloud provider where they're located, and test your back-end API from a backbone agent in the city & ISP where your datacenter is located or use an enterprise agent inside the datacenter.
Visibility into the Internet Stack – While it is useful to know when an API is unresponsive or slow, it is more powerful to understand why. Modern API monitoring provides insight into anything in the Internet Stack that impacts an API including DNS resolution, SSL, routing, etc. – as well as visibility into the impact on latency and performance introduced by systems such as internal networks, SASE implementations, or gateways.

The Internet Stack

Authentication – No modern monitoring system should have hard-coded credentials into a secure API, therefore monitoring systems must support secrets management, OAuth, tokens and modern authentication mechanisms.
Synthetic Code Tracing - As an API is being tested, collect and understand code execution traces to identify server-side issues including application, connectivity, and database problems.
Open Telemetry Support – modern observability implementations must support OTel as the standard mechanism to share and integrate data from multiple systems and to provide flexibility.
Focus on User experience – An API is only a component of a broader overall system performance. A payment API is likely a component of an online purchase transaction. You want to ensure the entire transaction system performs – from the end-user perspective. Ideally, an operations team would be able to see a visual representation of every dependency in the user transaction, from end-user across the internet, network, systems, APIs – all the way into code tracing.
Broad support for protocols – While many APIs use REST over HTTP, it may be important for your monitoring system to be able to test from both IPv4 and IPv6 agents, as well as support modern protocols like http/3 and QUIC, MQTT for IoT applications, NTP for time synchronization, or even custom or proprietary protocols that your applications use.

Traditional vs modern API monitoring at a glance

The following table summarizes the key differences between legacy API monitoring approaches and modern, Internet Performance Monitoring strategies that support resilience and user experience.

Feature: Scope
- Traditional API Monitoring: Server-centric metrics
- Modern API Monitoring (IPM): End-to-end user experience + infrastructure
Feature: Protocol Support
- Traditional API Monitoring: Limited to HTTP/S, REST
- Modern API Monitoring (IPM): HTTP/3, QUIC, MQTT, custom protocols
Feature: Data Granularity
- Traditional API Monitoring: High-cardinality data available but often limited to service boundaries
- Modern API Monitoring (IPM): High-cardinality traces with cross-system correlation (user IDs, sessions)
Feature: Root Cause Analysis
- Traditional API Monitoring: Limited to app/server layers
- Modern API Monitoring (IPM): Full internet stack (DNS, SSL, routing, etc.)
Feature: Testing Perspective
- Traditional API Monitoring: Cloud datacenters
- Modern API Monitoring (IPM): Last mile, backbone, cloud, wireless, and enterprise intelligent agents
Feature: Performance Context
- Traditional API Monitoring: API performance in the context of code
- Modern API Monitoring (IPM): API performance in context of user experience
Feature: Alerting Methodology
- Traditional API Monitoring: Alert thresholds based on error rates
- Modern API Monitoring (IPM): Experience scores and XLOs
Feature: Visualization
- Traditional API Monitoring: Code-centric dashboards
- Modern API Monitoring (IPM): Visual representation of everything impacting a system

Rethink what API monitoring should do

It is somewhat surprising that the cloud is only 15 years old. As technology and system architecture has evolved, our monitoring has to evolve including API monitoring.

What is today considered “owned” or “on-premises” infrastructure is most likely in a colocation datacenter, relying on a DNS and SSL provider, connected through at least two ISPs, depending on a cloud-based authentication system, connected through a cloud-based security provider and maybe a few other APIs.

We hear all the time “My APM system shows my systems are green, but my users keep complaining”. A monitoring system that only monitors your “on premises” API is not going to be able to spot, diagnose, or provide useful root-cause information to prevent or solve incidents quickly.

To ensure API resilience, enterprises must invest in modern monitoring solutions that provide end-to-end visibility, proactive alerting, and automated remediation capabilities.

Ready to modernize your API monitoring?

Discover how Catchpoint’s API Monitoring delivers the end-to-end visibility and resilience your users expect.

‍

Summary

Enterprises lose millions annually due to API outages and performance degradation. Modern observability strategies are crucial to mitigate these risks.

APIs are everywhere—and critical to everything

API resilience isn’t optional—here’s why

The goal is to have resilient APIs. The formula for resilience is

Reachability: Can I get to the API from where I am? (Or in the case of APIs, where its consumers are).
Availability: Is the API functional – does it do what it is supposed to do?
Performance: Does the API respond within the expected time?
Reliability: Can I trust the API will be working consistently?

Then, take this formula and combine it for every API used in your system!

System Resilience = minimum resilience across all APIs in use

Let’s illustrate the point:

Figure 1: One misbehaving dependency can make your whole system fail

With this goal in mind, let’s understand the requirements for API monitoring.

What a basic API Monitoring strategy should include

These are the foundational capabilities every API monitoring strategy should include. They help teams detect issues, ensure availability, and validate performance at a basic operational level.

Response Time: Measures the time taken for an API to respond to a request, helping identify latency issues.
Error Rate: Tracks the percentage of failed requests to detect anomalies or bugs.
Throughput: Monitors the number of API requests processed over a specific period to ensure scalability.
Uptime and Availability: Ensures APIs are consistently reachable and operational.
Logging: Collect detailed logs of API events, including timestamps, event types (e.g., errors, warnings), and messages, to aid in troubleshooting and post-incident analysis.
Alerts: Set up alerts based on predefined thresholds or anomalies (e.g., response time exceeding 200ms or error rates surpassing 5%).
Functional testing: verify that API endpoints return expected results
CI/CD integration: the ability to integrate monitoring into pipelines (and tools like Jenkins or Terraform) for automated creation and update of tests also known as “monitoring as code”.
Proactive monitoring: Use of synthetic mechanisms to continuously observe API performance to detect issues as they occur.
Scripting: Support for scripting standards such as Playwright to enable testing specific customer & API flows.
Historic data: A minimum of 13-month data retention to enable comparison of performance with the same period a year ago
High-cardinality data analysis: Analyze detailed data points such as unique user IDs or session-specific information for granular insights into performance trends or anomalies.
Chaos Engineering: Introduce errors into the system on purpose during a period of low use, or in a non-production environment in order to verify resilience.

Modern, internet-centric API monitoring requirements

Monitor from where it matters – most monitoring tools have agents in cloud servers, which often have very different connectivity, resources, and bandwidth than real-world systems, and are blind to geographical differences in routing, ISP congestion, etc. It is critical to monitor from all the locations from where a system will consume an API, using an agent that has similar characteristics. It is also important to consider where your intermediate services are. For example, you may test your full application from the customer-facing API from last-mile agents. Then test intermediate microservices from the cloud provider where they're located, and test your back-end API from a backbone agent in the city & ISP where your datacenter is located or use an enterprise agent inside the datacenter.
Visibility into the Internet Stack – While it is useful to know when an API is unresponsive or slow, it is more powerful to understand why. Modern API monitoring provides insight into anything in the Internet Stack that impacts an API including DNS resolution, SSL, routing, etc. – as well as visibility into the impact on latency and performance introduced by systems such as internal networks, SASE implementations, or gateways.

The Internet Stack

Authentication – No modern monitoring system should have hard-coded credentials into a secure API, therefore monitoring systems must support secrets management, OAuth, tokens and modern authentication mechanisms.
Synthetic Code Tracing - As an API is being tested, collect and understand code execution traces to identify server-side issues including application, connectivity, and database problems.
Open Telemetry Support – modern observability implementations must support OTel as the standard mechanism to share and integrate data from multiple systems and to provide flexibility.
Focus on User experience – An API is only a component of a broader overall system performance. A payment API is likely a component of an online purchase transaction. You want to ensure the entire transaction system performs – from the end-user perspective. Ideally, an operations team would be able to see a visual representation of every dependency in the user transaction, from end-user across the internet, network, systems, APIs – all the way into code tracing.
Broad support for protocols – While many APIs use REST over HTTP, it may be important for your monitoring system to be able to test from both IPv4 and IPv6 agents, as well as support modern protocols like http/3 and QUIC, MQTT for IoT applications, NTP for time synchronization, or even custom or proprietary protocols that your applications use.

Traditional vs modern API monitoring at a glance

The following table summarizes the key differences between legacy API monitoring approaches and modern, Internet Performance Monitoring strategies that support resilience and user experience.

Feature: Scope
- Traditional API Monitoring: Server-centric metrics
- Modern API Monitoring (IPM): End-to-end user experience + infrastructure
Feature: Protocol Support
- Traditional API Monitoring: Limited to HTTP/S, REST
- Modern API Monitoring (IPM): HTTP/3, QUIC, MQTT, custom protocols
Feature: Data Granularity
- Traditional API Monitoring: High-cardinality data available but often limited to service boundaries
- Modern API Monitoring (IPM): High-cardinality traces with cross-system correlation (user IDs, sessions)
Feature: Root Cause Analysis
- Traditional API Monitoring: Limited to app/server layers
- Modern API Monitoring (IPM): Full internet stack (DNS, SSL, routing, etc.)
Feature: Testing Perspective
- Traditional API Monitoring: Cloud datacenters
- Modern API Monitoring (IPM): Last mile, backbone, cloud, wireless, and enterprise intelligent agents
Feature: Performance Context
- Traditional API Monitoring: API performance in the context of code
- Modern API Monitoring (IPM): API performance in context of user experience
Feature: Alerting Methodology
- Traditional API Monitoring: Alert thresholds based on error rates
- Modern API Monitoring (IPM): Experience scores and XLOs
Feature: Visualization
- Traditional API Monitoring: Code-centric dashboards
- Modern API Monitoring (IPM): Visual representation of everything impacting a system

Rethink what API monitoring should do

It is somewhat surprising that the cloud is only 15 years old. As technology and system architecture has evolved, our monitoring has to evolve including API monitoring.

To ensure API resilience, enterprises must invest in modern monitoring solutions that provide end-to-end visibility, proactive alerting, and automated remediation capabilities.

Ready to modernize your API monitoring?

Discover how Catchpoint’s API Monitoring delivers the end-to-end visibility and resilience your users expect.

‍

This is some text inside of a div block.

LLMs don’t stand still: How to monitor and trust the models powering your AI

APM vs observability: why your definitions are broken

Semantic Caching: What We Measured, Why It Matters

Why Intelligent Traffic Steering is Critical for Performance and Cost Optimization

Leon Adato — Mon, 25 Aug 2025 04:00:00 +0000

In today’s world of globally distributed applications, user experience is everything. Whether your platform runs across multiple cloud providers or uses a Multi CDN with numerous points of presence (PoPs), efficiently routing user traffic can make or break performance. That's where intelligent traffic steering becomes not just a nice-to-have, but a must-have.

End-to-end ecosystem of a core banking system

At our recent joint webinar with IBM NS1, we explored how Catchpoint's real-time Internet Performance Monitoring (IPM) data integrates with NS1's powerful traffic steering capabilities to solve a critical problem: ensuring traffic is routed not just to the nearest server, but to the best-performing one.

The challenge: static routing in a dynamic world

Traditionally, DNS-based routing strategies like round-robin or proximity-based decisions have been used to direct user traffic. However, these methods don't account for real-time performance metrics. A nearby server might be under heavy load or facing regional ISP issues, yet static routing would still direct users there, resulting in higher latency and degraded experience.

This issue is magnified in multi-cloud and multi-CDN environments, where routing inefficiencies can also drive up cloud and infrastructure costs. Without real-time visibility, you’re essentially flying blind.

Regional performance differences and traffic distribution between AWS and Oracle cloud environments

The image above provides a great example of this challenge in a multi-cloud environment using AWS and Oracle. The global response time maps show noticeable regional performance differences between the two providers. Yet, because traffic is being routed using a static round-robin method (seen in the 52% AWS / 48% Oracle split), users in underperforming regions are still being directed to slower endpoints. For instance, the Oracle path shows significant latency in North America and Oceania, while AWS performs better there. Still, the lack of performance-aware steering results in an inefficient and inconsistent user experience.

The Solution: Adaptive traffic steering with Catchpoint and IBM NS1

By combining Catchpoint’s extensive Global Agent Network—spanning 2,880+ intelligent agents and millions of connected devices—with IBM NS1’s real-time DNS traffic steering capabilities, organizations can shift from reactive to proactive, performance-driven routing.

Catchpoint’s Global Agent Network

Check out the video below for an overview of how Catchpoint and IBM NS1 work together to enable real-time, performance-aware traffic steering.

Catchpoint IPM continuously collects metrics such as DNS resolution time, connect time, SSL, wait time, and total response time using synthetic tests from backbone, last mile, cloud, and wireless networks. This real-time telemetry is then fed into NS1's filter chain logic.

NS1 applies this data through its intelligent filter chains, which support decisions based on geolocation, ASN, availability, and actual performance. The diagram below illustrates how the end user’s DNS request is routed through NS1, where IPM-powered performance metrics guide the decision-making engine. This ensures traffic is not just sent to the closest server, but the best-performing server at that moment.

How real time metrics power intelligent traffic steering with NS1 and Catchpoint

This feedback loop enables true adaptive routing, drastically reducing latency and improving reliability without waiting for end-user impact to trigger change.

Feeding Catchpoint Data into IBM NS1 Connect

Catchpoint enables real-time performance telemetry to be pushed directly into NS1 IBM Connect using Data Webhooks. This integration empowers traffic steering logic to be based on real, actionable insights rather than static assumptions.

Metrics such as DNS time, connect time, SSL handshake duration, wait time (time to first byte), TTFB, and overall response time are just the start. Users can also feed in custom data sources like CDN-specific server timing metrics or application-specific performance KPIs. This flexibility ensures that the traffic steering logic is tailor-fit to business and technical requirements whether optimizing for speed, reliability, or cost.

Using Catchpoint IPM data in IBM NS1 Connect to compare multi-cloud performance

All this data is consumed by NS1 Pulsar's filter chain engine, which uses it to dynamically apply routing rules. These filters can be chained together based on geography, ASN, latency, availability, or any of the performance metrics provided by Catchpoint IPM.

NS1 Pulsar filter chain powered by Catchpoint IPM data

As a result, routing decisions aren't just smart they're precisely aligned with your most up-to-date performance landscape.

Real-world impact: Better performance and lower costs

In our joint implementation example, traffic was initially split evenly between AWS and Oracle cloud environments. After enabling dynamic routing with IPM-fed data, 77% of the traffic shifted to the better-performing cloud region, significantly reducing wait times and improving end-user experience.

Impact of traffic steering on cloud traffic distribution

The results speak for themselves:

86.57% reduction in wait time on AWS
85.30% reduction in wait time on Oracle

Wait time improvements after enabling performance-based traffic steering

These improvements not only enhance digital experience but also minimize the frustration and churn that come with sluggish performance.

Beyond latency improvements, traffic steering also drives operational efficiency. By understanding which cloud or PoP performs best in specific regions, businesses can:

Scale down underperforming infrastructure
Optimize cloud costs by avoiding overprovisioning
Improve regional performance by steering traffic intelligently

This is the power of proactive, performance-aware traffic steering.

Proactive, not reactive

Most traditional monitoring tools rely on real user monitoring (RUM) data, which only becomes useful after users have already experienced performance issues. Catchpoint IPM flips this model by using synthetic testing for proactive insights. This means traffic can be rerouted before users are impacted, creating a continuous feedback loop that enhances reliability and experience.

Next up: the setup guide

In an upcoming post, we’ll walk through the step-by-step implementation of intelligent traffic steering—from configuring Catchpoint tests to feeding metrics into NS1 and setting up intelligent filter chains. Watch this space.

Learn More

Check out the IBM NS1 Connect Documentation to get started with traffic steering and DNS configuration_._

‍

Summary

End-to-end ecosystem of a core banking system

The challenge: static routing in a dynamic world

Regional performance differences and traffic distribution between AWS and Oracle cloud environments

The Solution: Adaptive traffic steering with Catchpoint and IBM NS1

Catchpoint’s Global Agent Network

Check out the video below for an overview of how Catchpoint and IBM NS1 work together to enable real-time, performance-aware traffic steering.

How real time metrics power intelligent traffic steering with NS1 and Catchpoint

This feedback loop enables true adaptive routing, drastically reducing latency and improving reliability without waiting for end-user impact to trigger change.

Feeding Catchpoint Data into IBM NS1 Connect

Using Catchpoint IPM data in IBM NS1 Connect to compare multi-cloud performance

NS1 Pulsar filter chain powered by Catchpoint IPM data

As a result, routing decisions aren't just smart they're precisely aligned with your most up-to-date performance landscape.

Real-world impact: Better performance and lower costs

Impact of traffic steering on cloud traffic distribution

The results speak for themselves:

86.57% reduction in wait time on AWS
85.30% reduction in wait time on Oracle

Wait time improvements after enabling performance-based traffic steering

These improvements not only enhance digital experience but also minimize the frustration and churn that come with sluggish performance.

Beyond latency improvements, traffic steering also drives operational efficiency. By understanding which cloud or PoP performs best in specific regions, businesses can:

Scale down underperforming infrastructure
Optimize cloud costs by avoiding overprovisioning
Improve regional performance by steering traffic intelligently

This is the power of proactive, performance-aware traffic steering.

Proactive, not reactive

Next up: the setup guide

Learn More

Check out the IBM NS1 Connect Documentation to get started with traffic steering and DNS configuration_._

‍

This is some text inside of a div block.

The $1 Million Lesson: Building a Culture of Quality Through SLAs

Leon Adato — Mon, 18 Aug 2025 04:00:00 +0000

In the early days of DoubleClick, back when SaaS was still known as Application Service Provider (ASP), I was tasked with setting up the QoS (Quality of Service) Team. Our primary mission was to establish a monitoring system, but we quickly found ourselves managing Service Level Agreements (SLAs)—a task that became critical after we paid out over $1 million in penalties for SLA violations to a single customer. The reason? Someone had signed a contract promising 100% uptime, an impossible commitment.

This is the story of how we took control of our SLAs, stopped the financial bleeding, and built a culture of quality around service metrics. Whether you’re managing SLAs today or just curious about how they work, this post will provide valuable insights into the challenges we faced, the solutions we implemented, and the lessons we learned along the way.

What are SLAs?

An SLA (Service Level Agreement) is a contractual agreement between a vendor and a customer that outlines the expected level of service. Under this legal umbrella, you’ll find Service Level Objectives (SLOs), which define specific metrics like uptime, speed, or transactions per second.

At DoubleClick, we defined SLAs with the following principles in mind:

Attainable: The goals should be realistic.
Repeatable: The metrics should be consistently measurable.
Measurable: The performance should be quantifiable.
Meaningful: The metrics should matter to the business.
Mutually Acceptable: Both parties should agree on the terms.

SLAs benefit both the customer and the vendor. For customers, they provide objective grading criteria and protection from poor service. For vendors, they set clear expectations and incentivize quality improvements

Ground zero, discovery

When we first tackled the SLA problem, we were in crisis mode. The first step was to compile a list of all contracts, extract the SLAs and SLOs, and document the associated penalties. We stored this information in a database and began educating stakeholders—business leaders, legal teams, and executives—about the importance of SLAs.

From the beginning, we focused on end-user experience-based SLAs. This meant measuring performance from the user’s perspective, not just from the server’s perspective.

A universal challenge

Over the years, I’ve seen many companies face similar issues. Not all SRE and Dev teams fully grasp the SLAs their organization has with customers—they often focus heavily on internal SLOs while overlooking how those metrics tie directly to contractual commitments. For instance, after facing significant penalties, companies like Slack revised their SLA terms to better align internal goals with customer promises.

SLA Application Performance

Establishing an SLA is more than just putting a few sentences in the contract. The reason we paid $1 million is that there was no SLA Management System in place. We started then by building a Service Level Management (SLM) practice that relied on 4 pillars: Administration, Monitoring, Reporting, and Compliance (AMRC).

The SLM process

We sat down with business partners, customers, legal, and finance teams to create a process that would prevent costly mistakes in the future. This process, which we called the SLA lifecycle, was reviewed quarterly to ensure it remained effective and aligned with our business goals.

Risk simulations with data science: One of the most critical steps in our SLM process was using our in-house data scientists to run simulations. These simulations analyzed historical data from our monitoring tools to assess the risk of breaching SLAs. The goal was to set realistic SLAs that wouldn’t be breached every day, while still meeting customer expectations.
“What-if” scenarios: We also ran multiple “what-if” scenarios to understand the relationship between availability and revenue. These scenarios helped us evaluate the impact of downtime at different hours of the day and days of the week. For example, we could see how a 10-minute outage during peak traffic hours would affect revenue compared to the same outage during off-peak times.
The SLA desk: To streamline the process, we created an online tool in 2001—essentially an “SLA desk”—that allowed our sales team to request SLA portfolios for customers. These requests were reviewed and approved by our QoS team, ensuring that every SLA was realistic, measurable, and aligned with our capabilities.

Aligning external and internal SLAs

One of the biggest challenges we faced was the mismatch between external SLAs (what we promised customers) and internal SLAs (what we measured internally). For example, customers would ask for ad-serving uptime, while our tech team measured server availability.

To solve this, we aligned our external and internal SLOs and made the internal objectives (the targets) very very high. This was a huge victory because it allowed us to rely on one set of metrics to understand our SLA risk position and drive operational excellence. Our tech group (Ops, Engineering, etc.) also became more sensitive to the notion of a business SLA and started to care a lot about not breaching them.

Monitoring – the key to SLA success

For availability and performance, we relied on three synthetic products. Internally, we ran Sitescope in 17 data centers and used two external synthetic products. We wanted to have as many data points as possible from as many tools as possible. The stakes were just too high not to invest in multiple tools. This entire SLM project was not cheap to implement and run on an annual basis, but I also knew the cost of not doing it right the hard way.

For monitoring, it became clear we needed to test as often as possible from as many vantage points as possible:

If you only check your SLO endpoints once an hour, you must wait 59 minutes between checks. That gap can lead to false downtime alerts.
You also need many data points to ensure statistical significance. Smaller datasets lower precision and power, while larger one’s help manage false positives and false negatives.

Enter Differential Performance Measurement (DPM)

One of our biggest challenges was finding an effective way to measure the ad delivery speed and capture it in our SLAs. Clients would look at their site performance and notice spikes and they would attribute it to our system, meanwhile our performance telemetry would not show any problems. We couldn’t correlate the two charts; therefore we couldn’t come to an agreement whether it was our problem or someone else’s problem.

To address this, we developed a methodology called Differential Performance Measurement (DPM). Our goal was to measure Doubleclick’s performance and availability with precision, and to understand how it affected our customers’ pages. We also wanted to be accountable for what we controlled, so we could avoid blame and finger-pointing.

The methodology added context to the measurements. DPM introduced clarity and comparison, removing absolute performance numbers from the SLAs.

Recipe for Differential Performance Measurement (example with an advert.):

Take two pages—one without ads and one with a single ad call.

Page A = No ads
Page B = One ad

Make sure the pages do not contain any other third-party references (CDNs, etc.).
Make sure the page sizes (in KB) are the same.
“Bake” – Measure response times for both pages and you get the following metrics:

Differential Response (DR) will be (Response Time of page B) minus (Response Time of page A)
Differential Response Percentage (DRP) = DR / A. (e.g. If Page A is 2 seconds, and Page B is 2.1 seconds, DR is 0.1 second, and DRP is 0.1/2=0.05 or 5%)

This approach helped eliminate noise caused by:

Internet-related issues beyond our control (e.g., fiber cuts).
Monitoring agent inconsistencies (raising the need to monitor our monitoring tools).
Other third-party dependencies.

To visualize the impact of Differential Performance Measurement (DPM), the chart below compares response times for two scenarios

Scenario 1: The ad-serving company experienced performance issues, which negatively impacted the customer’s site. The vendor breached the SLA threshold between Time 4 and Time 8.

Scenario 2: The website itself encountered performance problems, unrelated to the ad-serving company.

Reporting – Transparency and Accountability

After the $1 million penalty, SLA management became a top priority, with visibility extending all the way to the CEO. We reported monthly on compliance and breaches, using tools like DigitalFuel to detect issues in real-time.

By the end of 2001, we were tracking over 100 Operational Level Agreements (OLAs), and a Culture of Quality had emerged at DoubleClick. Everyone—from engineers to executives—was aligned around business service metrics, and no one wanted to breach an SLA.

Lessons learned and the road ahead

Implementing a comprehensive SLM process at DoubleClick allowed us to:

Manage hundreds of contracts with up to five SLOs each.
Offer scalable SLAs that could adapt to new products.
Reduce financial risks by avoiding costly penalties.
Maintain our reputation by providing accurate and meaningful SLAs.
Detect breaches in real-time, allowing us to take proactive measures.

One of the biggest advantages was knowing in advance when an SLA was at risk. For example, we could predict that adding four minutes of downtime would breach 12 contracts and result in $X in penalties. This insight helped our Ops team act—pausing releases or preventing any changes that could impact uptime.

Some people dismiss SLAs, and in many cases, that skepticism is justified. Bad SLAs—those with unrealistic guarantees, no real penalties, or vague measurement criteria—undermine trust. I often see SLAs promising 0% packet loss, but when you ask how it’s measured, you quickly realize it’s meaningless. These kinds of SLAs give the entire concept a bad reputation.

However, when done right, SLAs are essential. They align customers and vendors, reduce friction, and eliminate blame games. That said, customers need to demand useful SLAs—not just ones that sound good on paper. The goal isn’t to drive vendors out of business but to hold them accountable. If they fail to deliver, they should feel the impact.

The evolution of SLAs

Back in 2001, we knew SLA management was critical, but could we have predicted how integral it would become in today’s cloud-driven world? SLAs have evolved from simple uptime guarantees to complex agreements that cover everything from latency to data residency. XLOs (Experience Level Objectives) are a thing—metrics that focus on the customer’s experience, not just the server’s performance. This shift in focus—from internal metrics to customer outcomes—is the future of performance management.

Stay tuned for Part 2, where we’ll explore how businesses can align their internal metrics with what truly matters: the customer’s experience

Learn more

New to SLAs, SLOs, and SLIs? Read this post to learn the fundamentals, best practices, and how they impact service reliability.

‍

Summary

What are SLAs?

At DoubleClick, we defined SLAs with the following principles in mind:

Attainable: The goals should be realistic.
Repeatable: The metrics should be consistently measurable.
Measurable: The performance should be quantifiable.
Meaningful: The metrics should matter to the business.
Mutually Acceptable: Both parties should agree on the terms.

Ground zero, discovery

From the beginning, we focused on end-user experience-based SLAs. This meant measuring performance from the user’s perspective, not just from the server’s perspective.

A universal challenge

SLA Application Performance

The SLM process

Risk simulations with data science: One of the most critical steps in our SLM process was using our in-house data scientists to run simulations. These simulations analyzed historical data from our monitoring tools to assess the risk of breaching SLAs. The goal was to set realistic SLAs that wouldn’t be breached every day, while still meeting customer expectations.
“What-if” scenarios: We also ran multiple “what-if” scenarios to understand the relationship between availability and revenue. These scenarios helped us evaluate the impact of downtime at different hours of the day and days of the week. For example, we could see how a 10-minute outage during peak traffic hours would affect revenue compared to the same outage during off-peak times.
The SLA desk: To streamline the process, we created an online tool in 2001—essentially an “SLA desk”—that allowed our sales team to request SLA portfolios for customers. These requests were reviewed and approved by our QoS team, ensuring that every SLA was realistic, measurable, and aligned with our capabilities.

Aligning external and internal SLAs

Monitoring – the key to SLA success

For monitoring, it became clear we needed to test as often as possible from as many vantage points as possible:

If you only check your SLO endpoints once an hour, you must wait 59 minutes between checks. That gap can lead to false downtime alerts.
You also need many data points to ensure statistical significance. Smaller datasets lower precision and power, while larger one’s help manage false positives and false negatives.

Enter Differential Performance Measurement (DPM)

The methodology added context to the measurements. DPM introduced clarity and comparison, removing absolute performance numbers from the SLAs.

Recipe for Differential Performance Measurement (example with an advert.):

Take two pages—one without ads and one with a single ad call.

Page A = No ads
Page B = One ad

Make sure the pages do not contain any other third-party references (CDNs, etc.).
Make sure the page sizes (in KB) are the same.
“Bake” – Measure response times for both pages and you get the following metrics:

Differential Response (DR) will be (Response Time of page B) minus (Response Time of page A)
Differential Response Percentage (DRP) = DR / A. (e.g. If Page A is 2 seconds, and Page B is 2.1 seconds, DR is 0.1 second, and DRP is 0.1/2=0.05 or 5%)

This approach helped eliminate noise caused by:

Internet-related issues beyond our control (e.g., fiber cuts).
Monitoring agent inconsistencies (raising the need to monitor our monitoring tools).
Other third-party dependencies.

To visualize the impact of Differential Performance Measurement (DPM), the chart below compares response times for two scenarios

Scenario 1: The ad-serving company experienced performance issues, which negatively impacted the customer’s site. The vendor breached the SLA threshold between Time 4 and Time 8.

Scenario 2: The website itself encountered performance problems, unrelated to the ad-serving company.

Reporting – Transparency and Accountability

Lessons learned and the road ahead

Implementing a comprehensive SLM process at DoubleClick allowed us to:

Manage hundreds of contracts with up to five SLOs each.
Offer scalable SLAs that could adapt to new products.
Reduce financial risks by avoiding costly penalties.
Maintain our reputation by providing accurate and meaningful SLAs.
Detect breaches in real-time, allowing us to take proactive measures.

The evolution of SLAs

Stay tuned for Part 2, where we’ll explore how businesses can align their internal metrics with what truly matters: the customer’s experience

Learn more

New to SLAs, SLOs, and SLIs? Read this post to learn the fundamentals, best practices, and how they impact service reliability.

‍

This is some text inside of a div block.

When AI tools fail: How to map your AI dependencies for proactive visibility

Leon Adato — Mon, 11 Aug 2025 14:32:52 +0000

(This post originally appeared on the Catchpoint Blog )

AI platforms have experienced several service interruptions over the past few months.

We’ve all seen the memes fly when ChatGPT, Gemini or Perplexity go down. They’re funny at first, but then reality hits: if you rely on AI tools for work or business, these outages can grind your day to a halt. And it’s not just a glitch here or there— there’s a clear pattern of AI services failing across different platforms:

February 5-6, 2025: Google’s Gemini experienced a 23-hour disruption that affected the "Add File" and "Link File" functionalities within Gems. The outage prevented users from attaching files to their AI-driven workflows. Users had no workaround, leading to productivity loss for businesses relying on Gemini’s file-processing capabilities.
January 23, 2025: ChatGPT and several OpenAI APIs suffered elevated error rates, with users encountering "bad gateway" errors. Businesses relying on ChatGPT for automation, customer service, and content generation were left scrambling
January 23, 2025: Perplexity’s API experienced a major outage, causing timeouts and disruptions for applications relying on its AI capabilities.
December 26, 2024: A host of OpenAI services (ChatGPT, Sora video creation, plus agents, realtime speech, batch, and DALL-E APIs) suffered error rates north of 90%.
June 4, 2024: On this day, multiple AI platforms, including OpenAI's ChatGPT, Anthropic's Claude, and Perplexity, experienced simultaneous outages. Users worldwide reported disruptions, leading to widespread discussions on social media platforms.

We’ve officially hit a point where our dependence on AI is no longer just a possibility; it’s an absolute. When these systems fail, we’re left scrambling. The real question is: how do you stay ahead of the next failure?

The revenue impact of AI outages

The numbers are big and growing bigger: in 2025, global AI investments are set to exceed $500 billion. For many companies, AI apps like ChatGPT aren’t optional anymore—they’re mission-critical. Gartner reports that 70% of enterprises now use large language models (LLMs) for everyday tasks like automated customer service, marketing personalization, and real-time data crunching.

When these AI systems go offline, it’s not just a minor inconvenience. In finance, a few hours of AI downtime could mean millions lost due to missed trades or undetected fraud. In eCommerce, chatbots and recommendation engines going dark mean abandoned shopping carts and fewer conversions, which translates to real money left on the table.

But the damage doesn’t stop at lost revenue. Companies increasingly rely on AI-powered automation to streamline workflows, meaning that outages force employees to revert to manual processes, significantly slowing down productivity. This is particularly evident in customer support, where AI chatbots handle vast volumes of inquiries. If an outage forces companies to fall back on human agents, call center queues expand, increasing response times and leading to diminished customer satisfaction.

If you’re concerned about the impact of AI outages on your business, now is the time to evaluate your AI dependencies and invest in tools that can help you stay ahead of disruptions.

The need for visibility: Mapping your AI dependencies

Even with the best monitoring strategies in place, AI outages present a unique challenge: You may know something is broken, but not necessarily where or why. To accurately pinpoint issues, you need tools that enable you to get actionable insights into your AI dependencies—whether they originate in the application layer or the underlying Internet Stack.

eCommerce AI dependencies: A case study

Consider an eCommerce company relying on an AI-powered chatbot for customer support. It relies on several key components to deliver a seamless shopping experience:

Front-end CDN: Ensures fast content delivery to users.
Distributed Hyperscaler: Acts as the origin server for dynamic content.
Search and Seller APIs: Retrieve relevant product data for users.
Chatbot Powered by OpenAI API: Handles customer inquiries and provides real-time support.

The chatbot is a critical part of the customer support workflow. When a shopper interacts with the chatbot, their request is forwarded to an external API, which then interacts with the OpenAI API to generate a response. This means the chatbot’s functionality is entirely dependent on the OpenAI API.

Flow diagram depicting the interaction between a user, an external API, and OpenAI's API in an e-commerce chatbot system

If the OpenAI API experiences an outage, the chatbot fails, leaving customers without support. This not only frustrates users but can also lead to lost sales and damaged customer relationships.

How to map AI dependencies and stay ahead of outages

In the eCommerce example above, the chatbot’s dependency on the OpenAI API highlights the importance of mapping AI dependencies. When outages occur, knowing exactly where the failure lies can mean the difference between minutes of downtime and hours of lost revenue. By mapping your AI dependencies, you can quickly identify the root cause of outages, reducing downtime and minimizing revenue loss. Here's how:

#1 Visualize your AI dependencies

Start by creating a map of all the services and APIs your AI tools rely on. For example, if your chatbot depends on OpenAI’s API, you need to include it in your dependency map. Tools like Internet Stack Map can help you visualize these connections, making it easier to pinpoint where failures occur when an outage happens.

Internet Stack Map view

In the example above, the Internet Stack Map view of our eCommerce case study shows all other services are working as expected, except for the OpenAI API (highlighted in red), which impacts chatbot interactions.

#2 Customize your workflow

Every AI system is unique, so your dependency map should reflect your specific architecture. Identify key components like CDNs, DNS providers, and origin servers, and ensure they’re included in your map. This customization ensures you’re prepared to troubleshoot issues that are specific to your setup.

#3 Correlate data for faster insights

Use monitoring tools that combine synthetic testing with real-time outage data. By correlating this data, you can quickly determine whether an issue is with your AI provider (e.g., OpenAI) or your own infrastructure. This reduces the time spent diagnosing problems, helps you avoid unnecessary war rooms and saves you money.

Faster resolution, fewer disruptions

AI outages remind us how vulnerable we are in this interconnected world. When these systems fail, every minute counts—particularly if you’re losing revenue or driving customers away. That’s why Internet Stack Map, recently updated with a groundbreaking user interface, is a game-changer for incident response. It offers immediate clarity about what broke and where, shrinking your Mean Time to Identify (MTTI) and Mean Time to Resolve (MTTR).

See how Internet Stack Map can help you stay ahead of disruptions—schedule a demo today.

Cloud Monitoring's Blind Spot: The User Perspective

Leon Adato — Tue, 05 Aug 2025 17:52:17 +0000

The evolution of internet-centric application delivery has worsened IT's visibility gaps into what impacts an end user's experience. This problem is exacerbated when these gaps lead to negative business consequences, such as loss of revenue or lower Net Promoter Scores (NPS). The need to address this worsening visibility gap problem is reinforced by Gartner’s recent publication of its first Magic Quadrant for Digital Experience Monitoring (DEM).

A good way to understand what visibility really should look like is through the perspective of first- versus last-mile monitoring.

First- vs. Last-Mile Monitoring: Where You Monitor Matters

The first mile represents cloud networks and platforms like AWS, Azure, Google Cloud and even “Joe’s network closet.” These environments are stable, well-optimized and critical for hosting applications. Monitoring from the first mile focuses on ensuring that the core infrastructure and code of your applications are performing as expected.

The last mile, however, is where real users connect to your applications; it’s where experiences occur. This includes backbone networks (e.g., regional ISPs like BT, AT&T and Comcast), last-mile providers (fiber or wireless like Verizon, Sky or T-Mobile) and wireless connections. Monitoring the last mile reveals the real-world challenges users face, such as latency spikes, packet loss and internet service provider (ISP)-specific issues that are invisible from the first mile.

Think of it like the Domino’s “Paving for Pizza” ad campaign — it’s not just about ensuring the pizza is perfect when it leaves the store (first mile); it’s about fixing the potholes in the roads so the pizza arrives intact at the customer’s door (last mile). The same principle applies to digital experiences: monitoring the first mile isn’t enough if the last mile isn’t delivering. Monitoring from the last mile paints the clearest picture of performance from your users’ perspective.

Why First-Mile Monitoring Alone Falls Short

Your applications are most likely hosted in a cloud provider’s data center, often within the same border gateway protocol (BGP) autonomous system (AS) as your monitoring tools. This means that monitoring so close to the source does little more than verify the availability of your infrastructure. In other words, this type of "inside of the house" setup offers limited visibility into real-world issues.

User perspective lost: Internet Performance Monitoring (IPM) monitors health from the user’s perspective, which cloud-only monitoring can’t do.
Observability risks: When the first mile goes down — a more frequent occurrence than many realize — your observability strategy goes with it. This isn’t just a theoretical risk; it’s something we’ve seen play out time and time again in real-world outages, such as the Lumen and AWS micro-outage in August 2024. In this incident, critical systems were disrupted, rippling across interconnected ecosystems and catching businesses off guard.

Rethinking Observability: Availability and Reachability

When it comes to delivering a flawless digital experience, observability relies on four key pillars: availability, reachability, performance, and reliability. Each plays a critical role in understanding how your applications are performing and how users experience them.

I’ll focus on the first two: availability and reachability. Availability is about whether your application is up and running. Reachability, on the other hand, measures whether users can actually connect to your application, factoring in network latency, packet loss and the number of hops between them and your servers.

I’ll illustrate the difference between monitoring from the cloud versus end-user networks, and how what looks perfect in the cloud often falls apart in the wild.

Visualizing the Difference: Availability Across Network Types

Cloud monitoring data often paints an overly rosy picture. As the chart below shows, monitoring from the cloud (green line) reports near-perfect availability, consistently hovering around 99.99%. But this data tells only part of the story — it reflects the controlled environment of cloud infrastructure, not the real-world experience of users.

Catchpoint dashboard showing network availability trends across Backbone, Cloud, Last Mile, and Wireless networks over a one-week period

Now, compare this to the backbone (the blue line), last mile (the red line) and wireless (the purple line) data. These fluctuations highlight the everyday challenges users face, from regional ISP disruptions to last-mile instability. The takeaway? While cloud monitoring data might make dashboards look good, it doesn’t account for the realities of real-world networks where your users connect. To truly understand availability, you need to monitor across all these network types.

Cloud vs. End-User Network Maps

Here is another example. The top map shows monitoring results from the cloud, while the bottom map shows end-user networks.

Catchpoint dashboard comparing cloud vs. end-user network performance

The cloud shows all green, indicating near-perfect first-mile performance. The bottom map shows the reality from the end-user perspective; the red and yellow markers represent performance issues that are not visible to cloud-only monitoring.

This disparity underscores the critical need for monitoring where your users actually connect. While the cloud may look pristine, end-user networks tell a very different story.

Now why is this?

Network path visualization showing traffic flows from multiple ISPs to an Amazon destination network, highlighting packet loss and performance metrics

The above image shows that the path it takes for a user to get to the cloud from an external ISP is more volatile than a cloud ISP (as shown in the image below).. This is due to the numerous BGP autonomous systems and hops that exist between a cloud-hosted application and the user. Each AS network represents a different administrative domain. As the traffic traverses these domains, it passes through multiple network hops. These hops can include diverse routing policies, peering agreements and congestion points.

Network path visualization showing traffic from multiple AWS Cloud locations to an Amazon destination network, showing minimal packet loss

Cloud-based monitoring lacks insight into these intermediate hops, particularly across transit providers and peering exchanges, resulting in a fragmented view of network performance and true user experience.

In contrast, backbone monitoring provides a more comprehensive perspective by capturing data closer to the core of the internet, offering visibility into the paths your end user traffic takes and potential bottlenecks along the way.

Average Response Times: Cloud vs. Backbone ISPs

It’s one thing to know whether your application is up and running, but what about the quality of the connection? The chart below compares response metrics between backbone networks and cloud networks.

Catchpoint dashboard comparing response times between backbone and cloud networks

On the left, monitoring from backbone networks reveals significant variability in key metrics like load time and wait time. The spikes represent the challenges users face when traversing real-world networks. Compare that to the cloud on the right, where everything looks stable, smooth and controlled. But most users aren’t connecting from the cloud. Without monitoring from backbone and last-mile networks, you’re only seeing part of the story.

Here’s another example of how cloud monitoring data might make everything look perfect when the reality is far from it. The chart below, showing monitoring from AWS, reports a near-instant response time of 44.79 milliseconds.

Catchpoint dashboard comparing cloud vs. backbone ISP response times

But what happens when you shift the perspective to backbone ISPs? In this CenturyLink example, response times skyrocket to 730.67 milliseconds.This kind of variability isn’t an outlier — it’s the reality users face every day when connecting to your application through different networks. And unless you’re monitoring from these networks, you’re missing the full picture of your application’s reachability.

Putting it All Together

The data in these charts tell a clear story. It shows what first-mile monitoring alone cannot: the variability, instability and challenges users face every day on backbone, last mile and wireless networks.The takeaway? To truly understand how your applications are performing, you need to monitor beyond the cloud. Backbone, last mile and wireless networks aren’t just part of the picture — they are the picture. The ability to monitor the entire Internet Stack, including those “eyeball” networks where your users actually connect, is what sets Catchpoint Internet Performance Monitoring (IPM) apart.

A visual representation of The Internet Stack

To learn more about how Catchpoint IPM can help you achieve Internet Resilience, request a demo or schedule a chat with our solution engineers.

‍

A good way to understand what visibility really should look like is through the perspective of first- versus last-mile monitoring.

First- vs. Last-Mile Monitoring: Where You Monitor Matters

Why First-Mile Monitoring Alone Falls Short

User perspective lost: Internet Performance Monitoring (IPM) monitors health from the user’s perspective, which cloud-only monitoring can’t do.
Observability risks: When the first mile goes down — a more frequent occurrence than many realize — your observability strategy goes with it. This isn’t just a theoretical risk; it’s something we’ve seen play out time and time again in real-world outages, such as the Lumen and AWS micro-outage in August 2024. In this incident, critical systems were disrupted, rippling across interconnected ecosystems and catching businesses off guard.

Rethinking Observability: Availability and Reachability

I’ll illustrate the difference between monitoring from the cloud versus end-user networks, and how what looks perfect in the cloud often falls apart in the wild.

Visualizing the Difference: Availability Across Network Types

Catchpoint dashboard showing network availability trends across Backbone, Cloud, Last Mile, and Wireless networks over a one-week period

Cloud vs. End-User Network Maps

Here is another example. The top map shows monitoring results from the cloud, while the bottom map shows end-user networks.

Catchpoint dashboard comparing cloud vs. end-user network performance

This disparity underscores the critical need for monitoring where your users actually connect. While the cloud may look pristine, end-user networks tell a very different story.

Now why is this?

Network path visualization showing traffic flows from multiple ISPs to an Amazon destination network, highlighting packet loss and performance metrics

Network path visualization showing traffic from multiple AWS Cloud locations to an Amazon destination network, showing minimal packet loss

Average Response Times: Cloud vs. Backbone ISPs

Catchpoint dashboard comparing response times between backbone and cloud networks

Catchpoint dashboard comparing cloud vs. backbone ISP response times

Putting it All Together

A visual representation of The Internet Stack

To learn more about how Catchpoint IPM can help you achieve Internet Resilience, request a demo or schedule a chat with our solution engineers.

‍

This is some text inside of a div block.