<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Gauri Katara</title>
    <description>The latest articles on Forem by Gauri Katara (@gaurikatara).</description>
    <link>https://forem.com/gaurikatara</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3859015%2Fc5477f65-4f59-40af-b410-844dd0f5e354.jpg</url>
      <title>Forem: Gauri Katara</title>
      <link>https://forem.com/gaurikatara</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gaurikatara"/>
    <language>en</language>
    <item>
      <title>After Fixing Thread Pool Exhaustion, A Reader Asked: Did You Add a Circuit Breaker? Here's What I Did Next</title>
      <dc:creator>Gauri Katara</dc:creator>
      <pubDate>Wed, 22 Apr 2026 03:38:10 +0000</pubDate>
      <link>https://forem.com/gaurikatara/after-fixing-thread-pool-exhaustion-a-reader-asked-did-you-add-a-circuit-breaker-heres-what-i-563e</link>
      <guid>https://forem.com/gaurikatara/after-fixing-thread-pool-exhaustion-a-reader-asked-did-you-add-a-circuit-breaker-heres-what-i-563e</guid>
      <description>&lt;p&gt;This is a follow-up to my previous article —&lt;/p&gt;

&lt;p&gt;Our Spring Boot API Froze Under Load&lt;/p&gt;

&lt;p&gt;[&lt;a href="https://hashnode.com/edit/cmny3jv0u00bg2dlndzza08zy" rel="noopener noreferrer"&gt;https://hashnode.com/edit/cmny3jv0u00bg2dlndzza08zy&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;A reader asked me a question I couldn't stop thinking about.&lt;/p&gt;

&lt;p&gt;In my last article I wrote about how our Spring Boot API froze under load because of thread pool exhaustion. We fixed it with two things — adding timeouts on external calls and isolating slow dependencies into their own thread pool using &lt;a class="mentioned-user" href="https://dev.to/async"&gt;@async&lt;/a&gt;. A reader left a comment that stopped me mid-scroll: "After the fix, did you keep monitoring thread pool metrics as a guardrail — or add any alerts around it? And did you consider a circuit breaker?" We had added monitoring. But the circuit breaker part? Honestly, we had talked about it and not implemented it yet. That comment pushed me to finally do it properly — and in this article I want to share what I learned about Resilience4j circuit breakers, why they are the missing layer in most Spring Boot microservices, and how to implement them step by step.&lt;/p&gt;

&lt;p&gt;The problem with just timeouts&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why timeouts alone are not enough&lt;/strong&gt;&lt;br&gt;
After our fix, here is what our system looked like: — External service calls had a 5 second timeout — Slow calls ran on an isolated thread pool — CloudWatch alerted us when active threads crossed 70% of max This was much better than before. But there was still a problem. Imagine the external service we were calling started failing — not slowly, but completely. Every call we made was timing out after 5 seconds and returning an error. With just timeouts, here is what happens: every single request to our API still tries to call that external service, waits 5 seconds, times out, and then returns an error. We are calling a service we already know is down — over and over — wasting threads, wasting time, and making the recovery slower. What we needed was something that would say: "This service has been failing repeatedly — stop calling it for a while. Let it recover. Then try again carefully." That is exactly what a circuit breaker does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a circuit breaker&lt;/strong&gt;&lt;br&gt;
Circuit breaker — the concept in plain English&lt;/p&gt;

&lt;p&gt;The name comes from electrical engineering. A circuit breaker in your home trips when there is too much current — it breaks the circuit to prevent damage. You reset it manually when the problem is fixed. In software it works the same way — with three states: CLOSED — Everything is working normally. Requests go through. The circuit breaker monitors failure rate in the background. OPEN — Too many failures have happened. The circuit breaker trips. Instead of trying the actual call, it immediately returns a fallback response. No waiting, no timeouts — fast failure. HALF-OPEN — After a waiting period, the circuit breaker allows a few test requests through. If they succeed, it goes back to CLOSED. If they fail, it goes back to OPEN. The result: when a dependency is down, you fail fast instead of failing slow. Your threads are not stuck waiting. Your system stays responsive. And the failing service gets breathing room to recover.&lt;/p&gt;

&lt;p&gt;Circuit breakers don't prevent failures. They contain them — stopping one failing dependency from bringing down your entire system.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Add Resilience4j&lt;br&gt;
&lt;strong&gt;Step 1 — Add the dependency&lt;/strong&gt;&lt;br&gt;
Resilience4j is the most widely used resilience library for Java. It integrates cleanly with Spring Boot and is lightweight compared to alternatives like Hystrix (which is now in maintenance mode). Add this to your pom.xml:&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight xml"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;io.github.resilience4j&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;resilience4j-spring-boot3&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;version&amp;gt;&lt;/span&gt;2.1.0&lt;span class="nt"&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;

&lt;span class="c"&gt;&amp;lt;!-- Also needed for Actuator metrics --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;dependency&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;groupId&amp;gt;&lt;/span&gt;org.springframework.boot&lt;span class="nt"&gt;&amp;lt;/groupId&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;artifactId&amp;gt;&lt;/span&gt;spring-boot-starter-aop&lt;span class="nt"&gt;&amp;lt;/artifactId&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/dependency&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use resilience4j-spring-boot3 for Spring Boot 3.x. For Spring Boot 2.x use resilience4j-spring-boot2.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Configure it&lt;br&gt;
&lt;strong&gt;Step 2 — Configure the circuit breaker&lt;/strong&gt;&lt;br&gt;
Add this to your application.properties or application.yml:&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;#application.properties
#Name your circuit breaker — use the same name in your code
&lt;/span&gt;&lt;span class="py"&gt;resilience4j.circuitbreaker.instances.externalService.sliding-window-size&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;

&lt;span class="py"&gt;resilience4j.circuitbreaker.instances.externalService.failure-rate-threshold&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;50 &lt;/span&gt;

&lt;span class="py"&gt;resilience4j.circuitbreaker.instances.externalService.wait-duration-in-open-state&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10s &lt;/span&gt;

&lt;span class="py"&gt;resilience4j.circuitbreaker.instances.externalService.permitted-calls-in-half-open-state&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;3 &lt;/span&gt;

&lt;span class="py"&gt;resilience4j.circuitbreaker.instances.externalService.automatic-transition-from-open-to-half-open-enabled&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true &lt;/span&gt;

&lt;span class="py"&gt;resilience4j.circuitbreaker.instances.externalService.register-health-indicator&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let me explain what each setting means in plain English:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;sliding-window-size&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;
&lt;span class="err"&gt;//&lt;/span&gt; &lt;span class="err"&gt;Look&lt;/span&gt; &lt;span class="err"&gt;at&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;last&lt;/span&gt; &lt;span class="err"&gt;10&lt;/span&gt; &lt;span class="err"&gt;calls&lt;/span&gt; &lt;span class="err"&gt;to&lt;/span&gt; &lt;span class="err"&gt;decide&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;failure&lt;/span&gt; &lt;span class="err"&gt;rate&lt;/span&gt;
&lt;span class="py"&gt;failure-rate-threshold&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;50&lt;/span&gt;
&lt;span class="err"&gt;//&lt;/span&gt; &lt;span class="err"&gt;If&lt;/span&gt; &lt;span class="err"&gt;50%&lt;/span&gt; &lt;span class="err"&gt;or&lt;/span&gt; &lt;span class="err"&gt;more&lt;/span&gt; &lt;span class="err"&gt;of&lt;/span&gt; &lt;span class="err"&gt;those&lt;/span&gt; &lt;span class="err"&gt;10&lt;/span&gt; &lt;span class="err"&gt;calls&lt;/span&gt; &lt;span class="err"&gt;failed&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="err"&gt;trip&lt;/span&gt; &lt;span class="err"&gt;the&lt;/span&gt; &lt;span class="err"&gt;circuit&lt;/span&gt; &lt;span class="err"&gt;(go&lt;/span&gt; &lt;span class="err"&gt;OPEN)&lt;/span&gt;
&lt;span class="py"&gt;wait-duration-in-open-state&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10s&lt;/span&gt;
&lt;span class="err"&gt;//&lt;/span&gt; &lt;span class="err"&gt;Stay&lt;/span&gt; &lt;span class="err"&gt;OPEN&lt;/span&gt; &lt;span class="err"&gt;for&lt;/span&gt; &lt;span class="err"&gt;10&lt;/span&gt; &lt;span class="err"&gt;seconds&lt;/span&gt; &lt;span class="err"&gt;before&lt;/span&gt; &lt;span class="err"&gt;trying&lt;/span&gt; &lt;span class="err"&gt;again&lt;/span&gt; &lt;span class="err"&gt;(HALF-OPEN)&lt;/span&gt;
&lt;span class="py"&gt;permitted-calls-in-half-open-state&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;
&lt;span class="err"&gt;//&lt;/span&gt; &lt;span class="err"&gt;In&lt;/span&gt; &lt;span class="err"&gt;HALF-OPEN,&lt;/span&gt; &lt;span class="err"&gt;allow&lt;/span&gt; &lt;span class="err"&gt;3&lt;/span&gt; &lt;span class="err"&gt;test&lt;/span&gt; &lt;span class="err"&gt;calls&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="err"&gt;if&lt;/span&gt; &lt;span class="err"&gt;they&lt;/span&gt; &lt;span class="err"&gt;pass,&lt;/span&gt; &lt;span class="err"&gt;go&lt;/span&gt; &lt;span class="err"&gt;back&lt;/span&gt; &lt;span class="err"&gt;to&lt;/span&gt; &lt;span class="err"&gt;CLOSED&lt;/span&gt;
&lt;span class="py"&gt;automatic-transition-from-open-to-half-open-enabled&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="err"&gt;//&lt;/span&gt; &lt;span class="err"&gt;Automatically&lt;/span&gt; &lt;span class="err"&gt;move&lt;/span&gt; &lt;span class="err"&gt;to&lt;/span&gt; &lt;span class="err"&gt;HALF-OPEN&lt;/span&gt; &lt;span class="err"&gt;after&lt;/span&gt; &lt;span class="err"&gt;wait&lt;/span&gt; &lt;span class="err"&gt;time&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="err"&gt;no&lt;/span&gt; &lt;span class="err"&gt;manual&lt;/span&gt; &lt;span class="err"&gt;reset&lt;/span&gt; &lt;span class="err"&gt;needed.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Use it in code&lt;br&gt;
&lt;strong&gt;Step 3 — Add the annotation and fallback&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now wrap the method that calls your external service with the @CircuitBreaker annotation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Service&lt;/span&gt; 
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ExternalDataService&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

&lt;span class="c1"&gt;// Circuit breaker name matches what you set in properties&lt;/span&gt;
&lt;span class="nd"&gt;@CircuitBreaker&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"externalService"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallbackMethod&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"getDataFallback"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;EnrichmentData&lt;/span&gt; &lt;span class="nf"&gt;getExternalData&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// This is the call that could fail or be slow&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;externalApiClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fetchEnrichmentData&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Fallback — called automatically when circuit is OPEN&lt;/span&gt;
&lt;span class="c1"&gt;// Must have same parameters + a Throwable parameter&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;EnrichmentData&lt;/span&gt; &lt;span class="nf"&gt;getDataFallback&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Throwable&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Return a safe default instead of failing the whole request&lt;/span&gt;
    &lt;span class="c1"&gt;// Options: cached data, empty object, default values&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;warn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Circuit breaker open for externalService. "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt;
             &lt;span class="s"&gt;"Returning fallback for userId: {}. Reason: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getMessage&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;EnrichmentData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;defaultData&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fallback method signature must match the original method exactly — same parameters — plus one extra Throwable parameter at the end. If the signature doesn't match, Spring will throw an error at startup.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Combine with timeout&lt;br&gt;
&lt;strong&gt;Step 4 — Combine circuit breaker with timeout&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A circuit breaker works best when combined with a timeout. The timeout prevents threads from waiting too long. The circuit breaker prevents calling a service that is already known to be failing. Resilience4j has a built-in TimeLimiter for this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;#Add timeout configuration alongside circuit breaker
&lt;/span&gt;&lt;span class="py"&gt;resilience4j.timelimiter.instances.externalService.timeout-duration&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;3s&lt;/span&gt;
&lt;span class="py"&gt;resilience4j.timelimiter.instances.externalService.cancel-running-future&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Use both annotations together&lt;/span&gt;
&lt;span class="nd"&gt;@CircuitBreaker&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"externalService"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fallbackMethod&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"getDataFallback"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@TimeLimiter&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"externalService"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EnrichmentData&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getExternalData&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;supplyAsync&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;
        &lt;span class="n"&gt;externalApiClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fetchEnrichmentData&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Fallback for combined circuit breaker + time limiter&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;EnrichmentData&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;getDataFallback&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Throwable&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;warn&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Fallback triggered for userId: {}. Reason: {}"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getMessage&lt;/span&gt;&lt;span class="o"&gt;());&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;completedFuture&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="nc"&gt;EnrichmentData&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;defaultData&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Monitor it&lt;br&gt;
&lt;strong&gt;Step 5 — Monitor circuit breaker state with Actuator&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once Resilience4j is set up with register-health-indicator=true, you can see the circuit breaker state through Spring Boot Actuator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="c"&gt;#Enable circuit breaker metrics in Actuator
&lt;/span&gt;&lt;span class="py"&gt;management.endpoints.web.exposure.include&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;health,metrics,circuitbreakers &lt;/span&gt;
&lt;span class="py"&gt;management.health.circuitbreakers.enabled&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="c"&gt;#Check health endpoint
&lt;/span&gt;&lt;span class="err"&gt;GET&lt;/span&gt; &lt;span class="err"&gt;/actuator/health&lt;/span&gt;
&lt;span class="c"&gt;#Response shows circuit breaker state:
#"externalService": { "status": "CLOSED", "details": {...} }
#Check detailed metrics
&lt;/span&gt;&lt;span class="err"&gt;GET&lt;/span&gt; &lt;span class="err"&gt;/actuator/metrics/resilience4j.circuitbreaker.state&lt;/span&gt;
&lt;span class="err"&gt;GET&lt;/span&gt; &lt;span class="err"&gt;/actuator/metrics/resilience4j.circuitbreaker.failure.rate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add a CloudWatch alarm on the circuit breaker state metric — if it goes OPEN, you want to know immediately. A circuit breaker going OPEN is a signal that something in your system needs attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full picture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The complete resilience stack — how it all fits together&lt;/p&gt;

&lt;p&gt;After everything we implemented, here is the full picture of how an external call is protected in our service: 1. Request comes in to our API 2. Main Tomcat thread handles it normally 3. External call is handed off to isolated async thread pool (&lt;a class="mentioned-user" href="https://dev.to/async"&gt;@async&lt;/a&gt;) 4. TimeLimiter ensures the call times out after 3 seconds maximum 5. Circuit breaker monitors failure rate across last 10 calls 6. If failure rate exceeds 50% — circuit opens, fallback returns immediately 7. After 10 seconds — 3 test calls go through in HALF-OPEN state 8. If tests pass — circuit closes, normal operation resumes 9. CloudWatch monitors active thread count AND circuit breaker state Each layer handles a different failure scenario. Together they mean one slow or failing external dependency cannot bring down our service.&lt;/p&gt;

&lt;p&gt;This is called the Bulkhead + Circuit Breaker pattern. It is one of the core resilience patterns in microservices architecture — documented in the Microsoft Azure Architecture patterns and Netflix's engineering blog.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned from this whole journey&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The honest truth is that we should have had a circuit breaker from day one. Timeouts and thread isolation were necessary but they were always incomplete without it. What I found interesting about implementing Resilience4j is how much it forces you to think about failure modes upfront. What should the fallback return? What is an acceptable failure rate? How long should we wait before retrying? These are questions that reveal a lot about how well you understand your own system's dependencies. If I were starting a new microservice project today, these three things would go in before the first endpoint is written: &lt;br&gt;
— Timeouts on every external call &lt;br&gt;
— Isolated thread pools for external dependencies &lt;br&gt;
— Circuit breaker with a sensible fallback Not after the first production incident. &lt;/p&gt;

&lt;p&gt;Before. --- This article was directly inspired by a comment on my previous post about thread pool exhaustion. If you haven't read that one, it gives context for why we ended up here. If you found this useful or have questions about Resilience4j configuration &lt;/p&gt;

&lt;p&gt;— drop a comment. &lt;/p&gt;

&lt;p&gt;And if you are using a different resilience pattern in your Spring Boot services, I would genuinely love to hear how you approached it.&lt;/p&gt;

&lt;p&gt;Follow me here for more practical Java backend content — no theory-only posts, just things that come from real production systems.&lt;/p&gt;

</description>
      <category>java</category>
      <category>backenddevelopment</category>
      <category>microservices</category>
      <category>springboot</category>
    </item>
    <item>
      <title>Our Spring Boot API Froze Under Load — Here's Exactly How We Fixed It</title>
      <dc:creator>Gauri Katara</dc:creator>
      <pubDate>Thu, 16 Apr 2026 04:06:09 +0000</pubDate>
      <link>https://forem.com/gaurikatara/our-spring-boot-api-froze-under-load-heres-exactly-how-we-fixed-it-3ecn</link>
      <guid>https://forem.com/gaurikatara/our-spring-boot-api-froze-under-load-heres-exactly-how-we-fixed-it-3ecn</guid>
      <description>&lt;p&gt;This article was originally published on &lt;br&gt;
My Hashnode blog [&lt;a href="https://gaurikatara.hashnode.dev" rel="noopener noreferrer"&gt;https://gaurikatara.hashnode.dev&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The API didn't throw an error. It just... stopped responding or frozen.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;No exceptions in the logs. No 500 errors. No obvious reason. Just requests piling up, response times climbing from milliseconds to 30 seconds, and then — timeouts. It was strange. Traffic had picked up slightly — nothing unusual for that time of day — and suddenly our backend service was barely responding. Users were seeing blank screens. This is the story from one of my typical workdays about how we diagnosed and fixed issue in our Spring Boot microservice — and what I learned from it that I hadn't found clearly explained anywhere online at single point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I began investigating all the potential causes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Network issues&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Examine database metrics&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Monitor CPU and memory usage&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Inspect thread dumps&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Review load balancer logs&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What I discovered was surprising—it was related to the threads.&lt;/p&gt;

&lt;p&gt;The issue stemmed from thread pool exhaustion in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WHAT IS THREAD POOL EXHAUSTION — and why is it so sneaky?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spring Boot uses an embedded Tomcat server by default. Tomcat handles incoming HTTP requests using a fixed pool of threads. By default, this pool has a maximum of 200 threads. Here is how it works normally: a request comes in, Tomcat assigns it a thread, the thread does its work, returns a response, and the thread goes back to the pool. Fast, clean, repeatable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now here is the problem: what if a thread gets stuck waiting?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Maybe it's calling an external API that is slow.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Maybe it's waiting on a database query that is taking too long.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Maybe it's doing something blocking that it shouldn't be.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If enough threads get stuck waiting at the same time, the pool runs out. New requests come in but there are no threads available to handle them. So they wait in a queue. The queue fills up. New requests start getting rejected or timing out. Your API is not down. It is not throwing errors. It is just frozen. And the worst part — from the outside it looks exactly like a network issue or a deployment problem. Most developers waste hours looking in the wrong place.&lt;/p&gt;

&lt;p&gt;Thread pool exhaustion does not always look like an error. It often looks like extreme slowness or total unresponsiveness — with perfectly clean logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HOW WE DIAGNOSED IT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Step 1 — Check the thread pool metrics&lt;/p&gt;

&lt;p&gt;The first thing I did after ruling out obvious causes — no deployment had happened, no database was down — was check our application metrics on AWS CloudWatch. We had Spring Boot Actuator enabled, which exposes metrics including thread pool stats. I looked at the active thread count and it was sitting at exactly 200 — the Tomcat default maximum. That was the first real clue. If your team has Actuator set up, you can check this yourself at: /actuator/metrics/tomcat.threads.busy If that number is at or near your maximum thread count while the API is slow, thread pool exhaustion is almost certainly your problem.&lt;/p&gt;

&lt;p&gt;Step 2 — Take a thread dump&lt;/p&gt;

&lt;p&gt;To confirm, I took a thread dump while the service was under load. A thread dump shows you exactly what every thread in your JVM is doing at that moment. You can trigger one using:&lt;/p&gt;

&lt;p&gt;Using jstack (get the PID first with: ps aux | grep java)&lt;/p&gt;

&lt;p&gt;&lt;code&gt;jstack &amp;gt; thread_dump.txt&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
Or via Spring Boot Actuator if enabled&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Postman -- http://localhost:8080/actuator/threaddump&lt;br&gt;
&lt;/code&gt;&lt;br&gt;
When I opened the thread dump, I saw the same pattern repeating across most of the 200 threads. They were all stuck in a WAITING or TIMED_WAITING state, blocked on an HTTP call to an external third-party service we were calling to enrich our response data. That external service had started responding slowly — averaging 25 seconds per call instead of the usual 200ms. Our code was calling it synchronously, blocking the thread the entire time. With enough concurrent requests, every thread in the pool was stuck waiting for that external service. New requests had nowhere to go.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE FIX&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Step 3 — The fix had two parts&lt;/p&gt;

&lt;p&gt;We fixed it with two changes. The first was immediate — add a timeout to the external HTTP call so threads don't wait forever. The second was the proper fix — make the call non-blocking.&lt;/p&gt;

&lt;p&gt;Part 1 — Add connection and read timeouts immediately&lt;/p&gt;

&lt;p&gt;This was a quick fix we could deploy right away. We configured our RestTemplate with explicit timeouts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Configuration&lt;/span&gt; &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RestTemplateConfig&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

&lt;span class="nd"&gt;@Bean&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;RestTemplate&lt;/span&gt; &lt;span class="nf"&gt;restTemplate&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;HttpComponentsClientHttpRequestFactory&lt;/span&gt; &lt;span class="n"&gt;factory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
        &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;HttpComponentsClientHttpRequestFactory&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// Don't wait more than 3 seconds to connect&lt;/span&gt;
    &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setConnectTimeout&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Don't wait more than 5 seconds for a response&lt;/span&gt;
    &lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setReadTimeout&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;RestTemplate&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;factory&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Always set timeouts on any external HTTP call. No timeout means a thread can be stuck waiting forever. This is one of the most common mistakes in Java backend services.&lt;/p&gt;

&lt;p&gt;Part 2 — Increase thread pool size as a short-term buffer&lt;/p&gt;

&lt;p&gt;While we worked on the proper async solution, we also increased the Tomcat thread pool size to give us more breathing room under load. This goes in your application.properties:&lt;/p&gt;

&lt;p&gt;Maximum number of threads to handle requests&lt;br&gt;
&lt;code&gt;server.tomcat.threads.max=400&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Minimum threads always kept alive&lt;br&gt;
&lt;code&gt;server.tomcat.threads.min-spare=20&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Max requests that can wait when all threads are busy&lt;br&gt;
server.tomcat.accept-count=100 Increasing thread count is not a real fix — it just delays the problem. If your threads are blocking, adding more threads means more threads will block. Always fix the root cause.&lt;/p&gt;

&lt;p&gt;Part 3 — The proper fix: make the external call async&lt;/p&gt;

&lt;p&gt;The real solution was to stop blocking a thread while waiting for the external service. We used Spring's &lt;a class="mentioned-user" href="https://dev.to/async"&gt;@async&lt;/a&gt; with a dedicated thread pool for external calls, isolating them from the main request-handling threads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Configuration&lt;/span&gt; 
&lt;span class="nd"&gt;@EnableAsync&lt;/span&gt; 
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AsyncConfig&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

&lt;span class="nd"&gt;@Bean&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"externalCallExecutor"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Executor&lt;/span&gt; &lt;span class="nf"&gt;externalCallExecutor&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;ThreadPoolTaskExecutor&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ThreadPoolTaskExecutor&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setCorePoolSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setMaxPoolSize&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setQueueCapacity&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setThreadNamePrefix&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"external-call-"&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;initialize&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the external call runs on its own isolated thread pool. Even if that external service slows down completely, our main Tomcat threads keep handling requests normally. The two concerns are fully separated.&lt;/p&gt;

&lt;p&gt;What happened after the fix&lt;/p&gt;

&lt;p&gt;After deploying the timeout fix first, the immediate crisis resolved. Threads were no longer getting stuck indefinitely — after 5 seconds they would time out, return an error or fallback response, and free up for the next request. After the async refactor, the problem disappeared entirely. Active thread count during peak traffic dropped from 200 (maxed out) to consistently under 40. Response times went back to normal. The external service could be slow or even temporarily down, and our API kept working.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned from this&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Always set timeouts on external calls. Every HTTP call your service makes to another service or API must have a connect timeout and a read timeout. No exceptions. This is the single most common cause of thread pool exhaustion.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Thread pool exhaustion looks like slowness, not errors. If your API is suddenly unresponsive with no exceptions in the logs, check your active thread count before you do anything else.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Increase thread pool size only as a temporary measure. It buys you time but does not fix the root cause. The real fix is to stop blocking threads in the first place.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Isolate slow dependencies. Any external call that could be slow or unreliable should run on its own thread pool, completely separate from your main request-handling threads. This is called bulkhead pattern — and it is one of the most important resilience patterns in microservices.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Actuator is your best friend in production. If you are running Spring Boot and you do not have Actuator enabled with metrics exposed, you are flying blind. Enable it. It will save you hours the next time something goes wrong. --- I hope this walkthrough helped.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thread pool exhaustion may seem complex but is easily fixable once you know the signs. If you've faced this issue or are dealing with it now, share your approach in the comments. If you found this useful, follow me for insights on Java backend engineering, Spring Boot, Kafka, AWS, and real-world production challenges&lt;/p&gt;

</description>
      <category>java</category>
      <category>springboot</category>
      <category>backenddevelopment</category>
      <category>api</category>
    </item>
  </channel>
</rss>
