<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jayesh Shinde</title>
    <description>The latest articles on Forem by Jayesh Shinde (@jayesh_shinde).</description>
    <link>https://forem.com/jayesh_shinde</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3517237%2Fb526557a-5a75-4338-814a-d6d3277a47b7.jpg</url>
      <title>Forem: Jayesh Shinde</title>
      <link>https://forem.com/jayesh_shinde</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jayesh_shinde"/>
    <language>en</language>
    <item>
      <title>How We Fixed Intermittent ECS Image-Not-Found Errors in AWS CDK</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Fri, 13 Mar 2026 09:00:11 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/how-we-fixed-intermittent-ecs-image-not-found-errors-in-aws-cdk-477f</link>
      <guid>https://forem.com/jayesh_shinde/how-we-fixed-intermittent-ecs-image-not-found-errors-in-aws-cdk-477f</guid>
      <description>&lt;p&gt;At one point, our ECS deployments started failing in a way that felt random.&lt;/p&gt;

&lt;p&gt;Sometimes a deployment would work perfectly. Sometimes the service would try to roll forward and fail because the container image it expected was no longer available. Nothing was wrong with the application code. The problem was in the deployment asset flow.&lt;/p&gt;

&lt;p&gt;We were using AWS CDK to deploy container-based workloads, and like many teams, we were relying on CDK’s default bootstrap ECR repository for Docker image assets. That was convenient at first, but it became a problem once repository retention rules were tightened for cost control.&lt;/p&gt;

&lt;p&gt;In environments with frequent deployments, older intermediate images were being cleaned up faster than our deployment flow could safely tolerate. The result was intermittent ECS deploy failures caused by missing images.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Root Cause
&lt;/h2&gt;

&lt;p&gt;AWS CDK Docker assets are published during the &lt;strong&gt;asset publishing phase&lt;/strong&gt;, which happens before CloudFormation starts deploying stacks.&lt;/p&gt;

&lt;p&gt;That means two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CDK is not just defining infrastructure, it is also managing where deployable image assets are stored.&lt;/li&gt;
&lt;li&gt;If the default asset repository has aggressive cleanup policies, your deployments can become fragile.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is especially painful in non-production environments where deployment frequency is high and image churn is constant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Strategy We Took
&lt;/h2&gt;

&lt;p&gt;We wanted a solution that was simple, low-risk, and did not require redesigning the whole build pipeline.&lt;/p&gt;

&lt;p&gt;So instead of pushing ECS image assets to the shared default CDK ECR repository, we moved to a &lt;strong&gt;dedicated ECR repository per environment/application area&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At a high level, the fix looked like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;create a dedicated ECR repository ahead of time&lt;/li&gt;
&lt;li&gt;configure the CDK synthesizer to publish image assets there&lt;/li&gt;
&lt;li&gt;keep lifecycle control on that repository&lt;/li&gt;
&lt;li&gt;deploy the ECR stack first, then the app stacks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gave us isolation from the shared bootstrap repository while keeping the rest of the CDK deployment model mostly unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sample: Dedicated ECR Stack
&lt;/h2&gt;

&lt;p&gt;Here is a simplified example of creating a dedicated ECR repository with a lifecycle policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Stack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;RemovalPolicy&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-cdk-lib&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Repository&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-cdk-lib/aws-ecr&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;constructs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;AppEnvProps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContainerAssetRepoStack&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;AppEnvProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Repository&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AppContainerRepo&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;repositoryName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`myapp-assets-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;removalPolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RemovalPolicy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;RETAIN&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;lifecycleRules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="na"&gt;rulePriority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Keep only the latest 100 images&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
          &lt;span class="na"&gt;maxImageCount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few important details here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;RETAIN&lt;/code&gt; protects the repository if the stack is deleted later.&lt;/li&gt;
&lt;li&gt;lifecycle rules still clean up old images over time.&lt;/li&gt;
&lt;li&gt;the repository name is normalized to lowercase, which is important for ECR.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Sample: Point CDK to the Dedicated Repo
&lt;/h2&gt;

&lt;p&gt;Once the repository exists, the application stack can tell CDK to publish image assets there using &lt;code&gt;DefaultStackSynthesizer&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;App&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DefaultStackSynthesizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-cdk-lib&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;constructs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;ServiceStackProps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ServiceStack&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ServiceStackProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;synthesizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DefaultStackSynthesizer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;imageAssetsRepositoryName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`myapp-assets-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// ECS service, task definition, container asset usage, etc.&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;App&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ServiceStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ServiceStackDev&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;stage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;dev&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps the existing CDK asset publishing model, but moves the destination away from the shared default bootstrap repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Important Gotcha
&lt;/h2&gt;

&lt;p&gt;A stack dependency is &lt;strong&gt;not enough&lt;/strong&gt; if the same deployment run tries to create the ECR repository and publish assets into it.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because asset publishing happens before CloudFormation stack deployment.&lt;/p&gt;

&lt;p&gt;So if the repository does not already exist, the asset publish step can fail before your “repo stack” is even deployed.&lt;/p&gt;

&lt;p&gt;The safest pattern is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;deploy the ECR repository stack first&lt;/li&gt;
&lt;li&gt;run the normal application deployment after that&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That sequencing matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Another Important Gotcha: IAM Permissions
&lt;/h2&gt;

&lt;p&gt;Changing the repository target is not enough by itself.&lt;/p&gt;

&lt;p&gt;The identity or role that CDK uses to publish Docker assets must also have permission to push to the new ECR repository.&lt;/p&gt;

&lt;p&gt;That usually means allowing actions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ecr:PutImage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:InitiateLayerUpload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:UploadLayerPart&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:CompleteLayerUpload&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:BatchCheckLayerAvailability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:BatchGetImage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:GetDownloadUrlForLayer&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ecr:GetAuthorizationToken&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you forget this part, the deployment simply moves from “image missing” problems to “access denied” problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Worked Well for Us
&lt;/h2&gt;

&lt;p&gt;We liked this approach because it was a practical middle ground.&lt;/p&gt;

&lt;p&gt;It did not require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rebuilding our CI/CD image strategy from scratch&lt;/li&gt;
&lt;li&gt;changing every ECS service definition&lt;/li&gt;
&lt;li&gt;introducing a more complex app-owned image publishing flow immediately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it did give us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;predictable image retention&lt;/li&gt;
&lt;li&gt;environment-specific isolation&lt;/li&gt;
&lt;li&gt;fewer surprises during ECS deployments&lt;/li&gt;
&lt;li&gt;better control over cost and cleanup behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When to Use This Pattern
&lt;/h2&gt;

&lt;p&gt;This approach makes sense if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you already use CDK-managed Docker/container assets&lt;/li&gt;
&lt;li&gt;the default bootstrap ECR repository is shared across too many deployments&lt;/li&gt;
&lt;li&gt;retention rules on that shared repository are causing instability&lt;/li&gt;
&lt;li&gt;you want a fast, low-disruption improvement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you want a more explicit long-term model, the next step is usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;build image in CI&lt;/li&gt;
&lt;li&gt;push image to a named ECR repository yourself&lt;/li&gt;
&lt;li&gt;reference the image directly in ECS by repo and tag&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That gives maximum control, but it also requires more changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;CDK defaults are great for getting started, but they are not always ideal once platform constraints like retention, cost control, and deployment frequency start to matter.&lt;/p&gt;

&lt;p&gt;In our case, moving Docker assets to dedicated ECR repositories was a small change with a big operational impact. It made deployments more predictable without forcing a major rework of the pipeline.&lt;/p&gt;

</description>
      <category>cdk</category>
      <category>aws</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>The Silent Connection Killer: MySQL2 and AWS Lambda's Freeze/Thaw Problem</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Thu, 05 Feb 2026 09:03:13 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/the-silent-connection-killer-mysql2-and-aws-lambdas-freezethaw-problem-1pek</link>
      <guid>https://forem.com/jayesh_shinde/the-silent-connection-killer-mysql2-and-aws-lambdas-freezethaw-problem-1pek</guid>
      <description>&lt;h2&gt;
  
  
  The Mystery Error
&lt;/h2&gt;

&lt;p&gt;You're running a Node.js Lambda with MySQL2, everything works great in testing, but production logs show intermittent failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error: Connection lost: The server closed the connection.code: "PROTOCOL_CONNECTION_LOST"
fatal: true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No pattern. No warning. Just random failures that make you question your life choices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Real Problem : How Lambda Actually Works
&lt;/h2&gt;

&lt;p&gt;Lambda doesn't spin up a fresh container for every request. AWS keeps containers "warm" for reuse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request 1 → Lambda runs → Response
                ↓
           [FREEZE] ← Container paused (not terminated)
                ↓     
         (5-15 min pass)
                ↓
           [THAW] ← Container resumed
                ↓
Request 2 → Lambda runs → 💥 PROTOCOL_CONNECTION_LOST

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During freeze, your Lambda is literally paused. The JavaScript event loop stops. Timers stop. Everything stops.&lt;br&gt;
But here's the catch: the outside world doesn't stop.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Happens to Your Database Connection
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Lambda creates a MySQL connection pool&lt;/li&gt;
&lt;li&gt;Connections sit idle in the pool&lt;/li&gt;
&lt;li&gt;Lambda freezes (container paused)&lt;/li&gt;
&lt;li&gt;Real world time passes (5-30 minutes)&lt;/li&gt;
&lt;li&gt;Network timeouts occur, NAT gateways clear state, RDS Proxy cleans up&lt;/li&gt;
&lt;li&gt;The TCP socket dies, but your pool doesn't know&lt;/li&gt;
&lt;li&gt;Lambda thaws, tries to use the dead connection&lt;/li&gt;
&lt;li&gt;💥 PROTOCOL_CONNECTION_LOST&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Why idleTimeout Doesn't Help
&lt;/h2&gt;

&lt;p&gt;You might think: "I'll set idleTimeout: 60000 to clean up idle connections!"&lt;br&gt;
Here's why it doesn't work:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Timer starts (60s countdown)
    ↓
Lambda FREEZES at 1s elapsed
    ↓
████████████████████████████████
█  15 minutes pass in REAL WORLD  █
█  Timer is PAUSED at 1s          █
████████████████████████████████
    ↓
Lambda THAWS - timer resumes at 1s
    ↓
Connection still in pool (timer thinks 2s passed)
    ↓
Connection is DEAD but pool doesn't know

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The timer doesn't run during freeze.&lt;/strong&gt; Your 60-second timeout is useless against a 15-minute freeze.&lt;/p&gt;

&lt;p&gt;The Solution: Detect and Retry&lt;br&gt;
Since we can't prevent stale connections, we detect them and retry transparently.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 1: Enable TCP Keep-Alive
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const pool = mysql.createPool({
  ...config,
  enableKeepAlive: true,
  keepAliveInitialDelay: 10000,
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This helps get clear error codes (ECONNRESET, PROTOCOL_CONNECTION_LOST) instead of hanging indefinitely.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 2: Implement Retry Logic
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;async executeQuery(sql, params) {
  const maxRetries = 1;
  let lastError = null;

  for (let attempt = 0; attempt &amp;lt;= maxRetries; attempt++) {
    let connection = null;

    try {
      connection = await this.getConnectionFromPool();
      const result = await this.executeQueryWithConnection(connection, sql, params);
      connection.release();
      return result;
    } catch (error) {
      // Non-recoverable error - throw immediately
      if (!this.isConnectionLostError(error)) {
        if (connection) connection.release();
        throw error;
      }

      // Connection lost - destroy stale connection
      if (connection) connection.destroy();

      // Retry if attempts left
      lastError = error;
      if (attempt &amp;lt; maxRetries) {
        console.warn("Connection lost, retrying...", { attempt: attempt + 1 });
        continue;
      }
    }
  }

  throw lastError;
}

isConnectionLostError(error) {
  const recoverableCodes = [
    "PROTOCOL_CONNECTION_LOST",  // Server closed connection
    "ECONNRESET",                // TCP reset
    "EPIPE",                     // Broken pipe
    "ETIMEDOUT",                 // Connection timeout
    "ECONNREFUSED",              // Connection refused
  ];
  return recoverableCodes.includes(error?.code);
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Points:
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;connection.destroy()&lt;/code&gt; - Removes stale connection from pool (don't reuse it!)&lt;br&gt;
&lt;code&gt;connection.release()&lt;/code&gt; - Returns healthy connection to pool&lt;br&gt;
One retry is usually enough - The second attempt gets a fresh connection&lt;/p&gt;
&lt;h1&gt;
  
  
  What About Pool Settings?
&lt;/h1&gt;

&lt;p&gt;Do I Need to Tune connectionLimit, maxIdle, etc.?&lt;br&gt;
Short answer: Not really.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Helps with freeze/thaw?&lt;/th&gt;
&lt;th&gt;Why?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;idleTimeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Timer paused during freeze&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;maxIdle&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Marginally&lt;/td&gt;
&lt;td&gt;Fewer connections = fewer stale ones, but adds reconnection overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;connectionLimit&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Doesn't affect stale connections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Retry logic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Handles stale connections at runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If you're using RDS Proxy, it handles connection pooling at the infrastructure level. Keep your Lambda pool settings simple and let the retry logic do the heavy lifting.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using RDS Proxy?
&lt;/h2&gt;

&lt;p&gt;RDS Proxy's "Idle client connection timeout" (default: 30 minutes) is separate from MySQL's wait_timeout. The proxy manages Lambda→Proxy connections independently.&lt;br&gt;
But even with RDS Proxy, connections can die due to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NAT gateway timeouts (typically 5-15 minutes for idle TCP)&lt;/li&gt;
&lt;li&gt;Network state table cleanup&lt;/li&gt;
&lt;li&gt;Proxy internal connection recycling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The retry logic is still your safety net.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Final Architecture
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────┐
│                    Your Lambda                       │
├─────────────────────────────────────────────────────┤
│                                                      │
│   executeQuery()                                     │
│       ↓                                             │
│   for (attempt = 0; attempt &amp;lt;= 1; attempt++)        │
│       ↓                                             │
│   getConnection() → Try query                       │
│       ↓                                             │
│   Success? → return result                          │
│       ↓                                             │
│   Connection lost? → destroy() → retry              │
│       ↓                                             │
│   Other error? → release() → throw                  │
│                                                      │
└─────────────────────────────────────────────────────┘
           ↓
    [RDS Proxy] (optional)
           ↓
    [MySQL/Aurora]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Lambda freezes, connections go stale&lt;/td&gt;
&lt;td&gt;Retry logic detects and recovers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pool doesn't know connections are dead&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;enableKeepAlive&lt;/code&gt; for faster detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;idleTimeout&lt;/code&gt; doesn't work during freeze&lt;/td&gt;
&lt;td&gt;Accept it, rely on retry instead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Random &lt;code&gt;PROTOCOL_CONNECTION_LOST&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Transparent retry = users don't notice&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: You can't prevent stale connections in a serverless environment. But you can detect them instantly and retry transparently.&lt;/p&gt;
&lt;h1&gt;
  
  
  BUT....
&lt;/h1&gt;


&lt;h2&gt;
  
  
  The Problem With Just Retrying Once
&lt;/h2&gt;

&lt;p&gt;The previous article suggested detecting a stale connection error and retrying once. That works — but only if a &lt;strong&gt;single connection&lt;/strong&gt; went stale. After a longer Lambda freeze (10–15+ minutes), the entire pool goes stale. Here's the scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pool has 5 connections → Lambda freezes → all 5 TCP sockets die

Request comes in after thaw:
→ gets conn #1 from _freeConnections → FAILS (stale)
→ retry: gets conn #2 from _freeConnections → FAILS (also stale!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One retry is not enough because &lt;code&gt;_freeConnections&lt;/code&gt; is a queue — the retry just picks the next dead connection in line.&lt;/p&gt;




&lt;h2&gt;
  
  
  How mysql2's Pool Actually Works Internally
&lt;/h2&gt;

&lt;p&gt;Looking at mysql2's &lt;code&gt;pool.js&lt;/code&gt; source, &lt;code&gt;getConnection()&lt;/code&gt; does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified from mysql2 internals&lt;/span&gt;
&lt;span class="nf"&gt;getConnection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shift&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// FIFO — no health check&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// if allConnections.length &amp;lt; connectionLimit → create a NEW connection&lt;/span&gt;
  &lt;span class="c1"&gt;// otherwise → queue the request in _connectionQueue&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;There is zero validation when pulling from &lt;code&gt;_freeConnections&lt;/code&gt;.&lt;/strong&gt; The pool hands you whatever is sitting there, stale or not.&lt;/p&gt;

&lt;p&gt;The inverse is also useful to know — when &lt;code&gt;_freeConnections&lt;/code&gt; is empty AND &lt;code&gt;_allConnections.length &amp;lt; connectionLimit&lt;/code&gt;, mysql2 will automatically create a &lt;strong&gt;brand new&lt;/strong&gt; TCP connection. This is the behavior we want to exploit.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix: Drain All Free Connections on a Stale Error
&lt;/h2&gt;

&lt;p&gt;Instead of retrying once and hoping the next connection is healthy, destroy &lt;strong&gt;every connection&lt;/strong&gt; in &lt;code&gt;_freeConnections&lt;/code&gt; the moment you detect a stale error. The pool's own logic then forces a fresh connection on retry.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;STALE_ERRORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PROTOCOL_CONNECTION_LOST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ECONNRESET&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;EPIPE&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ETIMEDOUT&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ECONNREFUSED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;drainFreeConnections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Destroy all idle connections in one sweep&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;conn&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// sets conn._pool = null, removes from _allConnections&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// clear the array in-place&lt;/span&gt;
  &lt;span class="c1"&gt;// pool._allConnections.length is now reduced → next getConnection()&lt;/span&gt;
  &lt;span class="c1"&gt;// sees: freeConnections empty + allConnections &amp;lt; connectionLimit&lt;/span&gt;
  &lt;span class="c1"&gt;// → creates a fresh TCP connection automatically&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;executeQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getConnection&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;release&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// kill the one that triggered the error&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;STALE_ERRORS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// Nuke all remaining free connections — they're all suspect&lt;/span&gt;
      &lt;span class="nf"&gt;drainFreeConnections&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="c1"&gt;// Retry — pool is now forced to open a fresh connection&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;freshConn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getConnection&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;freshConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;freshConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;release&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;retryErr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;freshConn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;destroy&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;retryErr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why &lt;code&gt;pool._freeConnections.length = 0&lt;/code&gt; instead of a &lt;code&gt;while&lt;/code&gt; loop with &lt;code&gt;shift()&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pool._freeConnections&lt;/code&gt; is an array used as a &lt;strong&gt;FIFO queue&lt;/strong&gt; — mysql2 uses &lt;code&gt;shift()&lt;/code&gt; when getting connections and &lt;code&gt;push()&lt;/code&gt; when releasing them. Since we're destroying &lt;strong&gt;all&lt;/strong&gt; of them, iterating with &lt;code&gt;for...of&lt;/code&gt; then zeroing the length is simpler and safer than mutating the array mid-iteration with &lt;code&gt;shift()&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What &lt;code&gt;conn.destroy()&lt;/code&gt; does under the hood
&lt;/h3&gt;

&lt;p&gt;When you call &lt;code&gt;destroy()&lt;/code&gt;, mysql2 does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified from connection._removeFromPool()&lt;/span&gt;
&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_allConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;splice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_allConnections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;indexOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_pool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;// TCP socket is closed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So after &lt;code&gt;drainFreeConnections&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;_freeConnections&lt;/code&gt; → empty ✅&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_allConnections.length&lt;/code&gt; → drops back toward 0 ✅&lt;/li&gt;
&lt;li&gt;Next &lt;code&gt;getConnection()&lt;/code&gt; → pool sees room under &lt;code&gt;connectionLimit&lt;/code&gt; → creates fresh TCP connection ✅&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Comparison: Single Retry vs. Drain All
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;After 1 stale conn&lt;/th&gt;
&lt;th&gt;After full pool stale&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single retry (article)&lt;/td&gt;
&lt;td&gt;✅ Works&lt;/td&gt;
&lt;td&gt;❌ Retry hits another stale conn&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Drain all + retry (this approach)&lt;/td&gt;
&lt;td&gt;✅ Works&lt;/td&gt;
&lt;td&gt;✅ Pool forced to create fresh conn&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  A Note on &lt;code&gt;_freeConnections&lt;/code&gt; Being a Private API
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;_freeConnections&lt;/code&gt; is not exported in mysql2's TypeScript typings (it's missing from &lt;code&gt;Pool.d.ts&lt;/code&gt;). In TypeScript, you'll need a cast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pool&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;any&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;_freeConnections&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It has been stable and present since the beginning of the library. But since it's not officially part of the public API, it's worth keeping an eye on across major version upgrades.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Prisma Doesn't Have This Problem
&lt;/h2&gt;

&lt;p&gt;If you're using Prisma, you may have noticed it doesn't suffer from the freeze/thaw stale connection issue as badly. There's a concrete reason for this — it's not magic, it's the Rust query engine.&lt;/p&gt;

&lt;p&gt;Prisma uses a connection pool built on the &lt;code&gt;mobc&lt;/code&gt; library inside its Rust engine. Before handing a connection to your query, it performs a &lt;strong&gt;time-gated pre-ping&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Connection pulled from pool
         ↓
Has more than 15 seconds passed since this connection was last used?
    YES → run SELECT 1
              ↓
         SELECT 1 succeeds? → proceed with query
         SELECT 1 fails?    → discard, open fresh connection
    NO  → skip ping, proceed directly (optimization for rapid queries)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is essentially the same pattern as SQLAlchemy's &lt;code&gt;pool_pre_ping=True&lt;/code&gt;, with a 15-second grace window to avoid pinging on every rapid-fire query.&lt;/p&gt;

&lt;p&gt;After a Lambda freeze of any meaningful duration (seconds to minutes), the timer has expired, so Prisma will ping &lt;strong&gt;before&lt;/strong&gt; your query even runs — and silently replace any dead connection. Your application code never sees the error.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it stacks up
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;mysql2 (raw pool)&lt;/th&gt;
&lt;th&gt;Prisma (Rust engine)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-ping on checkout&lt;/td&gt;
&lt;td&gt;❌ None&lt;/td&gt;
&lt;td&gt;✅ If &amp;gt;15s idle&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Handles full pool going stale&lt;/td&gt;
&lt;td&gt;❌ Needs manual drain logic&lt;/td&gt;
&lt;td&gt;✅ Each connection validated individually&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error surfaces to app code&lt;/td&gt;
&lt;td&gt;✅ Yes — you must handle it&lt;/td&gt;
&lt;td&gt;❌ Transparent — retried internally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Overhead&lt;/td&gt;
&lt;td&gt;None (no extra queries)&lt;/td&gt;
&lt;td&gt;One &lt;code&gt;SELECT 1&lt;/code&gt; per connection after idle period&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Even Prisma Isn't Bulletproof
&lt;/h3&gt;

&lt;p&gt;Worth noting: Prisma's pre-ping protects against stale connections, but the 15-second threshold means a freeze shorter than 15 seconds could still theoretically slip through. And connection-level issues outside the pool (e.g. NAT gateway state tables, RDS Proxy recycling) can still cause failures that the pre-ping doesn't catch. Retry logic at the application layer remains a good safety net regardless of ORM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The key insight is to work &lt;em&gt;with&lt;/em&gt; mysql2's internal pool logic rather than against it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;On a stale connection error, don't just retry — &lt;strong&gt;drain &lt;code&gt;_freeConnections&lt;/code&gt; first&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;mysql2 will automatically open fresh connections to fill the gap (it's built into &lt;code&gt;getConnection()&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Your retry then gets a genuinely new TCP connection instead of another dead one from the queue&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want this behavior without managing it yourself, Prisma's Rust engine gives you a time-gated pre-ping out of the box — which is the more principled long-term solution for serverless MySQL workloads.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>lambda</category>
      <category>backenddevelopment</category>
      <category>prisma</category>
    </item>
    <item>
      <title>Building a Clean Event Pipeline in Spring: From Simple Events to Async Listeners to the Outbox Pattern</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sat, 03 Jan 2026 07:43:05 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/building-a-clean-event-pipeline-in-spring-from-simple-events-to-async-listeners-to-the-outbox-3pi7</link>
      <guid>https://forem.com/jayesh_shinde/building-a-clean-event-pipeline-in-spring-from-simple-events-to-async-listeners-to-the-outbox-3pi7</guid>
      <description>&lt;p&gt;Event‑driven architecture sounds simple on paper: &lt;em&gt;“emit an event when something happens.”&lt;/em&gt;&lt;br&gt;&lt;br&gt;
But once you start implementing it inside a real Spring Boot service, you quickly discover the hidden trade‑offs.&lt;/p&gt;

&lt;p&gt;In this post, we’ll walk through a real‑world progression:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;emitting domain events inside a service
&lt;/li&gt;
&lt;li&gt;handling them with &lt;code&gt;@EventListener&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;realizing enrichment logic slows down the request
&lt;/li&gt;
&lt;li&gt;making listeners async
&lt;/li&gt;
&lt;li&gt;adding a production‑grade executor
&lt;/li&gt;
&lt;li&gt;and finally touching the &lt;strong&gt;gold standard&lt;/strong&gt;: the &lt;strong&gt;Outbox Pattern&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s dive in.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;1. The initial requirement: emit an event inside the service&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Imagine a simple use case: when a user is created, we want to emit an event so other parts of the system can react.&lt;/p&gt;

&lt;p&gt;A clean way to do this in Spring is to wrap &lt;code&gt;ApplicationEventPublisher&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Component&lt;/span&gt;
&lt;span class="nd"&gt;@AllArgsConstructor&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UserEventPublisher&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ApplicationEventPublisher&lt;/span&gt; &lt;span class="n"&gt;applicationEventPublisher&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="no"&gt;T&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;applicationEventPublisher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;publishEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now inside your service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="n"&gt;userEventPublisher&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;publish&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;UserCreatedEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;2. Handling the event with &lt;code&gt;@EventListener&lt;/code&gt;&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A simple listener:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Component&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;UserAuditListener&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@EventListener&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;handleUserCreateEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;UserCreatedEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"User created: "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works beautifully… until you need to do more than just print.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. The problem: enrichment logic slows down the request&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Let’s say before publishing to Kafka, you want to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fetch additional data from DB
&lt;/li&gt;
&lt;li&gt;call another service
&lt;/li&gt;
&lt;li&gt;enrich the event payload
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since &lt;code&gt;@EventListener&lt;/code&gt; is &lt;strong&gt;synchronous by default&lt;/strong&gt;, all this work blocks the original request thread.&lt;/p&gt;

&lt;p&gt;Your API response time suddenly spikes.&lt;/p&gt;

&lt;p&gt;Not good.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. Making listeners async with &lt;code&gt;@Async&lt;/code&gt; and &lt;code&gt;@EnableAsync&lt;/code&gt;&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Spring makes this easy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@EnableAsync&lt;/span&gt;
&lt;span class="nd"&gt;@SpringBootApplication&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;App&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And in the listener:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Async&lt;/span&gt;
&lt;span class="nd"&gt;@EventListener&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;handleUserCreateEvent&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;UserCreatedEvent&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// runs in a background thread&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the main request returns immediately while the listener does its work asynchronously.&lt;/p&gt;

&lt;p&gt;But there’s a catch…&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. The default executor is not production‑grade&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you don’t configure anything, Spring uses &lt;code&gt;SimpleAsyncTaskExecutor&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;creates a new thread per task
&lt;/li&gt;
&lt;li&gt;no pooling
&lt;/li&gt;
&lt;li&gt;no backpressure
&lt;/li&gt;
&lt;li&gt;no monitoring
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is fine for demos, not for real systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;6. Adding a custom executor&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A better approach is to define your own thread pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Configuration&lt;/span&gt;
&lt;span class="nd"&gt;@EnableAsync&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AsyncConfig&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@Bean&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"taskExecutor"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Executor&lt;/span&gt; &lt;span class="nf"&gt;taskExecutor&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newCachedThreadPool&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now all &lt;code&gt;@Async&lt;/code&gt; methods use this executor.&lt;/p&gt;

&lt;p&gt;You can replace it with a tuned &lt;code&gt;ThreadPoolTaskExecutor&lt;/code&gt; for even more control.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;7. The gold standard: the Outbox Pattern&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Async listeners solve the latency problem, but they don’t solve the &lt;strong&gt;reliability&lt;/strong&gt; problem.&lt;/p&gt;

&lt;p&gt;What if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the DB transaction commits
&lt;/li&gt;
&lt;li&gt;but the async listener fails before sending to Kafka?
&lt;/li&gt;
&lt;li&gt;or the service crashes?
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You lose the event.&lt;/p&gt;

&lt;p&gt;This is why mature systems use the &lt;strong&gt;Outbox Pattern&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How the Outbox Pattern works (high‑level)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Write the event into an “outbox” table inside the same DB transaction&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the user is created, the outbox record is also created
&lt;/li&gt;
&lt;li&gt;Atomic, consistent, no partial failures&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;A background process reads the outbox table&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
This can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a scheduled Spring job
&lt;/li&gt;
&lt;li&gt;a Kafka Connect Debezium connector
&lt;/li&gt;
&lt;li&gt;a lightweight polling thread
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The background process publishes the event to Kafka&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;After successful publish, the outbox record is marked as processed&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why this is the gold standard&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;no lost events
&lt;/li&gt;
&lt;li&gt;no double‑publishing
&lt;/li&gt;
&lt;li&gt;no dependency on async listeners
&lt;/li&gt;
&lt;li&gt;fully decoupled from request latency
&lt;/li&gt;
&lt;li&gt;battle‑tested at Uber, Netflix, Stripe, Shopify
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;8. Summary&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Here’s the journey we walked through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with simple Spring events
&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;@EventListener&lt;/code&gt; to react to them
&lt;/li&gt;
&lt;li&gt;Realize enrichment logic slows down the request
&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;@Async&lt;/code&gt; + &lt;code&gt;@EnableAsync&lt;/code&gt; to make listeners non‑blocking
&lt;/li&gt;
&lt;li&gt;Add a custom executor for production‑grade async processing
&lt;/li&gt;
&lt;li&gt;Finally, adopt the &lt;strong&gt;Outbox Pattern&lt;/strong&gt; for guaranteed delivery and reliability
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This progression mirrors how real systems evolve as they scale.&lt;/p&gt;

&lt;p&gt;If you’re building event‑driven microservices, the outbox pattern is the foundation you eventually want to reach.&lt;/p&gt;

</description>
      <category>java</category>
      <category>springboot</category>
      <category>kafka</category>
      <category>eventdriven</category>
    </item>
    <item>
      <title>How a Cache Invalidation Bug Nearly Took Down Our System - And What We Changed After</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Fri, 05 Dec 2025 01:57:56 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/how-a-cache-invalidation-bug-nearly-took-down-our-system-and-what-we-changed-after-2dd2</link>
      <guid>https://forem.com/jayesh_shinde/how-a-cache-invalidation-bug-nearly-took-down-our-system-and-what-we-changed-after-2dd2</guid>
      <description>&lt;p&gt;A few weeks ago, we had one of those production incidents that quietly start in the background and explode right when the traffic peaks.&lt;br&gt;
This one involved &lt;strong&gt;Aurora MySQL&lt;/strong&gt;, a &lt;strong&gt;Lambda with a 30-second timeout&lt;/strong&gt;, and a poorly-designed &lt;strong&gt;cache invalidation strategy&lt;/strong&gt; that ended up flooding our database.&lt;/p&gt;

&lt;p&gt;Here’s the story, what went wrong, and the changes we made so it never happens again.&lt;/p&gt;


&lt;h2&gt;
  
  
  🎬 The Setup
&lt;/h2&gt;

&lt;p&gt;The night before the incident, we upgraded our &lt;strong&gt;Aurora MySQL engine version&lt;/strong&gt;.&lt;br&gt;
Everything looked good. No alarms. No red flags.&lt;/p&gt;

&lt;p&gt;The next morning around &lt;strong&gt;8 AM&lt;/strong&gt;, our daily job kicked in — the one responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deleting the stale “master data” cache&lt;/li&gt;
&lt;li&gt;refetching fresh master data from the DB&lt;/li&gt;
&lt;li&gt;storing it back in cache&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This master dataset is used Application to work correctly, so if the cache isn’t warm, the DB gets hammered.&lt;/p&gt;


&lt;h2&gt;
  
  
  💥 The Explosion
&lt;/h2&gt;

&lt;p&gt;Right after the engine upgrade, a specific query in the Lambda suddenly started taking &lt;strong&gt;30+ seconds&lt;/strong&gt;.&lt;br&gt;
But our Lambda had a &lt;strong&gt;30-second timeout&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So what happened?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The cacheInvalidate → cacheRebuild flow failed.&lt;/li&gt;
&lt;li&gt;The cache remained empty.&lt;/li&gt;
&lt;li&gt;Every user request resulted in a &lt;strong&gt;cache miss&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;All those requests hit the DB directly.&lt;/li&gt;
&lt;li&gt;Aurora CPU spiked to &lt;strong&gt;99%&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Application responses stalled across the board.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Classic &lt;strong&gt;cache stampede&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We eventually triggered a &lt;strong&gt;failover&lt;/strong&gt;, and luckily the same query ran in &lt;em&gt;~28.7 seconds&lt;/em&gt; on the new writer, just under the Lambda timeout. That bought us a few minutes to stabilize.&lt;/p&gt;

&lt;p&gt;Later that night, we found the real culprit:&lt;br&gt;
➡️ &lt;strong&gt;The query needed a new index&lt;/strong&gt;, and the upgrade changed its execution plan.&lt;/p&gt;

&lt;p&gt;We created the index via a hotfix, and the DB stabilized.&lt;/p&gt;

&lt;p&gt;But the deeper problem was our cache invalidation approach.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧹 Our Original Cache Invalidation: Delete First, Hope Later
&lt;/h2&gt;

&lt;p&gt;Our initial flow was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Delete the existing cache key&lt;/li&gt;
&lt;li&gt;Fetch fresh data from DB&lt;/li&gt;
&lt;li&gt;Save it back to cache&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If step 2 fails, everything collapses.&lt;/p&gt;

&lt;p&gt;It’s simple… until it isn’t.&lt;br&gt;
In our case, the Lambda failed to fetch fresh data, so the cache stayed empty.&lt;/p&gt;


&lt;h2&gt;
  
  
  🔧 What We Changed (and Recommend)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;1. Never delete the cache before you have fresh data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We inverted the flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fetch → Validate → Update cache&lt;/li&gt;
&lt;li&gt;Only delete if we &lt;em&gt;already&lt;/em&gt; have fresh data ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This eliminates the “empty cache” window.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;2. Use “stale rollover” instead of blunt deletion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If the refresh job fails, we now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;rename the key&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;"Master-Data"&lt;/code&gt; → &lt;code&gt;"Master-Data-Stale"&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;keep the old value available&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;add an internal notification so the team can investigate&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures that even if the DB is slow or down, the system still has &lt;em&gt;something&lt;/em&gt; to serve.&lt;/p&gt;

&lt;p&gt;It’s not ideal, but it prevents a meltdown.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;3. API layer now returns stale data when fresh data is unavailable&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The API logic became:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try to read &lt;code&gt;"Master-Data"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If not found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attempt to rebuild (only if allowed)&lt;/li&gt;
&lt;li&gt;If rebuild fails → return stale data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This avoids cascading failures.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;4. Add a Redis distributed lock to prevent cache stampede&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Without this, even if stale data existed, multiple API nodes or Lambdas could all try to rebuild simultaneously — hammering the DB again.&lt;/p&gt;

&lt;p&gt;With a Redis lock:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only &lt;em&gt;one&lt;/em&gt; request gets the lock and rebuilds&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Others:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do &lt;strong&gt;not&lt;/strong&gt; hit DB&lt;/li&gt;
&lt;li&gt;Simply return stale data&lt;/li&gt;
&lt;li&gt;Wait for the winner to repopulate the cache&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This one change alone eliminates 90% of stampede risk.&lt;/p&gt;
&lt;h3&gt;
  
  
  Node.js — Acquire Distributed Lock (Redis)
&lt;/h3&gt;

&lt;p&gt;Below is a simple Redis-based lock using SET NX PX (no external library).&lt;br&gt;
You can replace redis client with ioredis or node-redis based on your stack.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// redis.js
const { createClient } = require("redis");

const redis = createClient({
  url: process.env.REDIS_URL
});
redis.connect();

module.exports = redis;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Acquiring and Releasing the Lock
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// lock.js
const redis = require("./redis");
const { randomUUID } = require("crypto");

const LOCK_KEY = "lock:master-data-refresh";
const LOCK_TTL = 10000; // 10 seconds

async function acquireLock() {
  const lockId = randomUUID();

  const result = await redis.set(LOCK_KEY, lockId, {
    NX: true,
    PX: LOCK_TTL
  });

  if (result === "OK") {
    return lockId; // lock acquired
  }

  return null; // lock not acquired
}

async function releaseLock(lockId) {
  const current = await redis.get(LOCK_KEY);

  if (current === lockId) {
    await redis.del(LOCK_KEY);
  }
}

module.exports = { acquireLock, releaseLock };

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const { acquireLock, releaseLock } = require("./lock");

async function refreshMasterData() {
  const lockId = await acquireLock();

  if (!lockId) {
    console.log("Another request is refreshing. Returning stale data.");
    return getStaleData();
  }

  try {
    const newData = await fetchFromDB();
    await saveToCache(newData);
    return newData;
  } finally {
    await releaseLock(lockId);
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;5. Add observability around refresh times&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We now record:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;query execution time&lt;/li&gt;
&lt;li&gt;cache refresh duration&lt;/li&gt;
&lt;li&gt;lock acquisition metrics&lt;/li&gt;
&lt;li&gt;alerts when a refresh exceeds a threshold&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is to catch slowdowns &lt;em&gt;before&lt;/em&gt; timeout happens.&lt;/p&gt;




&lt;h2&gt;
  
  
  📝 Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Engine upgrades can change execution plans&lt;/strong&gt;, sometimes dramatically.&lt;/li&gt;
&lt;li&gt;Always benchmark critical queries after major DB changes.&lt;/li&gt;
&lt;li&gt;Cache invalidation strategies must assume that &lt;strong&gt;refresh can fail&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Serving &lt;strong&gt;stale-but-valid data&lt;/strong&gt; is often better than serving errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed locks&lt;/strong&gt; are essential in preventing cache stampede.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The incident was stressful, but the learnings were worth it.&lt;br&gt;
Caching problems rarely show up during normal traffic — they appear right when your system is busiest.&lt;/p&gt;

&lt;p&gt;If you have a similar “delete-then-refresh” pattern somewhere in your application… you may want to review it before it reviews you.&lt;/p&gt;

</description>
      <category>node</category>
      <category>redis</category>
      <category>aws</category>
      <category>mysql</category>
    </item>
    <item>
      <title>🧩 From 15 Minutes to Infinite: Scaling STT Jobs with AWS Batch</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sun, 09 Nov 2025 05:25:17 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/how-we-fixed-missing-transcripts-with-aws-batch-e7e</link>
      <guid>https://forem.com/jayesh_shinde/how-we-fixed-missing-transcripts-with-aws-batch-e7e</guid>
      <description>&lt;h3&gt;
  
  
  💡 The Problem
&lt;/h3&gt;

&lt;p&gt;We recently ran into a production issue — our &lt;strong&gt;Speech-to-Text (STT)&lt;/strong&gt; service stopped working for a few hours.&lt;br&gt;
The feature was fixed quickly, but the &lt;strong&gt;transcripts for that downtime were missing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Luckily, in &lt;strong&gt;Amazon Connect&lt;/strong&gt;, all call recordings are stored in &lt;strong&gt;S3&lt;/strong&gt;.&lt;br&gt;
So the audio was there, but no transcripts.&lt;/p&gt;

&lt;p&gt;We needed to reprocess all those missed files — &lt;strong&gt;fast&lt;/strong&gt;.&lt;/p&gt;


&lt;h3&gt;
  
  
  🧠 First Attempt: Lambda (and its Limitations)
&lt;/h3&gt;

&lt;p&gt;We quickly built a &lt;strong&gt;Lambda&lt;/strong&gt; function to process unprocessed files from S3.&lt;/p&gt;

&lt;p&gt;It worked fine — until it didn’t.&lt;br&gt;
AWS Lambda has a &lt;strong&gt;15-minute execution limit&lt;/strong&gt;, and processing large audio files can easily exceed that.&lt;/p&gt;

&lt;p&gt;We could have switched to EC2, but that felt like using a hammer for a small screw — no auto-scaling, no graceful shutdown, no built-in retry or job management.&lt;/p&gt;

&lt;p&gt;We needed something that behaved &lt;strong&gt;like a job&lt;/strong&gt;, not a script.&lt;/p&gt;


&lt;h3&gt;
  
  
  🚀 Enter AWS Batch + Fargate
&lt;/h3&gt;

&lt;p&gt;That’s when &lt;strong&gt;AWS Batch&lt;/strong&gt; came to the rescue.&lt;br&gt;
It’s perfect for this kind of workload — long-running, batch-style, event-driven jobs.&lt;/p&gt;

&lt;p&gt;Here’s the setup we used:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Created a Compute Environment&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Backed by &lt;strong&gt;AWS Fargate&lt;/strong&gt; → no EC2 management.&lt;/li&gt;
&lt;li&gt;Scales automatically depending on job load.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Defined a Job Queue&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;All reprocessing jobs will be submitted here.&lt;/li&gt;
&lt;li&gt;The queue ensures controlled concurrency and retries.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Built a Job Definition&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Packaged our STT processing logic as a &lt;strong&gt;Docker image&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Uploaded it to &lt;strong&gt;Amazon ECR&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Defined required vCPU and memory for each job.&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Triggered via Lambda&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;A small Lambda fetches a list of unprocessed S3 files.&lt;/li&gt;
&lt;li&gt;For each batch (say 50 files), it &lt;strong&gt;submits a Batch Job&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  ⚙️ The Flow in Action
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lambda →&lt;/strong&gt; Checks for unprocessed audio files in S3.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lambda → AWS Batch:&lt;/strong&gt; Submits a job to process them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Batch (Fargate)&lt;/strong&gt; spins up compute, runs the job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job →&lt;/strong&gt; Downloads audio → runs STT → uploads transcript → updates metadata.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fargate shuts down automatically&lt;/strong&gt; when the job finishes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;No idle servers, no manual cleanup, no stress.&lt;/p&gt;


&lt;h3&gt;
  
  
  🧩 Why This Design Rocks
&lt;/h3&gt;

&lt;p&gt;✅ &lt;strong&gt;Serverless all the way&lt;/strong&gt; — Lambda + Fargate + S3&lt;br&gt;
✅ &lt;strong&gt;Auto-scaling compute&lt;/strong&gt; — no EC2 to babysit&lt;br&gt;
✅ &lt;strong&gt;Long-running safe zone&lt;/strong&gt; — runs beyond Lambda’s 15-min cap&lt;br&gt;
✅ &lt;strong&gt;Reusable&lt;/strong&gt; — we can reprocess any backlog anytime&lt;br&gt;
✅ &lt;strong&gt;Cost-efficient&lt;/strong&gt; — pay only for what’s used&lt;/p&gt;


&lt;h3&gt;
  
  
  🪄 Bonus Tip
&lt;/h3&gt;

&lt;p&gt;You can even schedule a &lt;strong&gt;“missed transcript” job&lt;/strong&gt; to run daily or weekly,&lt;br&gt;
checking for any files without transcripts and triggering a Batch job automatically.&lt;/p&gt;


&lt;h2&gt;
  
  
  🧩 Understanding AWS Batch Scaling
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;AWS Batch&lt;/strong&gt;, the number of tasks (containers) that run &lt;strong&gt;in parallel&lt;/strong&gt; depends on &lt;strong&gt;three things&lt;/strong&gt; working together:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compute Environment capacity&lt;/strong&gt;&lt;br&gt;
→ e.g., your environment has a maximum of &lt;code&gt;10 vCPUs&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Job Definition requirements&lt;/strong&gt;&lt;br&gt;
→ e.g., each job needs &lt;code&gt;1 vCPU&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;How many jobs are in the queue (and their array size, if used).&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;
  
  
  🔹 Case 1: You Submit Multiple Independent Jobs
&lt;/h3&gt;

&lt;p&gt;If you submit &lt;strong&gt;10 jobs&lt;/strong&gt;, each with &lt;code&gt;1 vCPU&lt;/code&gt;, and your environment allows &lt;code&gt;10 vCPUs&lt;/code&gt;,&lt;br&gt;
then AWS Batch can &lt;strong&gt;run all 10 in parallel&lt;/strong&gt; (subject to available Fargate capacity).&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# pseudo example&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1..10&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;aws batch submit-job &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--job-name&lt;/span&gt; process-audio-&lt;span class="nv"&gt;$i&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--job-queue&lt;/span&gt; my-queue &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--job-definition&lt;/span&gt; my-job-def
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each job = 1 vCPU → up to 10 can run simultaneously.&lt;/p&gt;

&lt;p&gt;AWS Batch’s &lt;strong&gt;Job Scheduler&lt;/strong&gt; will automatically pack as many as possible based on available compute.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔹 Case 2: You Use an Array Job
&lt;/h3&gt;

&lt;p&gt;Instead of manually looping, you can submit &lt;strong&gt;an array job&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws batch submit-job &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--job-name&lt;/span&gt; process-audios &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--job-queue&lt;/span&gt; my-queue &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--job-definition&lt;/span&gt; my-job-def &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--array-properties&lt;/span&gt; &lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates &lt;strong&gt;10 child jobs&lt;/strong&gt; under a single parent, each running independently (great for S3 list chunking).&lt;/p&gt;

&lt;p&gt;Same result — 10 parallel containers, each with 1 vCPU.&lt;/p&gt;




&lt;h3&gt;
  
  
  🔹 Case 3: You Submit a Single Job that Needs More vCPUs
&lt;/h3&gt;

&lt;p&gt;If you set in your &lt;strong&gt;job definition&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"vcpus"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and your environment has 10 total vCPUs →&lt;br&gt;
then &lt;strong&gt;Batch will reserve 4 vCPUs&lt;/strong&gt; for that job, leaving room for other smaller jobs.&lt;/p&gt;

&lt;p&gt;So the compute environment doesn’t spawn “10 copies automatically” —&lt;br&gt;
it just enforces &lt;strong&gt;a maximum pool of total CPU&lt;/strong&gt; that concurrent jobs can consume.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ TL;DR — How to Scale
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;What to Do&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Run multiple tasks concurrently&lt;/td&gt;
&lt;td&gt;Submit multiple jobs or an array job&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Each job’s CPU need&lt;/td&gt;
&lt;td&gt;Defined in Job Definition (e.g., 1 vCPU)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max parallel limit&lt;/td&gt;
&lt;td&gt;Based on compute environment capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Control at runtime&lt;/td&gt;
&lt;td&gt;You can pass &lt;code&gt;--array-properties size=N&lt;/code&gt; dynamically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling behavior&lt;/td&gt;
&lt;td&gt;Batch automatically scales Fargate/EC2 capacity up/down&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  🏁 Closing Thoughts
&lt;/h3&gt;

&lt;p&gt;This experience reminded me —&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“When your script starts feeling like a job, give it job-like powers.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AWS Batch (especially with Fargate) is often underrated,&lt;br&gt;
but it’s a powerful tool when you need &lt;strong&gt;on-demand, containerized, long-running compute&lt;/strong&gt;&lt;br&gt;
without managing any servers.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>fargate</category>
      <category>lambda</category>
      <category>node</category>
    </item>
    <item>
      <title>Reusing HTTP and SDK clients in AWS Lambda to avoid “too many open files” (FD) errors</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Wed, 15 Oct 2025 05:04:25 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/reusing-http-and-sdk-clients-in-aws-lambda-to-avoid-too-many-open-files-fd-errors-1p24</link>
      <guid>https://forem.com/jayesh_shinde/reusing-http-and-sdk-clients-in-aws-lambda-to-avoid-too-many-open-files-fd-errors-1p24</guid>
      <description>&lt;h3&gt;
  
  
  TL;DR
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;We hit sporadic network errors in a high-throughput Lambda that made HTTP calls (Axios) and AWS SDK calls.&lt;/li&gt;
&lt;li&gt;Root cause: creating new HTTP clients/agents per invocation ballooned the number of open sockets (file descriptors).&lt;/li&gt;
&lt;li&gt;Fix: initialize clients and their &lt;code&gt;https.Agent&lt;/code&gt; once at module scope with keep-alive and reuse them across warm invocations. For AWS SDK v2, also set &lt;code&gt;AWS_NODEJS_CONNECTION_REUSE_ENABLED=1&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The scenario
&lt;/h3&gt;

&lt;p&gt;We had a Lambda that was invoked asynchronously to process a large dataset (thousands of events). Inside the handler, we created an Axios client and AWS SDK client(s) for each invocation. Under sustained concurrency, we started seeing intermittent network failures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Symptoms we saw
&lt;/h3&gt;

&lt;p&gt;These popped up in CloudWatch logs while the Lambda was busy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“too many open files” errors:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Error: EMFILE: too many open files, open&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;NodeError: getaddrinfo ENFILE&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Connection instability:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;AxiosError: socket hang up&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Error: read ECONNRESET&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Error: connect ECONNRESET&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Occasional timeouts and throttling-like behavior despite healthy downstream services&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;These were worse during bursts when many async invocations overlapped.&lt;/p&gt;

&lt;h3&gt;
  
  
  What’s really happening (FDs and sockets in Lambda)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Every TCP connection (HTTP/HTTPS) consumes a file descriptor (FD).&lt;/li&gt;
&lt;li&gt;Lambda execution environments have a relatively low per-process FD limit (commonly around 1024).&lt;/li&gt;
&lt;li&gt;If you create a new HTTP client (and thus a new &lt;code&gt;https.Agent&lt;/code&gt;) per invocation, each agent can open many sockets. Under high concurrency, you exhaust FDs, leading to the errors above.&lt;/li&gt;
&lt;li&gt;Lambda reuses the same execution environment for multiple “warm” invocations. Objects created at module scope are kept alive and reused, which is exactly what we want for clients and connection pools.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why Node’s &lt;code&gt;https.Agent&lt;/code&gt; matters
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The agent controls connection pooling and keep-alive.&lt;/li&gt;
&lt;li&gt;Creating a new agent per invocation increases the number of socket pools and the total sockets in use.&lt;/li&gt;
&lt;li&gt;Reusing a single agent keeps the number of open sockets bounded and allows connection reuse across requests, reducing FD pressure and latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The anti-pattern (what we had)
&lt;/h3&gt;

&lt;p&gt;Creating new clients and agents inside the handler:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Anti-pattern: runs on every invocation&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;httpsAgent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="c1"&gt;// new agent each time&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://api.example.com/data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same issue with AWS SDK if you &lt;code&gt;new&lt;/code&gt; a client per invocation, especially if you also create its own agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix (module-level reuse with keep-alive)
&lt;/h3&gt;

&lt;p&gt;Move client and agent creation to module scope so they’re created once per warm environment and then reused.&lt;/p&gt;

&lt;h4&gt;
  
  
  Axios
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;axios&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;httpsAgent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;keepAlive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;maxSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;// tune based on expected concurrency per environment&lt;/span&gt;
  &lt;span class="na"&gt;maxFreeSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// socket idle timeout&lt;/span&gt;
  &lt;span class="na"&gt;freeSocketTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;axios&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="nx"&gt;httpsAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ax&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://api.example.com/data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  AWS SDK v3
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;NodeHttpHandler&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@aws-sdk/node-http-handler&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;S3Client&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@aws-sdk/client-s3&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;httpsAgent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;keepAlive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;maxSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;maxFreeSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;freeSocketTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;S3Client&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AWS_REGION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;requestHandler&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;NodeHttpHandler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;httpsAgent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;connectionTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;socketTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listBuckets&lt;/span&gt;&lt;span class="p"&gt;({});&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  AWS SDK v2
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Reuse clients, and enable connection reuse via env var.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;aws-sdk&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Also set in Lambda env: AWS_NODEJS_CONNECTION_REUSE_ENABLED=1&lt;/span&gt;
&lt;span class="nx"&gt;AWS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;AWS_REGION&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;httpOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;https&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;keepAlive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;maxSockets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;s3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;AWS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;S3&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;s3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listBuckets&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;promise&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results after the change
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;FD-related errors (EMFILE, ENFILE, socket hang ups) disappeared under the same workload.&lt;/li&gt;
&lt;li&gt;Lower p95 latency due to connection reuse.&lt;/li&gt;
&lt;li&gt;Fewer outbound connection spikes visible on NAT Gateway/ENI metrics (for VPC Lambdas).&lt;/li&gt;
&lt;li&gt;More predictable behavior during bursts.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Bonus mitigations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Concurrency control: use SQS with a sane &lt;code&gt;maxConcurrency&lt;/code&gt;/&lt;code&gt;batchSize&lt;/code&gt;, reserved concurrency, or step-wise throttling to prevent bursts from scaling FD usage across many environments at once.&lt;/li&gt;
&lt;li&gt;Timeouts and retries: set realistic timeouts; add backoff with jitter to avoid synchronized retries.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;context.callbackWaitsForEmptyEventLoop = false&lt;/code&gt;: can help the handler return even if the agent keeps idle sockets open (don’t overuse).&lt;/li&gt;
&lt;li&gt;Consider &lt;code&gt;undici&lt;/code&gt; for HTTP in Node 18+; it provides efficient HTTP/1.1 keep-alive by default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Quick checklist
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Initialize HTTP clients and SDK clients at module scope.&lt;/li&gt;
&lt;li&gt;Use a shared &lt;code&gt;https.Agent&lt;/code&gt; with &lt;code&gt;keepAlive: true&lt;/code&gt;; set &lt;code&gt;maxSockets&lt;/code&gt;, &lt;code&gt;maxFreeSockets&lt;/code&gt;, and timeouts.&lt;/li&gt;
&lt;li&gt;For AWS SDK v2, set &lt;code&gt;AWS_NODEJS_CONNECTION_REUSE_ENABLED=1&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Avoid creating clients/agents inside loops or inside the handler.&lt;/li&gt;
&lt;li&gt;Monitor and tune under realistic concurrency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Closing thoughts
&lt;/h3&gt;

&lt;p&gt;FD exhaustion is easy to miss until traffic scales. In serverless, the simplest lever is to reuse resources across warm invocations. One shared agent + one shared client per execution environment eliminates a whole class of flaky, intermittent network issues.&lt;/p&gt;

</description>
      <category>lambda</category>
      <category>serverless</category>
      <category>node</category>
      <category>aws</category>
    </item>
    <item>
      <title>🧠 How We Upgraded Our WordPress Search with OpenSearch Neural + Cohere for Multilingual Semantic Search</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sat, 11 Oct 2025 04:32:51 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/how-we-upgraded-our-wordpress-search-with-opensearch-neural-cohere-for-multilingual-semantic-4j56</link>
      <guid>https://forem.com/jayesh_shinde/how-we-upgraded-our-wordpress-search-with-opensearch-neural-cohere-for-multilingual-semantic-4j56</guid>
      <description>&lt;p&gt;At our company, we use WordPress as a knowledge base (KB) for internal articles.&lt;br&gt;&lt;br&gt;
We indexed those articles in &lt;strong&gt;OpenSearch&lt;/strong&gt;, but the default keyword search felt… old-school.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Searching &lt;strong&gt;“Netflix subscription”&lt;/strong&gt; missed “How to manage ネットフリックス plans”.&lt;/li&gt;
&lt;li&gt;Searching &lt;strong&gt;“AWS cost optimization”&lt;/strong&gt; returned random hits because keywords didn’t align.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we upgraded our search to &lt;strong&gt;semantic search&lt;/strong&gt; using the &lt;strong&gt;OpenSearch ML plugin&lt;/strong&gt; + &lt;strong&gt;Cohere embeddings served via Amazon Bedrock&lt;/strong&gt; — a great combination for &lt;strong&gt;multilingual understanding&lt;/strong&gt; and secure enterprise integration.&lt;/p&gt;


&lt;h2&gt;
  
  
  ⚙️ The Problem
&lt;/h2&gt;

&lt;p&gt;Our initial setup used simple keyword mapping:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even after tweaking analyzers, users searching in Japanese (katakana) or English weren’t getting expected matches.&lt;/p&gt;

&lt;p&gt;For example, “netflix” and “ネットフリックス” should be the same — but OpenSearch treated them as completely different tokens.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 The Plan
&lt;/h2&gt;

&lt;p&gt;We wanted to add &lt;strong&gt;semantic search&lt;/strong&gt; on top of our existing index.&lt;br&gt;
That means converting both documents &lt;strong&gt;and&lt;/strong&gt; queries into &lt;strong&gt;vectors&lt;/strong&gt; using the same embedding model, and comparing them using cosine similarity.&lt;/p&gt;

&lt;p&gt;Our plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create an &lt;strong&gt;ML connector&lt;/strong&gt; for the Cohere embedding API&lt;/li&gt;
&lt;li&gt;Create a &lt;strong&gt;model&lt;/strong&gt; for document embeddings (input_type = &lt;code&gt;search_document&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Create another &lt;strong&gt;model&lt;/strong&gt; for query embeddings (input_type = &lt;code&gt;search_query&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Update our &lt;strong&gt;ingestion pipeline&lt;/strong&gt; to generate vectors during indexing&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;script_score&lt;/strong&gt; (or kNN query) to retrieve the best semantic matches&lt;/li&gt;
&lt;/ol&gt;


&lt;h2&gt;
  
  
  🔌 Step 1: Create a Connector (to Cohere)
&lt;/h2&gt;

&lt;p&gt;First, we create an &lt;strong&gt;ML connector&lt;/strong&gt; in OpenSearch to call Cohere’s API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/connectors/_create
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-doc-connector"&lt;/span&gt;,
  &lt;span class="s2"&gt;"description"&lt;/span&gt;: &lt;span class="s2"&gt;"Amazon Bedrock connector for Cohere embeddings (document)"&lt;/span&gt;,
  &lt;span class="s2"&gt;"protocol"&lt;/span&gt;: &lt;span class="s2"&gt;"aws_sigv4"&lt;/span&gt;,
  &lt;span class="s2"&gt;"parameters"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"region"&lt;/span&gt;: &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;,
    &lt;span class="s2"&gt;"service_name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock"&lt;/span&gt;,
    &lt;span class="s2"&gt;"model_id"&lt;/span&gt;: &lt;span class="s2"&gt;"cohere.embed-multilingual-v3"&lt;/span&gt;,
    &lt;span class="s2"&gt;"input_type"&lt;/span&gt;: &lt;span class="s2"&gt;"search_document"&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here:&lt;/p&gt;

&lt;p&gt;protocol: aws_sigv4 makes OpenSearch sign Bedrock API calls with IAM credentials.&lt;/p&gt;

&lt;p&gt;model_id refers to Cohere’s multilingual embedding model hosted on Bedrock.&lt;/p&gt;

&lt;p&gt;We use input_type=search_document for document-level embeddings.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 Step 2: Register the Model
&lt;/h2&gt;

&lt;p&gt;Now, we create a model using the connector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/models/_register
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-doc-embed"&lt;/span&gt;,
  &lt;span class="s2"&gt;"function_name"&lt;/span&gt;: &lt;span class="s2"&gt;"embedding"&lt;/span&gt;,
  &lt;span class="s2"&gt;"connector_id"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-doc-connector"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

POST _plugins/_ml/models/bedrock-cohere-doc-embed/_deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This model will be used in our ingestion pipeline.&lt;/p&gt;

&lt;p&gt;Then deploy it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/models/cohere-doc-embed-model/_deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenSearch now knows how to call Cohere to embed documents.&lt;/p&gt;




&lt;h2&gt;
  
  
  💾 Step 3: Create the Index with a Vector Field
&lt;/h2&gt;

&lt;p&gt;Next, we create our knowledge base index that includes a vector field for embeddings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;PUT kb-articles
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"settings"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"index"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"knn"&lt;/span&gt;: &lt;span class="nb"&gt;true&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="s2"&gt;"mappings"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"properties"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"type"&lt;/span&gt;: &lt;span class="s2"&gt;"text"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
      &lt;span class="s2"&gt;"content"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"type"&lt;/span&gt;: &lt;span class="s2"&gt;"text"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
      &lt;span class="s2"&gt;"embedding"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"type"&lt;/span&gt;: &lt;span class="s2"&gt;"knn_vector"&lt;/span&gt;,
        &lt;span class="s2"&gt;"dimension"&lt;/span&gt;: 1024,
        &lt;span class="s2"&gt;"method"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"hnsw"&lt;/span&gt;,
          &lt;span class="s2"&gt;"space_type"&lt;/span&gt;: &lt;span class="s2"&gt;"cosinesimil"&lt;/span&gt;,
          &lt;span class="s2"&gt;"engine"&lt;/span&gt;: &lt;span class="s2"&gt;"nmslib"&lt;/span&gt;,
          &lt;span class="s2"&gt;"parameters"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;"ef_construction"&lt;/span&gt;: 512,
            &lt;span class="s2"&gt;"m"&lt;/span&gt;: 16
          &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The vector field embedding will hold our document-level embeddings from Cohere (1024 dimensions).
&lt;/h2&gt;

&lt;h2&gt;
  
  
  🧠 Step 4: Add an Ingest Pipeline
&lt;/h2&gt;

&lt;p&gt;We’ll generate document embeddings automatically during ingestion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;PUT _ingest/pipeline/kb-embed-pipeline
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"processors"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"ml_inference"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"model_id"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-doc-embed"&lt;/span&gt;,
        &lt;span class="s2"&gt;"input_map"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"text"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
        &lt;span class="s2"&gt;"output_map"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"embedding"&lt;/span&gt;: &lt;span class="s2"&gt;"embedding"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, when we index a document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST kb-articles/_doc?pipeline&lt;span class="o"&gt;=&lt;/span&gt;kb-embed-pipeline
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"title"&lt;/span&gt;: &lt;span class="s2"&gt;"Netflix subscription help"&lt;/span&gt;,
  &lt;span class="s2"&gt;"content"&lt;/span&gt;: &lt;span class="s2"&gt;"Steps to manage your Netflix account and billing."&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;OpenSearch automatically calls Amazon Bedrock, retrieves Cohere’s embedding, and stores it in the embedding vector field.&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Step 5: Handle User Queries with Another Connector
&lt;/h2&gt;

&lt;p&gt;Initially, I used the same connector (with input_type=search_document) for both documents and queries.&lt;br&gt;
That caused a mismatch — “ネットフリックス” (Katakana) and “Netflix” were still not matching.&lt;/p&gt;

&lt;p&gt;The fix was to create another connector and model specifically for query embeddings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/connectors/_create
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-query-connector"&lt;/span&gt;,
  &lt;span class="s2"&gt;"description"&lt;/span&gt;: &lt;span class="s2"&gt;"Amazon Bedrock connector for Cohere embeddings (query)"&lt;/span&gt;,
  &lt;span class="s2"&gt;"protocol"&lt;/span&gt;: &lt;span class="s2"&gt;"aws_sigv4"&lt;/span&gt;,
  &lt;span class="s2"&gt;"parameters"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"region"&lt;/span&gt;: &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;,
    &lt;span class="s2"&gt;"service_name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock"&lt;/span&gt;,
    &lt;span class="s2"&gt;"model_id"&lt;/span&gt;: &lt;span class="s2"&gt;"cohere.embed-multilingual-v3"&lt;/span&gt;,
    &lt;span class="s2"&gt;"input_type"&lt;/span&gt;: &lt;span class="s2"&gt;"search_query"&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and then register and deploy another model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/models/_register
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-query-embed"&lt;/span&gt;,
  &lt;span class="s2"&gt;"function_name"&lt;/span&gt;: &lt;span class="s2"&gt;"embedding"&lt;/span&gt;,
  &lt;span class="s2"&gt;"connector_id"&lt;/span&gt;: &lt;span class="s2"&gt;"bedrock-cohere-query-connector"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;

POST _plugins/_ml/models/bedrock-cohere-query-embed/_deploy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures both documents and queries are embedded in compatible vector spaces.&lt;/p&gt;

&lt;p&gt;Now, when we run a search, we first generate a &lt;strong&gt;query embedding&lt;/strong&gt; via the ML model’s &lt;code&gt;/predict&lt;/code&gt; endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST _plugins/_ml/models/bedrock-cohere-query-embed/_predict
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"text"&lt;/span&gt;: &lt;span class="s2"&gt;"ネットフリックス subscription"&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔍 Step 6: Semantic Search Query
&lt;/h2&gt;

&lt;p&gt;Finally, we plug the embedding into a &lt;code&gt;script_score&lt;/code&gt; query to rank results by cosine similarity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;POST kb-articles/_search
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"size"&lt;/span&gt;: 5,
  &lt;span class="s2"&gt;"query"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;"script_score"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"query"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"match_all"&lt;/span&gt;: &lt;span class="o"&gt;{}&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;,
      &lt;span class="s2"&gt;"script"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"source"&lt;/span&gt;: &lt;span class="s2"&gt;"cosineSimilarity(params.query_vector, 'embedding') + 1.0"&lt;/span&gt;,
        &lt;span class="s2"&gt;"params"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt;
          &lt;span class="s2"&gt;"query_vector"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt; embedding array from _predict &lt;span class="k"&gt;*&lt;/span&gt;/]
        &lt;span class="o"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="s2"&gt;"sort"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;
    &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"_score"&lt;/span&gt;: &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="s2"&gt;"order"&lt;/span&gt;: &lt;span class="s2"&gt;"desc"&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now “Netflix” and “ネットフリックス” both match beautifully.&lt;br&gt;
🎯 Cohere’s multilingual embeddings + OpenSearch vector search did the trick.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 Why We Chose Cohere
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strong multilingual understanding&lt;/strong&gt; — perfect for English + Japanese content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Easy integration&lt;/strong&gt; — just plug in an API key via OpenSearch ML connector&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent embedding dimensions&lt;/strong&gt; — works well with &lt;code&gt;knn_vector&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast inference&lt;/strong&gt; — good for production-scale pipelines&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🪛 Troubleshooting Tips
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Katakana &amp;amp; English not matching&lt;/td&gt;
&lt;td&gt;Used &lt;code&gt;search_document&lt;/code&gt; for query embeddings&lt;/td&gt;
&lt;td&gt;Create a separate connector with &lt;code&gt;input_type=search_query&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;“Dimension mismatch” errors&lt;/td&gt;
&lt;td&gt;Wrong embedding model or field dimension&lt;/td&gt;
&lt;td&gt;Make sure both model &amp;amp; field use same &lt;code&gt;dimension&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference timeout&lt;/td&gt;
&lt;td&gt;Cohere API rate limits&lt;/td&gt;
&lt;td&gt;Batch or cache embeddings during ingestion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weird scores&lt;/td&gt;
&lt;td&gt;Missing &lt;code&gt;space_type: cosinesimil&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Use cosine similarity in mapping&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  ✅ Summary
&lt;/h2&gt;

&lt;p&gt;We started with basic keyword search and ended up with &lt;strong&gt;multilingual semantic search&lt;/strong&gt; using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenSearch ML plugin&lt;/li&gt;
&lt;li&gt;Cohere embedding models&lt;/li&gt;
&lt;li&gt;Cosine similarity on &lt;code&gt;knn_vector&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Separate connectors for documents vs. queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ✨ Outcome
&lt;/h2&gt;

&lt;p&gt;After integrating Cohere via Bedrock and separating the input types:&lt;/p&gt;

&lt;p&gt;✅ We can now search using phrases, not just keywords&lt;/p&gt;

&lt;p&gt;🌏 Cross-lingual search works — “Netflix” ≈ “ネットフリックス”&lt;/p&gt;

&lt;p&gt;💬 Semantic matching improved drastically (e.g., “streaming issue” finds “再生 エラー”)&lt;/p&gt;

&lt;p&gt;📈 Search relevance and recall are noticeably better, even for non-English content&lt;/p&gt;

&lt;p&gt;The combination of OpenSearch ML plugin + Cohere embeddings via Bedrock turned our keyword search into a truly semantic multilingual search engine — all running within the AWS ecosystem.&lt;/p&gt;




&lt;p&gt;💡 &lt;em&gt;If you’re building multilingual or brand-sensitive search, don’t skip the &lt;code&gt;input_type&lt;/code&gt; difference — it can make or break your semantic matching.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>opensearch</category>
      <category>semanticsearch</category>
      <category>wordpress</category>
      <category>aws</category>
    </item>
    <item>
      <title>🛠️ Fixing Lost SecurityContext and Correlation IDs in Async Calls with Spring Boot</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sun, 05 Oct 2025 11:57:58 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/fixing-lost-securitycontext-and-correlation-ids-in-async-calls-with-spring-boot-4pc8</link>
      <guid>https://forem.com/jayesh_shinde/fixing-lost-securitycontext-and-correlation-ids-in-async-calls-with-spring-boot-4pc8</guid>
      <description>&lt;p&gt;When we started parallelizing API calls in our Spring Boot service using &lt;code&gt;CompletableFuture&lt;/code&gt; and a custom &lt;code&gt;ExecutorService&lt;/code&gt;, everything looked great… until we checked the logs.  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our &lt;strong&gt;JWT &lt;code&gt;SecurityContext&lt;/code&gt;&lt;/strong&gt; wasn’t available in the async threads.
&lt;/li&gt;
&lt;li&gt;Our &lt;strong&gt;MDC correlation IDs&lt;/strong&gt; (used for distributed tracing/log correlation) were missing too.
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That meant downstream services didn’t know &lt;em&gt;who&lt;/em&gt; was calling, and our logs lost the ability to tie requests together. Not good.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚨 The Problem
&lt;/h2&gt;

&lt;p&gt;Spring Security stores authentication in a &lt;code&gt;ThreadLocal&lt;/code&gt; (&lt;code&gt;SecurityContextHolder&lt;/code&gt;).&lt;br&gt;&lt;br&gt;
SLF4J’s MDC (Mapped Diagnostic Context) also uses &lt;code&gt;ThreadLocal&lt;/code&gt; to store correlation IDs.  &lt;/p&gt;

&lt;p&gt;When you hop threads (e.g., via &lt;code&gt;CompletableFuture.supplyAsync&lt;/code&gt;), those &lt;code&gt;ThreadLocal&lt;/code&gt; values don’t magically follow along. So in worker threads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SecurityContextHolder.getContext()&lt;/code&gt; → empty
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MDC.get("correlationId")&lt;/code&gt; → null
&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  ✅ The Solution: Wrap the Executor
&lt;/h2&gt;

&lt;p&gt;We solved this by wrapping our &lt;code&gt;ExecutorService&lt;/code&gt; in a lightweight decorator that &lt;strong&gt;captures the MDC + SecurityContext from the submitting thread&lt;/strong&gt; and restores them inside the worker thread.&lt;/p&gt;

&lt;p&gt;Here’s the implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ContextPropagatingExecutorService&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AbstractExecutorService&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nf"&gt;ContextPropagatingExecutorService&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;delegate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="nc"&gt;Runnable&lt;/span&gt; &lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Runnable&lt;/span&gt; &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mdc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;MDC&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCopyOfContextMap&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="nc"&gt;SecurityContext&lt;/span&gt; &lt;span class="n"&gt;securityContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SecurityContextHolder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getContext&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mdc&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="no"&gt;MDC&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setContextMap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mdc&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;securityContext&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="nc"&gt;SecurityContextHolder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setContext&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;securityContext&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;task&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="no"&gt;MDC&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;clear&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
                &lt;span class="nc"&gt;SecurityContextHolder&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;clearContext&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;};&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="nd"&gt;@Override&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Runnable&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;delegate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wrap&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// delegate lifecycle methods...&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ⚙️ Wiring It Up
&lt;/h2&gt;

&lt;p&gt;In our Spring Boot config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Configuration&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AsyncConfig&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;

    &lt;span class="nd"&gt;@Bean&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"executorService"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="nf"&gt;executorService&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="nc"&gt;ExecutorService&lt;/span&gt; &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Executors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;newCachedThreadPool&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nf"&gt;ContextPropagatingExecutorService&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, whenever we do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nc"&gt;AccountDTO&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fromAccount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="nc"&gt;CompletableFuture&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;supplyAsync&lt;/span&gt;&lt;span class="o"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;accountClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getAccountsById&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="o"&gt;),&lt;/span&gt; &lt;span class="n"&gt;executorService&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…the async thread has the &lt;strong&gt;same SecurityContext and MDC&lt;/strong&gt; as the request thread.  &lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Before vs After
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SecurityContextHolder.getContext()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Empty in async thread&lt;/td&gt;
&lt;td&gt;Correctly populated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MDC.get("correlationId")&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;null&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Same correlation ID as request thread&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs&lt;/td&gt;
&lt;td&gt;Missing trace IDs&lt;/td&gt;
&lt;td&gt;Full traceability across async calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Downstream services&lt;/td&gt;
&lt;td&gt;No JWT propagated&lt;/td&gt;
&lt;td&gt;JWT available for Feign/RestTemplate&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🔑 Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ThreadLocals don’t cross thread boundaries&lt;/strong&gt; — you need to propagate them manually.
&lt;/li&gt;
&lt;li&gt;Wrapping your &lt;code&gt;ExecutorService&lt;/code&gt; is a clean, reusable fix.
&lt;/li&gt;
&lt;li&gt;This pattern works not just for MDC + SecurityContext, but for any contextual data you need across async boundaries.
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;If you’re building microservices with Spring Boot and using async execution (&lt;code&gt;CompletableFuture&lt;/code&gt;, &lt;code&gt;@Async&lt;/code&gt;, Kafka listeners, etc.), don’t forget about context propagation. Without it, your logs and security checks will silently break.  &lt;/p&gt;

&lt;p&gt;Wrapping your executor is a small change that pays off big in &lt;strong&gt;observability&lt;/strong&gt; and &lt;strong&gt;security consistency&lt;/strong&gt;.  &lt;/p&gt;




</description>
      <category>springboot</category>
      <category>java</category>
      <category>backenddevelopment</category>
      <category>programming</category>
    </item>
    <item>
      <title>Why My CDK Deploys Started Failing After Org Added Strict SCP Rules</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sat, 27 Sep 2025 07:56:29 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/why-my-cdk-deploys-started-failing-after-org-added-strict-scp-rules-o88</link>
      <guid>https://forem.com/jayesh_shinde/why-my-cdk-deploys-started-failing-after-org-added-strict-scp-rules-o88</guid>
      <description>&lt;p&gt;Recently, I ran into a head‑scratcher while deploying a CDK stack. Everything used to work fine, but once my organization introduced &lt;strong&gt;strict SCP rules based on tags&lt;/strong&gt;, my &lt;code&gt;cdk deploy&lt;/code&gt; started failing with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AccessDenied: action cloudformation:CreateChangeSet is not authorized
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance, it didn’t make sense. I was &lt;em&gt;already tagging everything&lt;/em&gt; in my CDK code. I even had a loop that pulled tags from &lt;code&gt;props.Tags&lt;/code&gt; and attached them with &lt;code&gt;cdk.Tags.of(this).add(...)&lt;/code&gt;. These tags used to flow down nicely to all resources — including the CloudFormation stack itself.&lt;/p&gt;

&lt;p&gt;So why did it suddenly stop working? 🤔&lt;/p&gt;




&lt;h3&gt;
  
  
  What Changed? SCPs and Request‑Time Enforcement
&lt;/h3&gt;

&lt;p&gt;The key is &lt;strong&gt;where SCP rules get evaluated&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An SCP can restrict not only what resources exist, but also &lt;em&gt;what API calls are allowed&lt;/em&gt;. In my case, the org had a policy like this:(dummy example)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Deny"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cloudformation:CreateChangeSet"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"StringNotEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"aws:RequestTag/Org"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ABC"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means: &lt;em&gt;if the &lt;code&gt;CreateChangeSet&lt;/code&gt; request doesn’t already include the required tag, the call is blocked right away.&lt;/em&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  CDK Tagging: Two Different Worlds
&lt;/h3&gt;

&lt;p&gt;This is where CDK behavior matters:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;StackProps.tags&lt;/strong&gt;&lt;br&gt;
When you pass tags into the &lt;code&gt;super(scope, id, props)&lt;/code&gt; constructor, CDK includes those tags in the &lt;code&gt;CreateChangeSet&lt;/code&gt; API call. These show up as &lt;code&gt;RequestTags&lt;/code&gt;. That’s what SCPs check.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;cdk.Tags.of(resource).add(...)&lt;/strong&gt;&lt;br&gt;
This method attaches tags to resources &lt;strong&gt;inside the CloudFormation template&lt;/strong&gt;. They are applied &lt;em&gt;after&lt;/em&gt; the stack is already created.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So my old approach of looping through &lt;code&gt;props.Tags&lt;/code&gt; and calling &lt;code&gt;cdk.Tags.of(this).add(...)&lt;/code&gt; worked fine in the past, but now fails because the SCP never lets the stack get created in the first place. The required tags simply aren’t present yet at request time.&lt;/p&gt;




&lt;h3&gt;
  
  
  Fix: Pass Tags via StackProps
&lt;/h3&gt;

&lt;p&gt;The solution was simple once I understood the difference. Previously my props had &lt;strong&gt;Tags&lt;/strong&gt; property which I replaced with &lt;strong&gt;tags&lt;/strong&gt; which props:cdk.StackProps expects and uses to initialize CDK’s in-memory construct tree (the Stack object inside your app).&lt;br&gt;
When you run cdk deploy, the CLI uses the CloudFormation SDK to call:&lt;/p&gt;

&lt;p&gt;CreateChangeSet (or UpdateChangeSet) → this is the first API call to AWS.&lt;/p&gt;

&lt;p&gt;This is where stack-level tags (props.tags) are injected into the request payload&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MyStack&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Stack&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Construct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;StackProps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;props&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;cdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;MyStack&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;MyTaggedStack&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;Org&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ABC&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;Owner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;TeamA&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;StackProps.tags&lt;/strong&gt; → go straight into the &lt;code&gt;CreateChangeSet&lt;/code&gt; request (SCP passes ✅).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;cdk.Tags.of(this).add(...)&lt;/strong&gt; → still ensures all resources get the same tags after creation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Takeaway
&lt;/h3&gt;

&lt;p&gt;If your organization enforces strict &lt;strong&gt;SCP rules on &lt;code&gt;cloudformation:CreateChangeSet&lt;/code&gt;&lt;/strong&gt;, you can’t rely on &lt;code&gt;cdk.Tags.of(...)&lt;/code&gt; alone. Those tags arrive too late. You need to use &lt;strong&gt;&lt;code&gt;StackProps.tags&lt;/code&gt;&lt;/strong&gt; so the tags are present &lt;em&gt;in the request itself&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;It’s a subtle but important difference — and once I understood it, the “AccessDenied” error finally made sense.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cdk</category>
      <category>cloudformation</category>
      <category>devops</category>
    </item>
    <item>
      <title>Lessons Learned: CompletableFuture Security and Kafka Topic Grouping in Microservices</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Tue, 23 Sep 2025 08:27:53 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/lessons-learned-completablefuture-security-and-kafka-topic-grouping-in-microservices-5ami</link>
      <guid>https://forem.com/jayesh_shinde/lessons-learned-completablefuture-security-and-kafka-topic-grouping-in-microservices-5ami</guid>
      <description>&lt;h3&gt;
  
  
  Issue 1: CompletableFuture + Missing Security Context = 401 Unauthorized
&lt;/h3&gt;

&lt;p&gt;When we started using &lt;code&gt;CompletableFuture.supplyAsync()&lt;/code&gt; in our Spring Boot microservice, some calls to other services (via Feign client) started failing with &lt;strong&gt;401 Unauthorized&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The reason:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CompletableFuture&lt;/code&gt; runs in a separate thread.&lt;/li&gt;
&lt;li&gt;By default, the &lt;strong&gt;Spring Security context (and JWT token)&lt;/strong&gt; is not propagated to that new thread.&lt;/li&gt;
&lt;li&gt;As a result, our Feign interceptor couldn’t attach the &lt;code&gt;Authorization&lt;/code&gt; header, and the downstream service rejected the request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; We wrapped our executor with &lt;code&gt;DelegatingSecurityContextExecutorService&lt;/code&gt; so that the &lt;code&gt;SecurityContext&lt;/code&gt; (including the JWT token) travels along with async tasks. After this change, the token was correctly added and the 401s disappeared.&lt;/p&gt;




&lt;h3&gt;
  
  
  Issue 2: Kafka Consumers and Group IDs
&lt;/h3&gt;

&lt;p&gt;We also ran into a Kafka consumer issue. Initially, we used the &lt;strong&gt;same consumer group ID&lt;/strong&gt; for multiple topics. This caused unexpected behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consumers in the same group share messages.&lt;/li&gt;
&lt;li&gt;With a single group ID, messages from different topics were getting “competed” for by the same group, leading to missed or unbalanced consumption.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; We gave each topic its &lt;strong&gt;own unique group ID&lt;/strong&gt;. This way, each topic is consumed independently, and no messages are skipped.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For async code in Spring Security, always make sure the security context is propagated.&lt;/li&gt;
&lt;li&gt;For Kafka, think carefully about group IDs — same group = load balancing, different groups = independent consumption.&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>springboot</category>
      <category>java</category>
      <category>kafka</category>
      <category>backenddevelopment</category>
    </item>
    <item>
      <title>Debugging Helm Template Errors: Lessons Learned</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sun, 21 Sep 2025 11:53:06 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/debugging-helm-template-errors-lessons-learned-546a</link>
      <guid>https://forem.com/jayesh_shinde/debugging-helm-template-errors-lessons-learned-546a</guid>
      <description>&lt;p&gt;Today while working with &lt;strong&gt;Helm&lt;/strong&gt; to package my Kubernetes deployments, I hit a couple of tricky but interesting issues. Writing them down so that it may help others facing the same problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Keys with &lt;code&gt;-&lt;/code&gt; in values.yaml
&lt;/h2&gt;

&lt;p&gt;In my &lt;code&gt;values.yaml&lt;/code&gt;, I had a key like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;postgresql-pvc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And in my template (&lt;code&gt;postgresql.yaml&lt;/code&gt;), I tried to reference it like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.postgresql.postgresql-pvc.storage&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This failed with an error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ERROR] templates/: parse error: bad character U+002D '-'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;br&gt;
Helm templates treat the &lt;code&gt;-&lt;/code&gt; in &lt;code&gt;postgresql-pvc&lt;/code&gt; as a subtraction operator.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
Wrap the key in quotes and use index notation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;index .Values.postgresql "postgresql-pvc" "storage"&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells Helm clearly that it’s a key lookup, not math.&lt;/p&gt;

&lt;p&gt;OR&lt;/p&gt;

&lt;p&gt;Better to avoid using "-" in the key simply use like below in values.yaml file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;postgresqlPvc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and then in helm template&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;index .Values.postgresql.postgresqlPvc.storage"&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Numbers in &lt;code&gt;values.yaml&lt;/code&gt; need quotes
&lt;/h2&gt;

&lt;p&gt;The second issue I faced was with environment variables in a Deployment. My &lt;code&gt;values.yaml&lt;/code&gt; had:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;postgresql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;servicePort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And in the Deployment template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_PORT&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.postgresql.servicePort&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This failed with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cannot unmarshal number into Go struct field EnvVar.spec.template.spec.containers.env.value of type string
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;br&gt;
Kubernetes expects environment variable values to always be &lt;strong&gt;strings&lt;/strong&gt;, but Helm rendered it as a number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
Wrap the value in quotes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_PORT&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;.Values.postgresql.servicePort&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or Use quote function&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_PORT&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{{&lt;/span&gt; &lt;span class="nv"&gt;.Values.postgresql.servicePort | quote&lt;/span&gt; &lt;span class="pi"&gt;}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Helm correctly renders it as a string.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Checklist for Debug&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Env vars (value: in env) → always string&lt;/p&gt;

&lt;p&gt;Labels and annotations → always string&lt;/p&gt;

&lt;p&gt;Secrets/ConfigMaps → always string&lt;/p&gt;

&lt;p&gt;Ports, replicas, resource requests → can stay as numbers&lt;/p&gt;

&lt;p&gt;Check the Kubernetes API spec for the field type (string vs number)&lt;/p&gt;




&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Helm is powerful, but also picky:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keys with &lt;code&gt;-&lt;/code&gt; need &lt;strong&gt;quotes&lt;/strong&gt; and &lt;code&gt;index&lt;/code&gt; lookup.&lt;/li&gt;
&lt;li&gt;Numbers used as environment variables should be wrapped in &lt;strong&gt;quotes&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Small things, but they save a lot of head-scratching once you know them.&lt;/p&gt;




</description>
      <category>kubernetes</category>
      <category>helm</category>
    </item>
    <item>
      <title>Fixing Kafka KRaft Cluster ID Mismatch on Kubernetes</title>
      <dc:creator>Jayesh Shinde</dc:creator>
      <pubDate>Sat, 20 Sep 2025 05:22:59 +0000</pubDate>
      <link>https://forem.com/jayesh_shinde/fixing-kafka-kraft-cluster-id-mismatch-on-kubernetes-2943</link>
      <guid>https://forem.com/jayesh_shinde/fixing-kafka-kraft-cluster-id-mismatch-on-kubernetes-2943</guid>
      <description>&lt;p&gt;Running Kafka on Kubernetes is usually smooth—but we hit a tricky problem when using &lt;strong&gt;KRaft mode&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For context, &lt;strong&gt;KRaft&lt;/strong&gt; (Kafka Raft Metadata mode) is Kafka’s new way of managing cluster metadata without ZooKeeper. Each cluster has a unique &lt;strong&gt;Cluster ID&lt;/strong&gt; stored in its metadata. All nodes need to agree on this ID to join the cluster.&lt;/p&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;The Problem&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;We noticed that every time a Kafka pod restarted, it would generate a &lt;strong&gt;new Cluster ID&lt;/strong&gt;. Since our pods used &lt;strong&gt;persistent storage&lt;/strong&gt;, the metadata on disk still had the old Cluster ID. This caused Kafka to fail on startup with a clear error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cluster ID mismatch: Expected &amp;lt;old-id&amp;gt;, Found &amp;lt;new-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In short, the pod thought it was a new cluster, but the storage said otherwise.&lt;/p&gt;




&lt;h4&gt;
  
  
  &lt;strong&gt;How We Fixed It&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;For testing, we solved it by &lt;strong&gt;manually specifying the Cluster ID in the Kubernetes deployment&lt;/strong&gt;. This ensured that every pod picked the same ID as the metadata on the persistent volume. After that, the pods started without errors and rejoined the cluster seamlessly. Below is the sample Kubernetes env variable to set it&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;            - name: KAFKA_KRAFT_CLUSTER_ID
              value: "kraft-local-cluster"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h4&gt;
  
  
  &lt;strong&gt;Lessons Learned&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;In KRaft mode, the &lt;strong&gt;Cluster ID must persist&lt;/strong&gt; when using Kubernetes and stateful pods.&lt;/li&gt;
&lt;li&gt;Reusing old volumes can cause mismatches if the Cluster ID changes.&lt;/li&gt;
&lt;li&gt;For production, automate cluster ID propagation or initialize nodes with a pre-set ID to avoid manual fixes.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;KRaft mode is promising, but small details like the Cluster ID can trip you up. Once you know what to look for, fixing it is straightforward—and now our Kafka cluster is more stable than ever.&lt;/p&gt;




</description>
      <category>kafka</category>
      <category>kubernetes</category>
      <category>backenddevelopment</category>
    </item>
  </channel>
</rss>
