<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: gokcedemirdurkut</title>
    <description>The latest articles on Forem by gokcedemirdurkut (@gokcedemirdurkut).</description>
    <link>https://forem.com/gokcedemirdurkut</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3529539%2Fcb6545d7-dba4-4a74-8480-2fe490dcac93.jpeg</url>
      <title>Forem: gokcedemirdurkut</title>
      <link>https://forem.com/gokcedemirdurkut</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/gokcedemirdurkut"/>
    <language>en</language>
    <item>
      <title>Why Your Nginx Security Headers Disappear (add_header Inheritance Explained)</title>
      <dc:creator>gokcedemirdurkut</dc:creator>
      <pubDate>Thu, 12 Mar 2026 09:53:58 +0000</pubDate>
      <link>https://forem.com/gokcedemirdurkut/nginx-addheader-inheritance-the-silent-security-header-killer-p0f</link>
      <guid>https://forem.com/gokcedemirdurkut/nginx-addheader-inheritance-the-silent-security-header-killer-p0f</guid>
      <description>&lt;p&gt;Security headers can disappear in Nginx even when they are correctly configured.&lt;/p&gt;

&lt;p&gt;Some endpoints may silently drop important headers like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Content-Security-Policy&lt;/li&gt;
&lt;li&gt;Strict-Transport-Security&lt;/li&gt;
&lt;li&gt;X-Frame-Options&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This behavior is subtle and has existed in &lt;a href="https://nginx.org/" rel="noopener noreferrer"&gt;Nginx&lt;/a&gt; for a long time.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Problem Appears
&lt;/h2&gt;

&lt;p&gt;Everything looks correct at first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-I&lt;/span&gt; https://example.com/api/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Response headers include security policies as expected.&lt;/p&gt;

&lt;p&gt;But then another endpoint is checked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-I&lt;/span&gt; https://example.com/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And suddenly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ Content-Security-Policy missing
❌ Strict-Transport-Security missing
❌ X-Frame-Options missing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tools like &lt;a href="https://securityheaders.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;securityheaders.com&lt;/strong&gt;&lt;/a&gt; will immediately detect this.&lt;/p&gt;

&lt;p&gt;The confusing part is that the headers are clearly defined in the configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Common Configuration
&lt;/h2&gt;

&lt;p&gt;A typical setup may look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;add_header&lt;/span&gt; &lt;span class="s"&gt;Content-Security-Policy&lt;/span&gt; &lt;span class="s"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;add_header&lt;/span&gt; &lt;span class="s"&gt;Strict-Transport-Security&lt;/span&gt; &lt;span class="s"&gt;"max-age=31536000"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="p"&gt;~&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;.html&lt;/span&gt;$ &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;add_header&lt;/span&gt; &lt;span class="s"&gt;Cache-Control&lt;/span&gt; &lt;span class="s"&gt;"no-store"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The intention is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;disable caching for HTML responses&lt;/li&gt;
&lt;li&gt;apply security headers globally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, this configuration does not behave as expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Happens
&lt;/h2&gt;

&lt;p&gt;Responses served from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;location&lt;/span&gt; &lt;span class="p"&gt;~&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;.html&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;will only include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;Cache-Control: no-store
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All security headers defined in the &lt;code&gt;server&lt;/code&gt; block disappear.&lt;/p&gt;

&lt;p&gt;There are no warnings and no errors in the logs.&lt;br&gt;
The headers are simply missing.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Reason
&lt;/h2&gt;

&lt;p&gt;This happens because of how Nginx handles header inheritance.&lt;/p&gt;

&lt;p&gt;If a &lt;code&gt;location&lt;/code&gt; block defines &lt;strong&gt;any &lt;code&gt;add_header&lt;/code&gt; directive&lt;/strong&gt;, it &lt;strong&gt;does not inherit&lt;/strong&gt; &lt;code&gt;add_header&lt;/code&gt; directives from parent blocks.&lt;/p&gt;

&lt;p&gt;Even adding a single header like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cache-Control
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;causes Nginx to drop previously defined headers such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Content-Security-Policy&lt;/li&gt;
&lt;li&gt;Strict-Transport-Security&lt;/li&gt;
&lt;li&gt;X-Frame-Options&lt;/li&gt;
&lt;li&gt;Referrer-Policy&lt;/li&gt;
&lt;li&gt;Permissions-Policy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This behavior has existed for many years and has caused confusion in many configurations.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Previous Workarounds
&lt;/h2&gt;

&lt;p&gt;Before recently, the usual approaches were:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Duplicating headers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Define all security headers again in every &lt;code&gt;location&lt;/code&gt; block.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Using includes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Create a file like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;include&lt;/span&gt; &lt;span class="n"&gt;security_headers.conf&lt;/span&gt;;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and include it everywhere.&lt;/p&gt;

&lt;p&gt;Both approaches work but are difficult to maintain and easy to forget.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix in Nginx 1.29+
&lt;/h2&gt;

&lt;p&gt;Recent Nginx versions introduced a better solution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;add_header_inherit&lt;/span&gt; &lt;span class="s"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This directive changes how headers are inherited.&lt;/p&gt;

&lt;p&gt;Instead of replacing headers defined in parent blocks, child blocks &lt;strong&gt;merge them&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This feature was introduced in newer Nginx releases and is documented in the official release notes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.nginx.org/blog/nginx-open-source-1-29-3-and-1-29-4" rel="noopener noreferrer"&gt;https://blog.nginx.org/blog/nginx-open-source-1-29-3-and-1-29-4&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  A Safer Configuration
&lt;/h2&gt;

&lt;p&gt;With the new directive, the configuration can be written like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="kn"&gt;add_header_inherit&lt;/span&gt; &lt;span class="s"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;add_header&lt;/span&gt; &lt;span class="s"&gt;Content-Security-Policy&lt;/span&gt; &lt;span class="s"&gt;"...security&lt;/span&gt; &lt;span class="s"&gt;policy..."&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;add_header&lt;/span&gt; &lt;span class="s"&gt;Strict-Transport-Security&lt;/span&gt; &lt;span class="s"&gt;"max-age=63072000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="kn"&gt;includeSubDomains"&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;add_header&lt;/span&gt; &lt;span class="s"&gt;X-Frame-Options&lt;/span&gt; &lt;span class="s"&gt;"SAMEORIGIN"&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;add_header&lt;/span&gt; &lt;span class="s"&gt;X-Content-Type-Options&lt;/span&gt; &lt;span class="s"&gt;"nosniff"&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="p"&gt;~&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="s"&gt;.html&lt;/span&gt;$ &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;add_header&lt;/span&gt; &lt;span class="s"&gt;Cache-Control&lt;/span&gt; &lt;span class="s"&gt;"no-store,&lt;/span&gt; &lt;span class="s"&gt;no-cache,&lt;/span&gt; &lt;span class="s"&gt;private,&lt;/span&gt; &lt;span class="s"&gt;max-age=0"&lt;/span&gt; &lt;span class="s"&gt;always&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the behavior is predictable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTML responses are not cached&lt;/li&gt;
&lt;li&gt;Security headers are applied consistently&lt;/li&gt;
&lt;li&gt;Headers do not need to be duplicated&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Takeaway
&lt;/h2&gt;

&lt;p&gt;When using &lt;code&gt;add_header&lt;/code&gt; inside &lt;code&gt;location&lt;/code&gt; blocks in &lt;strong&gt;Nginx 1.29+&lt;/strong&gt;, enabling the following directive is recommended:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;add_header_inherit&lt;/span&gt; &lt;span class="s"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents accidental loss of security headers and makes configurations easier to maintain.&lt;/p&gt;

</description>
      <category>backend</category>
      <category>devops</category>
      <category>infosec</category>
      <category>security</category>
    </item>
    <item>
      <title>Preventing Silent ECS Deployment Failures with Circuit Breaker</title>
      <dc:creator>gokcedemirdurkut</dc:creator>
      <pubDate>Thu, 26 Feb 2026 10:57:55 +0000</pubDate>
      <link>https://forem.com/gokcedemirdurkut/preventing-silent-ecs-deployment-failures-with-circuit-breaker-2k5j</link>
      <guid>https://forem.com/gokcedemirdurkut/preventing-silent-ecs-deployment-failures-with-circuit-breaker-2k5j</guid>
      <description>&lt;p&gt;AWS Elastic Container Service (ECS) provides a built-in feature called the &lt;strong&gt;deployment circuit breaker&lt;/strong&gt;, designed to make service deployments safer and more resilient.&lt;/p&gt;

&lt;p&gt;This feature continuously monitors the health of tasks during a deployment and automatically rolls back changes if newly launched tasks fail to become healthy. When enabled, it prevents failed deployments from leaving services in a degraded or non-functional state.&lt;/p&gt;

&lt;p&gt;Without this safeguard, deployment failures can easily go unnoticed. For example, if new tasks fail to start or never pass health checks, the service may still appear to be running while it is effectively broken. These silent failures can result in data loss, financial impact, or operational issues depending on the workload.&lt;/p&gt;

&lt;p&gt;In this post, I’ll walk through how to enable the ECS deployment circuit breaker using Terraform, how to observe deployment failures via EventBridge, and how to send real-time alerts to Slack.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the ECS Deployment Circuit Breaker Matters
&lt;/h2&gt;

&lt;p&gt;Enabling the deployment circuit breaker provides several important benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic rollback&lt;/strong&gt; – Failed deployments are reverted to the last known healthy service revision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved visibility&lt;/strong&gt; – ECS emits structured events whenever a deployment fails or rolls back&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduced operational overhead&lt;/strong&gt; – Failures are mitigated automatically without immediate manual intervention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these significantly reduce the risk of production incidents caused by faulty deployments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Enabling the Circuit Breaker with Terraform
&lt;/h2&gt;

&lt;p&gt;The deployment circuit breaker can be enabled directly in your ECS service definition. In Terraform, this is done using the &lt;code&gt;deployment_circuit_breaker&lt;/code&gt; block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_service"&lt;/span&gt; &lt;span class="s2"&gt;"default"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"tuve"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;task_definition&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_task_definition&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;default&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;desired_count&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

  &lt;span class="nx"&gt;deployment_circuit_breaker&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;enable&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;rollback&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this configuration in place, ECS will automatically stop and roll back a deployment if the new tasks fail to reach a healthy state.&lt;/p&gt;

&lt;p&gt;Once enabled, the AWS Management Console clearly indicates that the &lt;strong&gt;Deployment circuit breaker&lt;/strong&gt; is turned on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbyxeejxs9hkimyzwv9h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbyxeejxs9hkimyzwv9h.png" alt=" " width="800" height="148"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Observing Deployment Failures
&lt;/h2&gt;

&lt;p&gt;Automatic rollback is useful, but visibility is just as important.&lt;/p&gt;

&lt;p&gt;When the ECS deployment circuit breaker triggers, ECS emits events to Amazon EventBridge with the following detail type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ECS Deployment State Change
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is an example event payload:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ddca6449-b258-46c0-8653-e0e3aEXAMPLE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"detail-type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ECS Deployment State Change"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws.ecs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"account"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"111122223333"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"time"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2020-05-23T12:31:14Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"eu-central-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"resources"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:ecs:eu-central-1:111122223333:service/default/servicetest"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"detail"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"eventType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ERROR"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"eventName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SERVICE_DEPLOYMENT_FAILED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"deploymentId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ecs-svc/123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"updatedAt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2020-05-23T11:11:11Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ECS deployment circuit breaker: task failed to start."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Fields to Monitor
&lt;/h3&gt;

&lt;p&gt;Some fields in this event are particularly useful for monitoring and alerting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;eventName&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SERVICE_DEPLOYMENT_FAILED&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SERVICE_DEPLOYMENT_ROLLBACK_COMPLETED&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;reason&lt;/strong&gt; – Explains why the deployment failed&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;resources&lt;/strong&gt; – Identifies the affected ECS service&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;updatedAt&lt;/strong&gt; – Indicates when the failure occurred&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tracking these fields ensures that deployment issues are visible immediately instead of being discovered hours later.
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Deployment Rollback in the AWS Console
&lt;/h3&gt;

&lt;p&gt;The AWS Management Console also provides clear visibility into rollback activity. After a failed deployment, the &lt;strong&gt;Deployments&lt;/strong&gt; tab shows the rollback status along with the target service revision.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0ahnc0921jb3hv33jv5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp0ahnc0921jb3hv33jv5.png" alt=" " width="800" height="316"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This view is particularly useful for confirming that the circuit breaker worked as expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sending Deployment Alerts to Slack
&lt;/h2&gt;

&lt;p&gt;To ensure deployment failures are noticed immediately, ECS deployment events can be routed to Slack using EventBridge and Lambda.&lt;/p&gt;

&lt;p&gt;The overall flow looks like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ECS → EventBridge → Lambda → Slack&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Lambda Handler Example
&lt;/h3&gt;

&lt;p&gt;The Lambda function listens for ECS deployment state changes and sends notifications when a deployment fails or rolls back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;detail_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detail-type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;detail_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ECS Deployment State Change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;event_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eventName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;event_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERVICE_DEPLOYMENT_FAILED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SERVICE_DEPLOYMENT_ROLLBACK_COMPLETED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;detail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
            &lt;span class="n"&gt;resources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
            &lt;span class="n"&gt;service_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resources&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reason&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updatedAt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="nf"&gt;send_slack_notification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;service_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;event_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  EventBridge Rule (Terraform)
&lt;/h3&gt;

&lt;p&gt;The following EventBridge rule filters ECS deployment events and forwards them to the Lambda function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cloudwatch_event_rule"&lt;/span&gt; &lt;span class="s2"&gt;"ecs_deployment"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecs-deployment-events"&lt;/span&gt;

  &lt;span class="nx"&gt;event_pattern&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;jsonencode&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="s2"&gt;"source"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"aws.ecs"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s2"&gt;"detail-type"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ECS Deployment State Change"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="s2"&gt;"detail"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="s2"&gt;"eventName"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="s2"&gt;"SERVICE_DEPLOYMENT_FAILED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"SERVICE_DEPLOYMENT_ROLLBACK_COMPLETED"&lt;/span&gt;
      &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cloudwatch_event_target"&lt;/span&gt; &lt;span class="s2"&gt;"lambda"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_cloudwatch_event_rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_deployment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;arn&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_lambda_function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;notification&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Final Outcome
&lt;/h2&gt;

&lt;p&gt;After enabling the ECS deployment circuit breaker and adding Slack notifications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failed deployments automatically roll back&lt;/li&gt;
&lt;li&gt;Silent service failures are eliminated&lt;/li&gt;
&lt;li&gt;Deployment issues become visible in real time&lt;/li&gt;
&lt;li&gt;ECS services are safer by default&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By combining automated rollback with real-time alerts, you can significantly reduce operational risk and increase confidence in your ECS deployments.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>terraform</category>
      <category>devops</category>
    </item>
    <item>
      <title>Deploying and Customizing AWS ParallelCluster Service (PCS) for HPC Workloads</title>
      <dc:creator>gokcedemirdurkut</dc:creator>
      <pubDate>Sun, 26 Oct 2025 09:15:02 +0000</pubDate>
      <link>https://forem.com/gokcedemirdurkut/deploying-and-customizing-aws-parallelcluster-service-pcs-for-hpc-workloads-km1</link>
      <guid>https://forem.com/gokcedemirdurkut/deploying-and-customizing-aws-parallelcluster-service-pcs-for-hpc-workloads-km1</guid>
      <description>&lt;p&gt;I recently worked on a project involving AWS ParallelCluster Service (PCS).&lt;/p&gt;

&lt;p&gt;The main goal was to build an &lt;a href="https://aws.amazon.com/hpc" rel="noopener noreferrer"&gt;HPC&lt;/a&gt; cluster that meets our specific requirements such as using an image with &lt;a href="https://www.python.org/downloads/release/python-3100" rel="noopener noreferrer"&gt;Python 3.10&lt;/a&gt;, installing the necessary dependencies and deploying a PCS cluster.&lt;br&gt;
To achieve this, I built a custom AMI(Amazon Machine Image) using Packer, then launched a PCS cluster based on that image.&lt;br&gt;
Throughout the process, I automated the workflow using Terraform, GitHub Actions and shell scripts.&lt;br&gt;
After deploying the cluster, I verified that everything was working correctly by running several SLURM job commands, which confirmed that our setup was operational.&lt;/p&gt;

&lt;p&gt;In this article, I’ll walk you through the entire process from building the custom AMI to running jobs on PCS.&lt;/p&gt;
&lt;h2&gt;
  
  
  What is AWS PCS?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/hpc/parallelcluster" rel="noopener noreferrer"&gt;AWS ParallelCluster Service&lt;/a&gt; (PCS) is a managed service that enables high-performance computing (HPC) on AWS. It’s designed for running parallel workloads such as simulations, ML training, or large-scale data analysis.&lt;br&gt;
Using PCS, you can deploy and manage HPC clusters without manually configuring compute nodes, networking or schedulers. It integrates seamlessly with services like Amazon S3 and AWS Batch, supporting complex workloads efficiently.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Use AWS PCS?
&lt;/h2&gt;

&lt;p&gt;Traditionally, deploying and managing HPC clusters required deep expertise in cluster configuration, job scheduling, and infrastructure management.&lt;br&gt;
AWS PCS abstracts much of this complexity by offering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Fully managed cluster orchestration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Integration with SLURM scheduler&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Elastic scaling based on job demand&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Infrastructure as Code (IaC) support with &lt;a href="https://developer.hashicorp.com/terraform" rel="noopener noreferrer"&gt;Terraform&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html" rel="noopener noreferrer"&gt;CloudFormation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes AWS PCS an excellent choice for researchers, data scientists, and DevOps engineers who want to focus on workloads rather than infrastructure.&lt;/p&gt;
&lt;h2&gt;
  
  
  How Was the PCS Cluster Created?
&lt;/h2&gt;

&lt;p&gt;The PCS cluster was provisioned using Terraform with the &lt;a href="https://registry.terraform.io/providers/hashicorp/awscc/latest" rel="noopener noreferrer"&gt;awscc&lt;/a&gt;&lt;br&gt;
provider. The infrastructure is defined as code and includes the cluster, login/compute node groups and job queue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The awscc provider was required because PCS resources are not yet supported by the standard AWS provider. It uses the AWS Cloud Control API to manage newer services like PCS.&lt;/p&gt;

&lt;p&gt;📢 &lt;a href="https://aws.amazon.com/about-aws/whats-new/2025/03/announcing-terraform-parallel-computing-service" rel="noopener noreferrer"&gt;Official Announcement — Terraform Support for AWS ParallelCluster Service (March 2025)&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  PCS Cluster Components
&lt;/h2&gt;

&lt;p&gt;A PCS setup typically includes three main components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cluster:&lt;/strong&gt; Defines general settings, scheduler configuration, and networking.&lt;br&gt;
&lt;strong&gt;Node Groups:&lt;/strong&gt; Specifies login and compute nodes.&lt;br&gt;
&lt;strong&gt;Queue:&lt;/strong&gt; Manages job scheduling and execution PCS Setup with Terraform.&lt;/p&gt;

&lt;p&gt;Below is a simplified example of how a PCS cluster can be created using Terraform and the awscc provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create PCS Cluster using awscc provider&lt;/span&gt;
&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"awscc_pcs_cluster"&lt;/span&gt; &lt;span class="s2"&gt;"this"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;enable_cluster&lt;/span&gt; &lt;span class="err"&gt;?&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_name&lt;/span&gt;
  &lt;span class="nx"&gt;size&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_size&lt;/span&gt;

  &lt;span class="nx"&gt;scheduler&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SLURM"&lt;/span&gt;
    &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cluster_slurm_version&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;networking&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;security_group_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;subnet_ids&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Variables like &lt;em&gt;cluster_name&lt;/em&gt;, &lt;em&gt;subnet_ids&lt;/em&gt;, and &lt;em&gt;cluster_slurm_version&lt;/em&gt; can be parameterized to adapt across environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost Awareness &amp;amp; Usage Control&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Running a PCS cluster can become expensive depending on your configuration.&lt;br&gt;
Each compute node in the cluster uses Amazon EC2 instances, and costs increase while those instances are running.&lt;/p&gt;

&lt;p&gt;To manage this, we added a Terraform variable that controls whether the PCS cluster should be deployed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Control cluster deployment&lt;/span&gt;
&lt;span class="nx"&gt;enable_pcs_cluster&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# create the cluster&lt;/span&gt;
&lt;span class="c1"&gt;# enable_pcs_cluster = false  # skip deployment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows you to enable or disable the PCS cluster as needed.&lt;br&gt;
For example, during development or testing, you can set &lt;em&gt;enable_pcs_cluster = false&lt;/em&gt; to avoid unnecessary charges.&lt;/p&gt;
&lt;h2&gt;
  
  
  Creating Custom AMIs with Packer
&lt;/h2&gt;

&lt;p&gt;Custom AMIs let you pre-install software and dependencies on compute nodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Create a Packer template based on a base AMI (e.x. Ubuntu or Amazon Linux 2).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Install the required software packages.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Build the AMI and use it in your PCS cluster.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Packer Template:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"builders"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"amazon-ebs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{user `aws_region`}}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"source_ami"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{user `source_ami`}}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"instance_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{user `instance_type`}}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ssh_username"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{user `ssh_username`}}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ami_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{user `ami_name_prefix`}}-{{user `distribution`}}-{{user `architecture`}}-{{isotime `2006.01.02-15.04`}}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ami_description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{user `ami_description`}}"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provisioners"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"shell"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"inline"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"sudo yum update -y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"sudo yum install -y gcc make python3"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;This is just an example. You can add other dependencies as needed.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;We used SLURM as the job scheduler within PCS to manage HPC workloads efficiently.&lt;br&gt;
It handles queueing, job submission, and node allocation allowing multiple users or jobs to share compute resources dynamically.&lt;/p&gt;

&lt;p&gt;During our setup, we required Python 3.10, but the default Amazon Linux 2 AMI provided only Python 3.7.&lt;br&gt;
To solve this, we created a custom AMI using Packer, based on the following Ubuntu Marketplace image:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Ubuntu Server 22.04 LTS (arm64)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;AMI ID: ami-0f45e2f16611e3139&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Source: AWS Marketplace&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This custom AMI included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Python 3.10 preinstalled&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Common HPC dependencies (gcc, make, python3-pip)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Additional Python libraries required for our workloads&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By using a custom AMI, we ensured compatibility with our codebase and reduced setup time for new compute nodes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Custom AMIs are especially helpful when your HPC workloads require specific compiler versions, Python environments, or third-party libraries.&lt;/p&gt;
&lt;h2&gt;
  
  
  Testing the SLURM Cluster
&lt;/h2&gt;

&lt;p&gt;Once the PCS cluster was up and running, we validated the environment by executing a few basic SLURM commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check the cluster and nodes&lt;/span&gt;
sinfo

&lt;span class="c"&gt;# Submit a test job&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"echo Hello from SLURM"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; test.sh
sbatch test.sh

&lt;span class="c"&gt;# View the job queue&lt;/span&gt;
squeue
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These simple tests confirmed that the SLURM scheduler was active, compute nodes were responding, and jobs were successfully executed across the PCS cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;AWS PCS simplifies HPC cluster management with a managed, scalable, and elastic environment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Terraform (awscc) enables modern Infrastructure as Code deployment.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;SLURM provides flexible and familiar scheduling for HPC users.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Custom AMIs and bootstrap scripts enable deep customization and reproducibility.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ideal for research, AI/ML training, and data-intensive simulations.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks for reading!&lt;/p&gt;

&lt;p&gt;#terraform #aws #devops #cloud #hpc #slurm #packer #ami #python&lt;/p&gt;

</description>
      <category>aws</category>
      <category>terraform</category>
      <category>devops</category>
      <category>hpc</category>
    </item>
  </channel>
</rss>
