<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Lee Yao</title>
    <description>The latest articles on Forem by Lee Yao (@lee_yao_cfeb14fb9b141b8c5).</description>
    <link>https://forem.com/lee_yao_cfeb14fb9b141b8c5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3917027%2F9dba3dda-178a-479e-a596-411d2f08f71d.jpg</url>
      <title>Forem: Lee Yao</title>
      <link>https://forem.com/lee_yao_cfeb14fb9b141b8c5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/lee_yao_cfeb14fb9b141b8c5"/>
    <language>en</language>
    <item>
      <title>Debugging a Multi-Container Airflow Pipeline: Kafka Network Isolation and the YAML Indentation Trap</title>
      <dc:creator>Lee Yao</dc:creator>
      <pubDate>Mon, 11 May 2026 06:09:41 +0000</pubDate>
      <link>https://forem.com/lee_yao_cfeb14fb9b141b8c5/debugging-a-multi-container-airflow-pipeline-kafka-network-isolation-and-the-yaml-indentation-trap-5595</link>
      <guid>https://forem.com/lee_yao_cfeb14fb9b141b8c5/debugging-a-multi-container-airflow-pipeline-kafka-network-isolation-and-the-yaml-indentation-trap-5595</guid>
      <description>&lt;p&gt;After getting Kafka, Spark, and Snowflake all working individually, I thought wiring them together in Airflow would be the easy part. It was not. What followed was an afternoon of containers failing silently, misleading error messages, and one of those bugs where the fix is two characters but finding it takes two hours.&lt;/p&gt;

&lt;p&gt;This post covers the two main issues I hit when building an Airflow DAG to orchestrate a CMS Medicare streaming pipeline. Both problems are worth understanding because they'll come up any time you try to connect services across multiple Docker Compose projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;The pipeline I was building looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CMS API → Kafka Producer → Kafka → Spark Streaming → Snowflake → dbt → dbt test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All orchestrated by a single Airflow DAG with four tasks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;run_cms_producer&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;run_spark_streaming&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;run_dbt_models&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;run_dbt_tests&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tricky part: Kafka and Spark live in one Docker Compose project (&lt;code&gt;CMS_project&lt;/code&gt;), and Airflow lives in a completely separate Docker Compose project (&lt;code&gt;DOT_project/airflow&lt;/code&gt;). Two separate projects, two separate Docker networks, and a whole set of problems that come with that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 1: &lt;code&gt;NoBrokersAvailable&lt;/code&gt; — Kafka Network Isolation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What the error looked like
&lt;/h3&gt;

&lt;p&gt;The first task, &lt;code&gt;run_cms_producer&lt;/code&gt;, kept failing immediately. The error in the Airflow task log:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;kafka.errors.NoBrokersAvailable: NoBrokersAvailable
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This error means the Kafka client tried to connect to the broker, got nothing back, and gave up. The fix seems obvious: check the broker address. I did. It looked right. The container names were correct. The ports were open. And yet — nothing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it happened: Docker Compose network isolation
&lt;/h3&gt;

&lt;p&gt;When you run &lt;code&gt;docker-compose up&lt;/code&gt; in a directory, Docker creates a private network for that project. By default, it's named after the folder: &lt;code&gt;cms_project_default&lt;/code&gt; for &lt;code&gt;CMS_project&lt;/code&gt;, and &lt;code&gt;airflow_default&lt;/code&gt; for &lt;code&gt;DOT_project/airflow&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Containers in &lt;code&gt;cms_project_default&lt;/code&gt; can talk to each other freely. Containers in &lt;code&gt;airflow_default&lt;/code&gt; can talk to each other freely. But &lt;strong&gt;containers in different projects cannot reach each other by default&lt;/strong&gt; — the networks are completely separate.&lt;/p&gt;

&lt;p&gt;The first fix I tried was adding Airflow's containers to the CMS network:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In Airflow's docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;networks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;airflow_default&lt;/span&gt;
  &lt;span class="na"&gt;cms_network&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;external&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cms_project_default&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This seemed to work — &lt;code&gt;docker inspect&lt;/code&gt; confirmed the Airflow scheduler was in both networks. But the &lt;code&gt;NoBrokersAvailable&lt;/code&gt; error kept showing up.&lt;/p&gt;

&lt;h3&gt;
  
  
  The real problem: how Kafka advertises itself
&lt;/h3&gt;

&lt;p&gt;Here's where it gets less obvious. Kafka doesn't just sit there and accept connections. When a client first connects, Kafka sends back a list of addresses the client should use for all future communication. This list is called the &lt;strong&gt;advertised listeners&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In my &lt;code&gt;CMS_project/docker-compose.yml&lt;/code&gt;, the Kafka configuration looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;KAFKA_LISTENERS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INTERNAL://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092&lt;/span&gt;
&lt;span class="na"&gt;KAFKA_ADVERTISED_LISTENERS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INTERNAL://kafka:29092,EXTERNAL://localhost:9092&lt;/span&gt;
&lt;span class="na"&gt;KAFKA_LISTENER_SECURITY_PROTOCOL_MAP&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT&lt;/span&gt;
&lt;span class="na"&gt;KAFKA_INTER_BROKER_LISTENER_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INTERNAL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This sets up two listeners:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;INTERNAL&lt;/code&gt; on port 29092 — intended for containers inside the same Docker Compose project&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;EXTERNAL&lt;/code&gt; on port 9092 — intended for connections from the host machine (your Windows terminal)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the Airflow container tried to connect to &lt;code&gt;kafka:29092&lt;/code&gt;, Kafka initially accepted the connection (because Airflow was now on the same Docker network). But then Kafka sent back its advertised address for that listener: &lt;code&gt;kafka:29092&lt;/code&gt;. When the Airflow container tried to use that address for subsequent requests, it hit the same broker again — which was fine.&lt;/p&gt;

&lt;p&gt;So why was it still failing?&lt;/p&gt;

&lt;p&gt;The answer is subtle. Even though Airflow was on the &lt;code&gt;cms_project_default&lt;/code&gt; network, Kafka was treating it as an &lt;strong&gt;internal&lt;/strong&gt; client. The &lt;code&gt;INTERNAL&lt;/code&gt; listener was designed for containers inside &lt;code&gt;cms_project&lt;/code&gt; — and those containers have a specific network configuration that resolves &lt;code&gt;kafka&lt;/code&gt; to the right IP. Airflow's container, being from a different Compose project, had slightly different network resolution behavior. The connection would time out at the API version check stage, which is the very first thing the Kafka client does before it can do anything else.&lt;/p&gt;

&lt;h3&gt;
  
  
  The fix: a third dedicated listener
&lt;/h3&gt;

&lt;p&gt;The cleanest solution is to add a third listener specifically for cross-project Docker container connections:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;KAFKA_LISTENERS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INTERNAL://0.0.0.0:29092,EXTERNAL://0.0.0.0:9092,DOCKER://0.0.0.0:39092&lt;/span&gt;
&lt;span class="na"&gt;KAFKA_ADVERTISED_LISTENERS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INTERNAL://kafka:29092,EXTERNAL://localhost:9092,DOCKER://kafka:39092&lt;/span&gt;
&lt;span class="na"&gt;KAFKA_LISTENER_SECURITY_PROTOCOL_MAP&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,DOCKER:PLAINTEXT&lt;/span&gt;
&lt;span class="na"&gt;KAFKA_INTER_BROKER_LISTENER_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INTERNAL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And expose the new port:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9092:9092"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;39092:39092"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now there are three separate listeners, each for a different kind of client:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Listener&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;th&gt;Who uses it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;INTERNAL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;29092&lt;/td&gt;
&lt;td&gt;Containers inside &lt;code&gt;cms_project&lt;/code&gt; (Spark, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EXTERNAL&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;9092&lt;/td&gt;
&lt;td&gt;Host machine (your terminal, local Python scripts)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;DOCKER&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;39092&lt;/td&gt;
&lt;td&gt;Containers from other Docker Compose projects (Airflow)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In the Airflow DAG, the producer task now uses the &lt;code&gt;DOCKER&lt;/code&gt; listener:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;run_producer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BashOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_cms_producer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bash_command&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KAFKA_BROKER=kafka:39092 python /opt/airflow/cms_producer.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;execution_timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And in &lt;code&gt;cms_producer.py&lt;/code&gt;, the broker address reads from the environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;KAFKA_BROKER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KAFKA_BROKER&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;localhost:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After this change, the connection worked immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to debug Kafka connectivity issues
&lt;/h3&gt;

&lt;p&gt;Before changing any configuration, verify the actual connection from inside the container that's having trouble. Don't guess based on port numbers or network diagrams:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Copy a test script into the Airflow container&lt;/span&gt;
docker &lt;span class="nb"&gt;cp &lt;/span&gt;test_kafka.py airflow-airflow-scheduler-1:/tmp/test_kafka.py

&lt;span class="c"&gt;# Run it from inside the container&lt;/span&gt;
docker &lt;span class="nb"&gt;exec &lt;/span&gt;airflow-airflow-scheduler-1 python /tmp/test_kafka.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;test_kafka.py&lt;/code&gt; contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kafka&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;KafkaConsumer&lt;/span&gt;
&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;KafkaConsumer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bootstrap_servers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;kafka:39092&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;consumer_timeout_ms&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Connected OK&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells you immediately whether the network path is open, without having to trigger a full DAG run and wait for logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 2: The YAML Indentation Trap
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What the error looked like
&lt;/h3&gt;

&lt;p&gt;Once the Kafka connection was fixed, &lt;code&gt;run_cms_producer&lt;/code&gt; and &lt;code&gt;run_spark_streaming&lt;/code&gt; started succeeding. But &lt;code&gt;run_dbt_models&lt;/code&gt; failed with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Runtime Error
  Could not find profile named 'cms_dbt'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;dbt couldn't find the profile I had added. I checked the file — &lt;code&gt;cms_dbt&lt;/code&gt; was right there in &lt;code&gt;profiles.yml&lt;/code&gt;. Or so I thought.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it happened: YAML nesting is invisible
&lt;/h3&gt;

&lt;p&gt;Here's what the &lt;code&gt;profiles.yml&lt;/code&gt; actually contained:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;dbt_dot_flights&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;snowflake&lt;/span&gt;
      &lt;span class="na"&gt;account&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;env_var('SNOWFLAKE_ACCOUNT')&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;
      &lt;span class="c1"&gt;# ... other fields ...&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;
  &lt;span class="na"&gt;cms_dbt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                    &lt;span class="c1"&gt;# ← THIS IS THE PROBLEM&lt;/span&gt;
    &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;
    &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;snowflake&lt;/span&gt;
        &lt;span class="na"&gt;account&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FZPFTPF-LOB40082&lt;/span&gt;
        &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;cms_dbt&lt;/code&gt; was indented under &lt;code&gt;dbt_dot_flights&lt;/code&gt;. In YAML, indentation defines the structure — so instead of being a separate top-level profile, &lt;code&gt;cms_dbt&lt;/code&gt; became a field &lt;em&gt;inside&lt;/em&gt; the &lt;code&gt;dbt_dot_flights&lt;/code&gt; profile. dbt looked for a top-level key called &lt;code&gt;cms_dbt&lt;/code&gt;, didn't find one, and reported that the profile was missing.&lt;/p&gt;

&lt;p&gt;What the file should have looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;dbt_dot_flights&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;
  &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;snowflake&lt;/span&gt;
      &lt;span class="c1"&gt;# ...&lt;/span&gt;

&lt;span class="na"&gt;cms_dbt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                      &lt;span class="c1"&gt;# ← Top-level, no indentation&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dev&lt;/span&gt;
  &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;snowflake&lt;/span&gt;
      &lt;span class="na"&gt;account&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FZPFTPF-LOB40082&lt;/span&gt;
      &lt;span class="c1"&gt;# ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference is two spaces of indentation, which is completely invisible when you're skimming a file looking for a specific key.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why YAML indentation bugs are so hard to catch
&lt;/h3&gt;

&lt;p&gt;Most syntax errors give you a loud, obvious error message. YAML indentation errors often don't — the file is syntactically valid, it just means something different from what you intended. A misindented block doesn't cause a parse error; it causes incorrect data structure, which only surfaces as a logical error later when something tries to use that data.&lt;/p&gt;

&lt;p&gt;In this case, &lt;code&gt;profiles.yml&lt;/code&gt; parsed without any complaint. dbt read it, built the profile registry, and simply didn't find &lt;code&gt;cms_dbt&lt;/code&gt; as a top-level key — because it wasn't one.&lt;/p&gt;

&lt;h3&gt;
  
  
  The quick way to catch YAML structure bugs
&lt;/h3&gt;

&lt;p&gt;Before you commit any YAML file, dump it as parsed Python to see exactly what structure it produces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import yaml; import json; print(json.dumps(yaml.safe_load(open('profiles.yml')), indent=2))"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows you the actual data structure the parser sees, not what you think you wrote. A misindented block shows up immediately as a nested key instead of a top-level key.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;profiles.yml&lt;/code&gt; specifically, the top-level keys should be the profile names. If you run the command above and see &lt;code&gt;cms_dbt&lt;/code&gt; nested inside &lt;code&gt;dbt_dot_flights&lt;/code&gt; instead of at the top level, you've found the bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  The second dbt failure: a typo in &lt;code&gt;dbt_project.yml&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;After fixing the indentation, &lt;code&gt;run_dbt_models&lt;/code&gt; failed again with a different error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No materialization 'table2' was found for adapter snowflake!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This one was a straightforward typo. In &lt;code&gt;dbt_project.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cms_dbt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;marts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;+materialized&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;table2&lt;/span&gt;    &lt;span class="c1"&gt;# ← Should be 'table'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;table2&lt;/code&gt; is not a valid dbt materialization type. The valid options are &lt;code&gt;view&lt;/code&gt;, &lt;code&gt;table&lt;/code&gt;, &lt;code&gt;incremental&lt;/code&gt;, and &lt;code&gt;ephemeral&lt;/code&gt;. One character difference, and dbt can't find the materialization.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Debugging Workflow That Actually Works
&lt;/h2&gt;

&lt;p&gt;After going through all of this, here's the approach I'd use from the start next time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Test network connectivity before writing any DAG code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Don't assume two containers can talk to each other just because they're on the same network. Actually test it from inside the source container before you do anything else.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &amp;lt;source-container&amp;gt; python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
from kafka import KafkaConsumer
c = KafkaConsumer(bootstrap_servers='&amp;lt;target&amp;gt;:&amp;lt;port&amp;gt;', consumer_timeout_ms=3000)
print('OK')
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Validate YAML files immediately after editing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Any time you touch a YAML file, run the parser dump to verify the structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import yaml; import json; print(json.dumps(yaml.safe_load(open('yourfile.yml')), indent=2))"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two seconds of checking saves an hour of debugging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Read the Airflow task logs, not the DAG-level logs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a task fails, the DAG-level logs just say "failed." The actual error is in the task-specific log file. In the Airflow UI: click the task (the colored square) → click "Logs." That's where the real error message is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: For Kafka specifically, understand the three listener types.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're running Kafka in Docker and connecting from multiple environments (host machine, same-project containers, cross-project containers), you need a separate listener for each. One listener cannot serve all three use cases cleanly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;NoBrokersAvailable&lt;/code&gt; from Airflow&lt;/td&gt;
&lt;td&gt;Two Docker Compose projects on separate networks; Kafka's &lt;code&gt;INTERNAL&lt;/code&gt; listener not designed for cross-project connections&lt;/td&gt;
&lt;td&gt;Add a third &lt;code&gt;DOCKER&lt;/code&gt; listener on a dedicated port (39092) for cross-project container access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Could not find profile 'cms_dbt'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;YAML indentation error — &lt;code&gt;cms_dbt&lt;/code&gt; was nested inside &lt;code&gt;dbt_dot_flights&lt;/code&gt; instead of being a top-level key&lt;/td&gt;
&lt;td&gt;Fix indentation; validate with &lt;code&gt;yaml.safe_load&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;No materialization 'table2'&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Typo in &lt;code&gt;dbt_project.yml&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Change &lt;code&gt;table2&lt;/code&gt; to &lt;code&gt;table&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Kafka network issue is the one worth spending time understanding. Docker Compose network isolation is intentional and useful, but it creates real headaches when you need services from different projects to communicate. Knowing that Kafka has separate listener types — and that you can add custom ones — gives you a clean, explicit solution rather than hacks like &lt;code&gt;network_mode: host&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The YAML bugs are embarrassing in retrospect, but they're also genuinely easy to miss. The fix is always the same: don't trust your eyes on YAML indentation, trust the parser.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>devops</category>
      <category>docker</category>
      <category>networking</category>
    </item>
    <item>
      <title>Why My Spark Container Keeps Exiting — Docker PID 1 and the Daemon Trap</title>
      <dc:creator>Lee Yao</dc:creator>
      <pubDate>Thu, 07 May 2026 04:41:08 +0000</pubDate>
      <link>https://forem.com/lee_yao_cfeb14fb9b141b8c5/why-my-spark-container-keeps-exiting-docker-pid-1-and-the-daemon-trap-dgf</link>
      <guid>https://forem.com/lee_yao_cfeb14fb9b141b8c5/why-my-spark-container-keeps-exiting-docker-pid-1-and-the-daemon-trap-dgf</guid>
      <description>&lt;p&gt;I spent an embarrassing amount of time staring at my terminal, watching Spark containers start and immediately die. Three different attempts, three different failure modes, all in the same afternoon. If you're setting up Spark inside Docker and your container just... vanishes, this post is for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I'm building a CMS Medicare streaming pipeline — pulling hospital charge data from the CMS public API, pushing it through Kafka, processing it with Spark Structured Streaming, and landing the results in Snowflake. The whole stack runs in Docker Compose. Kafka and ZooKeeper came up without a hitch. Spark did not.&lt;/p&gt;

&lt;p&gt;Here's what my &lt;code&gt;docker-compose.yml&lt;/code&gt; looked like at the start:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;zookeeper&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;confluentinc/cp-zookeeper:7.4.0&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ZOOKEEPER_CLIENT_PORT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2181&lt;/span&gt;

  &lt;span class="na"&gt;kafka&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;confluentinc/cp-kafka:7.4.0&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;zookeeper&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9092:9092"&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;KAFKA_ZOOKEEPER_CONNECT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zookeeper:2181&lt;/span&gt;
      &lt;span class="na"&gt;KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;

  &lt;span class="na"&gt;spark&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bitnami/spark:3.5&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;kafka&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;SPARK_MODE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;

  &lt;span class="na"&gt;spark-worker&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bitnami/spark:3.5&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;spark&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;SPARK_MODE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker&lt;/span&gt;
      &lt;span class="na"&gt;SPARK_MASTER_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spark://spark:7077&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looked reasonable enough. It wasn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Attempt 1 — The Image That No Longer Exists
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error response from daemon: failed to resolve reference
"docker.io/bitnami/spark:3.5": not found
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;bitnami/spark:3.5&lt;/code&gt; had been pulled from Docker Hub. I tried &lt;code&gt;3.5.3&lt;/code&gt;. Gone. Tried &lt;code&gt;bitnami/spark:3&lt;/code&gt;. Also gone. The entire Bitnami Spark image line had been removed with no notice.&lt;/p&gt;

&lt;p&gt;This is the first thing worth remembering before we even get to the real problem: &lt;strong&gt;third-party images on Docker Hub can disappear at any time.&lt;/strong&gt; There is no deprecation warning, no migration guide. For anything that needs to be reproducible, you either pin to a verified digest or mirror the image in a private registry.&lt;/p&gt;

&lt;p&gt;I switched to the Apache official image: &lt;code&gt;apache/spark:3.5.1-python3&lt;/code&gt;. That one pulled fine.&lt;/p&gt;




&lt;h2&gt;
  
  
  Attempt 2 — Wrong Environment Variables
&lt;/h2&gt;

&lt;p&gt;I updated the image name but kept the same environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spark&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/spark:3.5.1-python3&lt;/span&gt;
  &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;SPARK_MODE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;docker-compose up -d&lt;/code&gt; reported all containers as "Started." But &lt;code&gt;docker ps&lt;/code&gt; only showed two running — Kafka and ZooKeeper. The Spark containers had already exited.&lt;/p&gt;

&lt;p&gt;The problem: &lt;strong&gt;&lt;code&gt;SPARK_MODE&lt;/code&gt; is a Bitnami-specific environment variable.&lt;/strong&gt; The Apache official image has never heard of it.&lt;/p&gt;

&lt;p&gt;Bitnami's image ships with a custom entrypoint script that reads &lt;code&gt;SPARK_MODE&lt;/code&gt; and decides whether to launch a master or worker. It's a convenience layer Bitnami built on top of vanilla Spark. The Apache official image has none of this. Its default entrypoint (&lt;code&gt;/opt/entrypoint.sh&lt;/code&gt;) simply executes whatever command you pass in. If you don't pass a meaningful command, it finishes and exits.&lt;/p&gt;

&lt;p&gt;The lesson: switching between images from different publishers is not just swapping the &lt;code&gt;image:&lt;/code&gt; field. Different publishers package the same software with different entrypoints, different environment variables, and different directory layouts. Before you can use an image correctly, you need to understand how &lt;em&gt;that specific image&lt;/em&gt; expects to be started.&lt;/p&gt;




&lt;h2&gt;
  
  
  Attempt 3 — The Real Trap: &lt;code&gt;start-master.sh&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Spark comes bundled with &lt;code&gt;start-master.sh&lt;/code&gt;. That seems like the right tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spark&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/spark:3.5.1-python3&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/opt/spark/sbin/start-master.sh&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same result. "Started." No Spark container.&lt;/p&gt;

&lt;p&gt;The container was starting. Spark Master was launching. And then everything was shutting down within a fraction of a second. To understand why, you need to know one foundational Docker rule.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Rule: Docker Containers Live and Die with PID 1
&lt;/h2&gt;

&lt;p&gt;Every container has a main process — specified by &lt;code&gt;CMD&lt;/code&gt;, &lt;code&gt;ENTRYPOINT&lt;/code&gt;, or &lt;code&gt;command&lt;/code&gt; in your Compose file. Inside the container, this process gets &lt;strong&gt;PID 1&lt;/strong&gt;. When PID 1 exits, the container exits. No exceptions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PID 1 is running  →  container is running
PID 1 exits       →  container exits immediately
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now look at what &lt;code&gt;start-master.sh&lt;/code&gt; actually does internally (simplified):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nb"&gt;nohup &lt;/span&gt;java &lt;span class="nt"&gt;-cp&lt;/span&gt; &lt;span class="nv"&gt;$SPARK_CLASSPATH&lt;/span&gt; org.apache.spark.deploy.master.Master &amp;amp;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Master started."&lt;/span&gt;
&lt;span class="nb"&gt;exit &lt;/span&gt;0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;See that &lt;code&gt;&amp;amp;&lt;/code&gt;? It puts the Spark Master process into the &lt;strong&gt;background&lt;/strong&gt;. The shell script (PID 1) spawns a child Java process, prints a message, and calls &lt;code&gt;exit 0&lt;/code&gt;. The moment it does that, Docker kills the container and everything inside it — including the Spark Master that just started.&lt;/p&gt;

&lt;p&gt;Here's the exact timeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;t=0.0s  Container starts; PID 1 = start-master.sh (bash)
t=0.1s  Bash forks a Java process (Spark Master) into the background
t=0.2s  Bash script reaches exit 0 → PID 1 terminates
t=0.2s  Docker detects PID 1 exit → tears down the container
t=0.2s  The background Java process is killed along with it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spark Master was alive for about 0.2 seconds.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;start-master.sh&lt;/code&gt; was written for bare-metal servers and VMs, where you start a background daemon and the OS keeps it alive after the startup script exits. Docker doesn't work that way. Docker is watching PID 1 and only PID 1.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Kafka and ZooKeeper Didn't Have This Problem
&lt;/h2&gt;

&lt;p&gt;Confluent's images use &lt;code&gt;exec&lt;/code&gt; in their entrypoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;exec &lt;/span&gt;kafka-server-start /etc/kafka/server.properties
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In bash, &lt;code&gt;exec&lt;/code&gt; &lt;strong&gt;replaces the current process&lt;/strong&gt; with the specified command. The shell doesn't fork a child — it &lt;em&gt;becomes&lt;/em&gt; Kafka. Kafka inherits PID 1, runs in the foreground, and blocks indefinitely.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Image&lt;/th&gt;
&lt;th&gt;What PID 1 Does&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cp-kafka&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;exec kafka-server-start&lt;/code&gt; (foreground, blocking)&lt;/td&gt;
&lt;td&gt;✅ Container stays alive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cp-zookeeper&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;exec zookeeper-server-start&lt;/code&gt; (foreground, blocking)&lt;/td&gt;
&lt;td&gt;✅ Container stays alive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;apache/spark&lt;/code&gt; + &lt;code&gt;start-master.sh&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Forks Java to background with &lt;code&gt;&amp;amp;&lt;/code&gt;, script exits&lt;/td&gt;
&lt;td&gt;❌ Container exits immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The entire difference: &lt;code&gt;&amp;amp;&lt;/code&gt; versus &lt;code&gt;exec&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Four Ways to Fix It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fix A: &lt;code&gt;tail -f /dev/null&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spark&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apache/spark:3.5.1-python3&lt;/span&gt;
  &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tail"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-f"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/dev/null"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./spark-apps:/opt/spark-apps&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tail -f /dev/null&lt;/code&gt; watches a file that never gets new content. PID 1 blocks forever. Submit jobs via &lt;code&gt;docker exec&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec &lt;/span&gt;my-spark-container &lt;span class="se"&gt;\&lt;/span&gt;
  /opt/spark/bin/spark-submit &lt;span class="se"&gt;\&lt;/span&gt;
  /opt/spark-apps/my_job.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; local development, one-off job submission.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix B: Run the Spark Master Class Directly
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="s"&gt;bash -c "&lt;/span&gt;
  &lt;span class="s"&gt;/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master&lt;/span&gt;
  &lt;span class="s"&gt;--host spark --port 7077 --webui-port 8080&lt;/span&gt;
  &lt;span class="s"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Skips the wrapper script entirely. The Master process runs in the foreground as PID 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; when you actually need a running Master/Worker cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix C: Custom Entrypoint Script
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# custom-entrypoint.sh&lt;/span&gt;
/opt/spark/sbin/start-master.sh   &lt;span class="c"&gt;# starts daemon in background&lt;/span&gt;
&lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /opt/spark/logs/&lt;span class="k"&gt;*&lt;/span&gt;         &lt;span class="c"&gt;# blocks + streams logs to stdout&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./custom-entrypoint.sh:/opt/custom-entrypoint.sh&lt;/span&gt;
&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash /opt/custom-entrypoint.sh&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Master auto-starts, container stays alive, and you get log output via &lt;code&gt;docker logs&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; when you want Spark to auto-start and want logs accessible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix D: Use a Docker-Friendly Image
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;jupyter/pyspark-notebook&lt;/code&gt; handles all of this correctly out of the box. Their entrypoints are built around &lt;code&gt;exec&lt;/code&gt; from the start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; quick prototyping. Tradeoff: you depend on a third party to keep the image available.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Docker containers exit when PID 1 exits. Always.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;start-master.sh&lt;/code&gt; backgrounds Spark with &lt;code&gt;&amp;amp;&lt;/code&gt; and exits — which kills the container.&lt;/li&gt;
&lt;li&gt;Confluent's images use &lt;code&gt;exec&lt;/code&gt;, making the service itself PID 1 and keeping the container alive.&lt;/li&gt;
&lt;li&gt;The fix: ensure PID 1 is a foreground process that never returns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Three patterns to spot in any startup script:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;command &amp;amp;&lt;/code&gt; — background execution, PID 1 exits shortly after → container dies&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;exec command&lt;/code&gt; — replaces PID 1, container lives as long as the process does → container survives&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nohup command &amp;amp;&lt;/code&gt; — classic daemon pattern, same problem as &lt;code&gt;&amp;amp;&lt;/code&gt; in Docker → container dies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker containers are not VMs. On a VM, daemonizing a process and exiting the startup script is completely normal. In Docker, the startup script &lt;em&gt;is&lt;/em&gt; the container. Once you internalize that, most "why does my container keep exiting" questions answer themselves.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>spark</category>
      <category>dataengineering</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
