<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: afnan amin</title>
    <description>The latest articles on Forem by afnan amin (@afnan_amin_67fbd716f146a6).</description>
    <link>https://forem.com/afnan_amin_67fbd716f146a6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1997336%2F73cd1c20-7b32-44e4-8afd-8413f6e22ac9.jpg</url>
      <title>Forem: afnan amin</title>
      <link>https://forem.com/afnan_amin_67fbd716f146a6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/afnan_amin_67fbd716f146a6"/>
    <language>en</language>
    <item>
      <title>Journey with Apache Flink on AWS EKS</title>
      <dc:creator>afnan amin</dc:creator>
      <pubDate>Tue, 16 Dec 2025 10:00:43 +0000</pubDate>
      <link>https://forem.com/afnan_amin_67fbd716f146a6/journey-with-apache-flink-on-aws-eks-39n6</link>
      <guid>https://forem.com/afnan_amin_67fbd716f146a6/journey-with-apache-flink-on-aws-eks-39n6</guid>
      <description>&lt;p&gt;When building streaming pipelines, one of the biggest challenges isn’t just processing data , it’s keeping jobs running reliably while data continuously flows. Apache Flink addresses this challenge effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discovering Flink
&lt;/h2&gt;

&lt;p&gt;Apache Flink is a distributed stream-processing framework built for stateful, real-time data processing. Unlike batch-first frameworks, Flink treats streams as the core abstraction, with batch as just a special case.&lt;/p&gt;

&lt;p&gt;Flink stands out due to its handling of state, time, and fault tolerance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exactly-once processing guarantees&lt;/li&gt;
&lt;li&gt;Native state management for long-running jobs&lt;/li&gt;
&lt;li&gt;Event-time processing with watermarks&lt;/li&gt;
&lt;li&gt;Checkpointing that makes failures recoverable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made Flink feel less like a tool and more like a platform for always-on systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why We Chose Flink
&lt;/h2&gt;

&lt;p&gt;Streaming pipelines evolve over time , logic changes, schemas expand, and performance tuning becomes necessary. Flink’s checkpoint-based execution model allowed us to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restart jobs without breaking downstream systems&lt;/li&gt;
&lt;li&gt;Roll out logic or configuration changes safely&lt;/li&gt;
&lt;li&gt;Treat streaming jobs as living systems, not one-off deployments&lt;/li&gt;
&lt;li&gt;Its combination of stateful processing, exactly-once semantics, and safe evolution made Flink the clear choice for our production pipelines.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running Flink on AWS EKS
&lt;/h2&gt;

&lt;p&gt;Once we decided on Flink, we had to figure out where and how to run it reliably. AWS EKS provided:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A managed Kubernetes control plane&lt;/li&gt;
&lt;li&gt;Native integration with AWS services&lt;/li&gt;
&lt;li&gt;Consistent environments across dev, test, and prod&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To make Flink truly Kubernetes-native, we adopted the Flink Kubernetes Operator.&lt;/p&gt;

&lt;p&gt;After installing it via Helm, it introduced a FlinkDeployment CRD. From that point, our deployments became fully declarative. We define the desired state in YAML, and the operator continuously reconciles it.&lt;/p&gt;

&lt;p&gt;The operator:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Launches the JobManager pod (hosting the Flink UI)&lt;/li&gt;
&lt;li&gt;Scales TaskManagers as needed&lt;/li&gt;
&lt;li&gt;Configures networking, volumes, and service accounts&lt;/li&gt;
&lt;li&gt;Manages job restarts, upgrades, and recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This drastically reduced operational overhead and made Flink cloud-native and production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying Flink Jobs
&lt;/h2&gt;

&lt;p&gt;In practice, we separate cluster management from job management:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FlinkDeployment&lt;/strong&gt; defines the session cluster (image, resources, Flink config, EFS mounts)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FlinkSessionJob&lt;/strong&gt; defines the actual streaming job (entrypoint, arguments, parallelism, upgrade mode)&lt;/p&gt;

&lt;p&gt;Most of our deployments happen via Terraform, rendering YAML templates and applying them using kubernetes_manifest. &lt;/p&gt;

&lt;p&gt;Here’s a simplified example of a FlinkSessionJob:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;apiVersion: flink.apache.org/v1beta1&lt;br&gt;
kind: FlinkSessionJob&lt;br&gt;
metadata:&lt;br&gt;
  name: flink-job-gg-pending-orders&lt;br&gt;
  namespace: flink&lt;br&gt;
spec:&lt;br&gt;
  deploymentName: flink-session-v2&lt;br&gt;
  job:&lt;br&gt;
    entryClass: "org.apache.flink.client.python.PythonDriver"&lt;br&gt;
    args:&lt;br&gt;
      - "-py"&lt;br&gt;
      - "/opt/flink/lib/src/s3_to_iceberg.py"&lt;br&gt;
    parallelism: 1&lt;br&gt;
    upgradeMode: stateless&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;💡 Tip: To rerun a job from the last checkpoint, switch upgradeMode to last-state.&lt;/p&gt;

&lt;p&gt;Operating Jobs Safely&lt;/p&gt;

&lt;p&gt;This is where Flink’s strength shines. For initial runs, we typically start jobs stateless. But for restarts, backfills, or upgrades, we rely on checkpoint-based recovery using upgradeMode: last-state.&lt;/p&gt;

&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jobs resume from the latest successful checkpoint&lt;/li&gt;
&lt;li&gt;Downstream systems remain stable&lt;/li&gt;
&lt;li&gt;Minimal gaps or duplicates, even for CDC sources&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Evolving Jobs Over Time
&lt;/h2&gt;

&lt;p&gt;We handle changes carefully:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration Changes (parallelism, resources, checkpoint tuning):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Update the job or cluster spec&lt;/li&gt;
&lt;li&gt;Apply via Terraform or kubectl&lt;/li&gt;
&lt;li&gt;Operator restores state automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business Logic Changes (Python, SQL, JARs):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Push updated code to S3&lt;/li&gt;
&lt;li&gt;Sync to EFS via AWS DataSync&lt;/li&gt;
&lt;li&gt;Verify files in Flink containers&lt;/li&gt;
&lt;li&gt;Perform a rolling, stateful upgrade&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This process allows us to iterate safely, even on critical streaming pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;From choosing Flink for its stateful streaming model, to running it on AWS EKS, and finally operating jobs safely with the Flink Kubernetes Operator, this journey has shaped how we build and maintain streaming pipelines.&lt;/p&gt;

&lt;p&gt;Flink is more than a processing engine ,it’s a platform for evolving, long-running data pipelines. And running it Kubernetes-native on AWS lets us balance operational safety, scalability, and flexibility.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>data</category>
    </item>
  </channel>
</rss>
