<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: SOM4N</title>
    <description>The latest articles on Forem by SOM4N (@somnathseeni).</description>
    <link>https://forem.com/somnathseeni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F864906%2Fc57d8da3-a95b-4f66-9df1-f4be12f5a281.png</url>
      <title>Forem: SOM4N</title>
      <link>https://forem.com/somnathseeni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/somnathseeni"/>
    <language>en</language>
    <item>
      <title>AWS S3 Simplified: Automate Operations Without CLI on Remote Server</title>
      <dc:creator>SOM4N</dc:creator>
      <pubDate>Wed, 01 Jan 2025 19:17:58 +0000</pubDate>
      <link>https://forem.com/somnathseeni/aws-s3-simplified-automate-operations-without-cli-on-remote-server-44i7</link>
      <guid>https://forem.com/somnathseeni/aws-s3-simplified-automate-operations-without-cli-on-remote-server-44i7</guid>
      <description>&lt;h2&gt;
  
  
  Creating a Helper Script for AWS S3 Operations on Remote Servers Without AWS CLI
&lt;/h2&gt;

&lt;p&gt;In a world where cloud computing is becoming the backbone of modern infrastructure, it is high time that accessing AWS services like S3 efficiently must be accomplished. But imagine you are working on some remote UNIX server where the AWS CLI is not installed, and you want to publish the files to an S3 bucket. This blog will walk you through how to create a helper script that will solve this problem by using IAM to secure access and automatically obtain AWS credentials.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;You are working on a remote UNIX server that will be used to do the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Publish files to an AWS S3 bucket.&lt;/li&gt;
&lt;li&gt;Read and write to S3.
The server you are using does not have AWS CLI, and manual management of credentials is error-prone and inefficient. You need a more robust solution to deal with the following:&lt;/li&gt;
&lt;li&gt;Obtain AWS credentials securely.&lt;/li&gt;
&lt;li&gt;Automate file uploads or downloads.&lt;/li&gt;
&lt;li&gt;Eliminate dependence on AWS CLI.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Solution Overview
&lt;/h2&gt;

&lt;p&gt;The solution includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using an IAM user with proper S3 permissions.&lt;/li&gt;
&lt;li&gt;A helper script that retrieves the Access Key ID and Secret Access Key from AWS.&lt;/li&gt;
&lt;li&gt;Performing S3 operations using these credentials.&lt;/li&gt;
&lt;li&gt;Automate key rotation every 30 days.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Step-by-Step Implementation&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;IAM Configuration&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Create an IAM user or role with the necessary permissions to access your S3 bucket. Below is an example of an IAM policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": "arn:aws:s3:::your-bucket-name/*"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Replace &lt;code&gt;your-bucket-name&lt;/code&gt; with the name of your S3 bucket.&lt;br&gt;
Attach this policy to your IAM user or role.&lt;/p&gt;

&lt;p&gt;Deploy the Template:&lt;br&gt;
Use the AWS Management Console or AWS CLI to deploy the CloudFormation stack. For example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;aws cloudformation deploy --template-file template.yaml --stack-name S3AccessStack&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Retrieve the Credentials:&lt;br&gt;
After the stack is created, you can retrieve the exported outputs:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;aws cloudformation describe-stacks --stack-name S3AccessStack \&lt;br&gt;
--query "Stacks[0].Outputs[?ExportName=='S3AccessKeyId'].OutputValue" --output text&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Similarly, retrieve the Secret Access Key:&lt;br&gt;
&lt;code&gt;aws cloudformation describe-stacks --stack-name S3AccessStack \&lt;br&gt;
--query "Stacks[0].Outputs[?ExportName=='S3SecretAccessKey'].OutputValue" --output text&lt;/code&gt; &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Writing the Helper Script&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The script achieves the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieves AWS credentials from a secure source (e.g., AWS Secrets Manager or a pre-configured file).&lt;/li&gt;
&lt;li&gt;Automates S3 operations like file upload.&lt;/li&gt;
&lt;li&gt;Rotates keys every 30 days to enhance security.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/bin/bash

# File containing AWS credentials
CREDENTIALS_FILE="/path/to/credentials_file"
S3_BUCKET="your-bucket-name"

# Function to load credentials from file
load_credentials() {
  if [ ! -f "$CREDENTIALS_FILE" ]; then
    echo "Credentials file not found: $CREDENTIALS_FILE"
    exit 1
  fi

  ACCESS_KEY_ID=$(grep 'AccessKeyId' $CREDENTIALS_FILE | awk -F '=' '{print $2}')
  SECRET_ACCESS_KEY=$(grep 'SecretAccessKey' $CREDENTIALS_FILE | awk -F '=' '{print $2}')
}

# Function to update credentials
update_credentials() {
  echo "Updating credentials..."
  ACCESS_KEY_ID=$(aws cloudformation describe-stacks --stack-name S3AccessStack \
    --query "Stacks[0].Outputs[?ExportName=='S3AccessKeyId'].OutputValue" --output text)

  SECRET_ACCESS_KEY=$(aws cloudformation describe-stacks --stack-name S3AccessStack \
    --query "Stacks[0].Outputs[?ExportName=='S3SecretAccessKey'].OutputValue" --output text)

  echo -e "AccessKeyId=$ACCESS_KEY_ID\nSecretAccessKey=$SECRET_ACCESS_KEY" &amp;gt; $CREDENTIALS_FILE
  echo "Credentials updated successfully."
}

# Function to upload file to S3
upload_to_s3() {
  local file=$1
  if [ ! -f "$file" ]; then
    echo "File does not exist: $file"
    exit 1
  fi

  # Using curl to perform PUT operation
  curl -X PUT -T "$file" \
    -H "Host: $S3_BUCKET.s3.amazonaws.com" \
    -H "Date: $(date -u '+%Y-%m-%dT%H:%M:%SZ')" \
    -H "Authorization: AWS $ACCESS_KEY_ID:$SECRET_ACCESS_KEY" \
    "https://s3.amazonaws.com/$S3_BUCKET/$(basename $file)"

  echo "File uploaded successfully: $file"
}

# Main execution
if [ "$1" == "update-credentials" ]; then
  update_credentials
  exit 0
fi

if [ -z "$1" ]; then
  echo "Usage: $0 &amp;lt;file-to-upload&amp;gt; | update-credentials"
  exit 1
fi

load_credentials
upload_to_s3 "$1"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Save this script as aws_helper.sh and grant execution permissions &lt;br&gt;
Run ./aws_helper.sh update-credentials every 30 days to rotate the keys and update the credentials file.&lt;/p&gt;
&lt;h2&gt;
  
  
  How This Script Helps
&lt;/h2&gt;

&lt;p&gt;Eliminates AWS CLI Dependency:&lt;br&gt;
The script uses curl for S3 operations, ensuring compatibility with environments where AWS CLI is not installed.&lt;br&gt;
Improves Security:&lt;br&gt;
Automates key rotation and securely manages credentials.&lt;br&gt;
Automation:&lt;br&gt;
Enables seamless and automated S3 operations, reducing manual errors.&lt;br&gt;
Customizable:&lt;br&gt;
Can be extended to include additional S3 operations, such as deleting or listing files.&lt;/p&gt;
&lt;h2&gt;
  
  
  Extending the Script
&lt;/h2&gt;

&lt;p&gt;For larger-scale automation, consider integrating this script with:&lt;br&gt;
AWS SDKs: For more complex logic.&lt;br&gt;
AWS CloudFormation: To manage infrastructure as code.&lt;br&gt;
AWS Secrets Manager: To securely manage credentials.&lt;br&gt;
Refer to the &lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/releasehistory-aws-cfn-bootstrap.html" rel="noopener noreferrer"&gt;
      docs.aws.amazon.com
    &lt;/a&gt;
&lt;/div&gt;

&lt;p&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;
 Documentation for creating and managing your AWS resources programmatically. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This helper script provides a lightweight and efficient solution for performing AWS S3 operations on remote servers without AWS CLI. By leveraging IAM, automating credential retrieval, and rotating keys, it enhances security and reliability. Try it out and adapt it to fit your specific needs!&lt;/p&gt;

</description>
      <category>aws</category>
      <category>tutorial</category>
      <category>python</category>
      <category>learning</category>
    </item>
    <item>
      <title>Ensuring Deployment Accuracy with air sandbox diff in AbInitio</title>
      <dc:creator>SOM4N</dc:creator>
      <pubDate>Sun, 03 Nov 2024 18:02:43 +0000</pubDate>
      <link>https://forem.com/somnathseeni/ensuring-deployment-accuracy-with-air-sandbox-diff-in-abinitio-pp2</link>
      <guid>https://forem.com/somnathseeni/ensuring-deployment-accuracy-with-air-sandbox-diff-in-abinitio-pp2</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Recently, I came across a problem that made me realize how much of a necessity it is to properly validate deployment and the role of the air sandbox diff command in ensuring that all changes were deployed correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: An Overlooked Production Update
&lt;/h2&gt;

&lt;p&gt;We had to tweak an Ab Initio graph in many places at 32 distinct points since there was a recent deployment. This can affect the graph's behavior, if any of them hadn't been applied, for all of them followed a recently introduced business requirement. &lt;br&gt;
We also missed one critical change in otherwise very tight development and testing cycle after its deployment. As you can see, it is very laborious and error-prone to manually deduce where the missed change was in a graph of this size. It was obvious that an automated method was needed to identify the disparity.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using air sandbox diff for Rapid Detection
&lt;/h2&gt;

&lt;p&gt;I immediately used the air sandbox diff command-this is a highly powerful Ab Initio sandbox comparison tool. It would compare and show differences for configuration, metadata, graphs, and scripts, plus other files if any. It was through such a comparison where air sandbox diff helped to quickly give us the correct answer to an issue like that:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Differences Isolating&lt;/strong&gt;&lt;br&gt;
Air sandbox diff compared the deployed production sandbox against the development sandbox and instantly highlighted the missed change. This way, we wouldn't have to go through the painful process of poring over 32 points of modification, thus no detail was missed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Detailed Analysis:&lt;/strong&gt;&lt;br&gt;
The -verbose option with the command above revealed discrepancies with sufficient detail to immediately expose the location of the change, which was missed in production. This meant that the problem could be resolved, rather than guessed at.&lt;/p&gt;

&lt;p&gt;Using Air Sandbox Diff&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbmumgzpy0xujm8dnvh7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbmumgzpy0xujm8dnvh7.png" alt="Image description" width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use -verbose Mode:&lt;/strong&gt; This provides a more detailed view, making it easier to catch anomalies, especially where significant changes have been made.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;air sandbox diff -verbose /path/to/dev_sandbox /path/to/prod_sandbox&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repetitive Comparisons&lt;/strong&gt; For your list of validation steps to run after the deployment, include the diff in the air sandbox, Use the &lt;code&gt;-summarize&lt;/code&gt; option to get a quick overview of changes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;air sandbox diff -summarize /path/to/dev_sandbox /path/to/prod_sandbox&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ignore and Exclude&lt;/strong&gt;
If you want to ignore specific file types, like logs, use the &lt;code&gt;-ignore&lt;/code&gt;option and we can also Exclude files based on specific patterns, such as temporary files using &lt;code&gt;-exclude&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;air sandbox diff -ignore "*.log" /path/to/dev_sandbox /path/to/prod_sandbox
air sandbox diff -exclude "temp_*" /path/to/dev_sandbox /path/to/prod_sandbox
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Rapid Fix:&lt;/strong&gt;&lt;br&gt;
By identifying the missed change, we could directly effect a fix in production; ensure data flow continuity while sidestepping potential down streams.&lt;/p&gt;

&lt;p&gt;This reminded us that we do need post-deployment validation. Though the missed update was caught by the air sandbox diff effectively, we still learned to include an additional verification layer.&lt;/p&gt;

&lt;p&gt;In our future post-deployment checklists, we added one new step i.e., our admin team will send a summary of all the changes made as part of their post-deployment notes to developer and the support team for post validation. &lt;/p&gt;

&lt;p&gt;Below we have created a skeleton of the code block in which we will copy and test graph&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#!/bin/ksh

# Check if sufficient arguments are provided
if [[ $# -ne 4 ]]; then
  echo "Usage: $0 &amp;lt;dev-pset-path&amp;gt; &amp;lt;prod-server&amp;gt; &amp;lt;prod-pset-path&amp;gt; &amp;lt;local-temp-dir&amp;gt;"
  exit 1
fi

# Assign input arguments to variables
DEV_PSET=$1
PROD_SERVER=$2
PROD_PSET=$3
LOCAL_TEMP_DIR=$4

# Set local paths for temporary PROD PSET file copy
LOCAL_PROD_PSET=\"$LOCAL_TEMP_DIR/prod_pset.pset\"

# Make sure the temporary directory exists
mkdir -p $LOCAL_TEMP_DIR

# Copy the PSET file from PROD server to DEV server's local directory
echo "Fetching PSET from PROD server."
scp $PROD_SERVER:$PROD_PSET $LOCAL_PROD_PSET
if [[ $? -ne 0 ]]; then
  echo "Failed to fetch PSET from PROD server."
  exit 1
fi

# Compare the PSET files (DEV vs PROD)
echo "Comparing PSET files."
diff_output=$(air sandbx diff $DEV_PSET $LOCAL_PROD_PSET)

# Print the differences
if [[ -z "$diff_output" ]]; then
  echo "No differences found between DEV and PROD PSET files."
else
  echo "Differences found:"
  echo "$diff_output"
fi

#
# Optionally, cleanup the local copy of PROD PSET
rm $LOCAL_PROD_PSET

exit 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>abinitio</category>
      <category>etl</category>
      <category>automation</category>
      <category>programming</category>
    </item>
    <item>
      <title>Ab Initio Automation: How We Reduced 80% of Incidents Due to Connection Failures</title>
      <dc:creator>SOM4N</dc:creator>
      <pubDate>Mon, 21 Oct 2024 18:26:26 +0000</pubDate>
      <link>https://forem.com/somnathseeni/ab-initio-automation-how-we-reduced-80-of-incidents-due-to-connection-failures-20i6</link>
      <guid>https://forem.com/somnathseeni/ab-initio-automation-how-we-reduced-80-of-incidents-due-to-connection-failures-20i6</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In my current role on a data integration team, we encountered frequent job failures caused by connection timeouts while processing Data across different servers, databases, teams, and Amazon S3 buckets using AbInitio. These issues not only disrupted our workflows but also required manual interventions, reducing the efficiency of the overall process. &lt;/p&gt;

&lt;p&gt;In this blog, I’ll explain how I implemented an automated retry mechanism that resolved these issues, reduced manual interventions, and stabilized our processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Connection Timeouts and Job Failures
&lt;/h2&gt;

&lt;p&gt;Our daily tasks involved extracting data from various databases, performing transformations, and loading it back into different servers this workflow required seamless communication across multiple systems, but we consistently faced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connection timeouts:&lt;/strong&gt; Due to network issues, some jobs failed to complete within the allotted time, causing interruptions in data processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial loads:&lt;/strong&gt; When a job failed midway due to a connection issue, it would leave data partially loaded into tables, requiring the entire process to be restarted manually.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual interventions&lt;/strong&gt;: Every time a job failed, the team had to manually re-trigger the job and ensure it restarted from the beginning or resolved any partial load problems
.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution: Automation with Retry Scripts
&lt;/h2&gt;

&lt;p&gt;To address the frequent connection issues, I proposed the use of a retry script that automatically retries failed jobs a specified number of times until they successfully complete. This approach helped us avoid manual interventions, reducing downtime and improving the stability of the team’s workflow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
sandbox=$1
pset=$2
MAX_RETRIES=$3

while [ $attempt -lt $MAX_RETRIES ]; do
  echo "Running Ab Initio job..."
  air sandbox run $sandbox/$pset

  if [ $? -eq 0 ]; then
    echo "Job completed successfully"
    exit 0
  else
    attempt=$((attempt + 1))
    echo "Job failed, attempt $attempt of $MAX_RETRIES"

    if [ $attempt -lt $MAX_RETRIES ]; then
      sleep $RETRY_DELAY
    fi
  fi
done

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key Points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The script retries the job up to MAX_RETRIES times.&lt;/li&gt;
&lt;li&gt;If the job fails, it waits for RETRY_DELAY seconds before retrying.&lt;/li&gt;
&lt;li&gt;Upon success, the script exits. If all retries fail, the script stops and reports failure.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Benefits of the Retry Mechanism
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;80% Reduction in On-Call Incidents: The automated retry mechanism drastically reduced the number of on-call incidents related to job failures caused by connection issues. The team no longer had to manually re-trigger jobs or deal with partial loads.&lt;/li&gt;
&lt;li&gt;Process Stability: By automatically retrying jobs, our workflow became much more stable. The script handled intermittent connection problems seamlessly, allowing jobs to resume without intervention.&lt;/li&gt;
&lt;li&gt;Improved Efficiency: With the retry logic and recovery mechanism, we avoided the inefficiency of reloading entire files from the beginning. The script resumed jobs from the failure point, improving overall performance.&lt;/li&gt;
&lt;li&gt;Automation: Automation reduced the manual burden on the team, freeing up valuable time that could be spent on more strategic tasks. The need for urgent intervention at all hours was virtually eliminated.&lt;/li&gt;
&lt;li&gt;Scalable Solution: This retry approach is not only effective for Ab Initio jobs but can also be applied to other ETL or data processing scenarios that suffer from connection-related failures&lt;/li&gt;
&lt;li&gt;This solution can be applied to any ETL or data processing scenario where connection issues may arise, and it showcases how automation can drastically improve process reliability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're facing similar issues in your ETL pipelines or workflows, consider implementing retry scripts tailored to your environment to overcome job failures due to transient connection issues. let me know on your expertise how we could have handled these issues better&lt;/p&gt;

&lt;h2&gt;
  
  
  python version of Automation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import time
import subprocess
import sys

# Constants
RETRY_DELAY = 30

def run_job(sandbox_path, pset_name):
    # Construct the command
    command = f"air sandbox run {sandbox_path}/{pset_name}"

    try:
        # Run the command using subprocess
        result = subprocess.run(command, shell=True, check=True)
        return result.returncode
    except subprocess.CalledProcessError as e:
        return e.returncode

def retry_job(sandbox_path, pset_name, max_retries):
    attempt = 0

    while attempt &amp;lt; max_retries:
        print(f"Running Ab Initio job... Attempt {attempt + 1} of {max_retries}")

        # Run the job
        return_code = run_job(sandbox_path, pset_name)

        if return_code == 0:
            print("Job completed successfully")
            return True
        else:
            attempt += 1
            print(f"Job failed, attempt {attempt} of {max_retries}")

            if attempt &amp;lt; max_retries:
                print(f"Retrying job after {RETRY_DELAY} seconds...")
                time.sleep(RETRY_DELAY)
            else:
                print("Max retries reached. Job failed.")
                return False

if __name__ == "__main__":
    # Accept parameters from the command line
    sandbox_path = sys.argv[1]
    pset_name = sys.argv[2]
    max_retries = int(sys.argv[3])

    # Start the retry process
    retry_job(sandbox_path, pset_name, max_retries)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;calling this python script&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python retrypset.py sandbox_path pset_name max_retries

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>abinitio</category>
      <category>etl</category>
      <category>automation</category>
      <category>programming</category>
    </item>
    <item>
      <title>AbInitio Automation: How We Reduced 80% of Incidents -Connection Failures</title>
      <dc:creator>SOM4N</dc:creator>
      <pubDate>Mon, 21 Oct 2024 18:26:26 +0000</pubDate>
      <link>https://forem.com/somnathseeni/ab-initio-automation-how-we-reduced-80-of-incidents-due-to-connection-failures-144e</link>
      <guid>https://forem.com/somnathseeni/ab-initio-automation-how-we-reduced-80-of-incidents-due-to-connection-failures-144e</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In my current role on a data integration team, we encountered frequent job failures caused by connection timeouts while processing Data across different servers, databases, teams, and Amazon S3 buckets using AbInitio. These issues not only disrupted our workflows but also required manual interventions  of support teams, reducing the efficiency of the overall process. &lt;/p&gt;

&lt;p&gt;In this blog, I’ll explain how I implemented an wrapper script which resolved these issues, reduced manual interventions, and stabilized our processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Connection Timeouts and Job Failures
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connection timeouts:&lt;/strong&gt; Due to network issues, some jobs failed to complete within the allotted time, causing interruptions in data processing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partial loads:&lt;/strong&gt; When a job failed midway due to a connection issue, it would leave data partially loaded into tables, requiring the entire process to be restarted manually.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual interventions&lt;/strong&gt;: Every time a job failed, the team had to manually re-trigger the job and ensure it restarted from the beginning or resolved any partial load problems
.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Solution: Automation with Generic wrapper Scripts
&lt;/h2&gt;

&lt;p&gt;To address the frequent connection issues, I proposed the use of a wrapper  script that automatically retries if the job failed with connection issue and it will rerun the job specified number of times (parameterized) until they successfully complete. This approach helped us avoid manual interventions, reducing downtime and improving the stability of the team’s workflow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
sandbox=$1
pset=$2
MAX_RETRIES=$3

while [ $attempt -lt $MAX_RETRIES ]; do
  echo "Running Ab Initio job..."
  air sandbox run $sandbox/$pset

  if [ $? -eq 0 ]; then
    echo "Job completed successfully"
    exit 0
  else
    attempt=$((attempt + 1))
    echo "Job failed, attempt $attempt of $MAX_RETRIES"

    if [ $attempt -lt $MAX_RETRIES ]; then
      sleep $RETRY_DELAY
    fi
  fi
done

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key Points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The script retries the job up to MAX_RETRIES times.&lt;/li&gt;
&lt;li&gt;If the job fails, it waits for RETRY_DELAY seconds before retrying.&lt;/li&gt;
&lt;li&gt;Upon success, the script exits. If all retries fail, the script stops and reports failure.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Benefits of the Retry Mechanism
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;80% Reduction in On-Call Incidents: The automated retry mechanism drastically reduced the number of on-call incidents related to job failures caused by connection issues. The team no longer had to manually re-trigger jobs or deal with partial loads.&lt;/li&gt;
&lt;li&gt;Process Stability: By automatically retrying jobs, our workflow became much more stable. The script handled intermittent connection problems seamlessly, allowing jobs to resume without intervention.&lt;/li&gt;
&lt;li&gt;Improved Efficiency: With the retry logic and recovery mechanism, we avoided the inefficiency of reloading entire files from the beginning. The script resumed jobs from the failure point, improving overall performance.&lt;/li&gt;
&lt;li&gt;Automation: Automation reduced the manual burden on the team, freeing up valuable time that could be spent on more strategic tasks. The need for urgent intervention at all hours was virtually eliminated.&lt;/li&gt;
&lt;li&gt;Scalable Solution: This retry approach is not only effective for Ab Initio jobs but can also be applied to other ETL or data processing scenarios that suffer from connection-related failures&lt;/li&gt;
&lt;li&gt;This solution can be applied to any ETL or data processing scenario where connection issues may arise, and it showcases how automation can drastically improve process reliability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're facing similar issues in your ETL pipelines or workflows, consider implementing retry scripts tailored to your environment to overcome job failures due to transient connection issues. let me know on your expertise how we could have handled these issues better&lt;/p&gt;

&lt;h2&gt;
  
  
  python version of Automation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import time
import subprocess
import sys

# Constants
RETRY_DELAY = 30

def run_job(sandbox_path, pset_name):
    # Construct the command
    command = f"air sandbox run {sandbox_path}/{pset_name}"

    try:
        # Run the command using subprocess
        result = subprocess.run(command, shell=True, check=True)
        return result.returncode
    except subprocess.CalledProcessError as e:
        return e.returncode

def retry_job(sandbox_path, pset_name, max_retries):
    attempt = 0

    while attempt &amp;lt; max_retries:
        print(f"Running Ab Initio job... Attempt {attempt + 1} of {max_retries}")

        # Run the job
        return_code = run_job(sandbox_path, pset_name)

        if return_code == 0:
            print("Job completed successfully")
            return True
        else:
            attempt += 1
            print(f"Job failed, attempt {attempt} of {max_retries}")

            if attempt &amp;lt; max_retries:
                print(f"Retrying job after {RETRY_DELAY} seconds...")
                time.sleep(RETRY_DELAY)
            else:
                print("Max retries reached. Job failed.")
                return False

if __name__ == "__main__":
    # Accept parameters from the command line
    sandbox_path = sys.argv[1]
    pset_name = sys.argv[2]
    max_retries = int(sys.argv[3])

    # Start the retry process
    retry_job(sandbox_path, pset_name, max_retries)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;calling this python script&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python retrypset.py sandbox_path pset_name max_retries

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>abinitio</category>
      <category>etl</category>
      <category>automation</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
