Forem: MeteorOps

How to Handle Alert Fatigue

Arthur Azrieli — Fri, 28 Mar 2025 13:42:51 +0000

A very important aspect for many developers and DevOps is handling alerts. An even more important and often overlooked aspect is alert fatigue. Alert fatigue is caused by a high volume of alerts of which many are false positives, related alerts, or duplicates. Alert handlers become so used to alerts that they disregard and miss important ones. It takes only one or two missed alerts to bring a system to a halt. However, alert fatigue is nothing but a symptom of the underlying issues that need to be addressed.‍

What we'll cover and learn about alert fatigue and Alertmanager

In this article we want to discuss factors that could lead to alert fatigue, how to identify them and how to handle them. In addition, we will introduce Prometheus Alertmanager and how to use it in order to handle the very same scenarios that lead to alert fatigue.

Prerequisites

Before diving in, there are a few prerequisites if you want to follow along with the article:
Prometheus installed - install by helm chart
Alertmanager installed - alertmanager is installed by default when installing Prometheus helm chart
Knowledge of yaml‍

Too Many Alerts

Developers and DevOps are busy people. When a significant part of their time is spent on looking into or muting alerts that don't matter, we are looking at alert fatigue. To handle this situation we need to ask ourselves why there are so many alerts and why so many of them are false positives, duplicates, or simply of no value.

Easy Trigger Alert

There's a tendency to sometimes overdo alert configuration, mainly as precaution. For example, we would like to know if a service is using a lot of memory or if its CPU is spiking so that we can react in time in case it develops into service disruption. In an attempt to have a preemptive encompassing view of the system, we sometimes configure too many alerts whose thresholds are unjustified. To justify an alert, we need to ask if the CPU or memory spike are really a concern. It's not uncommon for services to work a little harder at times. If the historical data shows spikes that resolve themselves and don't correspond to incidents, the alert should not be configured at all. Besides, If we are concerned with service disruption due to resource shortage we should consider automatic scaling, not setting more alerts.‍

Alert Configuration Cleanup

Not setting more alerts is one part of prevention as the best medicine. Dealing with the ones that are already set is the other. Alerts that are already set should be examined through historical and operational lenses:

Have alerts been triggered many times before?
Have they self-resolved?
Have they really indicated some system or service disruption?
Are they perhaps related to other more significant alerts?

These questions will help us discover alerts that can be removed to prevent alert fatigue.

However, for mature and perhaps more complicated systems with many services and moving parts, going over hundreds of configured alerts and determining whether they are important or redundant can prove to be quite difficult and time consuming. There's always the question of ROI when coming to clean up alerts. If we spend a sprint on cleaning alerts it might benefit us down the line by reducing alert fatigue, but we just missed a sprint where we could deliver features and bug fixes. So the trade off is always there. We'd still argue that alert cleanup should take place, even if in small increments. While we are cleaning them up bit by bit, we can introduce an alert manager whose purpose is to provide additional protection against alert fatigue.‍

Alert Manager to Handle and Prevent Alert Fatigue

An alert manager like Prometheus AlertManager provides a robust way to manage alerts. In the context of handling alert fatigue, the most significant aspect of an alert manager is its ability to group and inhibit alerts.‍

Correlating and grouping alerts

Let's look at an example of correlating and grouping alerts. Imagine a scenario in which a datastore such as a database, search engine, or queue manager is reporting high CPU and memory consumption. Services that depend on this datastore might experience difficulties communicating with it. They too might trigger an alert indicating that they cannot communicate with the datastore. With Prometheus AlertManager, it is possible to configure that if there's an alert for the datastore resource consumption all other related alerts will be grouped together and sent to the same receiver. This way we can see the underlying cause and which services are affected.

Given that alerts are properly labeled and configured to include the service name, team, cluster, region and any other attribute of significance, we can configure an alert as follows:

alertmanager.yaml:
global: {}
receivers:
  - name: 'default-receiver'
    webhook_configs:
      - url: 'http://example.com/webhook'
  - name: 'data-dev-receiver'
    webhook_configs:
      - url: 'http://example.com/webhook-data-dev'
route:
  group_interval: 5m
  group_wait: 10s
  receiver: default-receiver
  repeat_interval: 3h

  routes:
  - matchers:
      - team="data-dev"
    group_by: ['cluster', 'database']
    receiver: 'data-dev-receiver'
    continue: true

In the routes directive, we are saying that any alert meant for the data-dev team should be grouped under the data-dev-receiver by cluster and database. To simulate, we can run these commands:

curl -H 'Content-Type: application/json' \
     -d '[{
           "labels": {
             "alertname": "DatabaseUnreachable",
             "team": "data-dev",
             "service": "aggregator",
             "database": "analytics",
             "cluster": "prod",
             "severity": "critical"
           }
         }]' \
     http://localhost:9093/api/v2/alerts

curl -H 'Content-Type: application/json' \
     -d '[{
           "labels": {
             "alertname": "DatabaseResourceConsumptionHigh",
             "team": "data-dev",
             "database": "analytics",
             "cluster": "prod",
             "severity": "critical"
           }
         }]' \
     http://localhost:9093/api/v2/alerts

This is what it would look like in Alertmanager:

Inhibiting and Suppressing Alerts

In other cases we want to suppress related alerts rather than group them with the underlying alert. For example, a database has the following alerts configured:

Consuming too much resources
Unreachable
Connections about to max out
File descriptors limit is about to be reached
Queries take too long to execute
Many failed queries

Any of these alerts can be followed by the others. If an alert triggers for resource consumption, then an unreachable alert might also trigger. If queries take too long to execute, an alert for resource consumption might trigger as well.

So any alert might be followed by other alerts which will create a cascade of incoming alerts when all is needed is just the one. Instead of having them all triggered one after another, Prometheus Alertmanager can be configured to suppress a subset of alerts if a specific alert is active. To handle this situation, we can create the following inhibit rules:

inhibit_rules:

  - source_matchers:
      - alertname="DatabaseResourceConsumptionHigh"
    target_matchers:
      - alertname="DatabaseUnreachable"
    equal: ['database']

  - source_matchers:
      - alertname="DatabaseResourceConsumptionHigh"
    target_matchers:
      - alertname="DatabaseSlowQueries"
    equal: ['database']

  - source_matchers:
      - alertname="DatabaseResourceConsumptionHigh"
    target_matchers:
      - alertname="DatabaseFailedQueries"
    equal: ['database']

We're saying that if there's an active alert for DatabaseResourceConsumptionHigh in the specific database, any DatabaseSlowQueries or DatabaseFailedQueries alert will be inhibited. Inhibited means that the alert will not trigger but will still be visible on demand. The idea here is if there's an alert on resource consumption, we already know that there's something going on with the database. We don't need to be paged for all other issues as they are probably related yet we still have them at hand for investigation.

This is what it would look like in Alertmanager:

Inhibited alerts don't appear when Inhibited is not selected

Inhibited alerts appear when Inhibited is selected

Personally Contributing to Prevent Alert Fatigue

The ability to suppress alerts using Prometheus Alertmanager can also be used by the individual contributor to help prevent alert fatigue.‍

Taking Ownership of Alerts

An engineer who is making changes to the system and knows that alerts might trigger, should use the AlertManager ability to silence and inhibit alerts. It goes without saying how disturbing it is to get a high number of alerts in the middle of a workday, especially when these alerts are false positives and could have been silenced in advance.

If the alerts cannot be silenced because they are required as indicators during operational changes, the engineer should assume the PD or DOD shift. It might sound trivial but from personal experience we can testify that this is a common problem.‍

Handling Alert Fatigue is a Continuous Process

Using an AlertManager and personally assuming responsibility over alerts are the two determining factors in handling alert fatigue. These will greatly reduce the number of alerts and pagers to mitigate and prevent alert fatigue. However, these are but the tools and methods to achieve the goal. They will have to be constantly used and reimplemented.

It's important to remember that the effort to manage alerts and avoid alert fatigue is an endless process. As systems evolve their monitoring stack evolves as well. Currently configured alert handling rules might have been set up with oversight and should now be revisited. Think of it as a kind of monitoring for your monitoring. We have alerts and alerting rules in place but we have to always ask ourselves whether they are properly set and whether conditions have changed that merit revision of alerting rules.

Why & When to use GitOps: Use-Cases & Principles

Michael Zion — Tue, 11 Mar 2025 21:43:41 +0000

Optimizing Kubernetes Management with GitOps and CD Tools.

We’ve come a long way from the days of deploying releases to individual machines, whether manually or by an automated process.

The de-facto standard today is to deploy and manage the application stack using an orchestrator, usually Kubernetes.

However, Kubernetes carries its own challenges. Although you can manage your stack in one centralized place, you still have to manage deployments, updates, rollbacks, and the overall infra.

When Kubernetes was in its early years, services were deployed either manually or through CI/CD using kubectl or Helm.

There was nothing to describe the desired state and no easy way to keep track of changes or revert them. To solve these issues, GitOps was added to the mix.

Single Source of Truth with Increased Visibility and Operability

The idea behind GitOps is a git repo that describes the desired state of the infrastructure and the application stack that runs on it.

Combined with CI/CD, GitOps allows for fast and efficient infra provisioning and application lifecycle management.

In the context of GitOps and Kubernetes, any change to the source code triggers a change in the cluster or application stack.

Changes are tracked and releases are tagged with a version and can be rolled back as fast as they can be deployed.

Best of all, no one is running manual commands which can be very error prone. In essence, GitOps centralises all operations related to deployments and infra management while allowing various teams to both collaborate and work independently.

Continuous Delivery Tools to Enhance GitOps

However, GitOps and CI/CD for Kubernetes still lack something in terms of really being able to manage the infra and application stack.

Managing deployments and infra becomes easier with GitOps thanks to declarative continuous delivery tools for Kubernetes such as Argo CD or Flux CD.

These tools act as a wrapper around your codebase which translates everything declarative into actual state. Your codebase holds all of your Kubernetes manifests, Helm charts, and configuration files. Argo CD then takes it, evaluates it, and applies it to the cluster.

Such tools also offer intuitive and convenient UI to visualize and manage everything that is going on in the cluster.

It’s good to be able to revert a deployment or fix a configuration issue through the UI and on the spot rather than going through the process of committing code and waiting for it to sync. This is especially true in case of an emergency or when developing and debugging.

The Real Benefits of Combining GitOps and CD Tools

When put into the context of everyday work of developers and DevOps, the real benefits of GitOps and CD tools come into light. What GitOps and CD tools offer is a nice package of intuitive and secure easy-to-use framework for management and collaboration.

Ease of Use and Visibility into the State of the Application and Cluster

The truth is that although not too difficult, setting up a GitOps codebase with CD tools can be a complex task. However, once in place, they grant ease of use with increased security.

Reading Kubernetes manifests to understand the complex relationship of services and components is not easy. It’s even harder to debug and troubleshoot issues with no centralized control dashboard.

It’s easier with CD tools because they visualize everything that the code represents.
It’s easier to discern the components of a service or the various CRDs that have been added to the cluster when they appear in front of you with all their dependencies and associations.

CD tools have robust control, logging, and debugging features which allow developers and DevOps to quickly assess the status of the cluster and respond accordingly. There’s no need to run kubectl commands or tail pod logs. Everything is intuitively visualized in the CD tool.

Secure Operations

Having everything visualised and controllable from outside the cluster, does away with the need to access the cluster itself.

‍Developers and DevOps can be given roles through RBAC to determine which resources they have access to and what they can do in the CD tool itself rather than in the cluster.

Although access is ultimately given to components of the cluster and application, it’s still better than having to manage service accounts and having dozens of developers running exec into pods.

For additional security, CD tools sit inside the cluster and communicate with native API so their function takes place inside the cluster and not outside. There’s no need to open the cluster to external CI/CD pipelines.

Quick Recovery, Consistency, and Guardrails

Recovery is built into CD Tools. CD tools continuously monitor the state of the cluster and compare with what the code says.

In case of a mismatch, the CD tool attempts to reconcile between the two. If it fails, it will retry several times before sending an alert and even pausing sync.

Retries are especially important in complex systems with vast networks of dependencies between services and components.

Even if the code dictates service dependencies and the order of deployment, dependencies could still fail.

By retry, CD tools give all components of the system more time and more chances to go online before failing the deployment and pausing progress.

Since there are no imperative commands running without a state, GitOps prevents the occurrence of hanging resources. If you remove a manifest file or parts of it, the CD tools will sync immediately, without any CI/CD pipeline, and remove the corresponding resources and components.

Management of Multiple Environments and Applications

With inherent security, ease of use, and built-in recovery, GitOps and CD tools are easily expanded to support the configuration and management of multi-env, multi-app stacks. It becomes much easier to separate environments:

Different environments for development and production.
Regional stacks for compliance or disaster recovery.
Separate applications for business units in the organization.

Since the initial setup is most of what it takes to get GitOps and CD tools going, it’s incredibly easy to set up multiple stacks and manage both separately, as they are different stacks, and side-by-side in one place, as they are centrally managed.

The Application Needs GitOps and CD Tools

Keeping the application stack and cluster in order and in the desired state is the main function of GitOps and CD tools. There's no guessing how the cluster should be or what should be in and who or what put it there. Everything is conveniently visualized with a vast array of actions that can be performed right from within the UI. GitOps and CD tools give developers and DevOps the ability to work independently in a secure and safe way. Secure in terms of access that is restricted to the CD tools itself. Safe in terms of the ability to quickly revert changes and the self-healing properties of CD tools. It’s not the easiest task to set up GitOps with CD tools, but the agility, convenience, productivity and security that they offer allow for well-managed operations. Well-managed operations then allow for faster and safer development pipelines to further advance your application and product.

From gatekeepers to service providers: reimagining DevOps relationship with developers

Arthur Azrieli — Fri, 28 Feb 2025 19:27:44 +0000

Most of us, if not all of us, are service providers. It doesn’t matter what position we are in. Each person in an organisation renders service onto the organisation, even the CEO and board of directors. Perhaps the only ones who are not service providers are the investors but perhaps they too render their service onto someone or something else. We begin with the notion of service providers because it is a crucial factor in the relationship between DevOps and developers.

The Relationship between DevOps and Developers

The relationship between DevOps and developers is as delicate and complicated as it is crucial for the whole organisation. Delicate because any friction causes setbacks in development. Complicated because it spans across many domains of knowledge and various requirements. And crucial for the whole organisation because if it’s not optimal, developers don’t deliver quality releases on time.

Luckily, the core of the relationship between DevOps and developers is technological, so handling challenges and adaptations is technological in nature. There are ways, purely technological ones, to reduce the pressure DevOps face on the one hand, and on the other hand remove obstacles from the developers’ way and let them become independent in their work.

DevOps Are Service Providers but on a Different Scale

DevOps is developer operations abbreviated, and it’s easy to see from this terminology that DevOps are the exclusive service providers for developers and should gear themselves accordingly. However, it appears that acting as service providers doesn’t come easy to DevOps due to the scale and complexity of what’s required from them as service providers. As a result, a lot of friction and dissonance exist between DevOps and developers in many organisations. To solve the friction and dissonance, we need to understand what type of service providers DevOps are, what’s the scale of their work, and how they can improve communication and operations to make their work, and the developers’ work, faster, easier, and more efficient.

Understanding the Role of DevOps as Service Providers

Most service providers usually provide their services within well-defined scopes. Developers develop, analysts analyze, QA verify, and so on. It’s true that these can also have their own internal customers like team members that require assistance. However, in such cases, the service or assistance is still within the scope and on a relatively small scale.

DevOps is a different story. Not only do DevOps have their own work cut out for them, they also support a lot of internal customers, and do so on a much larger scale and within a wider scope. DevOps assist RND people (and others such as analysts and sales engineers) in a wide variety of contexts:

Assisting in setting up local environments.
Granting roles and permissions to internal and external systems.
Troubleshooting issues with databases, microservices, and deployments.
Provisioning resources on demand.
Consulting on scale and security.

The problem here is that DevOps are a lot of times unprepared and unequipped for it. Neither in terms of understanding who the customer is, what they need, and how to give it to them with resources and restrictions in mind, nor in terms of how to do it at scale.

DevOps usually come into the job with a different mindset. A DevOps engineer probably sees themselves responsible for the development and maintenance of mainly the following:

Infra - Cloud, different environments.
Monitoring, logging, and alerting systems.
CI/CD infra.

Most DevOps will agree that it’s okay to add to this list ongoing support for developers and other stakeholders over a wide variety of contexts, systems, frameworks, and platforms.

However, it’s this very same wide variety of domains that DevOps support that prevents them from providing good service. When DevOps are unable to provide this service, it creates a lot of friction and dissonance between DevOps and developers.

The Dissonance Between DevOps and Developers

There’s no shortage of dissonance and conflict between DevOps and developers. Let’s look at some real-life examples.

What DevOps might say:

Developers don’t do the bare minimum to solve issues themselves before turning to DevOps.
Developers' requirements from DevOps are always the path of least resistance.
Developers don’t develop with security and scale in mind.

On the other hand, developers may want to counter-argue:

DevOps are not responsive enough, neither in terms of time to resolution nor deliverables.
DevOps impede development and velocity through requirements and bureaucracy.
DevOps don’t develop with developers in mind to assist them and facilitate their work.

It’s easy to see the dissonance when these complaints are put side by side and one after another. Putting this conflict into structure reveals the disconnect and distance between what one side needs and what the other side is capable of providing.

Since DevOps have their own duties to fulfill, adding to that extensive support adds pressure. This pressure could make DevOps compromise on the service they give because they are short on time and resources. So they expect developers to be self-sufficient and efficient when asking for support. What DevOps see as inefficiency in developers stems from the fact the developers are the most pressured entity in the organisation because they develop the product. When developers come across issues that prevent them from working they too are short on time and resources and need someone to assist them in a timely manner.

This is reality nowadays in many organisations. Both developers and DevOps are short on time and resources, and when the former approaches the latter, the latter needs to halt their work and assist, or development will suffer setbacks. The goal is to change this reality and realign DevOps and developers towards better collaboration.

Realigning DevOps with Developers

To realign DevOps with developers is not an easy task. The dissonance is not found in disagreement and differences. The dissonance is found in the fact that it’s not easy to change this reality and realign DevOps and developers towards better collaboration. It’s not enough to just tell DevOps that they are service providers and should do what they can to support developers in their day-to-day operations. It’s not enough to tell developers to approach DevOps like a customer requiring a service.

What’s lacking here is not verbal agreement but a well-defined, well-implemented framework of methodologies to help the two sides communicate and collaborate clearly and efficiently.

Moreover, the DevOps mindset must incorporate the idea that to provide services on a large scale, you need tools and you need to know what the customer needs, how and when. For DevOps, to provide services is to alleviate the pressure coming from developer needs and improve delivery. Let’s explore some ideas for improvement.

Communication Tools

Most companies use some form of ChatOps such as Slack or Teams. A dedicated channel for requests from developers is the first step. However, if there’s no structured way to submit requests for support or resources, it can become unmanageable and unwieldy really quickly. Many requests can come in at once and each of them might be related to something different.
To tackle this issue, it’s possible to install a request bot or request form in the dedicated channel. The request bot or form allows developers to submit requests in an orderly manner. It also allows DevOps to manage requests by queue and with more info and context to begin with. The form or bot should gather the following from the requester:

The nature of the problem - is it a request for support, a general question, or a request for resources?
What’s the environment in question - is it local, dev, or production?
The request itself - does the developer need to set up a new service, are they having issues with a service, do they need more permissions for internal and external systems?
If it’s a service - what is the name of the service and its dependencies (storage, docker repos, git repos, databases)?
If it’s a request for resources - quantitative measures such as CPU and memory and justification for adding resources.
If it’s more permissions - justification for permissions.
What steps were taken to try and troubleshoot - where applicable.
Any additional information that might be relevant - logs, metrics, documentation.

With the above in mind, now it’s time to see what can be automated or made self-serve to facilitate developer work by way of delegating and enabling.

Facilitating Developer Work

Most customers would prefer to do things themselves. Especially developers who always work in fast-paced environments and within time constraints. We’ve started with communication tools and now want to:

Gather info and analyze.
Discover areas that are constant pain points for developers.
Automate or delegate where possible.
Rinse and repeat.

This way DevOps can automate repetitive tasks such as granting permissions or provisioning resources. DevOps must also think about protecting their own customers and not be too permissive. Clear boundaries need to be defined when permissions or resources are asked because, like we said, DevOps are usually adamant when it comes to security and scale.

Beyond turning repetitive tasks to self-serve, DevOps should also strive towards making developers’ work as easy as possible. For a developer, an easier way to work could include:

Work - do everything on their own without needing anyone’s help.
Develop - disposable dev environments that are quick to set up.
CI/CD - easy to configure, easy to deploy, easy to revert.
Panic time - clean, well-scoped, logs and metrics that are easy to search.

At this point we are almost three quarters of the way in. Now all that’s left is to make sure that customers are well-aware of what’s at their disposal. DevOps can and should plan for building a body of knowledge to assist and educate developers on how to make the best of what’s offered to them by DevOps.

Summarize, Refine, Document, Educate

Once proper communication and procedures are in place to facilitate developer work, the body of knowledge should be assembled. Everything that doesn’t fall under automated requests and better troubleshooting and debugging tools should go into the body of knowledge.

The body of knowledge consists of the following:

Documentation
How-Tos
FAQs
Workshops

Composing and maintaining a body of knowledge is probably one of the most challenging things DevOps can do. It’s easy for DevOps to configure automations but most don’t know how to write in a clear concise manner. In addition, most customers don’t really bother to read the docs and if they do they just skim through. Even if developers do make use of documentation and workshops, knowledge is changing fast and there’s a need to adjust and update the body of knowledge accordingly. To tackle this challenge DevOps can encourage developers to take an active part in maintaining the body knowledge. An engaged customer is most likely a self-sufficient one. Perhaps the most important aspect of imparting knowledge is workshops. Not only is it easier to learn and understand by doing rather than reading, it also brings DevOps and developers closer together and strengthens their relationship.

DevOps as a Service: Maintaining Relationships at Scale

The first step towards improving the collaboration between DevOps and developers is understanding its scale. The scale is massive enough to put strain on the relationship, and like most scale issues, systems and procedures can be put in place and iteratively revised and improved to handle it.

DevOps must acknowledge that they are service providers at scale. Being service providers at scale, they must understand their resources, capabilities, and limitations in assisting developers, and put a system or procedure in place to handle it for them. DevOps must empower developers and delegate more responsibilities to them, therefore relieving pressure off of themselves while giving developers more agency.

Only when DevOps follow this set of principles will they be able to finally emerge as they have always meant to be: service providers at scale.

Proper mindset for handling data and databases: between scaling and failing

Arthur Azrieli — Fri, 28 Feb 2025 17:19:11 +0000

Startups and software companies put a lot of effort into what languages to use, what tech stacks to employ, and what cloud to deploy the app to but fail to put the same focus on their data and databases

Data is the core of the app and product from which everything is derived and upon which everything depends. Startups and software houses should put more thought into how they plan to gather, store, analyse, and use their data. Doing so could mark the difference between success and failure.

Avoid Common Data Architecture Regrets

Data is the raw material from which you mold your application. Anything else is just tools. If you treat your raw ingredients with proper care, the final recipe is bound to succeed.

The tried and true methods of optimizing data are abundant. It’s well known that databases can be tweaked to better perform through various means:

Indexing - speed up searches that rely on primary keys.
Normalization - keep data atomic and separate.
Query optimization - select only what you need, when you need it.
Partitioning - divide large tables into smaller ones.

So if it’s all tried and true and has been established, why do we highlight it as an overlooked part of many applications? That’s because there’s a long way to go from theory to practice. The list above only represents what can be done to optimize performance, not how and when to do so. Moreover, in a fast-paced environment of software development and especially in startups, proper planning for data is sometimes pushed aside in favor of rapid growth.

Plan Ahead for Your Data

We’ve mentioned several ways to optimize database performance, but what we should really focus on is planning for the data. Questions like what sort of data it will be, what will be the format, and what sort of manipulations it will undergo. Perhaps even more rudimentary than the type and usage of data is the database itself and how it fits your application.

Get to Know the Database

There are two main types of databases these days: relational (SQL) and document-based (NoSQL).

NoSQL:

If your application needs to handle single yet flexible documents.
If you predict large amounts of data that might be distributed and sharded.
If you expect a lot of unstructured data.

SQL:

If your application requires rigid, well-defined schemas and relations.
If your application requires consistency throughout the datascape.
If you intend to digest columnar data using big data tools.

Once you’ve chosen the database to work with, ask yourself again what your use case is. Inform yourself as to what others experienced working with MySQL, MariaDB, PostgreSQL, MongoDB, to name a few. Find the setbacks that others faced and see if at any point in the future you might face something similar.

Get to Know the Data and its Characteristics

The way you design your data now will impact you in the future. It’s a hard task, but force yourself to think of what other functionality you have in store and plan to implement. See if the current data scheme and models allow easy integration of such functionality.

Functionality implies data moving around and being updated constantly. Consider the behavior of your data:

If you do a lot of writes but fewer reads, opt for throughput.
If you do many reads but fewer writes, opt for io and use caching.

Load and Stress the Data

Data-related performance issues mostly hit you when you least expect them. The smooth functionality that you are used to is not attributed to the choice of database or data scheme. It’s mostly attributed to lack of load and stress on the database. This load and stress is what you should strive for.

Again a hard task ahead that requires you to accept that tens of thousands of requests per minute can easily become millions. It’s easy to list and discuss ways to optimize data manipulation and retrieval, but no one gains experience and knowledge without trying.

If you want to know if your data is well-structured and well-retrieved:

Try to write and read more than you imagine would be possible.
Only when it stops working do we look under the hood to find and fix the problem.

Like we said earlier, anything else is just tools. The data is the heart and core of the application and should be created like one: consistent, resilient, scalable.

Protect Your Data

With the efforts of choosing a database, data models, and optimizing their usage behind you, you should think about protecting your data.

Protecting your data from bad actors goes without saying. It’s the internal actors that you need to shield the data from. Internal actors can be services and humans, and since humans make mistakes, so do services.

Consider the following means of mitigating accidental service disruption or, worse yet, data loss or corruption:

Back up the data and plan for deploying from a snapshot.
Limit access from the get-go.
Reads from replicas, no human ever writes directly to the data.
If a service is the owner or main user of a table or database, other services request data through internal APIs.
Monitor the database CPU and memory and plan ahead in case you need to scale.
Look for, kill, and find the source of long-running queries to detect misbehaving services.

Keep The Data Clean

It’s not enough to protect your data. You also have to keep it nice and tidy. A lot of data accumulates through the product lifecycle and more often than not becomes stale.

Modern hard drives are fast, reliable, and reach terabytes in volume, but that doesn’t mean you should fill them with data.

Too much data puts strain on the disk and memory, not to mention that more data means longer queries, even with indexing.

Consider the following as ways to keep your data clean:

Don’t do soft deletes.
Scan and find least retrieved data and archive it.
If you’re using PostgreSQL, use vacuum. If you use MySQL, use optimize. Do the same for any other database you use.
Be wary of making changes – don’t add tables that duplicate data.

Keep Your Data in Mind

Out of all the aspects and methods we discussed, there’s one conclusion to be drawn:

Data is the most important, most overlooked aspect of software development.

To keep your data in mind means to consider all the pros and cons of choosing a database.

To keep your data in mind means you always check how data retrieval affects performance.

To keep your data in mind is to follow these principles:

Choose the right database for the workload.
Create indices and optimize queries.
Optimize I/O through hardware adjustments and caching.
Get rid of unnecessary data – no soft deletes.
Check and check again that high volume doesn’t create bottlenecks.
Back up your data and limit access.

Strive to apply the principles listed above because no matter what you do with your app or product, it’s almost always related to data.

Always keep in mind that Data is the foundation. When it’s well-maintained, the whole system benefits.

Practical Tips for Kubernetes Upgrades for Startups

Arthur Azrieli — Mon, 10 Feb 2025 18:58:35 +0000

Upgrade Kubernetes with confidence: A step-by-step guide to ensure seamless updates, maintain stability, and avoid breaking changes.

The all-too-popular Kubernetes upgrade-storm

There comes a day when you get a notification that the current Kubernetes version that you are running is reaching its end of life. Best case scenario you open a ticket knowing full well that this ticket will either be pushed down the list of priorities or be forgotten completely. After all, you have other priorities such as releases and bug fixing. You are a fast-running startup that needs to bring in new business in order to grow. Upgrading Kubernetes is the least of your concerns right now.

You will have to upgrade eventually

But the day finally comes when you need to upgrade and one of the following could be the trigger:

Your Kubernetes version actually finally reached its end of life.
RND management finally decided it was time to upgrade.
You need to upgrade regardless of end of life because some critical components in your cluster need upgrading for a bug fix or a feature that you need.

You look up the helm chart or operator that is running in your cluster and realize that you cannot upgrade to the newer versions because they are incompatible with your current Kubernetes version. So you need to upgrade the cluster in order to upgrade the helm charts. And to top it all off, it has been decided to upgrade Kubernetes all the way to the latest versions and you find yourself needing to upgrade 4 versions forward.

Every Kubernetes upgrade has the potential to introduce breaking changes. In most cases it’s deprecated APIs or APIs that moved from one API group to another. This will affect anything in your cluster that relies on these Kubernetes APIs. Now take this and do it four times. You need to carefully plan how to approach and execute the upgrade:

Scope and price the process in terms of effort and time to completion.
Create an upgrade plan and iterate over it by testing on lower environments.
Set a maintenance window.
Declare code freeze.
Upgrade and verify.

This is a challenging process that will exhaust you. It is labor intensive and very error prone. If you upgrade a library in some microservice you can test it locally, the scope is almost always isolated to this specific micro service and in any case the blast radius is relatively small. But when you upgrade a Kubernetes cluster you are upgrading the entire system and anything going wrong could have serious consequences.

How to approach the upgrade

Before discussing ways to approach an upcoming upgrade we need to address the elephant in the room. Once you need to upgrade, you’re probably short on time, short on resources, and need to upgrade several versions forward while making sure that the app stack itself and everything else that runs on your Kubernetes cluster remains functional. That is not the way to go.

What we derive from this situation is the first principle of how to approach the upgrade - upgrade small, upgrade continuous. Once we realize and implement this principle we can move on to what you need to do in order to successfully upgrade your Kubernetes cluster.

Upgrade small, upgrade continuous.

Once we realize and implement this principle, we can move on to what you need to do in order to successfully upgrade your Kubernetes cluster.

Upgrade small upgrade continuous

Remember the day when you got the notification that your Kubernetes cluster versions reached its end of life? Well this is the day where you waste no time and put this task in the sprint.

Opponents of this approach might say that a startup cannot afford to jump on every upgrade because there’s more pressing business to conduct. But a startup also cannot afford system instability. The longer the wait the more unstable the system might become and the upgrade will be harder and harder, especially if more than one upgrade is in question. So upgrade small, upgrade continuous. This goes for components in the cluster as much as the cluster itself.

Keeping your helm charts, operators and controllers up to date will almost always guarantee that you will not have to upgrade them when upgrading your Kubernetes cluster. There is no doubt that a startup should think first how to bring in money. If the Kubernetes cluster upgrade competes with a feature that will bring in new business, the feature will almost always win.

However, by insisting on upgrading small and keeping up to date, you provide yourself with breathing room for when a feature or a bug fix are really critical. You can allow yourself to skip the upgrade and focus on business because the end of life of your Kubernetes cluster version is farther down the road.

Test and verify on lower environments

Another principle worth discussing is testing and verifying the upgrade on lower environments. The term lower environments obviously means dev and stg but in many cases dev and stg environment represent the app stack and less so the infrastructure.

This means that when testing on lower environments they have to be identical to the production cluster that you are about to upgrade. As identical non-critical environments, you can allow yourself to make mistakes which are the best way to learn.

Upgrading Kubernetes is a difficult task. Having the ability to try it out without fear of service disruption is liberating and will allow you to experiment more therefore better preparing yourself for the upgrade day.

When you eventually test on lower environments, don’t just upgrade and settle for a working cluster. Remeber that lower environments are meant to represent the architecture and app stack of higher environments. Consider the following when upgrading consider the following:

Monitor the environments through metrics and logs to check for anything suspicious or out of the ordinary:
- A critical component fails to be scheduled - pods in crash CrashLoopBackOff, pods failing to satisfy liveness and readiness probe, nodes don’t scale when the cluster is loaded, etc.
- kube-proxy is not alive and well and services cannot talk to each other.
- kube-dns is not alive and well and services fail to resolve host names.
Run e2e tests on your app stack. Verifying that your app stack functions as it should in an upgraded environments will give extra confidence that the upgrade is going well:
- Run e2e tests.
- Run integration tests.
- Run load tests to verify that deployments scale accordingly.

There’s a caveat though in this approach that we need to address. You have to provision and maintain these environments which means more resources allocated to the upgrade process even if it’s not in the pipeline. But there’s no better approach. It’s a “measure twice cut once” and “invest money not time” rolled into one.

Try out the upgrade first and then execute and do it on preallocated environments. The stability you will achieve will contribute to the overall health of the system and the organization itself. No amount of money can compensate for dev teams overworked by system instability. It could also prove a source of churn where instability drives away clients and even prospects.

Now it’s time to upgrade

Let’s assume that we are in an ideal world where you have your lower environments ready and well-maintained and you have allocated time and resources for continuous upgrades. How do you prepare for an upgrade? There are several things you need to do.

First of all, you have to thoroughly read the release notes. And it doesn’t mean scrolling through them but reading them line by line. It’s a time consuming task but it follows the principle of “measure twice cut once”. A lot of what you will read won’t be relevant. A lot of what you will read will be invaluable. Dedicate time and patience to this task. Tutorials and guides are obviously welcome but try to remember that not all environments are alike.

Now that you have a sense of what’s heading your way in terms of the effort put into research, you could use an automated process to give you a head start. You can find exactly that in kubepug which is an open-source Kubernetes pre-upgrade checker. What you can and should do with kubepug:

Run kubepug against your current Kubernetes cluster version to get the following:
- A list of deprecated APIs.
- Any objects affected by API changes.
- What APIs should be used instead of deprecated ones.
If all goes well and we wish you that it will, run kubepug once more because it is also capable of verifying current versions.
If all goes well and we wish you that it will, run kubepug once more because it is also capable of verifying current versions.
Trust kubepug but verify that everything that it drew you attention to was indeed upgraded or replaced and that it’s consistent with the release notes.

Once you gather the information and mode of operation from release notes and guides, go look at your Kubernetes cluster and find everything that both exists in your cluster and is referenced in the information you gathered. The match between the two is the basis for your upgrade plan.

Automate and summarize

We’ve mentioned a few times that the upgrade process, especially everything that precedes it, is very arduous and time consuming. This is where you automate the discovery and summary process.

Use LLMs to summarize and highlight information that you gathered and other tools to scan, analyze and inform you on changes between versions. Another aspect of the upgrade process is to compile, document and implement the upgrade process itself. Laying down the foundations of upgrading small and continuously is perhaps the most important aspect even more than the upgrade itself.

It’s true that the goal is the eventual Kubernetes cluster upgrade, but how it’s carried out will determine the measure of peace of mind that you will have when approaching this important task.

Yours is a startup and should start well

Kubernetes is one of the best things to have happened to the tech industry. By using it, your startup avoids the pain of having to provision your own orchestrator. Take into account that the time you save for using Kubernetes rather than maintaining your own solution, is time to invest in paying respects back to Kubernetes.

And to do that you need to consider that for Kubernetes to continue serving you it needs to be up to date and well-maintained. Then and only then will it guarantee the highest level of stability. And a stable infra for a startup is priceless as it allows you to grow and rarely holds you back.

So for all your successful upgrades to come, adopt the mindset that we are trying to convey.

Give the Kubernetes upgrade its place in the development pipeline. Like we highlighted, an upgraded, well-maintained cluster is an invaluable resource. Small to medium effort every now and then is better than an out-of-the-blue urgent upgrade.

Be ahead of the upgrade. Don’t wait for it to come to you. Seek it proactively and open a ticket with a due date. Add yourself a calendar reminder. Allow yourself time to educate and prepare yourself. Yes there are automated tools like kubepug that we mentioned but we need to know how to use these tools and rely on them to the extent that they don’t have the final say.

Test on lower environments and verify and validate by looking at metrics and logs. Validate further by making sure that the app stack functions as it should.

These principles don't guarantee smooth upgrades as the unexpected is almost always bound to happen. However, they do guarantee successful upgrades that will instill in you greater confidence for the current upgrade as well as future upgrade. You’re a startup and adopting these principles and mindset will prove itself not only when upgrading your cluster, but in anything that you set out your startup to become and achieve.

Deploy a Kubernetes App & AWS Resources using Crossplane on Kubernetes: Part 2

Arthur Azrieli — Thu, 31 Oct 2024 12:19:56 +0000

To properly enjoy this article

This tutorial assumes you already followed the steps in part 1: Deploy AWS Resources using Crossplane on Kubernetes.

Also, this is the Github Repository we’ll be using:
https://github.com/MeteorOps/crossplane-aws-provider-bootstrap

What’s this article about?

In this article, we’ll cover a use-case that can benefit from Crossplane: full environment deployment.

This is a step-by-step guide with an example and a Git repository, so by the end of it, you should be able to deploy a sample env.

You can technically walkthrough the entire thing by "copy-paste" and everything should work. But, diving into the explanations with an extra 5–10 minutes will leave you with longer-term value.

Hope you enjoy!

What to expect from this article?

By the end of it, you’ll understand:

How Crossplane can be used for full environment deployment
How to deploy a sample app with AWS resources

What not to expect?

This article guides you through a simple application deployment, and not a full set of apps.

It also doesn’t go into using Crossplane in conjunction with Helm, but does cover important principles regarding it.

Why Crossplane for the Full Environment Use-Case?

When you want to deploy a full environment, it usually involves 3 layers:

Infrastructure: Resources the application needs to run well
Application: The programs built by the company to serve users
Data: The data the application uses

But you already know that.

The thing is a tradition developed, and Crossplane sort of broke this tradition.

The tradition was this process: Build infrastructure, Deploy application on top.
How did Crossplane break this tradition?
The application deployment can now provision infrastructure required by the application.

Pull-Request Environments are also easier

By creating a namespace with all of the apps and the AWS resources required with Crossplane, the use-case of creating a full environment per Pull-Request as part of the CI becomes much easier.

That's a nice benefit of such setup for companies utilizing the feature-branch or Gitflow approaches.

A Traditional Full Env Example

To provision and deploy a full environment in the past, the process would generally look something like this:

Provision VPC+EKS+... using Terraform
Use Terraform to bootstrap the cluster with a CD tool (e.g., ArgoCD)
ArgoCD looks at a repo that deploys all apps from there
An application needs a new S3 Bucket, so the developer writes Terraform code for it
The application gets removed after a while (but the bucket stays)
Someone needs to remember that bucket was owned by that app and remove it from Terraform

A Crossplane Full Env Example

To provision and deploy a full environment with Crossplane the process is similar (we still need a Kubernetes Cluster to start with for the initial environment):

Provision VPC+EKS+... using Terraform
Deploy Crossplane’s prerequisites to the cluster with Terraform
Add Crossplane resources to application Helm Charts (so they get their required infra upon deployment)
Create a Crossplane manifest to deploy the Helm Charts + Some shared infra required by all apps
When an application is removed, its AWS resources are gone with it
When an entire environment is terminated, its AWS resources are gone with it

Crossplane in Helm vs. Helm in Crossplane

When using Crossplane alongside Helm, the question arises:

Should Helm apply the Crossplane code? Or, should Crossplane apply the Helm Charts?
I'm glad you asked it - the answer is both, depends when.

Reasons for Crossplane in Helm:

Create or modify app-specific resources when that app is deployed
Delete app-specific resources when that app is deleted

Reasons for Helm in Crossplane:

Manage dependencies between resources and applications using Crossplane
Create shared resources that are not owned by a single application

The Step-by-Step Guide

Deploy the simple application alongside a S3 bucket using a Crossplane Composite Application.

Before proceeding

Make sure you follow the steps in the 1st article (takes 3-minutes to just copy-paste the code snippets into your terminal and run the entire thing).

Deploy the Crossplane Kubernetes Provider

Prepare the AWS Credentials for the Application to be able to use AWS
Run the following oneliner to create the Secret containing the AWS credentials in the right format as required by the Application (the application will simply run aws s3 ls to show the bucket):

kubectl create secret generic aws-creds \
--from-literal=aws_access_key_id=$(grep -i aws_access_key_id creds | awk -F' = ' '{print $2}') \
--from-literal=aws_secret_access_key=$(grep -i aws_secret_access_key creds | awk -F' = ' '{print $2}')

Make sure it was created as expected by fetching the secret:

Deploy the Crossplane Kubernetes Provider resources using the k8s-provider-bootstrap.yaml file

kubectl apply -f k8s-provider-bootstrap.yaml

Make sure the provider was created and is ready before proceeding to the next steps:

kubectl get providers provider-kubernetes

You should see something like this:

Deploy the Crossplane Kubernetes Provider Configuration using the k8s-provider-conf.yaml file:

kubectl apply -f k8s-provider-conf.yaml

This is done separately as it needs to happen after the Provider resources were created.

This is where we tell the Crossplane Kubernetes Provider in which Kubernetes cluster it should operate when it’s creating resources.

Create a deployable unit of an App & AWS Resources using Crossplane

Here we do 3 things with 3 files:

The composite-app-xrd file:
Contains the CompositeResourceDefinition (XRD) for the K8sApplication by using the Composition of a K8s Deployment and S3 Bucket (described below)
The composite-app-composition file:
Contains the Composition definition which creates both the Kubernetes Deployment and the S3 Bucket
The composite-app-example file:
Calls the CompositeResource defined by composite-app-xrd file

Crossplane Resources Files Breakdown & Creation

composite-app-xrd.yaml

~ K8sApplication CompositeResourceDefinition
This defines a composite resource for a Kubernetes application, with bucketName and bucketRegion fields in the spec. Users can claim this resource as K8sApplication.

The K8sApplication CompositeResource (XRD) accepts the bucketName & bucketRegion fields and uses them to create an S3 Bucket, and to create a K8s Deployment of a mock “service” that simply runs aws s3 ls to see the bucket.

~ Deploy the CompositeResourceDefinition (XRD)

kubectl apply -f composite-app-xrd.yaml

composite-app-composition.yaml

Defines a Composition of resources that can be created by a CompositeResource.

This is where we define the Composition that creates a combo of a Kubernetes Deployment with the mock “service” that runs aws s3 ls as well as the S3 bucket — The CompositeResource simply calls this resource.

~ Deploy the Composition

kubectl apply -f composite-app-composition.yaml

composite-app-example.yaml

Deploys the actual K8sApplication CompositeResource, and passes the details of the region in which the bucket should be created, and the name of the bucket (both are also passed to the Kubernetes Deployment as environment variables that helps it access the same bucket).

As mentioned above, the CompositeResource calls the Composition which creates the resources using the Crossplane providers.

Deploy the app by running the following command:

kubectl apply -f composite-app-example.yaml

Look at your pretty Application

Fetch the K8sApplication resource you’ve just created by running the below command obsessively until it’s marked as Healthy:

kubectl get K8sApplication

You should get something like this:

Print the logs of the application and see it fetching the AWS S3 Bucket:

kubectl logs -l app=awscli
# 2024-10-17 16:00:31 my-app-bucket-nqzhx-xzjcq
# 2024-10-17 16:00:50 my-app-bucket-nqzhx-xzjcq

Cleanup

kubectl delete -f composite-app-example.yaml

Recap

To briefly recap what you did here:

Prepared Crossplane for deploying a mix of Kubernetes and AWS resources
Defined the manifests required to deploy an app built of a Deployment and a S3 Bucket
Sharpened your grasp on some Crossplane concepts
Discussed some use-cases for which it’s useful

Hope you enjoyed this article, and if you are interested in another article about something related (or unrelated), please convince Michael it’s a good idea at michael@meteorops.com.

Disclaimer: In actual environments or production, it’s essential to fine-tune the permissions in the different manifests. Instead of using access keys and secret keys directly, consider implementing IAM Roles for Service Accounts (IRSA) to manage permissions more securely.

How to build a DevOps team

Michael Zion — Tue, 01 Oct 2024 22:27:55 +0000

If you're considering how to build a DevOps team the best way possible, this one's for you.

This blog post is the result of my advice for people building DevOps teams: CTOs, VPs of R&D, Team Leaders, and more.

TL;DR

I'll give you the bottom line.
These are the steps for building your DevOps team:

Polish your DevOps Philosophy
Understand the DevOps Lead responsibilities: Product, Project, People, Service, Architecture
Define your Team's Mission Statement
Set useful DevOps Goals & Practices
Set Guiding DevOps Principles
Define the DevOps team's roadmap & strategy
Set a definition of done
Common DevOps teams pitfalls
Calculate your team's required capacity
Hire the right DevOps Engineers

You can also apply these to existing DevOps teams and level-up your organization.
Let's get started!

Polish your DevOps Philosophy

You must first understand why organizations need DevOps, and how it can work in practice.

DevOps is an enabler role.

It's meant to enable the developers to build, improve, and take ownership over the system.

Let's break it down:

Build = Create something new. Could be a new microservice, a new database, or a new monitoring dashboard.
Improve = Introduce a change into something that exists. Could be fixing a bug, changing a database schema, or changing an alert threshold.
Take ownership = Take charge of problems that arise with what you built and improved. This means when something needs improvement, the owner is the one who does it.

The above are what the developers should do.

Now, DevOps should enable it.

Note, to enable is not to give full permissions and let them have fun.

To enable is to give developers what they need to build, improve, and own the system.

But, do it in a way that focuses the developers' energy in the right direction, and in a safe way.

Understand the DevOps Lead Responsibilities

You have 5 hats as a DevOps Lead:

Product (Platform)
Your clients are your company's developers.
Provide them with tools, knowledge, and automations.
Tools = Polished automations.
Automations = Automated knowledge.
Knowledge = Hard-earned information & insights.
Understand their requirements, and use Tools, Automations, and Knowledge to fulfill them.
Understand why they want what they want: Is it because of an underlying issue with the system? If yes, solve it before building anything.
If a tool or automation will save time, consider building or implementing it.
Service
Don't let your developers wait until you do something "the right way".
Sometimes they need immediate help to complete something.
Help them first, and invest time later in automation and tooling.
Project
Managing your DevOps team requires managing the work it does.
Turn the team's philosophy into a mission.
Turn the team's mission into goals.
Turn the team's goals into a roadmap.
Turn the team's roadmap into tasks.
Prioritize the tasks.
Set simple roles and responsibilities.
Hold your team members accountable to progression.
Make it easy to inform relevant team members on updates.
Make it easy to consult with a teammate on a subject of their expertise.
People
Each DevOps team member has type of work and tasks they enjoy more.
One enjoys sharing knowledge, another enjoys being of service, and some enjoy building tools.
Different teammates are also interested in different subjects: Monitoring, Infrastructure, CI/CD, etc.
A team member that's working on what they enjoy is more fulfilled and productive.
Strive to overlap each teammate's goals with the company's goals.
Architecture
You have 2 architectures to worry about.
1) The company's product architecture (built by the developers).
2) The DevOps platform's architecture (built by the DevOps team).
Help the developers understand and control the application's effect on the infrastructure.
Enable the developers to make informed decisions by providing context.
Finally, build the platform to support the developers requirements.
At every step, only limit decisions that damage the company.
Examples: High cost, Stability impairment, Restricts observability, etc.

Define Your DevOps Team's Mission Statement

Here's one for you:

"Enable the developers to build, improve, and own the system".

It's pretty minimal, so it's going to help your team stay focused.

A healthy sign it catches on: When the team debates a decision regarding a task, they ask "Does it help us enable the developers to build, improve, or own the system?".

That's when you win.

Set Useful DevOps Goals & Practices

Your company's success will be determined by its speed and its product's quality.

To make it happen, you should know programming is science, not maths - I'll explain.

People used to think programming will be a mathematical discipline.
They thought programmers will write functions and mathematically prove them.

Not what happened - Programming is a scientific discipline.

You write code, your test it, and you assume it's good - until one test fails.

In essence, you're experimenting.

You might ask: "wtf? how's this related to setting useful goals?"

The answer is that the first thing you want to enable is running experiments easily.

So here are some useful goals:

Developers can easily test their code in a consistent manner
Production and testing environments are identical (Production will benefit from the quality of the tests)
Developers can easily collaborate
Developers can understand the state of the system and the impact of changes

Let's translate those goals into smaller goals or practices that will achieve the goals:

Developers can create a testing/production environment with "One-Click"
There's a continuous integration process that enables the developers to collaborate by agreeing on the current up-to-date version of the system
Auto-create dashboards and alerts for new services

Set Guiding DevOps Principles

Some useful DevOps principles that help save your team time, improve the speed of delivery, and keep the system healthy:

PoC before doing things "the right way"
Make it work, then make it better
The entire system should be fully recoverable from Git
Use tools with a big community and well-documented interface
Equip key-developers with DevOps knowledge to be the first point-of-contact for their team members (super users)

Define the DevOps Team's Roadmap & Strategy

Roadmap = Goals * Strategy.

Once you set the goals (as mentioned above), you can start prioritizing.

♯1 - The DevOps Categories

Enabling developers requires a DevOps Engineer to handle and enable the following:

Provision infrastructure
Deploy workloads
Monitor the system
Recover from issues
Scale up and down
Track & test changes (Codebase Management)
Secure the system
Store & retrieve data
Configure the system

♯2 - Examine each goal through each category

Every DevOps goal you set should be examined through the lens of each category.

The reason is that together the categories cover each aspect of the building, improving, and owning of a software operation.

Let's do an example:

Goal: Create a "One-Click Environment Automation"
Categories to address:
- How should its infrastructure be provisioned?
- How should its workloads be deployed?
- How should its metrics and logs be sent, stored, and queried?
- ...

♯3 - Strategic Principles

Reach at least 50% capacity of working on the DevOps goals as soon as you responsibly can - Support the developers and teach them how to self-support to achieve that
Easy to modify > Perfect - When you do something that isn't perfect due to a lack of time, do it in such a way that modifying it later on to improve it is easy
Prerequisites first: Codebase Management -> Infrastructure -> Deployment -> Configuration -> Data Management.
At least moderate foundations quickly, reinforce later, but never weak foundations: If you don't use any boilerplates, or if you are not proficient in something early on, and it's a foundation (like infrastructure), then don't give up and build weak foundations, but also don't over-invest and build strong ones if it's too time consuming. Instead, moderately invest in the foundation, and revisit it later.

Set a Definition of Done

Also known as a Definition of done.

Ask the following questions for every single component in your system:

Monitoring: Are there metrics, logs, traces, and alerts setup in an actionable way?
Availability: Is there a mechanism to keep it alive during incidents?
Resiliency: Can it recover from an error quickly?
Recovery: Can it be fully restored to a previous state?
Testability: Is it possible to test changes to it?
Deliverability: Is there a process to release changes to it?
Persistency: Will its data persist if the system is hindered?
Integrable: Does it have a consistent and predictable interface allowing integration with it?
Security: Is it accessible only by the parts of the system that absolutely need it?
Dependencies: Are its dependencies fully tracked and managed?

Common DevOps Team Pitfalls

Pitfall: DevOps work blocks developers work
Indicator: Developers need to wait for DevOps team changes to complete before continuing work
Cause: The DevOps team doesn't utilize its own practices (AKA shoemaker's son always goes barefoot)
Pitfall: Only support developers and maintain the system
Indicator: No progress on any DevOps goal or task
Cause: Either there are no clear DevOps goals or repeating developers requests haven't been automated
Pitfall: Adopting 'Best' practices instead of 'Suitable' practices
Indicator: Introducing methodologies and adhering to principles that go against the company's goals
Cause: Prioritizing methodology over company goals, usually because of a disconnect from the company goals or due to lack of DevOps experience

Calculate Your DevOps Team's Required Capacity

Required DevOps Capacity = (Scale * Complexity) / Leverage.

Leverage = Level of DevOps Engineers * Company Resources * Team Focus
Scale = Number of instances of each component * Number of people in the organization
Complexity = Number of components * Number of teams

Hire the Right DevOps Engineers

The types of DevOps Engineers:

Barrels vs. Ammo
~ Ammo = People who can complete tasks but won't initiate them.
~ Barrel = People who understand what tasks are needed next but won't complete them.
Interviewer Tip: During the technical interview with a candidate ask about past DevOps accomplishments, and ask how the tasks were created and who did them.
Aspiration-Oriented vs. Prevention-Oriented
~ Aspiration-Oriented = Has goals, positive feedback encourages and focuses them while negative feedback discourages them and kicks them off track.
~ Prevention-Oriented = Avoids problems, positive feedback makes them lay back and reduce capacity while negative feedback focuses them and keeps them on track.
Interviewer tip: See if there's a common theme for projects the candidate did in the past. People focused on security tasks and deep attention to specific details are more likely to be prevention-oriented, while people who initiated many projects spanning multiple (DevOps) categories are more likely to be aspiration-oriented.

Working with service providers:

Set clear desired results
And let the DevOps service provider assist in exploring the goal and provide perspective from other companies.
Expect transparency on progress
And judge it against the latest plan.
Expect clear planning
And make sure it has clear goals, takes into consideration the risks, and has a strategy that adheres to your DevOps principles.

Summary

Lots of stuff covered over this one-pager, and still much was left outside.

If you take away 1 thing from it, let it be this: A simple DevOps team mission statement is the most significant thing you need.

It sounds over-simplistic and abstract, but without it there is not guiding principle for how DevOps in the organization should look like.

Hope you enjoyed, and send me an email at michael@meteorops.com if there's anything else you'd like to see in here.

Deploy AWS Resources using Crossplane on Kubernetes

Arthur Azrieli — Wed, 18 Sep 2024 23:50:11 +0000

In this article we will be talking about Crossplane as an Infrastructure as Code (IaC) tool that is running on Kubernetes, why should we use it and how you configure AWS provider to start creating resources, we will be going through a full step by step example for you to be able to create your first resource with Crossplane

Who is this article for?

DevOps engineers interested in learning another IaC tool
Developers that want to take more Ops responsibility and provision their own infrastructure
Engineering managers that are looking to implement an IaC tool in their company/startup

Why am I writing this?

I had some discussions with engineers that had some trouble to get started with Crossplane, it may be a little less straightforward than a well established tool like Terraform, some documentation isn’t precise for different use cases and providers and even ChatGPT’s code doesn’t seem to work at times. And here I am saving the day to make your life easier by giving you a step by step guide where you install and configure everything and deploy your first AWS resource using Crossplane.

Why should you even use Crossplane then?

There are certain use cases where Crossplane provides very powerful capabilities being able to create both applications and cloud resources, those can be used for ephemeral environments for example or for having a SaaS company provide full environments that could be self created by a tenant. Those environments could be created by just applying a Kubernetes manifest which is much simpler than starting to run traditional IaC plan and apply commands.

How this article works

We prepared a repository with resources to deploy everything needed.
We explain on each resource what it does.
We walk you through how to deploy Crossplane.
We deploy a S3 Bucket to make sure everything works.

Prerequisites

1. Clone the repository and step into it

git clone https://github.com/MeteorOps/crossplane-aws-provider-bootstrap.git cd crossplane-aws-provider-bootstrap

2. Make sure you have the required CLIs:

Install the AWS CLI & Authenticate it with your AWS Account
Install the Kubectl CLI
Install the Helm CLI
An existing Kubernetes cluster (we’ll be using kind)

Optional: Start a local kind cluster

brew install kind

kind create cluster

open /Applications/Docker.app

kubectl cluster-info --context kind-kind

Repository Overview

Link to the Github Repository: https://github.com/MeteorOps/crossplane-aws-provider-bootstrap.git

creds file‍ AWS credentials - should be filled with your own AWS credentials
crossplane-provider-conf file ‍Uses the creds file to create a Crossplane ProviderConfig (separated into a different file because it takes time for this resource to be ready)
crossplane-provider-bootstrap file ‍Creates the Crossplane AWS Provider, which enables creating AWS resources using Crossplane (and its dependencies):ServiceAccount, DeploymentRuntimeConfig, Provider, ClusterRole & ClusterRoleBindings
bucket-definitions & bucket-crd files ‍The Kubernetes Crossplane manifests that create a CompositeResourceDefinition and a Composition resource, which together define how to create a S3 Bucket (like a Terraform Module would). Note: The ‘Composition’ resource relies on the ‘CompositeResourceDefinition’.
bucket-example file ‍The Kubernetes Crossplane manifest we’ll apply at the end to create a S3 bucket using Crossplane

Deploy Crossplane

1. Fill the creds file with your AWS access keys
Get your AWS IAM User (not an SSO user as it requires a token to work) access keys and fill them in the credentials file

NOTE: for production usage, please create a Crossplane IAM user and use its access keys, or preferably use something like IRSA

2. Deploy the Crossplane Helm Chart
Add the Helm repository from which the Crossplane Helm Charts will be fetched

helm repo add crossplane-stable https://charts.crossplane.io/stable

Deploy Crossplane on your Kubernetes cluster in a new namespace named crossplane-system

helm install crossplane crossplane-stable/crossplane --namespace crossplane-system --create-namespace

3. Examine your Crossplane Deployment
‍

Run the following command to get Crossplane's pods:

kubectl get pods -n crossplane-system

Then, you should see 2 pods: crossplane & crossplane-rbac-manager

4. Provide Crossplane AWS access by creating a Kubernetes Secret
Insert your AWS credentials to the creds file and run the following from the same folder:

kubectl create secret generic aws-credentials -n crossplane-system --from-file=creds=./creds

Make sure the secret was created as expected:

Run the following command:

kubectl get secret aws-credentials -n crossplane-system

You should see the aws-credentials secret:

5. Deploy the Crossplane AWS Provider
‍Creating a Crossplane AWS Provider requires creating a bunch of resources: ServiceAccount, DeploymentRuntimeConfig, Provider, ClusterRole & ClusterRoleBindings, and ProviderConfig
We divided the resources creation into 2 phases:

1 - crossplane-provider-bootstrap.yaml:
ServiceAccount, DeploymentRuntimeConfig, Provider, ClusterRole & ClusterRoleBindings

2- crossplane-provider-conf.yaml:
ProviderConfig

The reason for dividing it into 2 phases is that the creation of the ProviderConfig fails if we attempt to create it before the first set of Provider resources and dependencies is ready

Create the Provider Kubernetes resources using the bootstrap YAML file:

Run the following command:

kubectl apply -f crossplane-provider-bootstrap.yaml

Validate the creation readiness of the Provider & wait for it to be ready:

Run the following command to see the AWS Provider resource:

kubectl get provider

You should see something like this:

It might take 1-2 minutes to become Healthy.

Create the ProviderConfig resource & Validate its creation:

Run the following command:

kubectl apply -f crossplane-provider-conf.yaml && kubectl get providerconfig

Create a S3 Bucket using Crossplane

Create the CompositeResourceDefinition to define a S3 Bucket:

Run the following command:

kubectl apply -f bucket-definitions.yaml

Create the Composition to define a S3 Bucket:

Run the following command:

kubectl apply -f bucket-crd.yaml

Create the S3 Bucket Crossplane resource in Kubernetes:

kubectl apply -f bucket-example.yaml

When we installed the AWS Provider, it was installed with some Crossplane CRDs of the AWS Provider.
One of those CRDs is ‘bucket’.

Now we can check if the bucket was created by running kubectl get bucket against our Kubernetes cluster

Check if the S3 Bucket was created in AWS:

‍List you AWS S3 buckets and search for the newly created one:

aws s3 ls

Teardown & Cleanup

We’ll start by deleting the S3 Bucket Crossplane resource in Kubernetes, which will end up deleting the resource in AWS.
Eventually, if we used kind to spin up a local Kubernetes cluster, we’ll terminate the cluster to keep our workstation nice and clean.

Delete the S3 Bucket resource:

kubectl delete -f bucket-example.yaml

If used kind delete the cluster:

kind delete cluster

Useful Debugging Commands

kubectl get provider


kubectl logs -n crossplane-system deploy/crossplane -c crossplane


kubectl logs -n crossplane-system -l pkg.crossplane.io/provider=provider-aws

Deploy Apache Airflow on AWS Elastic Kubernetes Service (EKS)

Saksham Awasthi — Fri, 23 Aug 2024 11:44:55 +0000

It’s not trivial to run your data pipelines smoothly.

Apache Airflow is an excellent option as it has many features and integrations, but it could be better. It requires a lot of heavy lifting to make its infrastructure scalable.

That’s where deploying Apache Airflow on Kubernetes comes in. It enables you to orchestrate multiple DAGs in parallel on multiple types of machines, leverage Kubernetes to autoscale nodes, monitor the pipelines, and distribute the processing.

This guide will help you prepare your EKS environment, deploy Airflow, and integrate it with essential add-ons. You will also learn about a few suggestions for making your Airflow production grade.

Prerequisites

Before you deploy Apache Airflow, ensure you have all the prerequisites: eksctl, kubectl, helm, and an EKS Cluster.

We’ll be using eksctl to create the EKS Cluster, but feel free to skip it if you already have one.

Setup the AWS & EKSCTL CLIs

1.Install the AWS CLI: (skip to step 2 if you already have the CLI installed)

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Please refer to the full AWS installation guide for other operating systems and architectures.

Once Installed, we have to configure the AWS cli on the local machine. Refer to this AWS guide about configuring the CLI locally.

2.Install the eksctl CLI: (skip to step 3 if you already have eksctl installed)

curl --location "https://github.com/weaveworks/eksctl/releases/download/0.104.0/eksctl_Linux_amd64.tar.gz" | tar xz -C /tmp sudo mv /tmp/eksctl /usr/local/bin

You can also refer to the eksctl installation guide.

Create the AWS EKS (Elastic Kubernetes Service) Cluster

Create an EKS Cluster: (skip to the next section if you already have a cluster)
You can create an EKS Cluster directly from the AWS management console or
using the eksctl cluster command.

Run the below command to create an EKS cluster in a public subnet in the Oregon region.

eksctl create cluster --name airflow-cluster --region us-west-2 --nodegroup-name standard-workers --node-type t3.medium --nodes 3 --nodes-min 1 --nodes-max 4 --managed

You can find a detailed blog on setting up an EKS Cluster.

Connect to the EKS Cluster from your local machine

Install kubectl in your local machine using

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl
kubectl version

Refer to the AWS kubectl & eksctl configuration guide for other operating systems and architectures.

After setting up your cluster, you must access it from your local machine. The below command will update the “kubeconfig” file.

aws eks --region us-west-2 update-kubeconfig --name airflow-cluster

Setup Helm Locally

Run the below command to install Helm on your local machine.

curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash

Refer to the Helm installation guide for other operating systems and architectures.

Support Dynamic Volume Provisioning for Persistent Storage using EBS

For an elastic scalable service, dynamic volume provisioning is preferred. Persistent storage must be configured and registered.

Follow this guide to set up Amazon EBS CSI Driver Add-On and Dynamic Volume.

What is Airflow?

Apache Airflow is an open-source system for scheduling and enhanced data pipeline orchestration or workflows. In simple terms, Apache Airflow is an ETL/ELT tool.

You can create, schedule, and monitor complex workflows in Apache Airflow.

You can connect multiple data sources with Airflow and send pipeline success or failure alerts in Slack or Email. In Airflow, you must define Python workflows, represented by Directed Acyclic Graph (DAG). Airflow can be deployed anywhere, and after deploying, you can access Airflow UI and set up workflows.

Use cases of Airflow:

Data ETL Automation: Streamline the extraction, transformation, and loading of data from various sources into storage systems.
Data Processing: Schedule and oversee tasks like data cleansing, aggregation, and enrichment.
Data Migration: Manage data transfer between different systems or cloud platforms.
Model Training: Automate the training of machine learning models on large datasets.
Reporting: Generate and distribute reports and analytics dashboards automatically.
Workflow Automation: Coordinate complex processes with multiple dependencies.
IoT Data: Analyze and process data from IoT devices.
Workflow Monitoring: Track workflow progress and receive alerts for issues.

Benefits of using Airflow in Kubernetes

Deploying Apache Airflow on a Kubernetes cluster offers several advantages over deploying it on a virtual machine:

Scalability: Kubernetes allows you to scale your Airflow deployment horizontally by adding more pods to handle increased workloads automatically.
Isolation: Enables running different tasks of the same pipeline on various cluster nodes by deploying each task as an isolated pod.
Automation: Kubernetes native capabilities, like auto-scaling, self-healing, and rolling updates, reduce manual intervention, improving operational efficiency.
Portability: Deploying on Kubernetes makes your Airflow setup more portable across different environments, whether on-premise or cloud.
Integration: Kubernetes integrates seamlessly with various tools for monitoring, logging, and security, enhancing the overall management of your Airflow deployment.

Airflow Architecture Diagram

The airflow components are the Executor, Scheduler, Web Server, and Airflow database. The Airflow worker and Triggerer are also involved.
As you can see in the above diagram, the Data Engineer writes Airflow DAGs. Airflow DAGs are collections of tasks that specify the dependencies between them and the order in which they are executed. A DAG is a file that contains your Python code.
The Scheduler picks up these DAGs and has the config to run the tasks specified in the DAGs.
In the above diagram, the Scheduler runs tasks using Kubernetes Executor and creates a separate pod for every task, which provides isolation.
Airflow also stores pipeline metadata in an external database. The main configuration file used by the Web server, Scheduler, and workers is airflow.cfg.
The Data Engineer can view the entire flow through the Airflow UI. Users can also check the logs, monitor the pipelines, and set alerts.

Airflow Deployment Options

When deploying Apache Airflow, there are multiple approaches to consider, each with unique advantages and challenges. Let us see the different deployment examples:

Amazon Managed Workflows for Apache Airflow (MWAA)
You should configure the service through the AWS Management Console. There, you can define your environment, set up necessary permissions, and integrate with other AWS services.
Google Cloud Composer:
Create an environment using the Google Cloud Console and integrate with Google Cloud services like BigQuery and Google Cloud Storage.
Azure Data Factory with Airflow Integration:
Try to Configure Airflow through the Azure Portal. Integrate with other Azure services for efficient workflow automation.
Self-hosted on AWS EC2:
We can launch and configure EC2 instances. We must install Airflow, set up the environment, configure databases, and set up the scheduler.
Running on Kubernetes (e.g., AWS EKS):
We can create Kubernetes clusters, deploy Airflow using Helm charts or custom YAML files, and manage container orchestration and scaling.

These are the different options or ways to deploy Airflow, but we are focusing on Amazon Web Service EKS to deploy Airflow, so let us see this in the below section.

Deploy Airflow on AWS EKS

Let us install Apache Airflow in the EKS cluster using the helm chart.

1.Create a new namespace.

kubectl create namespace airflow

2.Add the Helm chart repository.

helm repo add apache-airflow https://airflow.apache.org

3.Update your Helm repository.

helm repo update

4.Deploy Airflow using the remote Helm Chart

helm install airflow apache-airflow/airflow --namespace airflow   --debug

You will get the Airflow webserver and default Postgres connection credentials in the output. Copy them and save them somewhere.

5.Examine the deployments by getting the Pods

Kubectl get pods -n airflow

The Airflow instance is set up in EKS. All the airflow pods should be running.

Let’s prepare Airflow to run our first DAG

At this point, Airflow is deployed using the default configuration. Let's see how we can get the default values from the helm chart on our local machine, modify it, and update a new release.

1.Save the configuration values from the helm chart by running the below command.

helm show values apache-airflow/airflow > values.yaml

This command generates a file named values.yaml in your current directory, which you can modify and save as needed.

2.Check the release version of the helm chart by running the following command.

helm ls -n airflow

3.Let us add the ingress configuration to access the airflow instance over the internet.

We need to deploy an ingress controller in the EKS cluster first. The commands below will install the NGINX ingress controller from the helm repository.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install nginx-ingress ingress-nginx/ingress-nginx --namespace airflow-ingress --create-namespace --set controller.replicaCount=2
kubectl get pods -n airflow-ingress

Note - All the pods should be running.

kubectl get service nginx-ingress-controller --namespace airflow-ingress

Look for the external IP in the output of the get service command.

After installing the ingress controller, add the required configuration in the values.yaml file and save the file. There is a section dedicated to the ingress configuration.

# Ingress configuration
ingress:
  enabled: true
  web:
    enabled: true
    annotations: {}
    path: "/"
    pathType: "ImplementationSpecific"
    host: <your domain URL>
    ingressClassName: "nginx"

After the changes to the values in the values.yaml file, we run the helm upgrade command to deploy the changes and create a new release version.

By default, the Helm Chart deploys its own Postgres instance, but using a managed Postgres instance is recommended instead.

You can modify the Helm Chart’s values.yaml file to add configuration of the managed database and volumes

metadataConnection:
             user: postgres
             pass: postgres
             protocol: postgresql
             host: <YOUR_POSTGRES_HOST>
             port: 5432
             db: postgres
             sslmode: disable

Run the helm upgrade command to implement the changes done above.

helm upgrade --install airflow apache-airflow/airflow -n airflow -f values.yaml --debug

Check the release version after the above command is run successfully. You should observe that the revision has changed to 2.

Accessing Airflow UI

We will use port-forwarding to access the Airflow UI in this tutorial. Run the below command and access “localhost:8080” on the browser.

kubectl port-forward svc/airflow-webserver 8080:8080 -n airflow

Use the default webserver credentials saved in the above section, “Installing Airflow Helm chart.”

At this point, Airflow is set up and is accessible. Hurray 😀

You can also access the UI over your domain, which is added in the ingress configuration in the above section.

Create your first Airflow DAG (in Git)

No DAGs have been added to our Airflow deployment yet. Let us see how we can add them.

To Set up a private GitHub repository for DAG, you can create a new one using the Github website's UI.

You can also install the git command line interface on your local machine and run commands to initialize an empty git repo.

git init

Adding DAG configs to the git repo

Once the git repo is initialized, create a DAG file like “sample_dag.py” and push it to the remote branch.

git add .
git commit -m 'Adding first DAG'
git remote add origin
git push -u origin main

Integrate Airflow with a private Git repo

To integrate Airflow with a private Git repository, you will need credentials, i.e. username /password or an SSH key.

We will use the SSH key to connect to the git repo. Skip the first step below if the SSH Key already exists in your Github account.

1.[Skip if it already exists] Generate an SSH key in your local machine and add it to the GitHub account (If not already present).

ssh-keygen -t ed25519 -C "<mailaddress>"

2.Create a generic secret in the namespace where airflow is deployed. This secret contains your SSH key.

kubectl create secret generic airflow-ssh-git-secret --from-file=gitSshKey=<path to SSH private key> -n airflow

3.Update the Git configuration in values.yaml file and run helm update command like in the above section.

gitSync:
    enabled: true
    repo: <your private Git repository URL>
    branch: <Branch-name>
    rev: HEAD
    depth: 1
    maxFailures: 0
    subPath: ""
sshKeySecret: airflow-ssh-git-secret

Below is a “sample_dag.py” that demonstrates a simple workflow.

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2024, 8, 8),
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}
dag = DAG('hello_world', default_args=default_args, schedule_interval=timedelta(days=1))
t1 = BashOperator(
    task_id='say_hello',
    bash_command='echo "Hello World from Airflow!"',
    dag=dag,
)

Upon completion, you can see the DAGs in the UI interface. Airflow automatically detects new DAGs, but you can manually refresh the DAGs list in the Airflow UI by clicking the "Refresh" button on the DAGs page.

The UI has many options/settings to experiment with, such as code, graphs, audit logs, etc.

You can also check the EKS cluster’s activity and DAG dashboard from the Activity tab.

Run the Airflow job

DAGs can be scheduled to run or triggered manually from the UI interface. There is a run button on the rightmost side of the DAG table.

Also, it can be triggered from within the DAG.

Make your Airflow on Kubernetes Production-Grade

Apache Airflow is a powerful tool for orchestrating workflows, but making it production-ready requires careful attention to several key areas.

Below, we explore strategies to enhance security, performance, monitoring, and ensure high availability in your Airflow deployment.

Improved Security

a. Role-Based Access Control (RBAC)

Implementation: Enable RBAC in Airflow to ensure only authorized users can access specific features and data.
Benefits: Limits access to critical areas and reduces the risk of unauthorized changes or data breaches.

Refer to the Access Control guide.

b. Secrets Management

Implementation: Integrate with external secret management tools like AWS Secrets Manager, HashiCorp Vault, or Kubernetes secrets.
Benefits: Securely store sensitive information like API keys and database passwords, keeping them out of your codebase.

Refer to this AWS documentation about Secrets management in EKS, as well as this guide to use Kubernetes secrets in Airflow DAG

c. Network Security

Implementation: Use network policies and security groups to restrict Airflow's web interface and API access.
Benefits: Minimizes exposure to potential attacks by limiting network access to trusted sources only.

Refer to this guide to implement Network Security in EKS.

Improved Performance

a. Optimized Resource Allocation

Implementation: Right-size your Kubernetes pods and nodes based on the workload demand. Use Kubernetes Horizontal Pod Autoscaler (HPA) to scale Airflow resources dynamically and cluster autoscaler to scale nodes.
Benefits: Ensures efficient use of resources, reduces costs, and prevents bottlenecks during peak loads.

Airflow uses Executors for Autoscaling pods.

Refer to this generic guide on Implementing HPA and Cluster Autoscaler in EKS.

HPA will autoscale the different Airflow components, while the Cluster Autoscaler will make sure there are nodes to satisfy those requirements.

b. Task Parallelism

Implementation: Configure Airflow to handle parallel task execution by optimizing the number of worker pods and setting appropriate concurrency limits.
Benefits: Accelerates workflow execution by running multiple tasks simultaneously, improving overall performance.

Check out this guide for Implementing parallelism in Airflow.

c. Use of ARM Instances

Implementation: Consider running workloads on ARM-based instances like AWS Graviton for cost efficiency.
Benefits: ARM instances often provide a better cost-to-performance ratio, especially for compute-intensive tasks.

A quick guide to Creating an EKS cluster with ARM instances.

d. Use of HTTPS for ingress host

Implementation: Consider having HTTPS for the Airflow URL using TLS/SSL certificates with the Ingress controller in Kubernetes.
Benefits: HTTPS encrypts data to enhance the security of information being transferred. This is especially crucial when handling sensitive data, as encryption helps protect it from unauthorized access during transmission.

Refer to this guide to Install NGINX ingress and configure TLS.

Monitoring

a. Metrics Collection and Alerting

Implementation: Integrate Airflow with Prometheus to collect metrics on task performance, resource usage, and system health. Tools like Grafana or Prometheus Alertmanagercan can set up alerts based on critical metrics and log events.
Benefits: It provides visibility into Airflow’s performance, allowing you to identify and address issues proactively and enabling quick response to potential problems, reducing downtime and maintaining workflow reliability.

Refer to the “How to set up Prometheus and Grafana with Airflow” guide.

b. Logs Collection

Implementation: Set up centralized logging with tools like Elasticsearch, Logstash, Kibana (ELK stack or EFK stack), or Grafana Loki.
Benefits: Simplifies troubleshooting by consolidating logs from all Airflow components into a single, searchable interface.

Refer to the guide on how to Setup Elastic, Fluentd, and Kibana on EKS.

High Availability

a. Redundant Components

Implementation: Deploy multiple replicas of Airflow’s web server, scheduler, and worker nodes to ensure redundancy.
Benefits: Increases resilience by preventing single points of failure, ensuring that workflows continue even if one component goes down. To deploy multiple pods in Apache Airflow using a Helm chart, follow these steps:

1.Set Replicas for the Scheduler:

In your values.yaml file set the scheduler.replicas to the desired number of replicas. For example:

scheduler:
  replicas: 2

2.Set Replicas for the Web Server:
Similarly, set the web.replicas to deploy multiple web server pods:

web:
  replicas: 2

3.Deploy the Helm Chart:
Apply the Helm chart with the updated values.yaml file:

helm upgrade --install airflow apache-airflow/airflow -f values.yaml

This configuration ensures that multiple scheduler and web server pods are deployed, contributing to the high availability of your Airflow setup.
Airflow helm chart’s value.yaml file can be found here.

b. Database High Availability

Implementation: Use a highly available database solution like Amazon RDS with Multi-AZ deployment for Airflow’s metadata database.
Benefits: Ensures continuous operation and data integrity even during a database failure.

Refer to Amazon RDS with the Multi-AZ deployment guide.

c. Backup and Disaster Recovery

Implementation: Regularly backup Airflow’s database and configuration files. Implement a disaster recovery plan that includes rapid failover procedures.
Benefits: Protects against data loss and enables quick recovery in case of catastrophic failures. Read this document to set up automated backups in Amazon RDS.

Refer to this AWS page to learn about “Backup and Restore of EKS.”

Conclusion

Setting up Apache Airflow on Amazon EKS is a powerful way to manage your workflows at scale, but it requires careful planning and configuration to ensure it’s production-ready. Following this guide, you've deployed Airflow on EKS, created a simple DAG, connected Airflow with a private Git repository, and learned about different ways to implement security, performance, high availability, monitoring, and logging. With these optimizations, your Airflow deployment is now more efficient, cost-effective, and ready to handle the demands of real-world data orchestration.

Frequently Asked Questions

1. What is Apache Airflow?
Apache Airflow is an open-source tool that helps in orchestrating and managing workflows through Directed Acyclic Graphs (DAGs). It automates complex processes like ETL (Extract, Transform, Load) jobs, machine learning pipelines, and more.

2.Why deploy Airflow on Amazon EKS?
Deploying Airflow on Amazon EKS offers scalability, flexibility, and robust workflow management. EKS simplifies Kubernetes management, allowing you to focus on scaling and securing your Airflow environment.

3. What are the prerequisites for deploying Airflow on EKS?
You need an AWS account, an EKS cluster, kubectl configured on your local environment, a dynamic storage class using EBS volumes, and Helm for package management.

4. How do I monitor Airflow on EKS?
You can integrate Prometheus and Grafana for monitoring. Using Loki for log aggregation can also help in centralized log management and troubleshooting.

5. What Kubernetes add-ons are recommended for a production-grade Airflow setup?
Essential add-ons include External Secret Operator for secure secrets management, Prometheus and Grafana for monitoring, and possibly Loki for logging.

6. Can Airflow be integrated with external databases like RDS?
Yes, it’s common to configure Airflow to use an external PostgreSQL database hosted on Amazon RDS for production environments, providing reliability and scalability for your metadata storage.

7. How can I access the Airflow UI on EKS?
You can access the Airflow UI by setting up a LoadBalancer service or using an Ingress Controller with a DNS pointing to your load balancer for easy access.

8. How do I manage DAGs in a production environment?
For production, it’s advisable to store your DAGs in a private Git repository and integrate Airflow with this repo using GitSync to pull the latest DAG configurations automatically.

Terraform Starter Boilerplate for GCP using Terragrunt

Arthur Azrieli — Fri, 26 Apr 2024 11:16:24 +0000

The Boilerplate Github Repositories (stars are welcome ⭐)

What you deploy: https://github.com/MeteorOps/terragrunt-gcp-projects
Modules you can use: https://github.com/MeteorOps/terraform-gcp-modules ‍

Terraform mistakes that made me build this boilerplate

I built this boilerplate for a reason.

I saw what companies regret with how they implemented Terraform:

Writing all of the Terraform code in one main.tf file
Copy-pasting resources manually
Copy-pasting configuration throughout the codebase
No state separation and environment awareness

This is why they regretted the above:

One terraform apply could ruin an entire environment
Resources modifications required changes in multiple locations
Configuration modifications required changes in multiple locations
Accidentally deploying the wrong resources to the wrong environment

And so, I built this boilerplate for our clients (and you) to minimize regrets.

The focus of this boilerplate is managing GCP resources.

Is this boilerplate for you?

If you're a CTO, a DevOps lead embarking on a new project on GCP, or simply in search of a template to organize your Terraform repositories, this project is for you. Finding a well-structured example for deploying GCP resources can be challenging. My own search for such a template was unfruitful, leading me to develop a solution that I've decided to share openly and discuss in this article.

What should you expect from this guide?

By the conclusion of this guide, you'll have a thorough understanding of how to establish Terraform repositories using a best-practice folder structure for provisioning GCP resources. You'll also be equipped to execute a straightforward demo, witnessing an end-to-end workflow in action.

What shouldn't you expect from this guide?

An exhaustive library of modules for every resource in GCP. We kept the boilerplate minimal, so that you can utilize it for your needs.
You can fairly easily utilize existing modules you created or found using the boilerplate.

Getting Started

To begin, clone the essential repositories:‍

Primary Repository
Clone the terragrunt-gcp-projects repository to get started.

git clone git@github.com:MeteorOps/terragrunt-gcp-projects.git

Modules Repository (Optional)

For the modules used, clone the terraform-gcp-modules repository.

git clone git@github.com:MeteorOps/terraform-gcp-modules.git

Repository Structure Explained

The code organization follows a logical hierarchy to facilitate multiple projects, regions or environments.This structure gives you a number of benefits:

Hierarchical configuration: The configuration at each level cascades through the folders under it
State separation: The terraform state is saved per folder in a different path in a bucket, limiting the impact radius of changes
Dynamic-level of deployment: The deeper into the folder you go, the more specific resources you affect with one deployment

project
└ _global
└ region
   └ _global
   └ environment
      └ resource

Creating and using root (project) level variables

When dealing with multiple GCP projects or regions, passing common variables to modules can become repetitive. To avoid duplicating variables across each terragrunt.hcl file, leverage root terragrunt.hcl inputs to inherit variables seamlessly across regions and environments.

Deploy Using Terragrunt

Prerequisites

Install Terraform version 0.12.6 or newer and Terragrunt version v0.25.1 or newer.
Fill in your GCP Project ID in my-project/project.hcl.
Make sure the gcloud CLI is installed and you are authenticated, otherwise run gcloud auth login.

Module Deployment

To deploy a single module:

cd into the module's folder (e.g. cd my-project/us-central1/rnd-1/vpc).
Run terragrunt plan to see the changes you are about to apply.
If the plan looks good, run terragrunt apply.

Environment Deployment

To deploy all modules within an environment:

cd into the environment folder (e.g. cd my-project/us-central1/rnd-1).
Run terragrunt run-all plan to see all the changes you are about to apply.
If the plan looks good, run terragrunt run-all apply.

Testing Deployed Infrastructure

Post-deployment, modules will output relevant information. For instance the IP of a deployed application:

Outputs:

ip = "35.240.219.84"

A minute or two after the deployment finishes, you should be able to test the ip output in your browser or with curl:

curl 35.240.219.84

# Output: Let MeteorOps know if this boilerplate needs any improvement!

Clean-Up Process

To remove all deployed modules within an environment:

cd into the environment folder (e.g. cd my-project/us-central1/rnd-1).
Run terragrunt run-all plan -destroy to see all the destroy changes you're about to apply.
If the plan looks good, run terragrunt run-all destroy. ‍

Conclusion

This guide walks you through leveraging best practices for setting up and managing Terraform repositories for GCP with Terragrunt. These methodologies are designed to be straightforward, efficient, and easily adaptable to future projects or company needs.

P.S.

Feel free to subscribe to our Newsletter and learn about other insights and resources we release 👈🏼

Don't waste 6 months on hiring DevOps, take the 10 Hours DevOps Pill

Michael Zion — Thu, 04 Jan 2024 14:55:06 +0000

A practical playbook for founders and engineering managers to stop wasting time trying to get started with DevOps the right way

Is this article for you?

Maybe you're a CTO.
Maybe you're an engineering manager.
If you think about hiring you first DevOps Engineer - this article is for you.
It will save you 6 months of wasted mental capacity.

How? By dedicating 10 hours to planning.
I'll share with you here exactly what to do during those 10 hours.

You might know this story TOO well

~ "I need to focus on the product"

You are building your company’s product.
You start thinking about the development process and building or improving production.
Then, you read about DevOps somewhere and think to yourself: “This seems like what I need”.

~ "DevOps is expensive"

You start looking for a DevOps Engineer, but you find out the standard DevOps salary - “Oh, that’s pretty expensive for our budget”.
So you decide to wait. “Hiring a developer is more important right now” you tell yourself.

A month goes by, and you and your team are grinding away, building a kickass product.
But alas! The missing development automation hurts collaboration, the cloud infrastructure is a collection of manual changes - and you’re not sure if the system is built well.

So, you make a decision, and you post your first “DevOps Position”, congrats!

~ "I'll interview some candidates"

Candidates start streaming in, and you start interviewing.
After interviewing the 3rd candidate, you think to yourself “this is not it”.
Every candidate is missing a different piece of the required knowledge.
One doesn’t know AWS, the other isn’t familiar with "", and the last one only worked with on-prem before.

And so, you re-evaluate.

~ "I'll do it myself"

The realization is this: You can’t hire on knowledge for this position, you have to hire on skill and character.
BUT, you feel like the knowledge is crucial, at least at this point.
So you make another decision - I’ll invest some more time in this “DevOps” thing - I’ll learn.
You watch DevOps tutorials on Youtube.
You even subscribe to various DevOps Newsletters, something you didn’t do before.
Another month goes by.

~ "What do other startups do?"

After learning and implementing DevOps bits and pieces all over the place, it doesn’t feel right.
“Am I re-inventing the wheel and complicating the future?” you might ask yourself.
You take a step back, zoom-out, and ask other startup founders what they did at this stage.

That’s when you find out you’re not alone - “Oh! So it’s tough for other founders as well!”
But, each founder tells you something different.
One founder tells you to hire a freelancer - but that seems like just another hiring model and doesn’t solve the problem.
Another founder tells you to just hire a full-time DevOps Engineer - but your resources are limited, and you’d rather hire more developers to build the product.
Until one founder tells you to hire an agency: “Instead of a freelancer, get a company that’s built to get your DevOps right”.

~ "Maybe an agency?"

You’re convinced, and you start looking for agencies.
You get recommendations, you search online, and you start booking calls.

*1st call with a DevOps agency
*“Hi there! Yes we’re the solution for you! Let’s get started!”

2nd call with a DevOps agency
“Yes! We’ll provide you with a strong DevOps Engineer! Let’s get started!”

~ "I'll do it myself"

You choose not to go on the 3rd call and you pause the process.

Why? Because you realize: “These guys didn’t remove any uncertainty for me - how are they different from a freelancer?”
So you make another decision.
You’ll continue to do DevOps yourself, and maybe delegate some tasks to other team members.

That’s when you find out it was a trap! You got yourself into a loop!
‍

~ "Oh no! it's a loop!"

After just one month you realize it’s not sustainable, and you repeat the same process:

Open DevOps position
Realize each candidate is missing something else
Fear for the budget
Stop the hiring
Complete DevOps tasks yourself
Consider a freelancer
Drop the idea because it's similar hiring
Consider an agency
Interview a bunch of agencies
No extra certainty or confidence from an agency
Do DevOps work yourself
Repeat

This loop is sometimes 2 months long, and sometimes 12 months long.

Either way, it slows down your company!
It occupies your thoughts.
It takes valuable time from you.
It speeds up global warming.

Maybe the global warming part was an exaggeration.

It’s amazing how many founders went through this exact same process.

Break the loop

I’m here to tell you:

You can break the DevOps hiring loop
You can promote DevOps efforts quickly AND accurately
You can save 6-12 months of stumbling around

You need just 10 hours with a DevOps Expert

You need just 10 hours with a DevOps expert.
What you need most is perspective from an experienced individual.
You don’t yet have the DevOps perspective or experience? No problem.
Get a DevOps Expert to consult you on a hourly basis.
I call it "The 10 Hours DevOps Pill" 💊

Requirements for the DevOps expert

Not anyone will fit as your trusted DevOps expert.

Use someone that answers these criteria:

Helped 10+ companies with DevOps - Has a wide perspective
Consulted CTOs and VPs of R&D before - Understands the weight of culture in adopting DevOps
Consulted DevOps Group and Team Leaders before - Understands how to manage the work
Did lots of DevOps hands-on work before - Understands how to do the work
You understand them - They know where you stand and have simple explanations. (Run away from the buzzwords throwers)

Spoiler: The contents of the 10 hours

2h - Share everything about your startup
2h - Ask what’s possible + What others did
2h - Set DevOps goals
2h - Choose a strategy
2h - Create a DevOps roadmap Let’s dive into the playbook for each meeting!

1st + 2nd Consultation Hours: Share, share, share

Share the state of your company with the DevOps expert.
Share your infrastructure, your code, your work process, everything!
Don’t be shy, and don’t justify - simply explain your startup’s goals and your reasoning.
(Sign an NDA first, of course)

3rd + 4th Consultation Hours: Understand what’s possible + Gain perspective

Take notes or record this call!

Ask the DevOps expert: What’s possible?

How fast can changes reach production?
How fast can you create new environments?
How easily can your developers collaborate?
How confidently can you change the system?
What uptime can the system have?
How aware can your team be of the system’s state?

Ask the DevOps expert: What did other startups do at this stage?

What goals did they have?
What DevOps efforts did they promote?
Why did they decide those DevOps efforts?
Which startups performed best? Why?
Are there industry-unique considerations?

5th + 6th Consultation Hours: Set DevOps goals

You now know what’s possible, and what other companies do.
Time to set short-term, mid-term, and long term goals.
If you want to dive deep into useful DevOps principles, check out the CTO DevOps Handbook here 👈🏼.

7th + 8th Consultation Hours: Strategize

Based on your goals, strengths, and restrictions, choose a strategy.
Discuss the following options with your DevOps expert.

Possible Strengths

Developers with DevOps background
High budget
Available time to manage the DevOps efforts

Possible Restrictions

No budget
Low budget
DevOps needs are temporary for now
No time to manage the DevOps efforts

	DevOps Engineer	Freelancer	Agency	Winner
Budget	High	Medium	Medium	Freelancer / Agency
Management Effort Required	High	High	Low-Medium	Agency
Hourly Flexibility	Low	Medium	High	Agency
Ease of bridging knowledge gaps	Low-Medium	Low-Medium	Medium-High	Agency
Ease of building trust	High	Medium-High	Low-Medium	DevOps Engineer

The available strategies

Hire a DevOps Engineer, Boost with the DevOps expert

Create a DevOps roadmap with the DevOps Expert before hiring
Get the DevOps expert to interview the candidates
Get advice from the DevOps expert on what to examine during your interview with the candidates
Prioritize the DevOps efforts with the DevOps Expert
Plan the onboarding of the DevOps Engineer with the 6. DevOps expert
Hire the DevOps Engineer
Get advice on managing DevOps projects from the DevOps expert
Discuss the DevOps tasks implementation with the DevOps Engineer and DevOps expert
Occasionally or routinely consult with the DevOps expert on priorities, management, and engineering

Hire a DevOps freelancer, Boost with the DevOps expert

Exactly as the above strategy
Take advantage of the hourly flexibility when needed

Hire a DevOps agency, Boost with the DevOps expert

Create a DevOps roadmap with the DevOps Expert before choosing an agency
Share your requirements with the agency
Filter out any agency that doesn’t provide you with a plan BEFORE work starts
Get the plan reviewed by your DevOps expert: Does it make sense? Is the planned result desired?
Make sure the DevOps agency’s ability to provide support after the work is done
Asses the level of the DevOps agency’s DevOps Engineers with the help of the DevOps expert
Make sure the DevOps agency is flexible with its hourly capacity
Occasionally or routinely consult with the DevOps expert on priorities, management, and engineering

Advantages and disadvantages of the strategies

Option	Good when	Bad when
Full-time + Expert	- Your budget is high - You need flexible DevOps capacity - Your team has DevOps background	- Your budget is low
Freelancer + Expert	- Your budget is low - You need flexible DevOps capacity - You can't risk losing DevOps capacity short-mid-term
Agency + Expert	- Your budget is low - You need flexible DevOps capacity - You want an accurate plan before work starts - You frequently need to support new tools and technologies	- The agency doesn’t have flexible plans - The agency’s team isn’t experienced - The agency doesn’t create clarity before work starts

9th + 10th Consultation Hours: Create a DevOps roadmap + Execute your strategy

You’re ready to create a DevOps roadmap and start executing your strategy!
Creating a DevOps roadmap is the first step of each of the strategies described above.

You had to do a bunch of things first:

You needed a DevOps expert to help you break the vicious DevOps loop - Done.
You needed to set DevOps goals rooted in a deep DevOps understanding and perspective - Done.
You needed to get familiar with different strategies and choose the best one for you - Done!

You’re left with 2 things to do:

1 - Create the DevOps roadmap based on the goals and strategy you’ve set
2 - Start executing the strategy:

Share the DevOps goals with the team
Schedule the interviews with the Engineers/Freelancers/Agencies
Use the DevOps expert to stay on track
Set the milestones for the DevOps goals
Create the tasks
Complete the tasks

Prepare before starting the process

You want to do two things before jumping into the 10 hours:

Before the first 2 hours: Perfect your DevOps terminology (will support rich discussions)
Before the last 2 hours: List existing efforts that might slow down the DevOps roadmap (will help prioritizing)

Ask me anything

If you’re thinking about taking this “10-hours DevOps Pill”, feel free to reach out here.

Do it yourself

You now know the general process for breaking the DevOps loop.
You don’t NEED a DevOps expert.
Feel free to subscribe to my Newsletter and learn how to do it yourself here 👈🏼

The CTO DevOps Handbook: Simple Principles and Examples

Michael Zion — Mon, 25 Dec 2023 21:28:13 +0000

Nail the DevOps part as your company's CTO

The goal of this handbook is to give you clarity on DevOps:

Understand what’s DevOps (in simple words)
Know what’s possible with DevOps (in simple goals)
Get simple “when-to-do-what” DevOps guidelines ‍

I added a bonus at the bottom of the article.
It's a production-ready setup example you could take inspiration from.

Who this article is for

You might be a founder who wishes to get started with DevOps the right way.

You might be a CTO of a 1,000 employees company who wishes to get simple principles.

Or, maybe you’re a Software Engineer, and you want to understand if your company’s DevOps approach is good.

If you’re looking for a simple DevOps playbook, this is it.

Understand the desired result

Two things your company needs to be able to do

Serve its product to customers
Build and improve the product

Abilities you need to build, improve, and serve software

Run experiments and test changes

DevOps has a simple meaning

Developers and Operators have shared responsibility for building and improving the system.

In practice:

Developers are responsible to “Operate”
DevOps Engineers are responsible to enable to “Operate” AND do some of it themselves

Operate = provision, monitor, secure, configure, deploy, scale.

Choose a balance: Enabler, Doer, or Automator

The DevOps role will end up as a balance between:

Enabler: Provides the tools and knowledge to fulfill the DevOps goals
Doer: Does the tasks that fulfill the DevOps goals
Automator: Automates any repeating operation

Know what things you should enable, do, or automate

Provision infrastructure
Secure the system
Deploy workloads
Monitor the system
Recover from issues
Scale up or down
Track & test changes
Automate processes

Choose the right tools

Has state management = Saves time automating state-aware processes (e.g., Terraform)
Has a big community & good docs = Saves time dealing with common issues (e.g., Kubernetes)
Has multiple interface types: API, CLI, UI = Saves time integrating with the existing system (e.g., Vault)

You can also read about choosing tools here.

Set useful goals

There are DevOps goals that adopting them will focus you on the right direction:

One-Click Environments: makes e2e tests easy and quick
Atomic Commits: provides confidence that a tested change will work in production
Separate the Shared & Env-Specific Parts: enables e2e tests as the company scales up

If you want to learn about more useful DevOps goals, feel free to book a free consultation here.

Enablers: Choose the Tools-to-Knowledge Balance

Developers can either have the knowledge or the tools to do something.

More knowledge-reliance: if you want the developers to contribute to the DevOps efforts
More tools-reliance: if you want to abstract the operations from the developers

If the balance between the two is not intentional, it’s accidental.

Doers: Have a good reason to do it

Is it a one-time task?
Does it teach you how the developers work?
Are you directly accountable for the results of the task?

If you answered “no” to the above questions, enable or automate it instead.

Doing more = Learning the system's use-cases

Doing too much = Not scalable, too-much knowledge-reliance

Automators: Have a good reason to automate it

Did it happen before?
Is it likely to happen again?
Will automating it take less time than doing it?
Will automating it teach you an important company process?

If you answered “yes” to 2 out of the 4 questions - automate it!

More automations = Less reliance on knowledge to operate the system.

Too much automations = No system awareness.

P.S. - you can also enable developers to automate it.

Create available DevOps Capacity

The DevOps needs of a company have spikes.

One month you need 2 DevOps Engineers, and half of that the next month.

Switchovers between big efforts and small tasks are common.

This is true, especially for new companies.

Break the assumption: “DevOps tasks must be done by a DevOps Engineer”.

There are 3 types of DevOps capacity

Non-Flexible: A full-time DevOps Engineer on the team
Semi-Flexible: Key developers that can contribute to the DevOps goals
Fully-Flexible: A flexible DevOps Services company or freelancer

You can read more about calculating the DevOps capacity your company needs here.

When to focus on what: Common Dilemmas

When: You work alone, and the system is simple

‍Focus: On simplifying the development - Dockerize your apps, Create a post-commit pipeline that runs tests

When: You need to be able to create new environments quickly (for development, or for clients)

‍ Focus: On implementing “One-Click Environments”: Using IaC (e.g., Terraform) + Deployment tool (Depends on the platform).

When: You want to e2e test every code modification, but there are many code modifications

‍> Focus: On splitting the “One-Click Env” into a “base” with shared resources, and “env” with env-specific resources

When: You want to unify & standardize how you deploy, monitor, scale, configure, and secure your workloads

Focus: On implementing an orchestrator such as Kubernetes

When: You want you have many moving parts and wish to be certain a tested change will work

‍ Focus: On implementing GitOps and consider a Monorepo (the sooner the better)

When: You want the DevOps efforts to be done by the dev team

‍ Focus: On using “actual” IaC tools (Pulumi Typescript/Python), Full “how to operate” (see above) documentation‍

Never: Invest lots of time in new tech without a strong reason

Always:

Have your code in Git
Monitor the basic stuff: CPU, Memory, Disk, Network, App Logs, Cloud Costs
Architect for high-availability
Test before you deploy

BONUS: An example setup for a CTO approaching Production

2 AWS Accounts

One for development and staging
Another for production

Monorepo in Github

Docker-Compose for local development

2 Infrastructure-as-Code projects: 'base' & 'apps'

base = shared resources (e.g., VPC, RDS, ECS Cluster, EKS Cluster)
apps = env-specific resources (e.g., Lambda Functions, ECS Services, Kubernetes Namespaces)
config file per environment

Github Actions Workflow: Development workflow

Checkout branch and locally develop + test changes
Create a Pull Request: Deploys a Pull-Request ‘apps’ environment on the ‘development’ environment ‘base’
On merge to main: Deploys from the ‘main’ branch an ‘apps’ environment onto the ‘development’ environment ‘base’
Manual: Deploy from the ‘main’ branch onto the ‘staging’ / ‘production’ environment ‘base’

Setup Notes

Avoid mentioning an environmnent's name in the code for conditional resources deployment
Use each environment’s config file to declare if a resource should be created
Could be implemented using Terraform, Terragrunt, Pulumi, CDK, and other IaC tools
Production should have 2-instances of every workload for high-availability

If you’d like to see this setup in your startup, click here to book a call 👈🏼

P.S. - I'll be updating this page occasionally, so you might want to visit again

Another Bonus: DevOps Dictionary for Human Beings

Term	Definition	Tools
Environment	A working instance of the entire system
CI (Continuous Integration)	Enable developers to collaborate by agreeing on a single source-of-truth (master/main)	Jenkins, Github Actions, GitlabCI
CD (Continuous Delivery)	Create an artifact that’s ready for production (tested, tagged)	JFrog Artifactory, Nexus, AWS ECR
CD (Continuous Deployment)	Every available deliverable (artifact) gets deployed automatically	ArgoCD, Jenkins, AWS CodeDeploy
Monitoring / Observability	Collect metrics/traces/logs from apps and infrastructure, analyze them, and display them, and setup alerts	Prometheus, Jaeger, Elasticsearch, Fluentd, OpenTelemetry
Infrastructure	The resources on which the workloads run, in which the data is stored, and through which the network flows	Servers, Databases, Network Routers & Switches
Cloud Infrastructure	Same as the above, but specifically in the cloud	AWS EC2, AWS RDS, GCP Compute Engine, Azure Virtual Machines
Cloud	Computing & Data services served from remote locations for you to build your system	AWS, Azure, GCP
Containerization & Virtualization	Technologies utilizing Kernel & OS features to create virtual machines, or isolate process (AKA run containers)	Docker, vSphere, KVM
Secrets Management	Storing and retrieving sensitive configurations (e.g., tokens, passwords)	Hashicorp Vault, AWS Secrets Manager, SealedSecrets
Configuration Management	Usually refers to preparing servers for workloads (e.g., creating directories & files, starting processes)	Ansible, Chef, Puppet
Version Control	Saving the code in a versioned way (Git)	Github, Gitlab
GitOps	Making the system is the same as it’s described in Git	Flux, ArgoCD, Jenkins
Monorepo	All of the company’s code is in one Git Repository	NX, Turborepo
Polyrepo	Multiple Git repositories for different components
IaC (Infrastructure-as-Code)	Creating Cloud infrastructure with idempotent code and state management	Terraform, Pulumi, CDK, Crossplane
Deployment	Execute, serve, or install the artifacts	ArgoCD, Jenkins, AWS CodeDeploy, Scripts (Bash, Python, etc.)
Orchestrator	Dynamically allocating workloads to a pool of nodes	Kubernetes, Nomad, AWS ECS
Authentication & Authorization	Making sure each person, workload, or resource, has access only to what’s necessary (other workloads and resources)	AWS IAM, OpenID, OpenVPN, Twingate, Istio
Service Discovery	Exposing available workloads using DNS	Consul, CoreDNS

Get more practical advice

I post small nuggets of practical advice on the "MeteorOps Newsletter".
You can subscribe here 👈🏼