Forem: Kostis Kapelonis

How to Preview and Diff Your Argo CD Deployments

Kostis Kapelonis — Tue, 17 Jan 2023 14:36:25 +0000

Adopting Kubernetes has introduced several new complications on how to verify and validate all the manifests that describe your application. There are several tools out there for checking the syntax of manifests, scanning them for security issues, enforcing policies etc.

But at the most basic case one of the major challenges is to actually understand what each change means for your application (and optionally approve/reject the pull request that contains that change).

This challenge was already present even outside GitOps, but it has become even more important for teams that use GitOps tooling (such as Argo CD) for their Kubernetes deployments.

The problem

Any major Git platform has built-in support for showing diffs between the proposed change and the current code when a Pull Request is created. In theory, the presented diff should be enough for a human to understand what the changes contain and how they will affect the target environment.

In practice however several teams have adopted a templating tool (such as Kustomize or Helm) that is responsible for rendering the actual Kubernetes manifests for a target cluster.

As a quick example let’s say that you need to review a Pull Request with the following changes:

This seems simple enough. You assume that this change will increase the number of replicas to 20 (let’s say it is Black Friday and you want to increase the capacity of your web store ASAP). You merge the pull request and … nothing happens.

What you didn’t know is that there is a downstream Kustomize overlay that also defines replicas on its own. So the proposed change has no effect at all. The problem was that the pull request contains only the segment of a Kustomize source manifest and doesn’t show a diff for the end result (the full rendered manifest).

The problem is even more apparent when your organization is using Helm. Let’s say that you need to approve a pull request with the following changes:

As a human, it is very difficult to understand what exactly is happening here. You need to mentally run the templates in your head and decide if this change is correct or not. Wouldn’t it be nice if the diff had the actual manifest that is created from this chart?

Essentially the diff functionality found in your Git system is not enough when it comes to complex Kubernetes applications.

Using the built-in Diff functionality in the Argo CD GUI

One of the main benefits of using the Argo CD UI during a deployment is the built-in diff feature. When a resource is “out-of-sync” (i.e. it differs from what is in git) Argo CD will mark it with a special color/icon. In the following example, somebody has changed the service resource of an application:

You can then click on the service and see the diff

The big advantage here is that Argo CD has already integrated support for Kustomize and Helm. The diff you will see is on the final rendered manifests which is exactly what you want as you can preview changes in their full context.

Unfortunately, this method also has several disadvantages.

The first one is that Argo CD shows only diffs for applications when the auto-sync (and self-heal) behavior are disabled. This means that you are losing the main benefit of GitOps. The proper way to follow GitOps is to have auto-sync enabled (and self-heal as well) as this way guarantees the basic premise that the cluster and the Git repository contain the same thing, that is, that the desired state and actual state have not diverged.

But the second problem regarding Continuous Delivery is that the diff on the manifests is shown when the changes are already committed and pushed. And this is too late to perform any serious review. Ideally you want to review changes as early as possible. A Pull Request allows you to add comments, talk with your team about the changes and also reject the Pull Request altogether without affecting a production system.

Using the built-in diff functionality in the Argo CD GUI is great for validating a change and doing a last sanity check just before production. But it should not be the main review milestone of a manifest change. And ideally you should setup all your applications to sync automatically, so having this diff process is not available in the first place.

Using the local Diff feature of the Argo CD CLI

We have seen the built-in diff UI in Argo CD is shown very late in the delivery process. Can we use the same diff approach earlier in the life of a change?

It turns out that the Argo CD CLI also comes with a diff command. This diff command takes a “–local” parameter that allows you to compare what is happening in the cluster against ANY local files which don’t have to be pushed (or even committed at all). It will also automatically run your favorite template tool as it is defined in the Argo CD application.

Here is how it looks

This approach is very promising as you could in theory use it inside a CI system with the following process:

Open a pull request with the suggested changes
Have your CI system checkout the pull request
Run inside a CI pipeline argocd diff –local against the cluster where the pull request is destined (It also uses again the built-in support for kustomize/Helm within Argo CD)
Present the diff to the user in order to take decisions about the pull request

This sounds great in theory but in practice has several shortcomings.

The most obvious one is that you need to provide your CI pipeline with credentials that access the cluster where Argo CD is installed. This forfeits one of the main benefits of GitOps – the pull mechanism where the credentials stay within the cluster.

An even bigger concern however is what happens when you have multiple clusters. Which cluster should you pick to compare against? What if the chosen cluster has CRDs or other resources that are custom to it?

This process can also become very complex with remote or secure cluster instances. For example if you have an Argo CD cluster in Asia and your CI system is running in the US, connectivity between the two might be very slow or even impossible.

In summary “argocd diff –local” is great for local experimentation and quick adhoc checks, but for a production deployment process there is a better way to achieve the same result (spoiler: it doesn’t involve the Argo CD CLI at all neither needs cluster access).

Pre-rendering manifests in a second Git repository

Let’s take a step back. We have been looking for ways to show an enhanced diff as part of a pull request and ignore the existing diff that is already provided by the Git provider (as we have seen that this doesn’t work with the final manifest).

There is a way however to enhance the built-in diff and make it work on the final manifests.

The solution is to use 2 GitOps repositories, for each application/cluster. One git repository has the manifests in their unprocessed form (e.g. as Kustomize overlays) as before. There is now a second Git repository that has the final rendered manifests. And Argo CD is pointed at the latter.

Here is how it would look:

This process should be familiar to you if you have ever used a preprocessor or code generator. Essentially there is an automated process (can be the CI system or something else) that does the following:

A human creates a pull request on “Source” Git repo with the suggested change
A “copy” process takes the contents of the pull request and applies the respective template tool (i.e. Helm/Kustomize) to create the final rendered manifest
A second pull request is opened automatically on the “Rendered” Git repo with the contents of the manifests
A human sees the diff of the second pull request and this time the diff is between rendered manifests and not snippets/segments.
If the Pull Request is approved it is merged on both Git repositories. Thus the second repository has always rendered manifests
Argo CD monitors the second repository and applies the changes (the integrated support for Helm and Kustomize within Argo CD itself is not used at all, Argo CD is only syncing raw manifests)

This is a very valid process and I have seen it used in several companies with success.

The big advantage of course is that the diff you get in the Git provider provides you with the full information about what will change in the application AFTER all manifests are processed:

Here is the same example with the Helm chart, but this time we are using a second Git repository that has the rendered manifest stored.

However, I am personally against this process, as it complicates things a lot and increases the number of moving parts.

It doubles the number of repositories for any given application (or branches if multiple branches are used)
It introduces another point of failure which is the copy process that converts source YAML to final manifests
It completely bypasses the effort put in Argo CD to process manifests on its own.
It might be confusing for people who now have 2 Git repositories to work with and opens the possibility of mistakes in both ends (either committing stuff on the “source” repo that never makes it to “rendered” repo or vice versa).

In general I think that this is an overkill solution for a problem that can be solved more elegantly as we will see later in the article. Still, if you follow this approach and it works for you, make sure that you have safeguards and monitoring in place (especially for the copy/commit automated process).

Intermission: Preview Terraform plans

You might think that previewing the full manifests for a pull request is a new problem that Kubernetes introduced. It isn’t. There have been several tools before Kubernetes that had to deal with the exact same issue and it would make sense to look at what they do.

The most obvious candidate to examine is terraform. If you are not familiar with terraform it is a declarative tool that allows you to define your infrastructure in a HCL file and then “apply” your changes.

Terraform users have a very similar problem. A pull request that contains terraform changes (especially in big projects) is not immediately clear for a human to understand (unless you are an expert on running terraform mentally in your head).

To solve this issue, terraform has a “plan” command which reads the changes, decides what it will do and prints a nice summary of all the proposed changes without actually doing anything.

The plan functionality in terraform is crucial to terraform teams as it removes the guesswork on what terraform will do when the changes are applied.

With this summary at hand, the next step is obvious. We can simply attach the output of the plan command to the pull request. Humans can now look at both the diff of the hcl files but also the plan summary and decide if the change is valid or not.

This workflow is so common that an open source project – https://www.runatlantis.io/ does exactly that.

You make your changes on the terraform files
You create a pull request
Atlantis runs the “plan” command and attaches the result in the PR
You can then approve/comment the PR as a human
Atlantis then runs the “apply” command to actually modify your infrastructure
Atlantis locks the workspace until the PR is merged, preventing a second PR from overriding the changes before the first PR is merged.

This workflow is very effective for end-users but has several security drawbacks which are similar to the argo CLI diff approach

You need bidirectional communication between your Git provider and the server that runs Atlantis
The server that runs Atlantis will run terraform on its own and thus it needs all credentials that terraform has in your organization.
The server that runs Atlantis also needs to have access to your remote terraform state. Essentially Atlantis has the keys to your kingdom.

Another downside of Atlantis is that it’s Pull Request based, whereas proprietary Terraform CD tools on the market feature a dashboard where every project / workspace can be browsed (similar to how you have a dashboard in ArgoCD where you can see your Applications and whether or not they are synced).

In theory we could create a similar system for argocd diff –local but given the security implications, there is a much better approach that is helped by the GitOps principles.

Still, the basic idea of attaching a diff in a pull request for greater context offers several advantages for the end user that cannot be overstated.

Render Kubernetes manifests on the fly

One of the most important principles of GitOps is that at any given time the cluster state is the same as what is described in the git state. We really like how terraform Atlantis works but there is a way to improve the workflow and make it more secure and more robust by taking advantage of the Argo CD guarantee for GitOps.

The Terraform CLI that runs on the Atlantis instance needs credentials to your infrastructure because it must both read the terraform state and also create the actual infrastructure once changes are “applied”.

With Argo CD we can completely bypass this limitation because we already have the cluster state right there. It is stored in the target of the Pull Request!

This means that we don’t need any credentials either to the cluster or to Argo CD. We can simply run a diff between the files of the Pull request and the branch it is targeted at. The extra addition here is that we will also run a preprocessing step for the template solution (Helm or Kustomize) in order to get the full manifest.

So the full process is as follows:

Somebody opens a Pull request to the manifest repo
We check out the code of the pull request and run Kustomize, Helm or other templating tool in order to have the final rendered manifests of what is changed
We check out the code of the branch that is targeted by the Pull Request (e.g. main) and find the same environment and again run the same templating tool to get the final manifest
We run a diff between the final manifests from the two previous steps
We show the diff to the human operator that will decide if the pull request will be merged or not

The beauty of this approach is that unlike Atlantis, we never access the Argo CD cluster. All information is coming from Git (an advantage of using GitOps). This means that your Argo CD cluster could be in China with a very slow connection (or even an isolated connection) and your CI server doesn’t need to know anything about it. In fact the location and security access of your Argo CD server is now irrelevant as we don’t interact with it in any way.

Notice in the diagram above that unlike Atlantis, our CI server has a direct connection only to the Git repository. The Argo CD instance still cares only about the Git repository it monitors.

This makes our approach much more secure as the credentials of the cluster still stay within the cluster itself and we only interact with the Git repository.

One thing to notice here is that unlike Atlantis or argocd diff command we are not comparing desired state in Git to actual state (acquired from cloud provider’s API or Kubernetes API), we are comparing two versions of desired state stored in different branches of a Git repository. While being a good enough approximation, this approach is not 100% equivalent to the argocd diff one.

A corner case scenario would be Helm Capabilities – built-in variables populated by querying k8s cluster for API version and available resources. Some Helm templates use this information to render correct resource versions, appropriate for specific cluster’s version and available CRDs. This information has to be supplied manually to “helm template” command to achieve parity with argocd diff.

Attaching the full manifest diff to a pull request

The icing on the cake is that we will also attach the full manifest diff to the Pull request (as Atlantis does).. This is how it would look:

We now need to explain to users that the automatic diff of the pull request is not what they should look at anymore because it only has part of the story (see the problem description at the beginning of this article). Instead, they should look at our attached diff and get the full picture of what has changed and make decisions accordingly.

The attached diff is especially important for people that use Helm as you can see a diff between plain YAML instead of trying to run manually Golang templates in your head.

Enforcing changes during environment promotion

If you follow this diff approach where the full manifests changes are shown in the Pull Request it will be much easier for you and your team to collaborate on GitOps changes as everybody will have the full context of each incoming change.

A secondary benefit however of this diff approach is also to know what is NOT changed.

In my previous article about promotions between GitOps environments using folders, a lot of people asked about how you can guarantee that extracting a common setting from downstream Kustomize overlays and promoting it your base overlay can be safely executed in a single step.

I was really puzzled by this query, until I realized that most people that were asking this question looked at the simpler diff of the pull request and thus lacked the full context of the change.

Let’s take an example. You have two environments qa and staging with the following settings:

UI_THEME=light
CACHE_SIZE=2048kb
SORTING=ascending
N_BUCKETS=42

You want to add a new setting called PAGE_LIMIT=25 and promote it gradually first to qa and then to staging

You modify/commit the qa environment

UI_THEME=light
CACHE_SIZE=2048kb
PAGE_LIMIT=25
SORTING=ascending
N_BUCKETS=42

The deployment goes ok and you make the same change to the Staging environment. It works fine there as well.

Now you decide that this new setting should be the same across both environments and you decide to move it to the parent overlay (which is common to all non-prod environments).

So the actions you take are

Delete the setting from the QA environment
Delete the setting from the Staging environment
Add the setting into the parent overlay that both environments depend on
Commit/Push all the above in a single step

A lot of people were concerned about this process and asked how you can enforce that the whole process will work without affecting the existing environments.

We can finally answer this question by simply looking at the enhanced diff of the above commit

That’s right. All diffs are completely empty. Even though there are changes in the individual Kustomize files the end result (the rendered manifests) are EXACTLY the same.

This means that if you approve this Pull request Argo CD will do absolutely nothing and you are certain that all environments will be oblivious to this refactoring.

Of course the basic diff of the pull request is not that smart and shows the diff changes in text in the individual files.

So in this scenario we have the extreme case when the built-in diff of the pull request doesn’t have the full context of what is going on because it doesn’t understand the full manifests.

Conclusion

Previewing changes before applying them is a pillar of modern software automation and in the case of Kubernetes applications this is not always a straightforward process because of the templating of the manifests.

In this article we have seen several ways of previewing the changes in Argo CD applications:

Basic diff of the Git platform (not recommended)
Native diff of the Argo CD UI
Diff local files with the Argo CD CLI
Pre-rendering manifests in a second Git repository
Rendering manifests on the fly for each Pull request (recommended)

We hope that this process is helpful for you and your team when of course it is combined with static analysis, syntax validation, security scans and other sanity checks that run against your Kubernetes manifests.

Happy diffing!

How to Install and Upgrade Argo CD

Kostis Kapelonis — Tue, 17 Jan 2023 12:28:49 +0000

We have already covered several aspects of Argo CD in this blog such as best practices, cluster topologies and even application ordering, but it is always good to get back to basics and talk about installation and more importantly about maintenance.

Chances are that one of your first Argo CD installations happened with kubectl as explained in the getting started guide. While this form of installation is great for quick experimentation and for trying Argo CD, there are many more installation methods that are recommended for production deployments.

Manual installation and manual upgrade

The most obvious installation method is using the official manifests or the respective Helm chart. Note that the Helm chart for Argo CD is not official and sometimes it lags behind the regular Argo CD releases.

While the initial installation of Argo CD using just the manifests is quick and straightforward it also suffers from several shortcomings:

Configuration changes (e.g. SSO and notifications) must be also applied manually without any option for rollbacks or auditing
No disaster recovery option if your Argo CD instance goes down
Extra effort is needed for modifications on the base install (e.g. for Argo CD plugins)
It becomes very cumbersome to manage multiple Argo CD instances in a manual way

However, the biggest challenge is actually upgrading Argo CD in a safe way. New Argo CD versions come with their own notes and incompatibilities and trying to manually upgrade your instance without any backup plan is a recipe for disaster.

In summary, you should only employ manual installation via manifests for quick prototypes and demo installations.

Using a hosted Argo CD instance

If you are searching for the easiest way to use Argo CD while still having a production installation, look no further than a hosted Argo CD instance. At Codefresh, we already announced our hosted Argo CD offering earlier this year. This Argo CD instance is completely managed by Codefresh personnel. The only thing you need to do is connect your cluster for deploying your applications.

The main advantage of this method is that all maintenance effort is handled by Codefresh and not you. All version updates, security fixes and other upgrades are automatically handled behind the scenes by Codefresh and you can focus on deploying applications.

The hosted version of Argo is available to everyone that signs-up with Codefresh, including free accounts.

Using Argo CD to manage Argo CD

Using a hosted instance of Argo CD can be great for many organizations, but may not be a fit if you need to deploy behind the firewall, or need more customization. Ideally you would like to customize your Argo CD installation, setup different settings, configure your own plugins, pin down specific Argo CD versions etc

Hosting your own Argo CD instance is popular, but instead of doing it manually you should use a management platform on top of it. And the most obvious choice would be managing Argo CD with itself!

This use case is perfectly valid and a lot of organizations use self-managed Argo CD. The advantages are:

Using GitOps to handle not just applications but also the Argo CD installation
Full audit via Git
Easy rollbacks
Automatic drift detection for any manual changes
Complete changelog of all configuration changes (e.g. notifications SSO)
Easy disaster recovery

This is a great way to handle a production instance of Argo CD. Depending on the size of your organization it will still suffer however from some important challenges:

You still have to perform manual upgrades and make sure that each new version of Argo CD “sits” cleanly on top of the previous one
Handling a large number of Argo CD installations and keeping them all in sync (pun intended) is still a big challenge.

Using Argo CD Autopilot

The use case for using Argo CD to manage Argo CD is very popular as a concept but there are no best practices yet on how to get started and how to bootstrap the whole environment.

We use the same approach internally and we fully open-sourced our solution at https://argocd-autopilot.readthedocs.io/en/stable/

Argo CD autopilot provides a CLI installing and managing Argo CD that does the following:

Connects to your Git provider
Bootstraps Git repositories for handling both applications and itself
Setup Applications and ApplicationSet for auto-upgrading itself and other managed apps
Provides a best practice Git repo structure for both internal and external applications
Comes with a CLI that allows you to manage and maintain the installation
Introduces the concepts of deployment environments/projects

Argo CD Autopilot is under active development. You are welcome to participate in Github as well as the #argo-cd-autopilot channel in the CNCF slack.

Using a control plane

Handling one or two Argo CD instances is pretty straightforward if you choose any of the above installation methods. Several organizations however have a large number Argo CD instances that need to be kept in sync or rolled gradually as new versions come out.

Argo CD can natively support a management instance that handles multiple deployment clusters. So in theory you could have a single Argo CD instance for all your environments. We have already talked about this pattern in our article about scaling Argo. In the end, having a single instance is a single point of failure and also comes with its own issues for security and redundancy.

On the other hand having an Argo CD instance for each deployment cluster is also excessive and can lead to a cumbersome setup where maintaining Argo CD instances becomes tedious and unmanageable, especially across virtual private clouds and firewalls.

Ideally you would like a single management interface that can handle all possible combinations (Argo CD management cluster and deployment clusters) allowing you to craft your perfect topology.

This management interface exists in the form of the Codefresh GitOps control plane!

The Codefresh platform gives you a unified interface for handling all Argo CD instances no matter where they are located. All possible configurations are supported:

Argo CD management clusters
Argo CD deployment clusters
Hosted Argo CD installations
Deployment clusters managed by the hosted instance
Argo CD instances that deploy on the same cluster they are installed on
Argo CD instances that are deployed behind a firewall or on-premise environment

Via the control plane interface you can then

Connect Argo CD instances
Connect target deployment clusters
See the status of each Argo CD instance and each connected cluster
Upgrade Argo CD instances on a new version in a controlled manner
Keep track of versions/security alerts, and easily upgrade

The unified control plane is the perfect tool for all organizations that need a management interface on top of all their Argo CD instances without all the hassle for manual upgrades.

Summary

In this blog post we have seen most Argo CD installation methods from the simplest one (just the manifests) to the most powerful one (the unified control plane). Depending on the size and complexity of your organization you should choose a management method that allows you to focus on the things that matter most – deploying applications, instead of handling the instances themselves

For more information on the hosted instance and the control plane sign-up with Codefresh. See also our best practices article and getting started guides. And of course, don’t forget to get GitOps with Argo CD Certified!

How to Model Your Gitops Environments and Promote Releases between Them

Kostis Kapelonis — Wed, 23 Mar 2022 11:49:56 +0000

Two of the most important questions that people ask themselves on day 2 after adopting GitOps are:

How should I represent different environments on Git?
How should I handle promoting releases between environments?

In the previous article of the series, I focused on what NOT to do and explained why using Git branches for different environments is a bad idea. I also hinted that the “environment-per-folder” approach is a better idea. This article has proved hugely popular and several people wanted to see all the details about the suggested structure for environments when folders are used.

In this article, I am going to explain how to model your GitOps environments using different folders on the same Git branch, and as an added bonus, how to handle environment promotion (both simple and complex) with simple file copy operations.

Hopefully this article will help with the endless stream of questions and discussions on this hot topic.

Learn your application first

Before creating your folder structure you need to do some research first and understand the “settings” of your application. Even though several people talk about application configuration in a generic manner, in reality not all configuration settings are equally important.

In the context of a Kubernetes application, we have the following categories of “environment configuration”:

The application version in the form of the container tag used. This is probably the most important setting in a Kubernetes manifest (as far as environment promotions are concerned). Depending on your use case, you might get away with simply changing the version of the container image. However, several times a new change in the source code also requires a new change in the deployment environment
Kubernetes specific settings for your application. This includes the replicas of the application and other Kubernetes related information such as resource limits, health checks, persistent volumes, affinity rules, etc.
Mostly static business settings. This is the set of settings that are unrelated to Kubernetes but have to do with the business of your application. It might be external URLs, internal queue sizes, UI defaults, authentication profiles, etc. By “mostly static,” I mean settings that are defined once for each environment and then never change afterwards. For example, you always want your production environment to use production.paypal.com and your non-production environments to use staging.paypal.com. This is a setting that you never want to promote between environments
Non-static business settings. This is the same thing as the previous point, but it includes settings that you DO want to promote between environments. This could be a global VAT setting, your recommendation engine parameters, the available bitrate encodings, and any other setting that is specific to your business.

It is imperative that you understand what all the different settings are and, more importantly, which of them belong to category 4 as these are the ones that you also want to promote along with your application version.

This way you can cover all possible promotion scenarios:

Your application moves from version 1.34 to 1.35 in QA. This is a simple source code change. Therefore you only need to change the container image property in your QA environment.
Your application moves from version 3.23 to 3.24 in Staging. This is not a simple source code change. You need to update the container image property and also bring the new setting “recommender.batch_size” from QA to staging.

I see too many teams that don’t understand the distinction between different configuration parameters and have a single configuration file (or mechanism) with values from different areas (i.e. both runtime and application business settings).

Once you have the list of your settings and which area they belong to, you are ready to create your environment structure and optimize the file copy operations for the settings that change a lot and need to be moved between environments.

Example with 5 GitOps environments and variations between them

Let’s see an actual example. I thought about doing the classic QA/Staging/Production trilogy, but this is rather boring so let’s dive into a more realistic example.

We are going to model the environment situation mentioned in the first article of the series. The company that we will examine has 5 distinct environments:

Load Testing
Integration Testing
QA
Staging
Production

Then let’s assume that the last 2 environments are also deployed to EU, US, and Asia while the first 2 also have GPU and Non-GPU variations. This means that the company has a total of 11 environments.

You can find the suggested folder structure at https://github.com/kostis-codefresh/gitops-environment-promotion. All environments are different folders in the same branch. There are NO branches for the different environments. If you want to know what is deployed in an environment, you simply look at envs/ in the main branch of the repo.

Before we explain the structure, here are some disclaimers:

Disclaimer 1: Writing this article took a long time because I wasn’t sure if I should cover Kustomize or Helm or plain manifests. I chose Kustomize as it makes things a bit easier (and I also mention Helm at the end of the article). Note however that the Kustomize templates in the example repo are simply for illustration purposes. The present article is NOT a Kustomize tutorial. In a real application, you might have Configmap generators, custom patches and adopt a completely different “component” structure than the one I am showing here. If you are not familiar with Kustomize, spend some time understanding its capabilities first and then come back to this article.

Disclaimer 2: The application I use for the promotions is completely dummy, and its configuration misses several best practices mainly for brevity and simplicity reasons. For example, some deployments are missing health checks, and all of them are missing resource limits. Again, this article is NOT about how to create Kubernetes deployments. You should already know how proper deployment manifests look. If you want to learn more about production-grade best practices, then see my other article at https://codefresh.io/kubernetes-tutorial/kubernetes-antipatterns-1/

With the disclaimers out of the way, here is the repository structure:

The base directory holds configuration which is common to all environments. It is not expected to change often. If you want to do changes to multiple environments at the same time, it is best to use the “variants” folder.

The variants folder (a.k.a mixins, a.k.a. components) holds common characteristics between environments. It is up to you to define what exactly you think is “common” between your environments after researching your application as discussed in the previous section.

In the example application, we have variants for all prod and non-prod environments and also the regions. Here is an example of the prod variant that applies to ALL production environments.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: simple-deployment
spec:
  template:
    spec:
      containers:
      - name: webserver-simple
        env:
        - name: ENV_TYPE
          value: "production"
        - name: PAYPAL_URL
          value: "production.paypal.com"   
        - name: DB_USER
          value: "prod_username"
        - name: DB_PASSWORD
          value: "prod_password"                     
        livenessProbe:
          httpGet:
            path: /health
            port: 8080

In the example above, we make sure that all production environments are using the production DB credentials, the production payment gateway, and a liveness probe (this is a contrived example, please see disclaimer 2 at the start of this section). These settings belong to the set of configuration that we don’t expect to promote between environments, but we assume that they will be static across the application lifecycle.

With the base and variants ready, we can now define every final environment with a combination of those properties.

Here is an example of the staging ASIA environment:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: staging
namePrefix: staging-asia-

resources:
- ../../base

components:
  - ../../variants/non-prod
  - ../../variants/asia

patchesStrategicMerge:
- deployment.yml
- version.yml
- replicas.yml
- settings.yml

First we define some common properties. We inherit all configuration from base, from non-prod environments, and for all environments in Asia.

The key point here is the patches that we apply. The version.yml and replicas.yml are self-explanatory. They only define the image and replicas on their own and nothing else.

The version.yml file (which is the most important thing to promote between environments) defines only the image of the application and nothing else.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: simple-deployment
spec:
  template:
    spec:
      containers:
      - name: webserver-simple
        image: docker.io/kostiscodefresh/simple-env-app:2.0

The associated settings for each release that we DO expect to promote between environments are also defined in settings.yml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: simple-deployment
spec:
  template:
    spec:
      containers:
      - name: webserver-simple
        env:
        - name: UI_THEME
          value: "dark"
        - name: CACHE_SIZE
          value: "1024kb"
        - name: PAGE_LIMIT
          value: "25"
        - name: SORTING
          value: "ascending"    
        - name: N_BUCKETS
          value: "42"

Feel free to look at the whole repository to understand the way all kustomizations are formed.

Performing the initial deployment via GitOps

To deploy an application to its associated environment, just point your GitOps controller to the respective “env” folder and kustomize will create the complete hierarchy of settings and values.

Here is the example application as it runs in Staging/Asia.

You can also use Kustomize on the command line to preview what is going to be deployed for each environment. Examples:

kustomize build envs/staging-asia
kustomize build envs/qa
kustomize build envs/integration-gpu

You can of course pipe the output to kubectl to deploy each environment, but in the context of GitOps, you should always let your GitOps controller deploy your environments and avoid manual kubectl operations.

Comparing the configuration of two environments

A very common need for a software team is to understand what is different between two environments. I have seen several teams who have the misconception that only with branches you can easily find differences between environments.

This could not be further from the truth. You can easily use mature file-diffing utilities to find what is different between environments just by comparing files and folders.

The simplest way is to diff only the settings that are critical to the app.

vimdiff envs/integration-gpu/settings.yml envs/integration-non-gpu/settings.yml

And with the help of kustomize, you can compare any number of whole environments for the full picture:

kustomize build envs/qa/> /tmp/qa.yml
kustomize build envs/staging-us/ > /tmp/staging-us.yml
kustomize build envs/prod-us/ > /tmp/prod-us.yml
vimdiff /tmp/staging-us.yml /tmp/qa.yml /tmp/prod-us.yml

I personally don’t see any disadvantage between this method and performing “git diff” between environment branches.

How to perform promotions between GitOps environments

Now that the file structure is clear, we can finally answer the age-old question “how do I promote releases with GitOps”?

Let’s see some promotion scenarios. If you have been paying attention to the file structure, you should already understand how all promotions resolve to simple file copy operations.

Scenario: Promote application version from QA to staging environment in the US:

cp envs/qa/version.yml envs/staging-us/version.yml
commit/push changes

Scenario: Promote application version from integration testing (GPU) to load testing (GPU) and then to QA. This is a 2 step process

cp envs/integration-gpu/version.yml envs/load-gpu/version.yml
commit/push changes
cp envs/load-gpu/version.yml envs/qa/version.yml
commit/push changes

Scenario: Promote an application from prod-eu to prod-us along with the extra configuration. Here we also copy our setting file(s).

cp envs/prod-eu/version.yml envs/prod-us/version.yml
cp envs/prod-eu/settings.yml envs/prod-us/settings.yml
commit/push changes

Scenario: Make sure that QA has the same replica count as staging-asia

cp envs/staging-asia/replicas.yml envs/qa/replicas.yml
commit/push changes

Scenario: Backport all settings from qa to integration testing (non-gpu variant)

cp envs/qa/settings.yml envs/integration-non-gpu/settings.yml
commit/push changes

Scenario: Make a global change to all non-prod environments at once (but see also next section for some discussion on this operation)

Make your change in variants/non-prod/non-prod.yml
commit/push changes

Scenario: Add a new configuration file to all US environments (both production and staging).

Add the new manifest in the variants/us folder
Modify the variants/us/kustomization.yml file to include the new manifest
commit/push changes

In general, all promotions are just copy operations. Unlike the environment-per-branch approach, you are now free to promote anything from any environment to any other environment without any fear of taking the wrong changes. Especially when it comes to back-porting configuration, environment-per-folder really shines as you can simply move configuration both “upwards” and “backwards” even between unrelated environments.

Note that I am using cp operations just for illustration purposes. In a real application, this operation would be performed automatically by your CI system or other orchestration tool. And depending on the environment, you might want to create a Pull Request first instead of directly editing the folder in the main branch.

Making changes to multiple environments at once

Several people have asked in the comments of the first article about the use-case of changing multiple environments at once and how to achieve and/or prevent this scenario.

First of all, we need to define what exactly we mean by “multiple” environments. We can assume the following 2 cases.

Changing multiple environments at once that are on the same “level.” As an example, you want to make a change that affects prod-us, prod-eu and prod-asia at the same time
Changing multiple environments at once that are NOT on the same level. As an example, you want to make a change to “integration” and “staging-eu” at the same time

The first case is a valid scenario, and we will cover this below. However, I consider the second scenario an anti-pattern. The whole point of having different environments is to be able to release things in a gradual way and promote a change from one environment to the next. So if you find yourself deploying the same change in environments of different importance, ask yourself if this is really needed and why.

For the valid scenario of deploying a single change to multiple “similar” environments, there are two strategies:

If you are absolutely certain that the change is “safe” and you want it to reach all environments at once, you can make that change in the appropriate variant (or respective folders). For example, if you commit/push a change in the variants/non-prod folder then all non-production environments will get this change at the same time. I am personally against this approach because several changes look “safe” in theory but can be problematic in practice
The preferable approach is to apply the change to each individual folder and then move it to the “parent” variant when it is live on all environments.

Let’s take an example. We want to make a change that affects all EU environments (e.g. a GDPR feature). The naive way would be to commit/push the configuration change directly to variants/eu folder. This would indeed affect all EU environments (prod-eu and staging-eu). However this is a bit risky, because if the deployment fails, you have just brought down a production environment.

The suggested approach is the following:

Make the change to envs/staging-eu first
Then make the same change to envs/prod-eu
Finally, delete the change from both environments and add it in variants/eu (in a single commit/push action).

You might recognize this pattern from gradual database refactorings. The final commit is “transitional” in the sense that it doesn’t really affect any environments in any way. Kustomize will create the exact same definition in both cases. Your GitOps controller shouldn’t find any differences at all.

The advantages of this approach are of course the easy way to rollback/revert the change as you move it through environments. The disadvantage is the increased effort (and commits) you need to promote the change to all environments, but I believe that the effort outweighs the risks.

If you adopt this approach, it means that you never apply new changes to the base folder directly. If you want a change to happen to all environments, you first apply the change to individual environments and/or variants and then backport it to the base folder while simultaneously removing it from all downstream folders.

The advantages of the “environment-per-folder” approach

Now that we have analyzed all the inner workings of the “environment-per-folder” approach, it is time to explain why it is better than the “environment-per-branch” approach. If you have been paying attention to the previous sections, you should already understand how the “environment-per-folder” approach directly avoids all the problems analyzed in the previous article.

The most glaring issues with environment branches is the order of commits and the danger of bringing unwanted changes when you merge from one environment to another. With the folder approach, this problem is completely eliminated:

The order of commits on the repo is now irrelevant. When you copy a file from one folder to the next, you don’t care about its commit history, just its content
By only copying files around, you only take exactly what you need and nothing else. When you copy envs/qa/version.yml to envs/staging-asia/version.yml you can be certain that you only promote the container image and nothing else. If somebody else has changed the replicas in the QA environment in the meantime, it doesn’t affect your promotion action.
You don’t need to use git cherry-picks or any other advanced git method to promote releases. You only copy files around and have access to the mature ecosystem of utilities for file processing.
You are free to take any change from any environment to either an upstream or downstream environment without any constraints about the correct “order” of environments. If for example you want to backport your settings from production US to staging US, you can do a simple copy operation of envs/prod-us/settings.yml to envs/staging-us/settings.yml without the fear that you might take inadvertently unrelated hotfixes that were supposed to be only in production.
You can easily use file diff operations to understand what is different between environments in all directions (both from source and target environments and vice versa)

I consider these advantages very important for any non-trivial application, and I bet that several “failed deployments” in big organizations could be directly or indirectly attributed to the problematic environment-per-branch model.

The second problem mentioned in the previous article was the presence of configuration drift when you merge a branch to the next environment. The reason for this is that when you do a “git merge,” git only notifies you about the changes it will bring, and it doesn’t say anything about what changes are already in the target branch.

Again this problem is completely eliminated with folders. As we said already, file diff operations have no concept of “direction.” You can copy any setting from any environment either upwards or downwards, and if you do a diff operation on the files, you will see all changes between environments regardless of their upstream/downstream position.

The last point about environment branches was the linear complexity of branches as the number of environments grows. With 5 environments, you need to juggle changes between 5 branches, and with 20 environments, you need to have 20 branches. Moving a release correctly between a large number of branches is a cumbersome process, and in the case of production environments, it is a recipe for disaster.

With the folder approach, the number of branches is not only static but it is exactly 1. If you have 5 environments you manage them all with your “main” branch, and if you need more environments, you only add extra folders. If you have 20 environments, you still need a single Git branch. Getting a centralized view on what is deployed where is trivial when you have a single branch.

Using Helm with GitOps environments

If you don’t use Kustomize but prefer Helm instead, it is also possible to create a hierarchy of folders with “common” stuff for all environments, specific features/mixins/components, and final folders specific to each environment.

Here is how the folder structure would look like

chart/
  [...chart files here..]
common/
  values-common.yml
variants/
  prod/
     values-prod.yml
  non-prod/
    Values-non-prod.yml
  [...other variants…]
 envs/
     prod-eu/
           values-env-default.yaml
           values-replicas.yaml
           values-version.yaml
           values-settings.yaml
   [..other environments…]

Again you need to spend some time to examine your application properties and decide how to split them into different value files for optimal promotion speed.

Other than this, most of the processes are the same when it comes to environment promotion.

Scenario: Promote application version from QA to staging environment in the US:

cp envs/qa/values-version.yml envs/staging-us/values-version.yml
commit/push changes

Scenario: Promote application version from integration testing (GPU) to load testing (GPU) and then to QA. This is a 2 step process

cp envs/integration-gpu/values-version.yml envs/load-gpu/values-version.yml
commit/push changes
cp envs/load-gpu/values-version.yml envs/qa/values-version.yml
commit/push changes

Scenario: Promote an application from prod-eu to prod-us along with the extra configuration. Here we also copy our setting file(s).

cp envs/prod-eu/values-version.yml envs/prod-us/values-version.yml
cp envs/prod-eu/values-settings.yml envs/prod-us/values-settings.yml
commit/push changes

It is also critical to understand how Helm (or your GitOps agent which handles Helm) works with multiple value files and the order in which they override each other.

If you want to preview one of your environments, instead of “kustomize build” you can use the following command

helm template chart/ --values common/values-common.yaml --values variants/prod/values-prod.yaml –values envs/prod-eu/values-env-default.yml –values envs/prod-eu/values-replicas.yml –values envs/prod-eu/values-version.yml –values envs/prod-eu/values-settings.yml

You can see that Helm is a bit more cumbersome than Kustomize, if you have a large number of variants or files in each environment folder.

The “environment-per-git-repo” approach

When I talk with big organizations about the folder approach, one of the first objections I see is that people (especially security teams) don’t like to see a single branch in a single Git repository that contains both prod and non-prod environments.

This is an understandable objection and arguably can be the single weak point of the folder approach against the “environment-per-branch” paradigm. After all, it is much easier to secure individual branches in a Git repository instead of folders in a single branch.

This problem can be easily solved with automation, validation checks, or even manual approvals if you think it is critical for your organization. I want to stress again that I only use “cp” in the file operations for promoting releases just for illustration purposes. It doesn't mean that an actual human should run cp manually in an interactive terminal when a promotion happens.

Ideally you should have an automated system that copies files around and commits/pushes them. This can be your Continuous Integration (CI) system or other platform that deals with your software lifecycle. And if you still have humans that make the changes themselves, they should never commit to “main” directly. They should open a Pull Request instead. Then you should have a proper workflow that checks that Pull Request before merging.

I realize however that some organizations are particularly sensitive to security issues and they prefer a bulletproof approach when it comes to Git protection. For these organizations, you can employ 2 Git repositories. One has the base configuration, all prod variants, and all prod environments (and everything else related to production) while the second Git repository has all non-production stuff.

This approach makes promotions a bit harder, as now you need to checkout 2 git repositories before doing any promotion. On the other hand, it allows your security team to place extra security constraints to the “production” Git repository, and you still have a static number of Git repositories (exactly 2) regardless of the amount of environments you deploy to.

I personally consider this approach an overkill that, at least to me, shows a lack of trust against developers and operators. The discussion on whether or not people should have direct access to production environments is a complex one and probably deserves a blog post on its own.

Embrace folders and forget branches

We hope that with this blog post we addressed all the questions that arose from the “don’t use branches for environments” article and you now have a good understanding about the benefits of the folder approach and why you should use it.

If you have other interesting use cases or have extra questions on the subject of organizing your GitOps environments, please ask in the comments section.

Happy GitOps deployments!

Stop Using Branches for Deploying to Different GitOps Environments

Kostis Kapelonis — Fri, 17 Dec 2021 13:20:56 +0000

In our big guide for GitOps problems, we briefly explained (see points 3 and 4) how the current crop of GitOps tools don’t really cover the case of promotion between different environments or how even to model multi-cluster setups.

The question of “How do I promote a release to the next environment?” is becoming increasingly popular among organizations that want to adopt GitOps. And even though there are several possible answers, in this particular article I want to focus on what you should NOT do.

You should NOT use Git branches for modeling different environments. If the Git repository holding your configuration (manifests/templates in the case of Kubernetes) has branches named “staging”, “QA”, “Production” and so on, then you have fallen into a trap.

Let me repeat that. Using Git branches for modeling different environments is an anti-pattern. Don’t do it!

We will explore the following points on why this practice is an anti-pattern:

Using different Git branches for deployment environments is a relic of the past.
Pull requests and merges between different branches is problematic.
People are tempted to include environment specific code and create configuration drift.
As soon as you have a large number of environments, maintenance of all environments gets quickly out of hand.
The branch-per-environment model goes against the existing Kubernetes ecosystem.

Using branches for different environments should only be applied to legacy applications.

When I ask people why they chose to use Git branches for modelling different environments, almost always the answer is a variation of “we’ve always done it that way,” “it feels natural,” “this is what our developers know,” and so on.

And that is true. Most people are familiar with using branches for different environments. This practice was heavily popularized by the venerable Git-Flow model. But since the introduction of this model, things have changed a lot. Even the original author has placed a huge warning at the top advising people against adopting this model without understanding the repercussions.

The fact is that the Git-flow model:

Is focused on application source code and not environment configuration (let alone Kubernetes manifests).
Is best used when you need to support multiple versions of your application in production. This happens, but is not usually the case.

I am not going to talk too much about Git-flow here and its disadvantages because the present article is about GitOps environments and not application source code, but in summary, you should follow trunk-based development and use feature-flags if you need to support different features for different environments.

In the context of GitOps, the application source code and your configuration should also be in different Git repositories (one repository with just application code and one repository with Kubernetes manifests/templates). This means that your choice of branching for the application source code should not affect how branches are used in the environment repository that defines your environments.

When you adopt GitOps for your next project, you should start with a clean slate. Application developers can choose whatever branching strategy they want for the application source code (and even use Git-flow), but the configuration Git repository (that has all the Kubernetes manifests/templates) should NOT follow the branch-per-environment model.

Promotion is never a simple Git merge

Now that we know the history of using a branch-per-environment approach for deployments, we can talk about the actual disadvantages.

The main advantage of this approach is the argument that “Promotion is a simple git merge.” In theory, if you want to promote a release from QA to staging, you simply merge your QA branch into the staging branch. And when you are ready for production, you again merge the staging branch into the production branch, and you can be certain that all changes from staging have reached production.

Do you want to see what is different between production and staging? Just do a standard git diff between the two branches. Do you want to backport a configuration change from staging to QA? Again, a simple Git merge from the staging branch to qa will do the trick.

And if you want to place extra restrictions on promotions, you can use Pull Requests. So even though anybody could merge from qa to staging, if you want to merge something in the production branch, you can use a Pull Request and demand manual approval from all critical stakeholders.

This all sounds great in theory, and some trivial scenarios can actually work like this. But in practice, this is never the case. Promoting a release via a Git merge can suffer from merge conflicts, unwanted changes, and even the wrong order of changes.

As a simple example, let’s take this Kubernetes deployment that is currently sitting in the staging branch:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-deployment
spec:
  replicas: 15
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: backend
        image: my-app:2.2
        ports:
        - containerPort: 80

Your QA team has informed you that version 2.3 (which is in the QA branch) looks good, and it is ready to be moved to staging. You merge the QA branch to the staging branch, promoting the application and think that everything is good.

What you didn’t know is that somebody also changed the number of replicas in the QA branch to 2 because of some resource limitations. With your Git merge, you not only deployed 2.3 to staging, but you also scaled the replicas to 2 (instead of 15), and that is probably something that you don’t want.

You might argue that it would be easy to look at the replica count before merging, but remember that in a real scenario you have a large number of applications with a big number of manifests that are almost always templated (via Helm or Kustomize). So understanding what changes you want to bring and what to leave behind is not a trivial task.

And even if you do find changes that should not be promoted, you need to manually choose the “good” parts using git cherry-pick or other non-standard methods which are a far cry from the original “simple” Git merge.

But even if you are aware of all the changes that can be promoted, there are several cases where the order of promotion is not the same as the order of committing. As an example, the following 4 changes happen to the QA environment.

The ingress of the application is updated with an extra hostname.
Release 2.5 is promoted to the QA branch and all QA people start testing.
A problem is found with 2.5 and a Kubernetes configmap is fixed.
Resource limits are fine-tuned and committed to QA.

It is then decided that the ingress setting and the resource limits should move to the next environment (staging). But the QA team has not finished testing with the 2.5 release.

If you blindly merge the QA branch to the staging branch, you will get all 4 changes at once, including the promotion of 2.5.

To resolve this, again you need to use git cherry-pick or other manual methods.

There are even more complicated cases where the commits have dependencies between them, so even cherry-pick will not work.

In the example above, release 1.24 must be promoted to production. The problem is that one of the commits (the hotfix) contains a multitude of changes where some of them depend on another commit (the ingress config change) which itself cannot be moved to production (as it only applies only to staging). So even with cherry-picks, it is impossible to bring only the required changes from staging to production.

The end result is that promotion is never a simple Git merge. Most organizations will also have a large number of applications that go on a large number of clusters, composed by a large number of manifests. Manually choosing commits is a losing battle.

Configuration drift can be easily created by environment-specific changes

In theory, configuration drift should not be an issue with Git merges. If you make a change in staging and then merge that branch to production, then all your changes should transfer to the new environment.

In practice, however, things are different because most organizations only merge to one direction, and team members are easily tempted to change upstream environments and never back-port the changes to downstream environments.

In the classic example with 3 environments for QA, Staging, and Production, the direction of Git merges only goes to one direction. People merge the qa branch to staging and the staging branch to production. This means that changes only flow upwards.

QA -> Staging -> Production.

The classic scenario is that a quick configuration change is needed in production (a hotfix), and somebody applies the fix there. In the case of Kubernetes, this hotfix can be anything such as a change in an existing manifest or even a brand new manifest.

Now Production has a completely different configuration than staging. Next time a release is promoted from Staging to Production, Git will only notify you on what you will bring from Staging. The ad hoc change on production will never appear anywhere in the Pull Request.

This means that all subsequent deployments can fail, as production now has an undocumented change that will never be detected by any subsequent promotions.

In theory, you could backport such changes and merge periodically all commits from production to staging (and staging to QA). In practice, this never happens due to the reasons outlined in the previous point.

You can imagine that a large number of environments (and not just 3) further increases the problem.

In summary, promoting releases by Git merges does not solve configuration drift and in fact makes it even more problematic as teams are tempted to make ad hoc changes that are never promoted in sequence.

Managing different Git branches for a large number of environments is a losing battle

In all the previous examples, I only used 3 environments (qa-> staging-> production) to illustrate the disadvantages of branch-based environment promotion.

Depending on the size of your organization, you will have many more environments. If you factor in other dimensions such as geographical location, the number of environments can quickly skyrocket.

For example, let’s take a company that has 5 environments:

Load Testing
Integration testing
QA
Staging
Production

Then let’s assume that the last 3 environments are also deployed to EU, US, and Asia while the first 2 also have GPU and Non-GPU variations. This means that the company has a total of 13 environments. And this is for a single application.

If you follow a branch-based approach for your environments:

You need to have 13 long living Git branches at all times.
You need 19 pull requests for promoting a single change across all environments.
You have a two dimensional promotion matrix with 5 steps upwards and 2-3 steps outwards.
The possibilities for wrong merges, configuration drift and ad-hoc changes is now non-trivial across all environment combinations.

In the context of this example organization, all previous issues are now more prevalent.

The branch-per-environment model goes against Helm/Kustomize

Two of the most popular Kubernetes tools for describing applications are Helm and Kustomize. Let’s see how these two tools recommend modeling different environments.

For Helm, you need to create a generic chart that itself accepts parameters in the form of a values.yaml file. If you want to have different environments, you need multiple values files.

For Kustomize, you need to create a “base” configuration, and then each environment is modeled as an overlay that has its own folder:

In both cases, different environments are modeled with different folders/files. Helm and Kustomize know nothing about Git branches or Git merges or Pull Requests. They use just plain files.

Let me repeat that again: Both Helm and Kustomize use plain files for different environments and not Git branches. This should be a good hint on how to model different Kubernetes configurations using either of these tools.

If you introduce Git branches in the mix, you not only introduce extra complexity, but you also go against your own tooling.

The recommended way to promote releases in GitOps environments

Modeling different Kubernetes environments and promoting a release between them is a very common issue for all teams that adopt GitOps. Even though a very popular method is to use Git branches for each environment and assume that a promotion is a “simple” Git merge, we have seen in this article that this is an anti-pattern.

In the next article, we will see a better approach to model your different environments and promote releases between your Kubernetes cluster. The last point of the article (regarding Helm/Kustomize) should already give you a hint on how this approach works.

Stay tuned!

Using GitOps for Infrastructure and Applications With Crossplane and Argo CD

Kostis Kapelonis — Mon, 13 Dec 2021 13:43:05 +0000

If you have been following the Codefresh blog for a while, you might have noticed a common pattern in all the articles that talk about Kubernetes deployments. Almost all of them start with a Kubernetes cluster that is already there, and then the article explains how to deploy an application on top.

The reason for this simplification comes mainly from brevity and simplicity. We want to focus on the deployment part of the application and not its infrastructure just to make the article easier to follow. This is the obvious reason.

The hidden reason is that until recently infrastructure deployments were handled in a different manner than applications deployments. Especially in large enterprise companies, the skillset of people that deal with infrastructure and application can vary a lot as the tools of the trade are completely different.

For example, a lot of people that deal with infrastructure prefer to use Terraform templates, but employ Kustomize/Helm or other similar tools for application development. While this is a very valid solution, it doesn’t have to be this way.

Now with GitOps you can have a uniform way of dealing with infrastructure and applications.

GitOps and Terraform

If you are not familiar with GitOps, head over to https://opengitops.dev/ which is the official page of the GitOps working group. The principles of GitOps are the following:

The system is described in a declarative manner. (In practice, this means Kubernetes manifests.)
The definition of the system is versioned and audited. (In practice, it is stored in Git.)
A software agent automatically pulls the Git state and matches the platform state. (In practice, this means Flux/ArgoCD.)
The state is continuously reconciled. This means that any changes happening in Git should also be reflected in the system, as well as the opposite scenario.

If you have already worked with Terraform, you should already understand why it is difficult to apply GitOps to the Terraform CLI. While Terraform only has the first requirement (declarative format), it doesn’t satisfy the other three requirements around Git storage, automatic reconciliation, and two-way sync. As soon as Terraform finishes its job, it doesn’t interface with the system in any way. If you manually delete a Virtual Machine that was created by Terraform, vanilla Terraform doesn’t know about it. And regarding state, Terraform stores its own state which is completely different from the definition files in Git.

So at first glance, getting GitOps to work with infrastructure is a complicated task. You realize that you need to add something on top of vanilla Terraform in order to satisfy the GitOps requirements.

But there is a new kid on the block, and that is Crossplane!

GitOps and Crossplane

Crossplane has similar capabilities to Terraform (creating infrastructure) but with the following major differences:

Crossplane itself is a Kubernetes application. (The infrastructure it creates can be anything.)
Crossplane definitions are Kubernetes manifests.
You can either use manifests that describe resources in the common cloud providers or create your own resources.

As a simple example here is an EC2 instance describe in Crossplane:

apiVersion: ec2.aws.crossplane.io/v1alpha1
kind: Instance
metadata:
  name: sample-instance
spec:
  forProvider:
    region: us-east-1
    imageId: ami-0dc2d3e4c0f9ebd18
    securityGroupRefs:
      - name: sample-cluster-sg
    subnetIdRef:
      name: sample-subnet1  
  providerConfigRef:
    name: example

You can find more examples for Amazon, Google, and Azure. Crossplane supports several other providers, and of course you can add your own.

The important point here is that the file above is a standard Kubernetes manifest. You can:

Apply it with kubectl.
Verify it with manifest verification tools (kubeval or kube-linter).
Template it with Helm/Kustomize.
Use any of the tools in the Kubernetes ecosystem to read/manage/store it.

And since it is a standard manifest, you can of course store it in Git and manage it with ArgoCD for a Full GitOps workflow. The importance of this capability cannot be overstated.

If you combine ArgoCD and Crossplane, you have a full solution for following the GitOps principles with infrastructure and not just applications. Imagine if your ArgoCD dashboard contained this:

Isn’t that cool?

How ArgoCD and Crossplane work together

Crossplane offers an easy way to model your infrastructure as Kubernetes manifests. This is great on its own, but if you put ArgoCD in the mix, you essentially gain all the advantages of GitOps for your infrastructure.

You know exactly what infrastructure you have simply by looking at the Git repository.
You know exactly what was changed and when by looking at Git history.
The infrastructure state IS the same as the Git state. Terraform has its own state that is used a the single source of truth
You don’t need any external credentials any more.
You can easily rollback to a previous version of your infra with a git reset/git-revert. You completely avoid the dreaded configuration drift.

The last point is very critical. Terraform only knows what is in your infrastructure during deployment. If you make any manual changes afterwards (e.g. delete some infra), terraform knows absolutely nothing. You need to rerun terraform and pray that the correct action is taken. There are many terraform horror stories about incomplete/invalid states. Here is one of my favorite tfstate stories from Spotify.

ArgoCD will instantly detect any manual changes in your infrastructure, present you a diff, and even allow you to completely discard them if you want.

Essentially ArgoCD is agnostic as to what exactly is described by the Kubernetes manifests it manages. They can be plain Kubernetes applications, Virtual machines, Container registries, Load balancers, object storage, firewall rules, etc.

Create infrastructure and deploy to it using GitOps

Now that we saw that you can use ArgoCD and Crossplane together for managing infrastructure, we are now ready to treat applications and the platform they need in the same way.

This means that we can do the following:

Start from nothing.
Use Crossplane to create a Kubernetes cluster.
Commit the crossplane manifest and manage it with ArgoCD.
Use a standard manifest (e.g. deployment) to deploy an application to the cluster that was just created.
Commit that manifest in Git too. Like the application one, ArgoCD will manage it like any other Kubernetes manifest (even though it represents infrastructure).

Here is the whole workflow:

The Kubernetes cluster that Crossplane is running on is only used to bootstrap crossplane. It doesn’t get any production workloads on its own. For simple demos, this can be any cluster (even a local one running on your workstation).

The end result is that now you have a unified way to handle both infrastructure and applications right from Argo CD.

The process of changing either of them is exactly the same.

If you want to change infrastructure, you commit a change and ArgoCD takes care of it (with the help of Crossplane).
If you want to change your application, you commit a change and ArgoCD takes care of it.

They also both gain all the benefits of GitOps. If for example somebody changes the number of replicas in the deployment or tampers with the cluster nodes, ArgoCD will detect the manual changes automatically and give you the ability to discard them completely.

Why GitOps is the way forward

The whole point of DevOps is to make everything self-service and promote collaboration between developers and operators. Adopting Crossplane in a GitOps setting means that now both parties have a common language that they can communicate with.

There is no need anymore for separate workflows that are confusing to either of them. Adopting a common workflow for both infrastructure and applications is the embodiment of the DevOps spirit.

If you want to learn more about GitOps and how Codefresh has embraced it, check out the Codefresh DevOps platform for Argo. For more information about Crossplane, see the official site and the hosted solution by Upbound.

How to Handle Secrets Like a Pro Using Gitops

Kostis Kapelonis — Fri, 03 Sep 2021 11:24:03 +0000

One of the foundations of GitOps is the usage of Git as the source of truth for the whole system. While most people are familiar with the practice of storing the application source code in version control, GitOps dictates that you should also store all the other parts of your application, such as configuration, kubernetes manifests, db scripts, cluster definitions, etc.

But what about secrets? How can you use secrets with GitOps? This has been one of the most popular questions from teams that adopt GitOps.

The truth is that there is no single accepted practice on how secrets are managed with GitOps. If you already have a solid solution in place such as HashiCorp vault, then it would make sense to use that even though technically it is against GitOps.

If you are starting from scratch on a new project, there are ways to store secrets in Git so that you can manage them using GitOps principles as well. It goes without saying that you should never ever commit raw secrets in Git.

All solutions that handle secrets using Git are storing them in an encrypted form. This means that you can get the best of both worlds. Secrets can be managed with GitOps, and they can also be placed in a secure manner in any Git repository (even public repositories).

How Kubernetes secrets work

In the context of this article we will talk about two kinds of secrets, the built-in Kubernetes secrets (present in every Kubernetes cluster) and the sealed secrets as introduced by the Bitnami Sealed secrets controller.

Before we talk about sealed secrets, let’s talk about normal/plain secrets first. Kubernetes includes a native secret resource that you can use in your application. By default, these secrets are not encrypted in any way and the base64 encoding used should be never seen as a security feature. While there are ways to encrypt the Kubernetes secrets within the cluster itself, we are more interested in encrypting them outside the cluster, so that we can store them externally in Git as required by one the founding principles of GitOps (everything stored in Git).

Using Kubernetes application secrets is straightforward. You can use the same mechanisms as configmaps, namely mounting them as files on your application or passing them as environment variables.

Sealed secrets are just an extension built on top of Kubernetes native secrets. This means that after the encryption/decryption takes place, all secrets function as plain Kubernetes secrets, and this is how your application should access them. If you don’t like how plain Kubernetes secrets work, then you need to find an alternative security mechanism.

An example application with secrets

As a running example, we will use a simple application found at https://github.com/codefresh-contrib/gitops-secrets-sample-app. This is a web application that reads several dummy secrets and displays them (without actually using them in any way).

We have chosen to make the application read the secrets as files (from /secrets/) instead of using environment variables. Here are the paths that are used:

[security]
# Path to key pair
private_key = /secrets/sign/key.private
public_key= /secrets/sign/key.pub

[paypal]
paypal_url = https://development.paypal.example.com
paypal_cert=/secrets/ssl/paypal.crt

[mysql]
db_con= /secrets/mysql/connection
db_user = /secrets/mysql/username
db_password = /secrets/mysql/password

It is important to note that the application is very simple. It only reads secrets from these paths. It doesn’t know anything about Kubernetes, secret resources, volume mounts, or anything else. You could run it on a Docker container (outside of Kubernetes), and if the correct paths have secret files, it would just work.

For illustration purposes, the application loads different kinds of secrets such as username/password, public/private key, and certificate. We will handle all of them in a similar way.

The Bitnami sealed secrets controller

The Bitnami Sealed secrets controller is a Kubernetes controller that you install on the cluster and performs a single task. It converts sealed secrets (that can be committed to Git) to plain secrets (that can be used in your application).

Installing the controller is straightforward:

helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm repo update
helm install sealed-secrets-controller sealed-secrets/sealed-secrets

Once installed, the controller creates two keys on its own:

The private key is used for secret decryption. This key should stay within the cluster, and you should never give it to anyone.
The public key is used for secret encryption. This can (and will) be used outside the cluster, so it is ok to give to somebody else.

Once the controller is installed, you install and use your application in the standard way. You don’t need to change your application code or tamper with your Kubernetes manifests. If your application can use vanilla Kubernetes secrets, then it can work with sealed secrets as well.

It should be clear that the controller does not come in direct contact with your application. It converts sealed secrets to Kubernetes secrets, and afterwards it is up to your application how to use them. The application does not even know that its secrets were originally encrypted in Git.

Encrypting your secrets

We have seen how the controller decrypts secrets. But how do we encrypt secrets in the first place? The controller comes with the associated kubeseal executable that is created for this purpose.

It is a single binary, so you can install it by copying it to your favorite directory (probably in your PATH variable)

wget https://github.com/bitnami-labs/sealed-secrets/releases/download/v0.16.0/kubeseal-linux-amd64 -O kubeseal
sudo install -m 755 kubeseal /usr/local/bin/kubeseal

Kubeseal does the opposite operation from the controller. It takes an existing Kubernetes secret and encrypts it. Kubeseal requests the public key that was created during the installation process from the cluster and encrypts all secrets with that key.

This means that:

Kubeseal needs access to the cluster in order to encrypt secrets. (It expects a kubeconfig like kubectl.)
Encrypted secrets can only be used in the cluster that was used for the encryption process.

The last point is very important as it means that all secrets are cluster specific. The namespace of the application is also used by default, so secrets are cluster and namespace specific.

If you want to use the same secret for different clusters, you need to encrypt it for each cluster individually.

To use kubeseal, just take any existing secret in yaml or json format and encrypt it:

kubeseal -n my-namespace < .db-creds.yml > db-creds.json

This creates a SealedSecret which is a custom Kubernetes resource specific to the controller. This file is safe to commit in Git or store in another external system.
You can then apply the secret on the cluster

kubectl apply -f db-creds.json -n my-namespace

The secret is now part of the cluster and will be decrypted by the controller when an application needs it.

Here is the full diagram of encryption/decryption:

The full process is the following:

You create a plain Kubernetes secret locally. You should never commit this anywhere.
You use kubeseal to encrypt the secret in a SealedSecret.
You delete the original secret from your workstation and apply to the cluster the sealed secret.
You can optionally commit the Sealed secret to Git.
You deploy your application that expects normal Kuberentes secrets to function. (The application needs no modifications of any kind.)
The controller decrypts the Sealed secrets and passes them to your application as plain secrets.
The application works as usual.

Using sealed secrets with Codefresh GitOps

By using the Sealed Secrets controller, we can finally store all our secrets in Git (in an encrypted form) right along the application configuration.

In the example repository, you can look at the folder https://github.com/codefresh-contrib/gitops-secrets-sample-app/tree/main/safe-to-commit and it has all manifests of the application, including secrets.

You can simply point the Codefresh GitOps UI to this folder and deploy an application in a single step:

After the deployment is finished, you can see the components of the application in the GitOps dashboard:

And if you launch the application, you will see it is correctly reading all secrets:

From here on, the application is following all GitOps principles. If you make any changes in Git (including changing the secrets) the cluster will be updated. And if you change something on the cluster (even secrets), Codefresh GitOps will detect the change.

Adopting the Sealed Secrets controller in production

Secret rotation is a complex process that should not be taken lightly. We have explained the basics of Sealed Secrets in this article, but if you want to use the controller in production you need to read the documentation and take into account other aspects such as secret rotation and key handling.

For more details on using the controller with Codefresh GitOps, please see our example page.

New to Codefresh? Create your free account today!

Unlimited Preview Environments with Kubernetes Namespaces

Kostis Kapelonis — Fri, 09 Jul 2021 11:41:47 +0000

In our big series of Kubernetes anti-patterns, we briefly explained that static test environments are no longer needed if you are using Kubernetes. They are expensive, hard to maintain, and hard to clean up.

Instead, we suggested the adoption of temporary environments that are created on demand when a pull request is opened.

In this article, we will see the practical explanations on how to achieve unlimited temporary environments using Kubernetes namespaces.

Choosing a naming strategy

Since the preview environments will be created and destroyed in a dynamic manner, you need to select a strategy for their names. While several solutions exist for naming, the two most common variations are:

Using the name of the branch as a context URL. This means example.com/feature1, example.com/feature2, example.com/feature3, and so on
Using the name of the branch as a host subdomain. This means feature1.example.com, feature2.example.com, feature3.example.com

The context-URL-based strategy is very easy to set up since it doesn’t need any special DNS settings (or TLS certs/wildcards). On the other hand, not all applications are designed to run with a different root context.
If you are certain that your application will not have issues with the context URL, then that strategy is the easiest to start.

The host-based naming strategy is much more robust, but it requires some configuration in your DNS provider to catch all subdomains and send them to the cluster that will hold all your preview namespaces.

In both cases, we also use an underlying Kubernetes namespace with the same name as the branch. We take advantage of the fact that Git branches have unique names, making sure that there are no clashes between environment names or Kubernetes namespaces.

It is also very common for teams to create branch names that represent specific issues (e.g. with JIRA). This makes it very easy to understand what developers are implementing for each feature environment.

For example, if a developer starts working on “issue 45 for billing”, then:

A Git branch is created with name issue-45-billing
A temporary environment is deployed at Kubernetes namespace issue-45-billing
The environment is exposed at example.com/issue-45-billing or at issue-45-billing.example.com

Using a Kubernetes Ingress for traffic management

You can create a preview environment in a Kubernetes namespace using any of the available deployment mechanisms (e.g. Helm or Kustomize). In order to distinguish traffic between different pods, you also need to install a Kubernetes Ingress. An Ingress is a special Kubernetes resource responsible for routing requests inside the cluster.

There are several implementations available, and for our example, we will use Ambassador gateway. We will use Ambassador Edge stack 1.13.8 but the open source Emissary Ingress should work as well. Both host-based and context-based naming strategies are supported natively by the Ingress specification.

You can see how we set up our Ingress in the example application for the context-based naming strategy.

kind: Ingress
apiVersion: extensions/v1beta1
metadata:
  name: "simple-java-app-ing"
  annotations:
    kubernetes.io/ingress.class: {{ .Values.ingress.class }}

spec:
  rules:
    - http:
        paths:
          - path: {{ .Values.ingress.path }}
            backend:
              serviceName: simple-service
              servicePort: 80

The most important property is the “path” property that tells the Ingress what URL context to honor when a request comes in the cluster (e.g. example.com/feature1, example.com/feature2, and so on).

We use Helm for making this path property configurable. This means we can pass a Helm value for each deployment that represents the URL of that preview environment.

Apart from the configurable Ingress, our example application is a vanilla Kubernetes application. You can see the full Helm chart in GitHub.

Creating an environment for a pull request

With the application manifests in place and an Ingress installed in the cluster, we are now ready to set up the workflow for the temporary environments.

First we need a pipeline that creates a temporary environment when a pull request is opened (or synced/updated).

Codefresh comes with a rich set of triggers that allow you to define exactly which events will launch the pipeline. Here is the trigger dialog:

We are only interested in the initial event of opening a pull request along with the “sync” event. In GitHub terms, a pull request is synced when somebody pushes another commit to an already open pull request. We want to update the environment in this case as well.

As for the pipeline itself, it is very simple with just 4 steps:

version: "1.0"
stages:
  - "prepare"
  - "verify"
  - "deploy"

steps:
  main_clone:
    title: "Cloning repository"
    type: "git-clone"
    repo: "codefresh-contrib/unlimited-test-environments-source-code"
    revision: "${{CF_REVISION}}"
    stage: "prepare"
  build_app_image:
    title: Building Docker Image
    type: build
    stage: prepare
    image_name: kostiscodefresh/spring-actuator-sample-app
    working_directory: ./
    tag: '${{CF_BRANCH}}'
    dockerfile: Dockerfile
  clone:
    title: "Cloning repository"
    type: "git-clone"
    repo: "codefresh-contrib/unlimited-test-environments-manifests"
    revision: main
    stage: "deploy"
  deploy:
    title: Deploying Helm Chart
    type: helm
    stage: deploy
    working_directory: ./unlimited-test-environments-manifests
    arguments:
      action: install
      chart_name: simple-java-app
      release_name: my-spring-app
      helm_version: 3.2.4
      kube_context: myawscluster
      namespace: ${{CF_BRANCH_TAG_NORMALIZED}}
      cmd_ps: '--create-namespace --wait --timeout 5m'
      custom_values:
        - 'image_tag=${{CF_BRANCH_TAG_NORMALIZED}}'
        - 'replicaCount=3'
        - 'ingress_path=/${{CF_BRANCH_TAG_NORMALIZED}}/'

The 4 steps are:

A clone step for checking out the source of the application
A build step to create a container image and also push it to Docker Hub
Another clone step for checking out the Helm chart
The Helm deploy step to deploy the application to a new namespace

The key point here is the CF_BRANCH_TAG_NORMALIZED variable. This variable is provided by Codefresh and represents the Git branch that triggered this pipeline.

We use the variable in the deploy step in the namespace property as well as the ingress_path property.

As an example, if I create a pull request for a GitHub branch named “demo” and run this pipeline:

A namespace called demo will be created on the cluster
Helm will deploy a version of the application on that namespace
The Ingress of the cluster will be instructed to redirect all traffic at /demo/ to this deployment

Here is the result deployment in the browser:

And that’s it! Now each time a new pull request is opened, a new deployment will take place in the respective namespace.

Because we also catch the PR sync event in our Git trigger, we can also commit again on a branch where a pull request is already open. Another deployment will take place with all our changes.

Automatic comments on the pull request with the environment URL

Even if you have a naming convention for preview environments that is easy to remember, it would be very helpful for all members of your team to actually have a written history log of the creation of a temporary environment.

One of the most common patterns is adding the environment URL as a comment in the same pull request that created it.

In the example above, I am working on feature 2345 or a branch called pr-2345. After I created the pull request, the environment was deployed to my Kubernetes cluster, and a comment on the pull request has the exact URL.

This way, anybody who is responsible for reviewing the pull request has, in a single place, both the file changes and the temporary environment for checking how the application looks after the changes.

To achieve this pattern, you can use the Codefresh plugin for adding comments to pull requests.
You can add the following snippet in your Codefresh pipeline:

 add_pr_comment:
    title: Adding comment on PR
    stage: deploy
    type: kostis-codefresh/github-pr-comment
    fail_fast: false
    arguments:
      PR_COMMENT_TEXT: "[CI] Staging environment is at https://kostis.sales-dev.codefresh.io/${{CF_BRANCH_TAG_NORMALIZED}}/"
      GIT_PROVIDER_NAME: 'github-1'

With this pipeline step, we add a comment on a pull request. For the comment itself, we again use the predefined CF_BRANCH_TAG_NORMALIZED variable that provides the name of the pull request.

The plugin knows which pull request will be used for the comment by automatically fetching the pull request from the trigger of the pipeline. This is why we have no need to specify which pull request will be commented on.

Quality checks and smoke tests

Creating automatic preview environments for each pull request is a capability that is also offered by several other products in the Kubernetes ecosystem. The big power of Codefresh comes with the flexibility to add any verification steps before or after the creation of the environment.

For example, it would be wise to run unit and integration tests before an environment is deployed. After all, if unit tests fail, does it really make sense to create a temporary environment? The developer should instead fix the unit tests and then try to deploy again.

On the other hand, maybe you want to use the temporary environment for integration tests and possible security scans. This way when a pull request is created, the reviewer will have all the information needed at hand:

The code that was changed
How the application looks
If the new code introduces security issues or not
If the new code passes unit and integration tests.

Making all this information available in a single place results in a much faster review process.

It is also possible to add extra steps in the pipeline that check things after the environment is created. A very common example is running a set of smoke tests on the newly created temporary environment. This gives you even higher confidence about the correctness of the changes.

You can also include other steps after the deployment such as posting a message to a Slack channel, sending an email, updating a dashboard, and so on.

Here is our final pipeline for creating a preview environment when a pull request is opened.

This pipeline has the following steps:

A clone step to fetch the source code of the application
A freestyle step that runs Maven for compilation and unit tests
A build step to create the docker image of the application
A step that scans the source code for security issues with Snyk
A step that scans the container image for security issues with trivy
A step that runs integration tests by launching the app in a service container
A step for Sonar analysis
A step that clones a second Git repository that has the Helm chart of the app
A step that deploys the source code to a new namespace.
A step that adds a comment on the pull request with the URL of the temporary environment
A step that runs smoke tests against the temporary test environment

Here is the whole pipeline definition:

version: "1.0"
stages:
  - "prepare"
  - "verify"
  - "deploy"

steps:
  main_clone:
    title: "Cloning repository"
    type: "git-clone"
    repo: "codefresh-contrib/unlimited-test-environments-source-code"
    revision: "${{CF_REVISION}}"
    stage: "prepare"

  run_unit_tests:
    title: Compile/Unit test
    stage: prepare
    image: 'maven:3.5.2-jdk-8-alpine'
    commands:
      - mvn -Dmaven.repo.local=/codefresh/volume/m2_repository package   
  build_app_image:
    title: Building Docker Image
    type: build
    stage: prepare
    image_name: kostiscodefresh/spring-actuator-sample-app
    working_directory: ./
    tag: '${{CF_BRANCH}}'
    dockerfile: Dockerfile
  scan_code:
    title: Source security scan
    stage: verify
    image: 'snyk/snyk-cli:maven-3.6.3_java11'
    commands:
      - snyk monitor       
  scan_image:
    title: Container security scan
    stage: verify
    image: 'aquasec/trivy'
    commands:
      - trivy image docker.io/kostiscodefresh/spring-actuator-sample-app:${{CF_BRANCH}}
  run_integration_tests:
    title: Integration tests
    stage: verify
    image: maven:3.5.2-jdk-8-alpine
    commands:
     - mvn -Dmaven.repo.local=/codefresh/volume/m2_repository verify -Dserver.host=http://my-spring-app -Dsonar.organization=kostis-codefresh-github
    services:
      composition:
        my-spring-app:
          image: '${{build_app_image}}'
          ports:
            - 8080
      readiness:
        timeoutSeconds: 30
        periodSeconds: 15
        image: byrnedo/alpine-curl
        commands:
          - "curl http://my-spring-app:8080/"
  sonar_scan:
    title: Sonar Scan
    stage: verify
    image: 'maven:3.8.1-jdk-11-slim'
    commands:
      - mvn -Dmaven.repo.local=/codefresh/volume/m2_repository sonar:sonar -Dsonar.login=${{SONAR_TOKEN}} -Dsonar.host.url=https://sonarcloud.io -Dsonar.organization=kostis-codefresh-github
  clone:
    title: "Cloning repository"
    type: "git-clone"
    repo: "codefresh-contrib/unlimited-test-environments-manifests"
    revision: main
    stage: "deploy"
  deploy:
    title: Deploying Helm Chart
    type: helm
    stage: deploy
    working_directory: ./unlimited-test-environments-manifests
    arguments:
      action: install
      chart_name: simple-java-app
      release_name: my-spring-app
      helm_version: 3.2.4
      kube_context: myawscluster
      namespace: ${{CF_BRANCH_TAG_NORMALIZED}}
      cmd_ps: '--create-namespace --wait --timeout 5m'
      custom_values:
        - 'image_tag=${{CF_BRANCH_TAG_NORMALIZED}}'
        - 'replicaCount=3'
        - 'ingress_path=/${{CF_BRANCH_TAG_NORMALIZED}}/'
  add_pr_comment:
    title: Adding comment on PR
    stage: deploy
    type: kostis-codefresh/github-pr-comment
    fail_fast: false
    arguments:
      PR_COMMENT_TEXT: "[CI] Staging environment is at https://kostis.sales-dev.codefresh.io/${{CF_BRANCH_TAG_NORMALIZED}}/"
      GIT_PROVIDER_NAME: 'github-1'
  run_smoke_tests:
    title: Smoke tests
    stage: deploy
    image: maven:3.5.2-jdk-8-alpine
    working_directory: "${{main_clone}}"
    fail_fast: false
    commands:
     - mvn -Dmaven.repo.local=/codefresh/volume/m2_repository verify -Dserver.host=https://kostis.sales-dev.codefresh.io/${{CF_BRANCH_TAG_NORMALIZED}}/  -Dserver.port=443

Now when a preview environment is created, you have the guarantee that it passed the checks defined by your team (quality and security), leaving only the actual business logic as a review item.

This makes the process of reviewing pull requests as fast as possible, since all the common checks are fully automated and reviewers can focus solely on how the application works.

Destroying a preview environment

Creating a preview environment can be a costly operation in a big team. Reducing cloud costs is one of the biggest challenges when it comes to Kubernetes and cloud infrastructure.

You need to have a way to clean up preview environments when they are no longer used. Even though some teams have an automatic job (e.g. via cron) to destroy preview environments that are no longer used, the most cost effective option is to delete a preview environment immediately after the respective pull request is closed.

We can setup a trigger for this event using the Git dialog of Codefresh:

For this pipeline, we capture the pull request closed events. It is not really important if the pull request was merged or not. Since it is closed, we assume that the respective preview environment is no longer needed.

The pipeline that deletes the environment is trivial; it has only one step:

Here is the full definition of the delete pipeline:

version: "1.0"
steps:
  delete_app:
    title: Delete app
    type: helm
    arguments:
      action: auth
      helm_version: 3.2.4
      kube_context: myawscluster
      namespace: ${{CF_BRANCH_TAG_NORMALIZED}}
      commands:
            - helm delete my-spring-app --namespace ${{CF_BRANCH_TAG_NORMALIZED}}
            - kubectl delete namespace ${{CF_BRANCH_TAG_NORMALIZED}}

In the pipeline, we uninstall the Helm application and also delete the respective namespace with the same name.

Adopting the mindset of preview environments

We hope you enjoyed this tutorial for preview environments. Adopting Kubernetes is both a technical challenge and a paradigm shift, as several traditional practices are no longer needed. Stop using predefined test environments and switch to dynamic preview environments today!

For more details, see our documentation for preview environments.

Note that preview environments can affect your billing if they are not properly configured and managed. If you are running an open source project with public infrastructure you need to take precautions to prevent abuse of this mechanism.

New to Codefresh? Create your free account today!

Enterprise CI/CD Best Practices – Part 3

Kostis Kapelonis — Mon, 14 Jun 2021 10:30:17 +0000

This is the third and last part in our “Enterprise CI/CD best practices” series. See also part 1 and part 2 for the previous best practices.

Best Practice 16 – Database Updates have their own Lifecycle

As more and more companies adopt continuous delivery we see an alarming trend of treating databases as an external entity that exists outside of the delivery process. This could not be further from the truth.

Databases (and other supporting systems such as message queues, caches, service discovery solutions, etc.) should be handled like any other software project. This means:

Their configuration and contents should be stored in version control
All associated scripts, maintenance actions, and upgrade/downgrade instructions should also be in version control
Configuration changes should be approved like any other software change (passing from automated analysis, pull request review, security scanning, unit testing, etc.)
Dedicated pipelines should be responsible for installing/upgrading/rolling back each new version of the database

The last point is especially important. There are a lot of programming frameworks (e.g., rails migrations, Java Liquibase, ORM migrations) that allow the application itself to handle DB migrations. Usually the first time the application startup it can also upgrade the associate database to the correct schema. While convenient, this practice makes rollbacks very difficult and is best avoided.

Database migration should be handled like an isolated software upgrade. You should have automated pipelines that deal only with the database, and the application pipelines should not touch the database in any way. This will give you the maximum flexibility to handle database upgrades and rollbacks by controlling exactly when and how a database upgrade takes place.

Best Practice 17 – Database Updates are Automated

Several organizations have stellar pipelines for the application code, but pay very little attention to automation for database updates. Handling databases should be given the same importance (if not more) as with the application itself.

This means that you should similarly automate databases to application code:

Store database changesets in source control
Create pipelines that automatically update your database when a new changeset is created
Have dynamic temporary environments for databases where changesets are reviewed before being merged to mainly
Have code reviews and other quality checks on database changesets
Have a strategy for doing rollbacks after a failed database upgrade

It also helps if you automate the transformation of production data to test data that can be used in your test environments for your application code. In most cases, it is inefficient (or even impossible due to security constraints) to keep a copy of all production data in test environments. It is better to have a small subset of data that is anonymized/simplified so that it can be handled more efficiently.

Best Practice 18 – Perform Gradual Database Upgrades

Application rollbacks are well understood and we are now at the point where we have dedicated tools that perform rollbacks after a failed application deployment. And with progressively delivery techniques such as canaries and blue/green deployments, we can minimize the downtime even further.

Progressive delivery techniques do not work on databases (because of the inherent state), but we can plan the database upgrades and adopt evolutionary database design principles.

By following an evolutionary design you can make all your database changesets backward and forwards compatible allowing you to rollback application and database changes at any time without any ill effects

As an example, if you want to rename a column, instead of simply creating a changeset the renames the column and performing a single database upgrade, you instead follow a schedule of gradual updates as below:

Database changeset that only adds a new column with the new name (and copies existing data from the old column). The application code is still writing/reading from the old column
Application upgrade where the application code now writes to both columns but reads from the new column
Application upgrade where the application code writes/reads only to the new column
Database upgrade that removes the old column

The process needs a well-disciplined team as it makes each database change span over several deployments. But the advantages of this process cannot be overstated. At any stage in this process, you can go back to the previous version without losing data and without the need for downtime.

For the full list of techniques see the database refactoring website.

Best Practice 19 – All deployments must happen via the CD platform only (and never from workstations)

Continuing the theme of immutable artifacts and deployments that send to production what was deployed, we must also make sure the pipelines themselves are the only single path to production.

The main way to use CI/CD pipelines as intended is to make sure that the CI/CD platform is the only application that can deploy to production. This practice guarantees that production environments are running what they are expected to be running (i.e., the last artifact that was deployed).

Unfortunately, several organizations either allow developers to deploy directly from their workstations, or even to “inject” their artifacts in a pipeline at various stages.

This is a very dangerous practice as it breaks the traceability and monitoring offered by a proper CI/CD platform. It allows developers to deploy to production features that might not be committed in source control in the first place. A lot of failed deployments stem from a missing file that was present on a developer workstation and not in source control.

In summary, there is only a single critical path for deployments, and this path is strictly handed by the CI/CD platform. Deploying production code from developer workstations should be prohibited at the network/access/hardware level.

Best Practice 20 – Use Progressive Deployment Patterns

We already talked about database deployments in best practice 18 and how each database upgrade should be forwards and backward compatible. This pattern goes hand-in-hand with progressive delivery patterns on the application side.

Traditional deployments follow an all-or-nothing approach where all application instances move forward to the next version of the software. This is a very simple deployment approach but makes rollbacks a challenging process.

You should instead look at:

Blue/Green deployments that deploy a whole new set of instances of the new version, but still keep the old one for easy rollbacks
Canary releases where only a subset of the application instances move to the new version. Most users are still routed to the previous version

If you couple these techniques with gradual database deployments, you can minimize the amount of downtime involved when a new deployment happens. Rollbacks also become a trivial process as in both cases you simply change your load balancer/service mesh to the previous configuration and all users are routed back to the original version of the application.

Make sure to also look at involving your metrics (see best practices 21 and 22) in the deployment process for fully automated rollbacks.

Best Practice 21 – Metrics and logs can detect a bad deployment

Having a pipeline that deploys your application (even when you use progressive delivery) is not enough if you want to know what is the real result of the deployment. Deployments that look “successful” at first glance, but soon prove to introduce regressions is a very common occurrence in large software projects.

A lot of development teams simply perform a visual check/smoke test after a deployment has finished and call it a day if everything “looks” good. But this practice is not enough and can quickly lead to the introduction of subtle bugs or performance issues.

The correct approach is the adoption of application (and infrastructure) metrics. This includes:

Detailed logs for application events
Metrics that count and monitor key features of the application
Tracing information that can provide an in-depth understanding of what a single request is doing

Once these metrics are in place, the effects of deployment should be judged according to a before/after comparison of these metrics. This means that metrics should not be simply a debugging mechanism (post-incident), but should act instead as an early warning measure against failed deployments.

Choosing what events to monitor and where to place logs is a complex process. For large applications, it is best to follow a gradual redefinition of key metrics according to past deployments. The suggested workflow is the following:

Place logs and metrics on events that you guess will show a failed deployment
Perform several deployments and see if your metrics can detect the failed ones
If you see a failed deployment that wasn’t detected in your metrics, it means that they are not enough. Fine-tune your metrics accordingly so that the next time a deployment fails in the same manner you actually know it in advance

Too many times, development teams focus on “vanity” metrics, i.e., metrics that look good on paper but say nothing about a failed deployment.

Best Practice 22 – Automatic Rollbacks are in place

This is a continuation of the previous best practice. If you already have good metrics in place (that can verify the success of a deployment) you can take them to the next level by having automated rollbacks that depend on them.

A lot of organizations have great metrics in place, but only manually use them:

A developer looks at some key metrics before deployment
Deployment is triggered
The developer looks at the metrics in an ad-hoc manner to see what happened with the deployment

While this technique is very popular, it is far from effective. Depending on the complexity of the application, the time spent watching metrics can be 1-2 hours so that the effects of the deployment have time to become visible.

It is not uncommon for deployments to be marked as “failed” after 6-24 hours either because nobody paid attention to the correct metrics or because people simply disregarded warnings and errors thinking that was not a result of the deployment.

Several organizations are also forced to only deploy during working hours because only at that time there are enough human eyes to look at metrics.

Metrics should become part of the deployment process. The deployment pipeline should automatically consult metrics after a deployment happens and compare them against a known threshold or their previous state. And then in a fully automated manner, the deployment should either be marked as finished or even rolled back.

This is the holy grail of deployments as it completely removes the human factor out of the equation and is a step towards Continuous Deployment (instead of Continuous Delivery). With this approach:

You can perform deployments at any point in time knowing that metrics will be examined with the same attention even if the time is 3 am
You can catch early regressions with pinpoint accuracy
Rollbacks (usually a stressful action) are now handled by the deployment platform giving easier access to the deployment process by non-technical people

The result is that a developer can deploy at 5 pm on Friday and immediately go home. Either the change will be approved (and it will be still there on Monday) or it will be rolled back automatically without any ill effects (and without any downtime if you also follow best practice 20 for progressive delivery)

Best Practice 23 – Staging Matches Production

We explained in best practice 12 that you should employ dynamic environments for testing individual features for developers. This gives you the confidence that each feature is correct on its own before you deploy it in production.

It is also customary to have a single staging environment (a.k.a. pre-production) that acts as the last gateway before production. This particular environment should be as close to production as possible so that any configuration errors can and mismatches can be quickly discovered before pushing the application deployment to the real production environment.

Unfortunately, most organizations treat the staging environment in a different way than the production one. Having a staging environment that is separate from production is a cumbersome practice as it means that you have to manually maintain it and make sure that it also gets any updates that reach production (not only in application terms but also any configuration changes).

Two more effective ways of using a staging environment are the following:

Create a staging environment on-demand each time you deploy by cloning the production environment
Use as staging a special part of production (sometimes called shadow production)

The first approach is great for small/medium applications and involves cloning the production environment right before a deployment happens in a similar (but possibly smaller) configuration. This means that you can also get a subset of the database and a lower number of replicas/instances that serve traffic. The important point here is that this staging environment only exists during a release. You create it just before a release and destroy it once a release has been marked as “successful”.

The main benefit of course is that cloning your production right before deployment guarantees that you have the same configuration between staging and production. Also, there is nothing to maintain or keep up-to-date because you always discard the staging environment once the deployment has finished.

This approach however is not realistic for large applications with many microservices or large external resources (e.g., databases and message queues). In those cases, it is much easier to use staging as a part of the production. The important point here is that the segment of production that you use does NOT get any user traffic, so in case of a failed deployment, your users will not be affected. The advantage again is that since this is part of the production you have the same guarantee that the configuration is the most recent one and what you are testing will behave in the same way as “real” production.

Applying these Best Practices to Your Organization

We hope that now you have some ideas on how to improve your CI/CD process. Remember however that it is better to take gradual steps and not try to change everything at once.

Consult the first section of this guide where we talked about priorities. Focus first on the best practices that are marked as “critical” and as soon as you have conquered them move to those with “high” importance.

We believe that if you adopt the majority of practices that we have described in this guide, your development teams will be able to focus on shipping features instead of dealing with failed deployments and missing configuration issues.

Cover photo by Unsplash.

Enterprise CI/CD Best Practices – Part 2

Kostis Kapelonis — Mon, 14 Jun 2021 10:29:54 +0000

This is the second part in our “Enterprise CI/CD best practices” series. See also part 1 for for the previous part and part 3 for the next part.

Best Practice 8 – Automate All your Tests

The main goal of unit/integration/functional tests is to increase the confidence in each new release that gets deployed. In theory, a comprehensive amount of tests will guarantee that there are no regressions on each new feature that gets published.

To achieve this goal, tests should be fully automated and managed by the CI/CD platform. Tests should be run not only before each deployment but also after a pull request is created. The only way to achieve the level of automation is for the test suite to be runnable in a single step.

Unfortunately, several companies are still creating tests the old-fashioned way, where an army of test engineers is tasked with the manual execution of various test suites. This blocks all new releases as the testing velocity essentially becomes the deployment velocity.

Test engineers should only write new tests. They should never execute tests themselves as this makes the feedback loop of new features vastly longer. Tests are always executed automatically by the CI/CD platform in various workflows and pipelines.

It is ok if a small number of tests are manually run by people as a way to smoke test a release. But this should only happen for a handful of tests. All other main test suites should be fully automated.

Best Practice 9 – Make Your Tests Fast

A corollary of the previous section is also the quick execution of tests. If test suites are to be integrated into delivery pipelines, they should be really fast. Ideally, the test time should not be bigger than the packaging/compilation time, which means that tests should finish after five minutes, and no more than 15.

The quick test execution gives confidence to developers that the feature they just committed has no regressions and can be safely promoted to the next workflow stage. A running time of two hours is disastrous for developers as they cannot possibly wait for that amount of time after committing a feature.

If the testing period is that large, developers just move to their next task and change their mind context. Once the test results do arrive, it is much more difficult to fix issues on a feature that you are not actively working on.

Unfortunately, the majority of time waiting for tests steps from ineffective test practices and lack of optimizations. The usual factor of a slow test is code that “sleeps” or “waits” for an event to happen, making the test run longer than it should run. All these sleep statements should be removed and the test should follow an event-driven architecture (i.e., responding to events instead of waiting for things to happen)

Test data creation is also another area where tests are spending most of their data. Test data creation code should be centralized and re-used. If a test has a long setup phase, maybe it is testing too many things or needs some mocking in unrelated services.

In summary, test suites should be fast (5-10 minutes) and huge tests that need hours should be refactored and redesigned.

Best Practice 10 – Each test auto-cleans its side effects

Generally speaking, you can split your unit tests is two more categories (apart from unit/integration or slow and fast) and this has to do with their side effects:

Tests that have no side effects. They read only information from external sources, never modify anything and can be run as many times as you want (or even in parallel) without any complications.
Tests that have side effects. These are the tests that write stuff to your database, commit data to external systems, perform output operations on your dependencies, and so on.

The first category (read-only tests) is easy to handle since they need no special maintenance. But the second category (read/write tests) is more complex to maintain as you need to make sure that you clean up their actions as soon as the tests finish. There are two approaches to this:

Let all the tests run and then clean up the actions of all of them at the end of the test suit
Have each test clean-up by itself after it runs (the recommended approach)

Having each test clean up its side-effects is a better approach because it means that you can run all your tests in parallel or any times that you wish individually (i.e., run a single test from your suite and then run it again a second or third time).

Being able to execute tests in parallel is a prerequisite for using dynamic test environments as we will see later in this guide.

Best Practice 11 – Use Multiple Test Suites

Testing is not something that happens only in a single step inside a CI/CD pipeline. Testing is a continuous process that touches all phases of a pipeline.

This means that multiple test types should exist in any well-designed application. Some of the most common examples are:

Really quick unit tests that look at major regressions and finish very fast
Longer integrations tests that look for more complex scenarios (such as transactions or security)
Stress and load testing
Contract testing for API changes of external services used
Smoke tests that can be run on production to verify a release
UI tests that test the user experience

This is just a sample of different test types. Each company might have several more categories. The idea behind these categories is that developers and operators can pick and choose different testing types for the specific pipeline they create.

As an example, a pipeline for pull requests might not include stress and load testing phases because they are only needed before a production release. Creating a pull request will only run the fast unit tests and maybe the contact testing suite.

Then after the Pull Request is approved, the rest of the tests (such as smoke tests in production) will run to verify the expected behavior.

Some test suits might be very slow, that running them on demand for every Pull Request is too difficult. Running stress and load testing is usually something that happens right before a release (perhaps grouping multiple pull requests) or in a scheduled manner (a.k.a. Nightly builds)
The exact workflow is not important as each organization has different processes. What is important is the capability to isolate each testing suite and be able to select one or more for each phase in the software lifecycle.

Having a single test suite for everything is cumbersome and will force developers to skip tests locally. Ideally, as a developer, I should be able to select any possible number of test suites to run against my feature branch allowing me to be flexible on how I test my feature.

Best Practice 12 – Create Test Environments On-demand

The traditional way of testing an application right before going into production is with a staging environment. Having only one staging environment is a big disadvantage because it means that developers must either test all their features at once or they have to enter a queue and “book” the staging environment only for their feature.

This forces a lot of organizations to create a fleet of test environments (e.g., QA1, QA2, QA3) so that multiple developers can test their features in parallel. This technique is still not ideal because:

A maximum of N developers can test their feature (same as the number of environments) in parallel.
Testing environments use resources all the time (even when they are not used)
The static character of environments means that they have to be cleaned up and updated as well. This adds extra maintenance effort to the team responsible for test environments

With a cloud-based architecture, it is now much easier to create test environments on-demand. Instead of having a predefined number of static environments, you should modify your pipeline workflow so that each time a Pull Request is created by a developer, then a dedicated test environment is also created with the contents of that particular Pull Request.

The advantages of dynamic test environments cannot be overstated:

Each developer can test in isolation without any conflicts with what other developers are doing
You pay for the resources of test environments only while you use them
Since the test environments are discarded at the end there is nothing to maintain or clean up

Dynamic test environments can shine for teams that have an irregular development schedule (e.g., having too many features in flight at the end of a sprint)

Best Practice 13 – Run Test Suites Concurrently

This is a corollary of the previous best practice. If your development process has dynamic test environments, it means that different test suites can run at any point in time for any number of those environments even at the same time.

If your tests have special dependencies (e.g., they must be launched in a specific order, or they expect specific data before they can function) then having a dynamic number of test environments will further exacerbate the pre-run and post-run functions that you have for your tests.

The solution is to embrace best practice 10 and have each test prepare its state and clean up after itself. Tests that are read-only (i.e., don’t have any side-effects) can run in parallel by definitions.

Tests that write/read information need to be self-sufficient. For example, if a test writes an entity in a database and then reads it back, you should not use a hardcoded primary key because that would mean that if two test suites with this test run at the same time, the second one will fail because of database constraints.

While most developers think that test parallelism is only a way to speed up your tests, in practice it is also a way to have correct tests without any uncontrolled side effects.

Best Practice 14 – Security Scanning is part of the process

A lot of organizations still follow the traditional waterfall model of software development. And in most cases, the security analysis comes at the end. The software is produced and then a security scan (or even penetration test) is performed on the source code. The results are published and developers scramble to fix all the issues.

Putting security scanning at the end of a release is a lost cause. Some major architectural decisions affect how vulnerabilities are detected and knowing them in advance is a must not only for developers but also all project stakeholders.

Security is an ongoing process. An application should be checked for vulnerabilities at the same time as it is developed. This means that security scanning should be part of the pre-merge process (i.e as one of the checks of a Pull Request). Solving security issues in a finished software package is much harder than while it is in development.

Security scans should also have the appropriate depth. You need to check at the very least:

Your application source code
The container or underlying runtime where the application is running on
The computing node and the Operating System that will host the application

A lot of companies focus only on two (or even one) of these areas and forget the security works exactly like a chain (the weakest link is responsible for the overall security)

If you also want to be proactive with security, it is best to enforce it on the Pull Request level. Instead of simply scanning your source code and then reporting its vulnerabilities, it is better to prevent merges from happening in the first place if a certain security threshold is not passed.

Best Practice 15 – Quality Scanning/Code reviews are part of the process

Similar to security scans, code scans should be part of the day-to-day developer operations. This includes:

Static analysis of code for company-approved style/formatting
Static analysis of code for security problems, hidden bugs
Runtime analysis of code for errors and other issues

While there are existing tools that handle the analysis part, not all organizations execute those tools in an automated way. A very common pattern we see is enthusiastic software teams vowing to use these tools (e.g., Sonarqube) for the next software project, only to forget about them after some time or completely ignoring the warning and errors presented in the analysis reports.

In the same manner as security scans, code quality scanning should be part of the Pull Request process. Instead of simply reporting the final results to developers, you should enforce good quality practices by preventing merges if a certain amount of warning is present.

Continued on part3.

Cover photo by Unsplash.

Enterprise CI/CD Best Practices – Part 1

Kostis Kapelonis — Mon, 14 Jun 2021 10:29:22 +0000

If you are trying to learn your way around Continuous Integration/Delivery/Deployment, you might notice that there are mostly two categories of resources:

High-level overviews of what CI/CD is and why you need it. These are great for when you are getting started but do not cover anything about day two operations or how to optimize an existing process.
Detailed tutorials that cover only a specific aspect of CI/CD (e.g., just unit testing or just deployment) using specific programming languages and tools. We believe that there is a gap between those two extremes. We are missing a proper guide that sits between those two categories by talking about best practices, but not in an abstract way. If you always wanted to read a guide about CI/CD that explains not just the “why” but also the “how” to apply best practices, then this guide is for you.

We will describe all the basic foundations of effective CI/CD workflows, but instead of talking only in generic terms, we will explain all the technicalities behind each best practice and more importantly, how it can affect you if you don’t adopt it.

Setting Priorities

Several companies try to jump on the DevOps bandwagon without having mastered the basics first. You will soon realize that several problems which appear during the CI/CD process are usually pre-existing process problems that only became visible when that company tried to follow best practices in CI/CD pipelines.

The table below summarizes the requirements discussed in the rest of the guide. We also split the requirements according to priority:

Critical requirements are essential to have before adopting DevOps or picking a solution for CI/CD. You should address them first. If you don’t, then they will block the process later down the road.
Requirements with High priority are still important to address, but you can fix them while you are adopting a CI/CD platform
Requirements with Medium priority can be addressed in the long run. Even though they will improve your deployment process, you can work around them until you find a proper solution.

Number	Best practice	Category	Importance
1	All project assets are in source control	Artifacts	Critical
2	A single artifact is produced for all environments	Artifacts	High
3	Artifacts move within pipelines (and not source revisions)	Artifacts	High
4	Development happens with short-lived branches (one per feature)	Build	High
5	Builds can be performed in a single step	Build	High
6	Builds are fast (less than 5 minutes)	Build	Medium
7	Store your dependencies	Build	High
8	Tests are automated	Testing	High
9	Tests are fast	Testing	High
10	Tests auto clean their side effects	Testing	High
11	Multiple test suites exist	Testing	Medium
12	Test environments on demand	Testing	Medium
13	Running test suites concurrently	Testing	Medium
14	Security scanning is part of the process	Quality and Audit	High
15	Quality scanning/Code reviews are part of the process	Quality and Audit	Medium
16	Database updates have their lifecycle	Databases	High
17	Database updates are automated	Databases	High
18	Database updates are forward and backward compatible	Databases	High
19	Deployments happen via a single path (CI/CD server)	Deployments	Critical
20	Deployments happen gradually in stages	Deployments	High
21	Metrics and logs can detect a bad deployment	Deployments	High
22	Automatic rollbacks are in place	Deployments	Medium
23	Staging matches production	Deployments	Medium

Best Practice 1 – Place Everything Under Source Control

Artifact management is perhaps the most important characteristic of a pipeline. At its most basic level, a pipeline creates binary/package artifacts from source code and deploys them to the appropriate infrastructure that powers the application that is being deployed.

The single most important rule to follow regarding assets and source code is the following:

All files that constitute an application should be managed using source control.

Unfortunately, even though this rule seems pretty basic, there are a lot of organizations out there that fail to follow it. Traditionally, developers are using version control systems only for the source code of an application but leave out other supporting files such as installation scripts, configuration values, or test data.

Everything that takes part in the application lifecycle should be checked into source control. This includes but is not limited to:

Source code
Build scripts
Pipeline definition
Configuration values
Tests and test data
Database schemas
Database update scripts
Infrastructure definition scripts
Cleanup/installation/purging scripts
Associated documentation

The end goal is that anybody can check out everything that relates to an application and can recreate it locally or in any other alternative environment.

A common anti-pattern we see is deployments happening with a special script that is available only on a specific machine or on the workstation of a specific team member, or even an attachment in a wiki page, and so on.

Version control also means that all these resources are audited and have a detailed history of all changes. If you want to see how the application looked 6 months ago, you can easily use the facilities of your version control system to obtain that information.

Note that even though all these resources should be versioned control, it doesn’t have to be in the same repository. Whether you use multiple repositories or a single one, is a decision that needs careful consideration and has not a definitive answer. The important part however is to make sure that everything is indeed version controlled.

Even though GitOps is the emerging practice of using Git operations for promotions and deployments, you don’t need to follow GitOps specifically to follow this best practice. Having historical and auditing information for your project assets is always a good thing, regardless of the actual software paradigm that you follow.

Best Practice 2 – Create a Single package/binary/container for All Environments

One of the main functionalities of a CI/CD pipeline is to verify that a new feature is fit for deployment to production. This happens gradually as every step in a pipeline is essentially performing additional checks for that feature.

For this paradigm to work, however, you need to make sure that what is being tested and prodded within a pipeline is also the same thing that gets deployed. In practice, this means that a feature/release should be packaged once and be deployed to all successive environments in the same manner.

Unfortunately, a lot of organizations fall into the common trap of creating different artifacts for dev/staging/prod environments because they haven’t mastered yet a common infrastructure for configuration. This implies that they deploy a slightly different version of what was tested during the pipeline. Configuration discrepancies and last-minute changes are some of the biggest culprits when it comes to failed deployments, and having a different package per environment exacerbates this problem.

Instead of creating multiple versions per environment, the accepted practice is to have a single artifact that only changes configuration between different environments. With the appearance of containers and the ability to create a self-sufficient package of an application in the form of Docker images, there is no excuse for not following this practice.

Regarding configuration there are two approaches:

The binary artifact/container has all configurations embedded inside it and changes the active one according to the running environment (easy to start, but not very flexible. We don’t recommend this approach)
The container has no configuration at all. It fetches needed configuration during runtime on demand using a discovery mechanism such as a key/value database, a filesystem volume, a service discovery mechanism, etc. (the recommended approach)

The result is the guarantee where the exact binary/package that is deployed in production is also the one that was tested in the pipeline.

Best Practice 3 – Artifacts, not Git Commits, should travel within a Pipeline

A corollary to the previous point (the same artifact/package should be deployed in all environments) is the fact that a deployment artifact should be built only once.

The whole concept around containers (and VM images in the past) is to have immutable artifacts. An application is built only once with the latest feature or features that will soon be released.

Once that artifact is built, it should move from each pipeline step to the next as an unchanged entity. Containers are the perfect vehicle for this immutability as they allow you to create an image only once (at the beginning of the pipeline) and promote it towards production with each successive pipeline step.

Unfortunately, the common anti-pattern seen here is companies promoting commits instead of container images. A source code commit is traveling in the pipeline stages and each step is being rebuilt by checking out the source code again and again.

This is a bad practice for two main reasons. First of all, it makes the pipeline very slow as packaging and compiling software is a very lengthy process and repeating it at each step is a waste of time and resources.

Secondly, it breaks the previous rule. Recompiling a code commit at every pipeline step leaves the window open for resulting in a different artifact than before. You lose the guarantee that what is deploying in production is the same thing that was tested in the pipeline.

Best Practice 4 – Use short-lived Branches for each feature

A sound pipeline has several quality gates (such as unit tests or security scans) that test the quality of a feature and its applicability to production deployments. In a development environment with a high velocity (and a big development team), not all features are expected to reach production right away. Some features may even clash with each other at their initial deployment version.

To allow for fine-grained quality gating between features, a pipeline should have the power to veto individual features and be able to select only a subset of them for production deployment. The easiest way to obtain this guarantee is following the feature-per-branch methodology where short-lived features (i.e. that can fit within a single development sprint) correspond to individual source control branches.

This makes the pipeline design very simple as everything revolves around individual features. Running test suites against a code branch tests only the new feature. Security scanning of a branch reveals problems with a new feature.

Project stakeholders are then able to deploy and rollback individual features or block complete branches from even being merged into the mainline code.

Unfortunately, there are still companies that have long-lived feature branches that collect multiple and unrelated features in a single batch. This not only makes merging a pain but also becomes problematic in case a single feature is found to have issues (as it is difficult to revert it individually).

The evolution of short-lived branches is to follow trunk-based development and feature toggles. This can be your endgame but only if you have mastered short-lived branches first.

Best Practice 5 – A basic build should take a single step

CI/CD pipelines are all about automation. It is very easy to automate something that already was very easy to run in the first place.

Ideally, a simple build of a project should be a single command. That command usually calls the build system or a script (e.g., bash, PowerShell) that is responsible for taking the source code, running some basic tests, and packaging the final artifact/container.

It is ok if more advanced checks (such as load testing) need additional steps. The basic build, however (that results in a deployable artifact) should only involve a single command. A new developer should be able to check out a brand new copy of the source code, execute this single command and get immediately a deployable artifact.

The same approach is true for deployments (deployments should happen with a single command)
Then if you need to create any pipeline you can simply insert that single step in any part of the pipeline.

Unfortunately, there are still companies that suffer from many manual steps to get a basic build running. Downloading extra files, changing properties, and in general having big checklists that need to be followed are steps that should be automated within that very same script.

If a new hire in your development team needs more than 15 minutes for the basic build (after checking out the code in their workstation) then you almost certainly suffer from this problem.

A well-built CI/CD pipeline just repeats what is already possible on the local workstation. The basic build and deploy process should be already well oiled before being moved into a CI/CD platform.

Best Practice 6 – Basic Builds are Fast (5 – 10 minutes)

Having a fast build is a big advantage for both developers and operators/sysadmins.

Developers are happy when the feedback loop between a commit and its side effects is as short as possible. It is very easy to fix a bug in the code that you just committed as it is very fresh on your mind. Having to wait for one hour before developers can detect failed builds is a very frustrating experience.

Builds should be fast both in the CI platform and in the local station. At any given point in time, multiple features are trying to enter the code mainline. The CI server can be easily overwhelmed if building them takes a lot of time.

Operators also gain huge benefits from fast builds. Pushing hot fixes in production or rolling back to previous releases is always a stressful experience. The shorter this experience is the better. Rollbacks that take 30 minutes are much more difficult to work with than those that take three minutes.

In summary, a basic build should be really fast. Ideally less than five minutes. If it takes more than 10 minutes, your team should investigate the causes and shorten that time. Modern build systems have great caching mechanisms.

Library dependencies should be fetched from an internal proxy repository instead of the internet
Avoid the use of code generators unless otherwise needed
Split your unit (fast) and integration tests (slow) and only use unit tests for the basic build
Fine-tune your container images to take full advantage of the Docker layer caching

Getting faster builds is also one of the reasons that you should explore if you are moving to microservices.

Best Practice 7 – Store/Cache Your Dependencies

It’s all over the news. The left-pad incident. The dependency confusion hack. While both incidents have great security implications, the truth is that storing your dependencies is also a very important tenet that is fundamental to the stability of your builds.

Every considerable piece of code uses external dependencies in the form of libraries or associated tools. Your code should of course be always stored in Git. But all external libraries should be also stored by you in some sort of artifact repository.

Spend some time to collect our dependencies and understand where they are coming from. Apart from code libraries, other not-so-obvious moving parts are needed by a complete build as your base docker images or any command-line utilities that are needed for your builds.

The best way to test your build for stability is to completely cut off internet access in your build servers (essentially simulating an air-gapped environment). Try to kick off a pipeline build where all your internal services (git, databases, artifact storage, container registry) are available, but nothing else from the public internet is accessible, and see what happens.

If your build complains about a missing dependency, imagine that the same thing will happen in a real incident if that particular external resource is also down.

Continued on part2.

Cover photo by Unsplash.

Argo Rollouts, the Kubernetes Progressive Delivery Controller, Reaches 1.0 Milestone

Kostis Kapelonis — Wed, 26 May 2021 14:43:13 +0000

Argo Rollouts, part of the Argo project, recently released their 1.0 version. You can see the changelog and more details on the Github release page.

If you are not familiar with Argo Rollouts, it is a Kubernetes Controller that deploys applications on your cluster. It replaces the default rolling-update strategy of Kubernetes with more advanced deployment methods such as blue/green and canary deployments.

In addition, it supports integration with several metrics providers to automatically promote/roll back your deployment according to live metrics.

We have already covered some example scenarios in previous blog posts:

We also have a dedicated documentation page with more scenarios.

New graphical user interface

The most user-visible feature in the 1.0 release is the introduction of a dedicated graphical user interface. Previously, you could monitor the status of a rollout from the command line or see status health with the ArgoCD dashboard.

In this release, Argo Rollouts includes its user interface:

The GUI includes all the information you need about a rollout:

Current deployment strategy
Total number of steps and current deployment step
Current status of the rollout (paused, progressing, degraded, etc.)
Number of replicas for the previous and current deployment
Container images for the previous and current deployment

Note that unlike ArgoCD, where the interface is part of the controller, in Argo Rollouts the interface is launched from the CLI and runs on your machine.

You can still use the CLI to monitor deployments as well:

Similar information is offered both ways.

Use existing Kubernetes Deployment objects

Argo Rollouts works by monitoring changes in a Kubernetes custom resource aptly named Rollout.

You can see the full details on the Rollout Specification page. The Rollout resource is compatible with the standard Deployment resource but includes extra fields that define the progressive delivery options.

This means that until recently if you wanted to use Argo Rollouts, you had to convert your existing Deployment objects to Rollouts. The process was not very difficult, but it was challenging for Argo Rollouts to cooperate with other Kubernetes tools that only understand Deployments.

With the 1.0 release, Argo Rollouts also supports an alternative format for Rollouts. You can now keep all Rollout specific information in the custom CRD and just mention an existing deployment:

apiVersion: argoproj.io/v1alpha1               # Create a rollout resource
kind: Rollout
metadata:
  name: rollout-ref-deployment
spec:
  replicas: 5
  workloadRef:                                 # Reference an existing Deployment using workloadRef field
    apiVersion: apps/v1
    kind: Deployment
    name: rollout-ref-deployment
  strategy:
    canary:
      steps:
        - setWeight: 20
        - pause: {duration: 10s}
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/instance: rollout-canary
  name: rollout-ref-deployment
spec:
  replicas: 0                                  # Scale down existing deployment
  selector:
    matchLabels:
      app: rollout-ref-deployment
  template:
    metadata:
      labels:
        app: rollout-ref-deployment
    spec:
      containers:
        - name: rollouts-demo
          image: argoproj/rollouts-demo:blue
          imagePullPolicy: Always
          ports:
            - containerPort: 8080

This means that you can still use all your favorite tools with Argo Rollouts, even those that do not understand CRDs.

See the documentation page for more details.

Official container image with the CLI

Codefresh pipelines are based on containers. Each step in a Codefresh pipeline is a docker image that is created either by you or loaded by an existing registry (private or public).

Previously, we had to maintain our container image for Argo Rollouts to integrate with container-based pipelines.

This is no longer necessary because Argo Rollouts released an official container image with the CLI.

You can find all image releases at https://quay.io/repository/argoproj/kubectl-argo-rollouts

Ambassador support for traffic splitting

In the case of canaries, Argo Rollouts supports several networking solutions for splitting live traffic between the previous and current versions in a gradual way.

With the 1.0, Ambassador Edge Stack is now added as an official provider for traffic routing. This means that you can use Ambassador for canary releases in addition to the other already supported solutions.

This feature is also a testament to the fact that Argo Rollouts does not need a full service mesh to gradually shift traffic.

Ambassador Edge Stack of course has several other features that are useful to have on a Kubernetes cluster (ingress services, SSO, external authentication, etc.) on their own.

Start your progressive delivery journey now

The 1.0 release contains several fixes and features. Some highlights are:

Get started now by following the installation instructions.

Troubleshooting Kubernetes Clusters as a Developer with Komodor

Kostis Kapelonis — Thu, 13 May 2021 16:48:43 +0000

The container ecosystem is moving very fast and new tools designed specifically for Kubernetes clusters are introduced at a very fast pace. Even though several times a new tool is simply implementing a well-known mechanism (already present in the VM world) with a focus on containers, every once in a while we see tools that are designed from scratch rather than adapting a preexisting idea. One such tool is Komodor.

It is hard to describe Komodor in a single sentence because it is unlike any other existing tool. Komodor is a Kubernetes troubleshooting tool specifically designed for developers. It is a smart dashboard that combines a live view of your cluster with several integrations for other tools (such as metric providers) that you already have installed.

The goals of Komodor are centered around an easy understanding of what is running on your cluster, what has recently changed, how it was changed, and by whom. And by change, we mean all kinds of changes such as new deployments, feature flags, adhoc kubectl commands, health checks, and everything else that affects your cluster.

If you during an incident (or even under normal circumstances) you have struggled with questions such as:

What was the last time that application X was deployed
What manifests were changed yesterday at 5.00 pm?
How do I find the CI pipeline responsible for application Y?
Who changed the configuration of application Z that is a dependency of your own app

then Komodor is here to help you!

Giving developers the information they really need

Before diving into the features of Komodor it is important to understand the mindset behind Komodor as it is essentially a tool that doesn’t belong to any existing categories of tools.

When something goes wrong in your cluster, your first impulse is to check your metrics solution. Having metrics enabled for your cluster is a great practice but it is not always enough to troubleshoot issues (especially if it is 3.00 am and you have just woken up looking at a cluster that you are not familiar with).

Current metric solutions share some common characteristics:

Most of them were existing tools created for bare-metal and VMs and simply adapted their functionality for containers with mixed success
They are great at telling you what changed but not why and by whom
Most of the time they offer much information that is not relevant to all cases and it is up to the human to understand which information is important and which is not

The problem is further exacerbated by the fact that most existing metric solutions are targeted at operators:

The process of troubleshooting a Kubernetes cluster as a developer is particularly difficult with the existing tools. As a developer, you don’t care about Persistence volumes or DNS errors or expired certificates. These problems are not normally solved by developers anyway.

Developers however do care about:

Failed deployments
Errors because a feature flag was enabled
Communication errors that have been caused not by the deployment itself but by a dependency of the application that is being deployed
Rogue manifest changes that are not happening via CI/CD

Komodor bridges this gap by giving you a tool specifically for Kubernetes clusters and specifically for all things important to developers. You can also use Komodor if you are an operator and link it to your existing tools.

Helping developers and operators collaborate and share a common language for solving issues is at the heart of DevOps.

How Komodor works

Komodor is offered as a hybrid application. The web UI runs in the cloud and is managed by the Komodor team. You need to install in your cluster the komodor agent as a Helm chart that then pushes information with the cloud UI. This means that communication for your cluster is outgoing only. There is no need to open any ports in your firewall or modify your allowlist with specific port ranges.

Once your agent is installed successfully your dashboard will be automatically populated. From now on Komodor is monitoring several events in your cluster and the web UI is updated in real time.

Currently Komodor supports Deployments, Daemonsets and StatefulSets. The left sidebar has a list of all your namespaces along with some extra information as we will see later on.

Support for custom CRDs is something that may come in the future.

Using the Komodor timeline

At the heart of Komodor is the event timeline. If you click on any service/deployment from your dashboard you will see the complete timeline for this service (from the time the Komodor agent was installed of course)

The timeline is one of the major advantages for Komodor because it captures all events that affect your application. Out of the box Komodor monitors your application for restarts, replica changes, manifest changes, health status changes and so on. By adding extra integrations you can also add in the timeline events from external providers (e.g. get alerts from Graphana, New Relic or Datadog).

The beauty of this mechanism is that Komodor can understand ALL events that happened in your application regardless of their source. This comes in contrast with other Kuberentes tools that only know about their own events and don’t have the full picture of what was changed when.

As an example for a single application the following might happen:

The deployment image was upgraded by your CI/CD system
The pod autoscaler changed the amount of replicas to better handle traffic
An alert was created by New Relic
A system administrator changed manually a manifest via kubectl

Komodor will gather all of these events and present them in the same timeline in chronological order. This gives you great insight into what is happening with a single service without having to jump between different applications and dashboards.

The last case is very important as this means that Komodor will catch even adhoc changes performed manually in the cluster (e.g. with kubectl edit).

In the example below I have manually changed the replicas with kubectl without deploying a new version via CI/CD. Komodor not only identifies the manifest and marks it as “change replicas” in the timeline, but also provides me with a detailed diff on what changed.

Detecting manual configuration changes is very important as they can be responsible for failed deployments (the well known phenomenon of configuration drift). Several times after a failed deployment developer will waste time by thinking about issues with their code, while in reality the problem was that somebody changed the cluster or the application in an un-audited manner.

With Komodor it is now possible to see both planned and unplanned changes in the same pipeline.

Correlating events between related services

Looking at each service on its own offers great insight into what your application is doing at all times. However, the true power of Komodor becomes more apparent when you select multiple services either from the same namespace or from different namespaces.

Komodor then will do a smart merge of all events and present you a unified timeline with the events of all selected services (still in chronological order among them).

The importance of this view cannot be overstate. The health of a service might be affected by other dependent services, so having this unified overview is a timesaver for detecting issues not with a service itself but with its dependencies.

If you have ever been on call you might recognize the following scenario:

You are paged at 3.00 am in the morning to fix an issue with application A is an application that you are already familiar with
You check your dashboards and realize that no deployment has happened for application A in the last week. Then you start looking for configuration changes and your Git history says that no configuration changes have been made lately.
Getting the hint that another dependent service might be the issue you start looking at other dependent projects B, C, D etc which is a time consuming process
Since you are not familiar with the dependent process you need to ask other people (and possibly wake them up) in order to see if their configuration was changed or not. You also spend extra time to check the CI/CD pipelines for those services.
This process can continue in more depth (i.e. check transitive dependencies of your dependencies)
In the end you find out that earlier that day a sys admin from an unrelated team did a manual change (not recorded with CI/CD) in a dependent service that resulted in your application failing over the next hours.

Having Komodor in place can cut down this process from several hours to some minutes. Now you can simply gather all dependent services in a single timeline and see exactly what happened with all of them including manual changes.

You can make this process even easier if you help Komodor understand what “related services” means for you. While you can always manually select additional services that you consider related, Komodor has native integration for Kiali/Istio and Datadog that will automatically look at network dependencies between services and understand all the dependencies between your services.

Annotating your applications with extra metadata

In the previous section we talked about how Komodor detects manifest changes automatically and can even present you a diff on what changed.

But what about application changes? In most cases (especially if you are following GitOps) the application source code is in a separate Git repository that has nothing to do with the repository that holds your manifests.

Komodor has native integration with Github in the form of extra annotations. By annotating your deployments with extra information for the Git repositories that comprise your application Komodor can communicate with GitHub and present you with diff for both application code AND manifests. This is a huge advantage when it comes to troubleshooting, as you can follow both infrastructure and code changes all at once.

Here is an example of the annotations.

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    app.komodor.com/app: https://github.com/kostis-codefresh/gitops-app-source-code
    app.komodor.com/app.ref: refs/heads/dummy
    app.komodor.com/deploy.job.jenkins: https://codefresh.io
    app.komodor.com/infra: kostis-codefresh/gitops-kubernetes-configuration
    app.komodor.com/infra.ref: refs/heads/master
    app.komodor.com/pipelines: https://github.com/kostis-codefresh/gitops-pipelines
    app.komodor.com/pipelines.ref: HEAD

And here is the extra information in the UI

Notice that Github information is just one example of annotations that you can add to your service. Komodor supports other types of annotations such as links to your metrics, your alert provider, your CI pipeline, your playbook and so on.

This means that with Komodor you can easily explore a cluster that you are not familiar with. And instead of hunting down information in slack channels or company wikis (or even worse having to wake up people in the middle of an incident) you have all the information at hand right within Komodor.

If you think you have seen this type of annotation before you are not mistaken. They are also used by Ambassador for creating a developer portal. Right now Komodor introduces its own annotations, but in the future it would be great if we have a common standard for this type of information.

Recovering from incidents with Komodor

I hope that you now have a good idea on what Komodor offers. It is not a replacement for your metrics or alerts. It complements them both by offering a unified dashboard for your cluster with the information that you need specifically for Kubernetes applications:

An overview of all services and their health
A comprehensive timeline for all events that affected your service (even adhoc changes)
A way to group related services and merge their timeline together
Information on both infrastructure and application code changes in the same dashboard
A way to add extra annotations to your services for handy links to CI, playbooks, metrics etc.

Komodor will change the way that you manage incidents and depending on the amount of effort you assign to it (particularly with external integrations) you can cut down significantly the amount of time wasted looking for information during an incident.

To start exploring Komodor and change the way you troubleshoot your Kubernetes applications visit https://komodor.com/