Forem: Canarian

Why Your Cloud is Broken

Ant(on) Weiss — Wed, 21 Apr 2021 11:07:21 +0000

The Promise of Desired State Configuration

Last decade of IT was marked by a gradual proliferation of Desired State Configuration Management practices. Pioneered by Mark Burgess’ CFEngine back in 1993 and further developed by such tools as Chef, Puppet and lately Kubernetes - this once revolutionary approach allowed IT administrators to automate the management of the ever growing fleet of servers - physical and virtual.
All Desired State Configuration systems are based on another now widely present practice of IT management called Infrastructure-As-Code (IaC). The idea is that we (the system administrators) describe the Desired State of our system in code (usually some kind of a DSL - Domain Specific Language) and the configuration system makes sure that the actual state of the system reflects the desired state. This automated process of bringing the system to the desired state is called convergence or reconciliation and is performed by the configuration system controllers and agents. The controllers publish and verify state transitions, while the agents take care of the actual state application.

This is of course a very simplified, nutshell-contained description, but I believe it’s sufficient to understand the basic problem inherent in this pattern.

The Actual State

And the problem is - all these systems are based on the assumption that the administrators of a system know what the desired state of the system is.
But this couldn’t be further away from the truth. In reality - any even moderately complex cloud environment has loads of various components that the administrators have a very vague understanding of. These components are often misconfigured or under-optimized. The configuration blind spots only get discovered when something in the system crashes.

As an example: here are 3 incidents that occured at one company we recently started consulting for over the course of one week:

An auto-scaling event in message provider causes a message queue to explode
Code is deployed pointing at a dummy instance of a downstream service. Stays undiscovered for 4 days.
A database that was never configured for HA (or auto-scaling) runs hot for a week without any alerts until it finally goes up in flames.

An attentive reader will notice that all of the mentioned systems were in a desired state defined by the system administrators. It was administrators who configured the auto-scaling, pointed code at a dummy service or decided HA configuration for the DB is not currently needed. Or maybe they didn’t consciously decide all of these and just went with default configurations (which are never meant for production, are they?) . Why? Well - because they never had the time to specify what the desired configuration of the system is. Because the complexity and variety of system components we have to manage today is too much of a cognitive load for a human or even a team of humans to handle.

The Adaptive State

Putting aside the complex modular stacks our IT is composed of - the only desired state of the system we can truly define is this:

“Our System Works and Serves Its Customers”.

It seems like a naive approach at first. But isn’t this the top-level business objective of any information system?

And that’s exactly why Desired State Configuration doesn’t cut it anymore. Because it focuses on defining the state of infrastructure instead of functional goals. What we really need in the age of complex flexible information systems (aka Cloud Native IT) is Adaptive Configuration - i.e smart controllers and agents that can configure (and continuously optimize) the components of the system according to its business objectives as defined by us, aligned with industry best practices, and supported by continuously collected machine data.

We’re already seeing the first glimpses of this approach and (quite unsurprisingly) - the first conflicts resulting from these newer smarter techniques clashing with the Desired State Configuration patterns that are still present.

I’m going to outline the approaches and the conflicts we’re seeing in follow-up posts. And of course I’m very curious to hear about your experience with the Desired State Configuration approach and where you see its pros and cons manifested. Looking forward to your feedback!

Stay tuned, stay adaptive, stay well!

On Resilience, Phase Transitions and Semantic Change Management in Information Systems.

Ant(on) Weiss — Mon, 22 Mar 2021 15:32:34 +0000

Change as the Vehicle of Value Creation

The value of a modern information system is defined in large part by the speed with which it can change and adapt.

“Adapt to what?” - you may ask.

Well, a lot of stuff - the wild and unpredictable fluctuations of market forces, changes in consumer demand and institutional regulations and the sheer cost of underlying infrastructure.

Change is the vehicle of value creation. But somewhat paradoxically - change is also the main source of value disruption - because of the instability it brings to a system.

The last 20 years of IT evolution were all about enabling higher rates of change while also eliminating its disruptive impact.

Enabling Change

Continuous integration (CI) practices were created in order to identify and address the potential disruptions as early as possible in the change lifecycle.

They allowed moving the bottleneck from the release point to the actual integration stages where it is easier and cheaper to resolve the arising issues.

The feedback loops got shorter and the situation improved. But pretty soon - this too became insufficient. The evolution of online services demanded increasingly tight uptime restrictions. The ability to roll out changes to production in a safe and continuous manner brought on the push for Continuous Delivery (CD).

But if only that was so easy..

Facing Instability

Organizations taking the leap towards continuous updates of production environments are facing numerous challenges. As information systems become more complex, it is increasingly (up to the point of impossible) hard to predict the impact of any individual change on a system’s performance. Especially so when there are numerous changes happening simultaneously on various layers of the technological stack. Some of them initiated by the system’s operators or users, others stemming from integrations with third party systems, still others - as artefacts of the system evolution.

If the impact of change is unpredictable, then how do we preserve the stability of value delivery? The obvious strategy is to limit the amount of change, disallow parallel updates and evaluate the impact of each change individually until stability and value generation is ensured - only then can further changes be introduced. This is a great approach for slowly-changing, highly observable systems. Not the dynamic cloud-native applications we’re running and using today.

In current business reality there is no other option but to move fast and break things.
But still, we don't want our customers to suffer and potentially leave us for a competitor because the service is broken. So we try to compensate for compromised stability with significant investments in monitoring and observability. We also do our best to hire qualified operations personnel for on-call rotation - so they can quickly fix the problems that arise. Hence the proliferation of monitoring software services and the severe shortage of Ops and SRE professionals we’re witnessing in the last decade.

It’s pretty obvious of course that just exposing more metrics and logs and putting more humans on call for resolving incidents is a model that doesn’t scale. The rate of burnout in our industry is already higher than in healthcare!

Have we really built all these smart machines only to waste our lives on watching them behave?!

Of course not!!!

Instead - a new model of smart, continuous incident prediction and remediation is needed.
And if we look around - we’re already starting to see the first glimpses of platforms, tools and most importantly - humans embracing these ideas.

The 5 Steps of Resilience

This is where I want to stop for a moment to talk about resilience. And specifically - resilience engineering. It’s a huge topic - much wider than a paragraph in a blog post could cover. So I’ll just mention that resilience of a system is pre-defined by its adaptive capacity, its ability to bend and morph in response to unexpected environmental events while still preserving the basic required functionality. While resilience engineering is concerned with:

building systems capable of resilience and
practices of resilience in system operation.

Resilience basically entails the system’s ability to identify potential destabilization factors as fast as possible, analyze the problem and its impact, enumerate possible ways of tackling the problem, iterate on all the possible solutions until a working solution is found and finally apply the solution. All of these with minimum adverse impact on the expected functionality of the system.

Therefore - in order to enable resilience we need our system to continuously cycle through 5 main stages of interaction with problems that we want it to withstand:

Identification of the Problem
Analysis of the Problem
Identification of Possible Solutions
Validation of Possible Solutions
Application of the Most Viable* Solution

*Note: the viability of a solution is defined by organizational policy with regards to costs, time, quality and additional considerations.

Know Thy Enemy

In this post I’d like to focus on the first 2 stages. Without identification and analysis - no corrective action can occur. Moreover - the better we get at the first 2 capabilities - the better will our ability to adapt become. As Sun Tzu put it:

Know thy enemy and know yourself; in a hundred battles, you will never be defeated.
When you are ignorant of the enemy but know yourself, your chances of winning or losing are equal.
If ignorant both of your enemy and of yourself, you are sure to be defeated in every battle.

In the light of our discussion - identification and analysis deal with knowing our enemy.

Identification and Phase Transitions

In thermodynamics and statistical physics there’s a notion of phase transitions. A phase is a condition of a system in which its behaviour is qualitatively different from its previous condition. A classic example is solid matter melting into liquid and further evaporating into gas. Only to condense into liquid again, of course. A somewhat related phenomena in the study of fluid dynamics is turbulence. Described by Richard Feynman as “the most important unsolved problem in classical physics” turbulence is the onset of instability and chaotic patterns in previously smooth or laminar flow. So again - a transition to a different phase where the same system starts to behave differently, even though consisting of the same set of components. A transition is never immediate - it’s caused by gradually accumulating levels of energy (kinetic or thermal) - the energy keeps building up until the system reaches a critical point. This is the point at which even a tiny addition of energy can throw the system into the new, unstable behaviour.
During the transition - small potential instabilities start to unfold - islands of chaos in the stable ocean of predictability.
Quite in the same way our information systems don’t become unstable all at once. The constant influx of changes (that can be seen as energy) generates the chaotic islands of tech debt, security loopholes, circular dependencies and unfortunate misconfigurations until the system reaches its critical point and collapses into instability.

Identifying Phase Transitions in IT

And this brings us back to the identification stage.
Identifying a problem after it occurs is too often much too late. Going from chaos back to stability is nerve-wrecking and very costly. If we want to build a better problem identification capability - we need to measure and identify the phase transition processes in information systems. Which is totally possible in well-monitored, observable systems of today. In such a system we will be able to measure the current level of instability and adapt the incoming rate of change to the “viscosity” of the infrastructure. Or, inversely - make the system’s interaction with the outside world more “viscous” so as to slow them down to the point where we can promise greater certainty. Sometimes fully blocking the changes that can potentially bring the whole system down. And gradually opening the gateway again once we’re further from the critical point.

Data analysis and machine learning are of course key to such advanced infrastructure management patterns. But this approach goes beyond the basic anomaly detection that most of existing "AIOps" solutions offer. This involves arming our systems with capabilities of continuous self-introspection and self-remediation.This is also about predicting what the next phase of a system may be as the effect of change that we plan to apply. Taking into account the amount and - even more importantly - the kind of change.

Semantic Change Management (or Not All Changes Were Created the Same)

Semantic change management is the other missing piece of the resilience puzzle. In most current software delivery studies we are usually focused on quantitative analysis: deployment rate, lead time, exception count, etc. But practice shows that overwhelmingly the question “what was deployed?” is much more important in problem analysis than “how many deploys were made?”. It’s the type of change and not the rate of change that makes or (more often) breaks a system.

The exact typology of changes varies, based on the type of a system. But for almost all information systems one can broadly separate all changes into code, infrastructure, and configuration changes. This division can be made more granular by separating frontend from middleware from backend, by separating the cross-system configuration from isolated component config, and so on. With each change type holding its own properties that define its potential impact on the system under change.
Software delivery systems that we’re creating now need to allow for codification and analysis of these organization-wide change semantics . This is the prerequisite for the identification of phase transition states described in the previous paragraph. This semantic typology will allow for a granular definition of deployment strategies (such as, for example, continuous canary validation techniques) applied to each and every change. And for analyzing if the type and size of the change is something a system has the adaptive capacity to absorb in its current state.

To Summarize

This article is an attempt to outline two of the most important missing pieces of continuous change management in modern and future cloud- and edge-native IT systems:

Phase Transition Analysis
Semantic Change Management

These capabilities (enabled by data analysis and machine learning) are seen as the prerequisites for making a system semi-autonomously capable of resilience (or as Mark Burgess, whose ideas have influenced me tremendously, would put it - immunity).
The mechanisms for enabling these capabilities are being created as we speak. Once they are operational and well-trusted - the vision of continuous deployment, or of “Liquid Software” as defined by Sadogursky, Landman and Simon will finally start to become reality. And the face of what we now call DevOps and of our industry as a whole will change beyond recognition.

This is the future we want to build.

Thanks to Mark Burgess and Leonid Mirsky for reviewing and providing valuable comments!

Resilience Engineering and Life

Ant(on) Weiss — Thu, 17 Dec 2020 10:40:00 +0000

31 years ago my life boarded an airplane and crashed into a rock. A soviet teenager, brought up in the breath-takingly beautiful city of Leningrad I found myself standing in the midst of Geula - a Jerusalem neighbourhood populated mainly by Ultra Orthodox Haredi Jews.
It is a grey, chilly morning in January 1990. I am surrounded by ugly dirty buildings and bearded men wearing weird black overcoats and fedora hats. Staring around in shock and disbelief, desperately wishing this was all a dream. Wishing I was back in Russia surrounded by my friends, rock music and perestroika.

But alas, there was no going back. In the upcoming years Israel never ceased to surprise me with more and more things that nothing in my previous life has prepared me for. It was a bumpy ride with street fights, drugs and even imprisonment. And one could say that it’s a kind of a miracle - that here I am today - a well-respected Israeli citizen blessed with a family and a successful business, building a new company.

Me in 1991. Not yet ready to adapt.

Like thousands of my fellow immigrants I adapted, I overcame the unexpected challenges of the new reality and found my place. But there were also others. Those who failed to acclimate and fell victim to addiction, delinquency, depression or suicide.

So what is it - that thing that helps some of us to adapt and succeed while others crumble?
In scientific talk it is called resilience or adaptive capacity. I was in no way better than those immigrants who failed. Instead there were certain choices and actions I took when faced with difficulties that allowed me to regain my social status, to learn the new skills and understandings needed to succeed in my newfound home.
As John Allspaw says - resilience is not something a system has, resilience is something a system does. It’s not a property but rather an activity, something we actively pursue and develop. Resilience is the ability of a system, be it a human being, an organization or a software component to withstand the unforeseen adversities, to adapt to the changes they require and to spring back, to recover, to continue providing the previously expected capabilities.

So why are we now talking about resilience at IT conferences? Why is the topic of resilience becoming so top-of-mind for many of the most profound visionaries of our industry? Well, it’s because the information systems we are building are becoming increasingly more complex and unpredictable, interconnected and chaotic, while we become increasingly dependent on them for carrying out our expected capabilities as a society, as a civilization.

How many production incidents did you have in the last year?

How many of those were expected to occur?

What was the total cost of those incidents?

How long did it take you to go back to normal?

As an industry we’ve come to an understanding that in complex distributed systems failure is a feature, not a bug. And what really matters is the system's resilience - it’s ability to withstand the failure, to bounce back and recover.

And we’re also becoming painfully aware of the human factor in the resilience of the information systems. A program does what it is programmed to do, but in most cases - not what the programmer intended. As Stafford Beer put it - the purpose of a system is what it does.
And right now it’s only us humans who can stand in to fill that gap between the intended and the actual purpose. Paraphrasing Conway's law one could say that the resilience of an information system is defined by the resilience of the organization that builds it. And let me say, 2020 was a great testbed for organizational resilience, with test results still being calculated.

Now - as engineers the first question we should ask is: can this be engineered? Can we intentionally build our organizations and consequently our systems to do more resilience and less brittleness?
The answer is - maybe! We now somewhat understand the principles and the the algorithms of how resilience works. But the paradox lies in the fact that resilience is about being prepared for the unexpected, about being ready for the unknown. Therefore it can only be tested in real time - when the unexpected event happens. Much like many other activities in software delivery - resilience is a continuous quest based on never-ending learning and adaptation.

The quest has begun! There’s already a great deal of knowledge to learn from. But there’s still a long road ahead - and it is now up to us to walk that road, to build the resilient systems of tomorrow. And then maybe, just maybe we will all be ready for whatever unexpected crap comes our way.

My picks for AllDayDevOps

Ant(on) Weiss — Mon, 09 Nov 2020 16:45:55 +0000

AllDayDevOps 2020 is happening in 24 hours from now!
It's packed with content and choosing the talks to attend isn't a trivial task.
That's why I decided to compile my own recommendations list.

So here they are - 14 talks to attend at AllDayDevops!

1.
SITE RELIABILITY ENGINEERING
Managing Systems in an Age of Dynamic Complexity

Laura Nolan, Slack

I like the title of this talk. And I believe folks at Slack know a thing or two about how we work and collaborate.

2.
CULTURAL TRANSFORMATION
Bullet-Proof Coding : Adaptive Collaboration for Resilience

Anton Weiss, Otomato Software

Regretfully my own talk occupies the same time slot as Laura's presentation. So you have to choose :)

3.SITE RELIABILITY ENGINEERING
Site Reliability Engineering: Anti-patterns in Everyday Life and What They Teach Us

Jennifer Petoff, Google

It's time we hear from Google about how not to SRE.

4.
MODERN INFRASTRUCTURE
The Past, Present, and Future of Cloud Native API Gateways

Daniel Bryant, Ambassador Labs

I love Daniel's presentation skills and API gateways/service meshes/smart proxies are all such exciting technologies!

CULTURAL TRANSFORMATION

Using DevOps Principles to Measure Value Flow

Helen Beal, DevOps Institute

Helen is a great speaker and DevOps without measurements is like a fish without water.

6.
KEYNOTES
Ask Me Anything Keynote: DevSecOps

John Willis, Red Hat and Shannon Lietz, Intuit

John is one of the godfathers of the DevOps movement and a great speaker. Ask him anything about DevSecOps - do that!

7.
SITE RELIABILITY ENGINEERING

The Unmonitored Failure Domain: Mental Health

Jaime Woo, Incident Labs

If you don't monitor your team's mental health - little else matters.

8.
KEYNOTES
Ask Me Anything Keynote: Chaos Engineering

Casey Rosenthal, Verica / Nora Jones, Jeli

Chaos engineering is one of the more compelling topics in modern IT. Nora and Casey are both definitely the right people to ask anything about that.

9.
MODERN INFRASTRUCTURE
Service Mesh Past, Present, and Future with Envoy Proxy and WebAssembly

Idit Levine, Solo.io

Idit is my super-talented compatriot and the work solo.io have been doing with WebAssmebly and Envoy is pushing the industry forward!

10.
SITE RELIABILITY ENGINEERING
Fast & Simple: Observing Code & Infra Deployments At Honeycomb

Liz Fong-Jones, honeycomb.io

Honeycomb is one of the more interesting stars on the observability sky.

11.
CULTURAL TRANSFORMATION
Doing DevOps With Deming

Ken Muse, Wintellect

How does one do DevOps without Deming?!?!

12.
MODERN INFRASTRUCTURE
Solving the Service Mesh Adopter’s Dilemma

Lee Calcote, Layer5

Lee is a great presenter and knows service meshes like nobody else does. Layer5 ftw!

13.
SITE RELIABILITY ENGINEERING
0 to SRE: Lessons from a First-Year SRE

Reginald Davis, Elasticsearch

Getting started with SRE is definitely harder than it sounds.

14.
CULTURAL TRANSFORMATION
Gatekeeping and the DevOps Revolution: We Haven't Always Known Everything

Kat Cosgrove, JFrog

Gatekeepers - do we really need them? Kat is a great speaker and she'll provide the answers.

Those are my picks - and what are yours?

My picks for AllDayDevOps

Ant(on) Weiss — Mon, 09 Nov 2020 15:50:14 +0000

Let's stop fooling ourselves. What we call CI/CD is actually only CI.

Ant(on) Weiss — Tue, 20 Oct 2020 16:08:09 +0000

Cover Photo by Yan from Pexels

// Detect dark theme var iframe = document.getElementById('tweet-1308108094157787136-107'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1308108094157787136&theme=dark" }

Yes - this post started as a tweet. One that went semi-viral. It struck a real, naked, buzzing nerve. A nerve that most of us prefer not to touch.

Some of us pretend it's not a real pain, others are just too busy fixing production issues. Still others - and that would probably be the majority of the industry - are yet to discover how much of a distress it is when they meet with it face to face.

the gap between the elite and everybody else is only growing wider

But the truth is out there. Mere mortals can't have Continuous Delivery and even less so - Continuous Deployment. CD has become the privilege of that group that the folks at DORA quite non-accidentally call "elite performers". And the gap between the elite and everybody else is only growing wider.

Fooling Ourselves

Most organizations we work with say: "of course we have CI/CD pipelines!"
But when one digs deeper - there's usually some CI - and no CD in sight. Or, as @itaysk noted "it's not even CI, but continuous build..."

When asked what stops them from safely and regularly deploying every change into production environments - everybody seems to have their own reasons. Organizational, cultural, historical, technical, contractual.. Some go as far into denial as saying : "Oh, we don't need continuous delivery. In fact most companies out there don't really need it." But the underlying reason is of course the lack of confidence. Nobody wants to be the culprit for a system outage. According to a number of industry surveys the average cost of one hour of downtime is around 75000 USD. There's a lot at stake!
So instead we choose to move slower, to add controlled handoffs and build home-grown guardrails. To hire more Ops engineers and call them SRE to feel more secure. Rarely discussing the price of establishing and maintaining all of these over time.

But why can't we have CD?

Continuous Delivery is a sociotechnical practice. And as many Twitter commenters correctly noted - the barriers on the way to having it are two-fold. As with anything in DevOps it starts with culture and shared understanding that continuously delivering in small increments makes everything better. Engineers who've experienced true CD can't really fathom any other way of delivering software. As @giltayar puts it "CD ... is a total game changer. It changes how you perceive software development and delivering features... I did CD and EVERYTHING about how I developed changed. It was magical."

// Detect dark theme var iframe = document.getElementById('tweet-1308341183979151360-95'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1308341183979151360&theme=dark" }

The Social Dilemma

But we humans are scared of change. The new mode of delivery challenges our perceptions: of ownership, of reliability, of hierarchy. If your SRE team is responsible for production site uptime - then what's their incentive for enabling the constant flow of change that continuously threatens the very thing they are responsible for? If you have folks whose job it is to control what gets released when - what will they do when this control is made obsolete? The existing organizational barriers make the blame game easier - thus providing us with a false sense of confidence. Because the tools we currently have can't promise true confidence - and this bring us to...

The Technical Dilemma

The socio-cultural obstacles are truly the hardest to remove. But as Archimedes used to say: "Give me a lever long enough and a fulcrum on which to place it, and I shall move the world." Technology, while meaningless on its own can become a great enabler for societal innovation.
Trouble is - the tools for continuous delivery/deployment are still lacking. And this is especially true for the new brave cloud/edge-native world we see rapidly unfolding before our eyes.

But Aren't CI Tools Enough?

This is where some readers might say: "Why are you saying there are no tools for CD? We already have Jenkins/CircleCI/Github Actions... Why can't we use those? and then there's Spinnaker, isn't there?"

That, of course, is a grave mistake. Yes - any CI server or even generic workflow automation tool can theoretically orchestrate your deployments - the mechanics of deployment are trivial. But deploying like this is the same thing as the proverbial "throwing changes over the wall" practice that brought on the DevOps revolution.
Because CI tools ignore the semantics of change. The only kind of feedback they provide is deterministic one - verifying a pre-defined functionality under pre-defined conditions. While the production environment has inherent uncertainty leading it to behave in often unpredictable manner. Therefore - in modern complex systems no change is verified until it reaches production. As they say - until the wheels hit the road.

And that is exactly why most orgs out there can't have CD. Because blindly pushing into production is scary, stressful and in the end falls on the shoulders of the undermanned SRE team.

And that is exactly why most orgs out there can't have CD. Because blindly pushing into production is scary, stressful and in the end falls on the shoulders of the undermanned SRE team.

Cloud Native CD is Possible

It's not all bad, of course. Some teams we talk to succeed to establish true cloud native CD by investing multiple man-months in home-grown solutions. This is costly, most orgs can't allow this, but those who do are very proud of their achievement - until the platform changes under their feet and they need to reinvent the home-grown solution.

Some very interesting OSS projects have emerged in the last couple of years in an attempt to tackle the problem. ArgoCD with Argo Rollouts, Flux and Flagger, Shipper and Keptn are all definitely worth looking at.

Still no one comprehensive, reliable, usable platform exists that can help us deploy to production continuously with confidence and without complex unsustainable in-house hackery.

That's why we at Canarian decided to step up to the challenge.

We're building a platform that will allow you to deploy continuously with confidence, full observability and automated recovery.

In the next post I'll describe the feature set that we see as the minimal viable proposition for such a platform and how we're building it.

Sounds interesting? Send us an email, sign up for our beta version on the site or just follow this blog.

We'll keep you continuously updated ;)

Keep delivering!