Forem

Serverless Chats

Episode #9: Chaos Engineering in Serverless with Gunnar Grosch

About Gunnar Grosch

Gunnar is Cloud Evangelist at Opsio based in Sweden. He has almost 20 years of experience in the IT industry, having worked as a front and backend developer, operations engineer within cloud infrastructure, technical trainer as well as several different management roles.

Outside of his professional work he is also deeply involved in the community by organizing AWS User Groups and Serverless Meetups in Sweden. Gunnar is also organizer of ServerlessDays Stockholm and AWS Community Day Nordics.

Gunnar's favorite subjects are serverless and chaos engineering. He regularly and passionately speaks at events on these and other topics.

Links from the Chat

Transcript

Jeremy: Hi, everyone. I'm Jeremy Daly and you're listening to Serverless Chats.  This week, I'm chatting with Gunnar Grosch. Hi, Gunnar. Thanks for joining me.

Gunnar: Hi, Jeremy. Thank you very much for having me.

Jeremy: So you are a Cloud Evangelist and co founder at Opsio. So why don't you tell the listeners a bit about yourself and maybe what Opsio does?

Gunnar: Yeah, sure. Well, I have quite a long history within IT. I've been working almost 20 years in IT, ranging everything from development through operations, management and so on them. Um, about a year and a half ago, we started a new company called Opsio. And we are a cloud consulting firm. Helping customers to use cloud services in any way possible and also operations.

Jeremy: Great. All right, so I saw you speak at ServerlessDays Milan, and you did this awesome presentation on Chaos Engineering and serverless. So that's what I want to talk to you about today. So maybe we can start with just kind of a quick overview of what exactly chaos engineering is.

Gunnar: Yes, definitely. So chaos engineering is quite a new field within IT. Well, the background is that we know that sooner or later, almost all complex systems will fail. So it's not a question about if it's rather a question about when. So we need to build more resilient systems and to do that, we need to have experience in failure. So chaos engineering is about creating experiments that are designed to reveal the weakness in a system. So what we actually do is that we inject failure intentionally in a controlled fashion to be able to gain confidence so that we get confidence in that our systems can deal with these failures. So chaos engineering is not about breaking things. I think that's really important. We do break things, but that's not the goal. The goal is to build a more resilient system.

Jeremy: Right. Yeah, and then so that's one of the things that I think maybe there’s this misunderstanding of too is that you're doing very controlled experiments, as you said, and this is something where it's not just about maybe fixing problems, either in the system, it's also sort of planning on for when something goes down. It's not just finding bugs or weakness, it's also like, what happens if DynamoDB for some reason goes down or some backend database like, how do you plan for resiliency around that, right?

Gunnar: Yeah, that's correct, because resiliency isn't only about having systems that don't fail at all. We know that failure happens, so we need to have a way of maintaining an acceptable level of operations or service. So when things fail that the service is good enough for the end users or the customers. So we do the experiments to be able to find out how both the system behaves and also how the organization, their operations teams, for example, how they behave when failures occur.

Jeremy: Well and about the monitoring systems too, right? I mean, we put monitoring systems in place, and then something breaks and we don't get an alarm. Right? So, this is one of the ways that you could test for that as well.

Gunnar: Yeah, that's a quite common use case for chaos engineering, actually, to be able to do experiments to test your monitoring, make sure that you get the alerts that you need. No one wants to be the guy that has to wake up early in the morning when something breaks. But you have to rely on that function to be there so that PagerDuty or whatever you use actually calls the guy.

Jeremy: And you said this is a relatively new field. You know, when you say new, it's like a couple of years old. So how did this get started?

Gunnar: Well, it actually started back in 2010 at Netflix, so I guess it's around nine years now. And they started a tool or created a tool that was called Chaos Monkey. And the tool was created in response from them moving from traditional physical infrastructure into AWS. So they needed a way to make sure that their large distributed system could adapt to failure so that they can have a failure. So they use Chaos Monkey to more or less turn off or shut down EC2 instances and see how the system behaved. So, that was in 2010, then I guess the first chaos engineer was hired at Netflix in 2014. So about five years ago, and in 2017, the team at Netflix published a book that's on Chaos Engineering. I think it's out on O'Reily Media, which is like the book on Chaos Engineering today that's used by most people who use chaos engineering.

Jeremy: So we can get into some more of the details about running the experiments, so I want to get into all of that, but I'm kind of curious, because this is something where, especially with teams now, and maybe as we get into Serverless too, you've got developers that are closer to the stack, there may be less OPs people or more of this mix. So is this like a technical field? Is it the devs that do it, the OPs people, like who's sort of responsible for doing this chaos engineering stuff?

Gunnar: Yeah, that's a good question. I would say that it's a question that's being debated in the field as well. Where does chaos engineering belong? And I think it's different in different organizations. Some teams have specific or some organizations have specific teams that are only working with chaos engineering like Netflix, like people at Amazon as well. But other organizations, they use chaos engineering and it’s spread out in the organization. So it might be operations that works with chaos engineering. But it also might be a DevOps team or just developers as well. But to do the experiments, you need to involve more or less the entire organization. So you use people from from many teams.

Jeremy: Right. Cause you're gonna run, in some cases, you run this earlier on where you’re in testing or dev, you might run some of these experiments, but then you're gonna end up most likely, if you really want to test the resiliency of your system, you're gonna run this in production somewhere. And that means if something breaks, you know, your tech support team or customer service or whatever, they might start getting flooded with calls. So it's probably good to kind of notify everybody, “Hey, we're running an experiment”, right?

Gunnar: Exactly. It might involve people from from all teams within the organization. I guess if a major customer is affected, someone at sales might get a call as well.

Jeremy: Yeah, right, you don't want to be that support rep or you don’t want to be the account manager, probably, unless you figure that out. All right, so let's get into actually talking about the experiments themselves. So let's maybe take a step back and ask the question, why would we run experiments in the first place?

Gunnar: Yeah, well, since the purpose is to find out if the system is resilient to failure, we look at if our customers or if our system has a problem, are our customers getting the experience they should? Is the system behavior, behaving good enough for our customers to get the experience? Or another thing might be that we have downtime, we have issues that are costing us money. So, and as you mentioned, is our monitoring working as it should? So we have quite a lot of things that might intends us to actually do the chaos experiments. And to do it well, it builds confidence. When we do the experiments, we build confidence, and we know how everything within the system and the organization behaves in the face of failure.

Jeremy: Right and we probably already kind of talked about this a little bit, but this idea of an organization being able to handle an outage, right? Because if all of a sudden something goes down and you're like, well we just expected everything to be running. And then all of a sudden something goes down. And either there's cascading failures or all kinds of things like that. If you run these experiments and say, okay, it's like a fire drill, what do we do if X fails? You know, how do we recover from that or what kind of resiliency do we need to build in. So that's a big part of it as well, right?

Gunnar: Yeah, that's correct. A lot of organizations today run what's called Game Days, where you do exactly these types of fire drills, where you inject some sort of failure into either the system or the organization, and like a game, you actually do it and see how would you behave. How do you perform within the organization. And this continues not only until you have resolved the actual issue, you need to make sure that you know how do you report everything. How do you follow up on the failure and how do you solve the issues that you found within your system or your organization?

Jeremy: Yeah. And you mentioned this too, I think in your talk where you know, it's not just about what happens when a system goes down, like what happens when the system slows down? Like does that affect how many customers, like how many orders you get per hour, for example, I think you mentioned Amazon that when their latency goes up by 100 milliseconds. They lose X amount of dollars per hour or something like that.

Gunnar: Yeah, that's correct. I know Adrian Hornsby, evangelists at AWS has a slide in one of his presentations where he shows some numbers on this. And I believe that the example of Amazon is that they have 1% drop in sales for 100 milliseconds, extra load time. And for a company Amazon’s size, that's quite a lot.

Jeremy: That's that's a lot of money.

Gunnar: Yeah, another example I know was that Google has a number of that says that 500 milliseconds of extra load time cost 20% fewer searches on google.com.

Jeremy: Which means 20% fewer ads load, which means 20% less money from the ads. Yes, so I think that's just to me is one of the more fascinating things of this as well is just this idea of injecting latency, and we can talk about that. But you know, all these different things you can do to affect the customer experience and see what effect it literally has on your bottom line. That's just a really cool, it's a really cool thing that I think maybe most companies aren't thinking about right now.

Gunnar: No, exactly. And that's why just business metrics are so important as well. We tend to focus on technical metrics in our field. But when it comes to chaos engineering, the business metrics is probably more important. And what CPU load there is, or how much memory we're using isn't all that important. Our system should be able to handle that, but how does the business metrics get affected when we have issues?

Jeremy: Right? And that's why you want to run these experiments sort of his close to production or in production, because that's where you're going to see actual effects like that, where you actually see when customers are impacted what happens.

Gunnar: Exactly.

Jeremy: All right, so if we're gonna run some of these experiments, you had a bunch of steps laid out, right? So why don’t we start at step one, what's the first thing we do?

Gunnar: Well, the first step is that we need to define what we call the steady state and steady state is the normal behavior of our system over time.  So that we, the metrics that we have, we need to know what is the steady state of those metrics. And that might be if it's a business metrics. It might be, how many searches do we have per hour per day or per year? And how many purchases are there on the Amazon.com? But of course, we have the technical metrics, or system metrics as well. And how many, clicks, or how many function invocations are there per hour, for example. Or what is the CPU load? So we find these metrics and we define those so that we can use them when we run our experiment, to be able to benchmark what happens when we do the actual experiment.

Jeremy: All right, so then the next step, so once we know our steady state. And essentially, as you said, and this again, this is based off of a number of different factors, too. So it's also like, how much load, like, you know, what's our steady state at 8 am on a Monday versus what's our steady state at 2 pm on a Wednesday, things like that. So you should have those metrics kind of laid out. And as you said, those business metrics are extremely important as well. So once we have the steady state, then we move on to step two.

Gunnar: Yeah, and the second step is then that we form what we call a hypothesis. So we decide upon how will the system or a hypothesis around, how will the system handle a specific event. By using what we call what ifs, we try to find a way to form this hypothesis. And what ifs might be, what if dynamodb isn't responding? What if the load balancer breaks? What if latency increases by X amount? And by using these what ifs were then able to form our hypothesis. And the hypothesis might be that “if latency increases by 100 milliseconds, our front end will still behave correctly for the customers or end users.” Or if Dynamodb isn't responding, our system will have a graceful degradation, and we will still be able to have service to the end users. So that's the hypothesis that we're then going to try to prove or disprove.

Jeremy: And so is there, because again, I know it's a relatively new field, and maybe there's more about this in the book that Netflix did. But are there like a common list of hypotheses, or is it something that, is just sort of going to be unique to eat to each system.

Gunnar: They are often quite unique, but many of the chaos experiments that we use, they originate in the eight fallacies of distributed computing. You know, that the network is reliable, that latency is zero, that bandwidth is infinite, and so on, so on. So if we base experiments on these, well, we will get a baseline of experiments that we can inject into most organizations or most systems. But then outside of that, well, it's probably based upon what type of system, what services you're running.

Jeremy: And so should we come up with a list of experiments, when we’re coming up with hypotheses, should we come up with something that's going to test everything, or should we try to be, are there more specific things we should focus on?

Gunnar: Well, we usually start by the critical services, so we start by Tier 0 services perhaps at the most critical functionality. And so we find those systems and perform the experiments there, and then we can move, expand based on that. So the systems that will probably affect the most are the most important ones to get started with.

Jeremy: So you would want to start with what happens if the database goes down? What happens if this service goes down or something like that as opposed to let's start injecting latency, to see how that affects customer behavior.

Gunnar: Yeah, most likely, if you run an e commerce site that I suppose the final steps of the purchase process is quite important. You might want to start there.

Jeremy: And I would think too, if you came up with a hypothesis and you said, “Alright, if the DynamoDB table goes down, everything's gonna blow up and the site’s not gonna work anymore”, that before you would run that experiment, you'd probably, if you identify that there's a problem, you probably would fix that before you run the experiment, right?

Gunnar:  Yeah, that's correct. And it's quite common that just while you're trying to form your hypothesis or create your experiment, you think of things that you haven't thought about before, and depending on the people in the room, you might start talking with people that you normally don't do and you find things that probably won't work. So then it's better to fix or better, you should fix those issues first and then start over, form a new hypothesis.

Jeremy: Right, and then once you kind of work your way through that, then I mean, there's gonna certainly be things in your system that you're not gonna know that they're gonna have an effect until you actually run the experiment. But like you said, I just think it's one of the things you don't want to say, “Okay, I have a feeling that if we if we take DynamoDB down that our entire system will crash, let's try it,'' because you're pretty sure it's going to happen. So, alright, so perfect. So now we have our hypothesis and now we need to plan our experiments, right?

Gunnar: That's correct. So we have the hypothesis, and based on that, we then create the experiment and we do that just like, we create a plan for it in detail, exactly how to run this experiment. Who does what? And when we do it? But key here is that you should always start with the smallest possible experiment so that you contain the blast radius.  And blast radius containment is really important to be able to, first, of course, test the experiment, but to not create a bigger impact then what's needed to be able to see if that part works or not.

Jeremy: So when you say when you say contain the blast radius, so what would be an example of that, like just only kind of adjusting or breaking a single function? Or how would you define that?

Gunnar: The blast radius might be different sorts, but just doing it on the single function is one way of doing it. And you're trying to make sure that you don't affect the entire system that way. But it might also be that you're doing it on the small set of users. So instead of injecting the failure or exposing all users to the failure, you can do it on a small test group, for example.

Jeremy: Yeah, and that actually makes a ton of sense, too, because if you're like, rather than saying what happens when dynamodb goes down, you could say what happens when this function can’t access DynamoDB? And then that's a, like you said, much smaller blast radius. All right, so you got the experiment to find. You planned it to contain this blast radius. What is the, how do you test though, that the experiment worked or didn't work?

Gunnar: Well, yeah, it's important that you have metrics that can measure the effect of the actual experiment. So if you're testing a specific part of the system, you need to know how did the system behave when you injected that type of failure? So once again, it's part of monitoring and having observability into what you're actually doing in the system.

Jeremy: And when you run these experiments, I can imagine, especially with serverless, it gets a little more difficult, right? Because you can't actually shut down DynamoDB, right? So you have to find different ways to kind of get around that.

Gunnar: Yeah, that's true. When we're creating experiments for serverless we, well, we have a higher level of abstraction, so the failure modes aren't the one that we're used to in chaos engineering. As I said, it started out with Chaos Monkey that shut down EC2 instances. Well, we don't have any EC2 instances anymore. There isn't an off button for DynamoDB. So we need to find new ways of injecting this failure.

Jeremy: So now we've got, you know this is a sort of a complex step. But now you've got all these things in place. This is probably where you want to notify the organization. Like, ‘Hey, we're gonna break something now’ or ‘We might break something now.’

Gunnar: That’s really important because, just performing experiments without notifying anyone might lead to consequences that you don't really want. And having success stories around chaos engineering is really important. And success might be that something breaks. It doesn't have to be that the system is resilient. It's a success story, even if you find something that breaks. But the organization needs to be ready to handle whatever happens when you inject a failure. And an important part as well is, when you start your experiment or when you're designing your experiment, make sure to have a way of stopping it. You need to have that big red button to stop it.

Jeremy: Yeah, that makes a lot of sense, because I can imagine that would be ‘Hey, we might break some stuff. Oh, by the way, we broke it. But it's gonna take an hour before we can fix it.’ That's probably not a good thing.

Gunnar: Your only chaos experiment.

Jeremy:  It might be the only one. Well, but you're right, because you mentioned this idea of building confidence with an organization. And I can certainly see that if I'm, you know, I'm the shipping department or something like that, and you know, we rely on the system being up all the time, and you're gonna tell me ‘Oh, by the way, we're going to inject some failure into the system or we're gonna try some experiments. It might take down the shipping system.’ And then all of your workers that are in the factory or in the warehouse might stand around for an hour because no new orders are coming in because we broke something. You know, that's the kind of thing where, you know, they might be not super excited about something like that happening. But if you could do it on a smaller scale and you can say, ‘Well, you know, for the billing department, it's okay if we didn't bill people for an hour because we were testing some things. But what we found was, if this happens, we can fix this.’ And that basically means, or we built in the resiliency for it or some sort of backup plan for it, and now we can reduce that type of outage. If that happens, we reduce the downtime or we reduce the failures or whatever. You know, that reduced down to 15 minutes, as opposed to maybe a three hour fix or something like that. And so if you're the shipping department, you say ‘Okay, see, this is a benefit. I take that hour, let them test it. And then I know that if there is some critical failure in the future, all of my people aren't standing around in the warehouse for a day because we were able to figure out what the problem was before it happened.

Gunnar: Yeah, that's absolutely correct.

Jeremy: Alight, cool. So now, speaking of, sort of the measurement side of things. So, step four is we measure it and we learned something, right?

Gunnar: So now we've run our experiment. We've performed the injection of failure that we wanted to do. So now we need to use the metrics that we have in place to to prove or disprove the hypothesis, and depending on what type of hypothesis we have, this might be a technical or system metric. Or it might be some other business metrics. So that we see, did the frontend break when we shut down DynamoDB? For example. So we need to see that is the system resilient to the injected failure? Remember what I said early on? That resiliency isn't that nothing ever breaks. It's about assisting, being able to give good enough service to to the end.

Jeremy: Right, and that's actually a really good point too, because that's that's another thing, too, is with distributed systems. Things break all the time. Messages don't get delivered or you can't access something. And Netflix does this very cool, graceful degradation, right? Where they just don't show recommended movies, or whatever their thing is, if that section isn't working right. So this is that, “don't show an error message if you don't have to.” If you can get away with kind of cutting out a piece of it, you know, then have a graceful way off of letting your system fail.

Gunnar: Yeah, exactly. And just having that graceful degradation, I think the Netflix example is really good there. Having UIs that don't block users. You don't get an error message. It just keeps going. But you don't see the part that's not working.

Jeremy: Yeah, and then with serverless, too. It's one of those things where if for some reason your database was down and you couldn't process payments or something right away, you can certainly buffer them in an SQS queue or in Kinesis or something like that. So you could still be taking orders even if you can't necessarily bill, or you can't charge somebody's credit card. But what's better? Being able to keep taking orders with maybe a few people who enter incorrect credit card information that you have to notify them later and say, “Hey, this credit card didn't work” and maybe get them to give you a new one. Or to basically say to everybody, “Sorry, you can't order anything right now."

Gunnar: Yeah, that's that's part of what you need to do when you run your experiment, you need to learn from the outcome and find ways off doing it in a better way, usually. And that might be a question you have to ask. Should be shut down the entire part? Or should we leave good enough service for some users?

Jeremy: Right. And that's probably where communicating with the rest of the company and sharing all of that success, or I guess, non-success, depending on what it is, with the rest of the company and saying, “Okay, if this broken your system, what would you want to do?” Like, “how would you want to handle a failure in your system?” I think that's a good discussion to have with your entire organization.

Gunnar: Yeah, you involve the product owners or how your organization looks and let it be a business decision in the end, perhaps.

Jeremy: So now we've got these small experiments running, right, cause we're still containing this blast radius and trying to be smart about not killing our entire system. The next step, though, is to turn it up to 11 and see what happens, right?

Gunnar: That's correct. So, if we've done the experiment on a small set of users and the next step if it was a success, we didn't have any failures were then able to expand on the system and scale it up and perhaps have bigger set of users or more countries or more functions. However  you contain the experiment early on, now you're able to scale it up, and with that increased scope, you'd usually see you some new effects. Injecting latency into one single function might not reveal any issues, but if you inject into multiple functions at once, you might get an effect that you didn't see you with a small scope test.

Jeremy: Right. Yeah, and then because again you have those cascading failures and things like that as well. So one function goes down or one function can't connect. That's one thing. But if a whole series of them can't or multiple services that maybe share some common infrastructure, something if that goes down, so that's really interesting. And this is also to where I think the sort of the economic tests that injecting latency what happens if we slow down, like on 5% of your users, you may see something there, but if you want to get a larger sample size, that would certainly be where you would want to run the larger test around that.

Gunnar: Yeah, exactly. The business metrics might not, you might not see the effect in the business metrics until you scale the experiment up a bit.

Jeremy: Alright. Awesome. So why don't we now kind of shift this a little bit more towards serverless and just kind of talk about what the challenges are? I mean, we mentioned that you can't turn off DynamoDB, but maybe we can just talk about what are the common things in serverless we might test for like, what are those common weaknesses that you typically see in a serverless application.

Gunnar: Yes, definitely. Well, things we usually see, and this isn’t something that we see only when doing chaos engineering. It's something that we might see in architectures every now and then. It might be that we're missing error handling. So when we have functions receiving errors from downstream services, for example, we aren't handling them in a correct fashion. These are things that we might find when doing chaos engineering experiments in serverless. Timeout values, that's quite common that we don't have the proper timeouts on our services so that perhaps it's not always an issue. But when we get latency, when we inject latency, we might see that intermediate services might have failures when we have the incorrect timeout values. So services at the edge might be affected because there are timeout issues further down. And fallback, missing fallback, rather. That's something that we see every now and then, that a downstream service of some sort in DynamoDB, for example, isn't available. So then we don't have proper fallback for that because we're relying on DynamoDB to be there all the time because that's how cloud works, right?

Jeremy: Well, but that's because every engineer pretty much designs for the happy path, and you just assume everything will work. Like you said, those eight fallacies of distributed computing. So that makes that makes a ton of sense.

Gunnar: And then we have the bigger one with the missing regional failover. Well, quite often, with design systems that aren't distributed over more than one region. So we have them in one single region, and regions they rarely fail, but it might happen. And it doesn't have to be that an entire region goes out as well. It might be that network issues for closer to the user might prevent them from reaching a specific region as well. So you don't always have enough regional spread if you only use CloudFront, for example. You might need to spread out your Lambda functions as well over multiple regions.

Jeremy: You know, that's something to where this is, I see this all the time. It's just people build, serverless application or parts of their application serverless and they just assume us-east-1 will always be there. You know, I mean, there are hurricanes, there are floods. There are earthquakes. There are all kinds of things that could affect, and not just one of the availability zones, cause, obviously, availability zones within each region are actually spread out quite a bit. But you could have something that took out an entire region, and then kind of what happens, what happens there? But also, the other piece of it, too, is that if that goes down, how are users affected, even if you do have another region, like what if we have to now route all US traffic to the EU? Right? Or how does the EU behave, or what's the latency like when they're accessing services in us-east-1. Because, obviously you have that speed of light problem, right? It can only get back and forth so fast. So, yeah, that's interesting, too. And also, just if one region goes down, how do you replicate the data? I mean, there's just a lot in that if you're building a really large distributed application. So anything else, any other sort of common things that we see.

Gunnar:  One common thing that I see when we're performing these types of experiments is that, like we said before, we don't have graceful degradation, so that the systems they show ever messages to the end users or, parts just don't work but are still there. So we don't have UIs that are non-blocking. And that's a perfect use case for a chaos engineering to be able to find those on and, well, then fix them.

Jeremy: Actually, I think that's a good point. One of the timeout things on the UI side. If we just set our API gateways to timeout after 30 seconds, then the problem is that, which is the maximum that they in timeout at, 29 and a half or whatever it is, that when that times out, if our front end is waiting for 30 seconds for that API to respond, that's not gonna be a great experience for the users on the UI side. So shortening those timeouts so that it fails faster would be helpful, because then you can respond. If you expect that to respond in two seconds or less, which you should. I mean, you should be a couple hundred milliseconds, but if you expect it to respond that quickly and it doesn't, then you should probably degrade at that point, right?

Gunnar: Yes. Correct. I know serverless hero, Yan Cui, he wrote an article about exactly this and how to use chaos engineering to be able to find these or fine tune the timeout values. So I think that's an excellent use case.

Jeremy: Awesome. All right, so let's talk about serverless chaos engineering. So how do we actually do these experiments? Or what are these experiments, is probably a better question?

Gunnar: Yeah, well, one thing that we have talked about already, it's latency injection, and I think that's probably the easiest one to get started with. Yan Cui, once again, he used latency injection to be able to properly configure timeout values for functions and downstream services. So, he created an article around this and with examples of how to do it. And Adrian Hornsby then from AWS, has created a Lambda Layer for this as well. So you're able to add a Lambda layer to your functions and then inject latency and see how the function and the system overall behaves then. So latency injection is a quite easy one to get started with. And then we have status code injection or error code injection. So instead of having the proper 200 status code as a response from your API gateway, you might get a 404 or 500 errors. Some like that. So you can inject those and see how does my system behave when there is an error message and because that's not something that you might normally get. But by injecting them you're able to test the system behavior that way.

Jeremy: And you could do like concurrency, manipulate the concurrency. Probably?

Gunnar: Yes, definitely. And to be able to test that properly, then you need to be on a quite large scale, usually, to see exactly how it works. But you're also able to configure DynamoDB for the read and writes. Since we don't have the control of the underlying infrastructure, we need to make stuff up to be able to do these experiments in another fashion than perhaps traditionally with chaos engineering. And that's the fun part as well. Since most of the services are API driven, it's quite easy to do. Change us back, back and forth. The stop button is a configuration change in many cases. So you can just use the CLI to do a change back and forth when you're doing your experiment.

Jeremy: Right, and then what about configuration errors? Like if we changed IAM policies, for example, there are other things you could run around that?

Gunnar: Yeah, by doing the configuration changes or permission changes, you might replicate that some service isn't available or isn't responding in the correct fashion. But you might also create a configuration error in a way that might happen every now and then. Because with the way that serverless works, we have so many ways to configure everything. You have configuration on each and every function, so that is something that happens every now and then. It might be on deployment that some configuration change happens or something is missing. And so we're able to do that in a controlled fashion by using chaos experiments.

Jeremy: So could we run a region failure test with serverless? Like how would we do that?

Gunnar: Yeah, that's a bit harder to do it in a controlled fashion. But, we can do it by configuration changes, by doing permission changes. So were able to lock, more or less, lock ourselves out of specific region. But it's hard to get the exact same effect as if the actual region is down. Because you're probably getting errors, that are different from the way it would behave if the entire region is down.

Jeremy: I wonder if you could do that with Route 53? You know, because you can do like some of the health checks and things like that. I wonder if you could inject a failure there and then have it route traffic that way?

Gunnar: Yup. Well, if you have those health checks or you have that type of routing in place in Route 53, you might do it that way. That's right, to more or less inject failure into the health checks.

Jeremy: Alright, so maybe then let's talk about how we actually then run the test. We talked about some of them, but there’s tooling, or there’s some tooling you mentioned. Adrian Hornsby has his lambda layer, but are there other tools that we could use if we wanted to start doing this?

Gunnar: Yes, there are. Looking at chaos engineering as a whole, there is one of the bigger options today is a tool called Gremlin, and they have a SaaS offering for chaos experiments. And one part of that system is an application fault injection. And that way you can inject failure into Lambda functions using Gremlin. And so that's one option. Another option is the open source Chaos Toolkit, which has drivers for the main cloud providers. And you're then able to inject failure into serverless services as well. And then one which I think is a fun one, is Thundra, the observability tool. They have added on option to inject failure into the serverless applications that you're observing through there too, so then you're getting both the observability off the chaos, and you're doing the chaos through the same tool.

Jeremy: And so, with the open source one. Is that something that you can build your own experiments? Or are there plugins or something?

Gunnar: Exactly. It's built around, and I think they called extension, or plugins and that you're able to build yourself. There are a bunch of them out there, but you can easily build your own as well and extend all that. So I would say that the most common way of getting started with it in the serverless space is to more or less build your own. Seems you're able to do many of the things through CLI or the API. You can easily do simple scripts that can perform the chaos and have an easy way of stopping them as well.

Jeremy: But it sounds like if you're building your own stuff and even if you're using any of these other ones, maybe not so much with like Thundra, for example. But you do have to kind of build these tests into your code, right? Like you have to kind of write your code to say “alright, I can fail this by flipping a switch somewhere,” but you actually would have to modify your underlying code. It's not just something that maybe you could just wrap, or is it something you could just wrap?

Gunnar: If we look at the Lambda Layer that Adrian has built that is actually a wrapper that you're wrapping around your functions. So you still need to have it in your code. But if you have a disabled, it should be fine to be there all the time and then you just enabling it when you're doing your experiments.

Jeremy: Because you don't want to change your code If you're running an experiment versus not running experiments. That kind of defeats the purpose.

Gunnar: Exactly, yeah, because you want, when you're running your experiments, you wanted to be asked when in production. And if you have to do deployments every time, as you said, it defeats the purpose.

Jeremy: Alright, so let's say you're a company and you're super interested in doing this and obviously cause it sounds ridiculously fun to go in and maybe break things, but not break things, but so what would you, how would you flip these on and off? Would you do that with environment variables, or how would you start and stop these tests?

Gunnar:  I would say that depends on what tooling you're using. If you're using, like gremlin or Chaos Toolkit, they have their on and off switch in the systems, and Thundra as well. But if you're using, for example, the Lambda Layer from Adrian, then you're doing it through parameter store. So you're having a variable there or a parameter where you're able to set it enabled or disabled in a quite easy way than just by updating using in the AWS CLI. And so that way, or if you're building your own, you might do it through to environment variables on the actual function as well.

Jeremy: Awesome. Okay. All right. So I guess, last question here is chaos engineering, I mean, obviously, to me, it sounds like a no brainer. Do you have any advice for people thinking about doing chaos engineering?

Gunnar: Well, what I would say it's that, of course, they should do it. I think it is beneficial for organizations to do it. Even if you don't have a large global scale distributed system, you can still do chaos engineering. But an easy way to get started is by looking at the references that are out there. Chaos engineering is a hot topic today, and there are tons of talks around chaos engineering all the time at different conferences. So look at them on YouTube and read up on it. And just make sure to know that you start small and you do your small experiments first, and then you're able to scale up.

Jeremy: Awesome. All right, well, thank you, Gunner, so much for being here. If the listeners want to find out more about you, how did they do that?

Gunnar: Well, I guess the easiest way is to look me up on Twitter @GunnarGrosch, which is hard to spell, but I think you'll find me.

Jeremy: All right. I will put that in the show notes and also those other tools that you mentioned and stuff, I'll get those in there as well. All right. Thanks again, man.

Gunnar: Thanks for having me.

Episode source