Continuous Verification - It's More Than Breaking Things On Purpose
Speaker: Adam D'Abbracci
Summary
Adam D'Abbracci discusses the concept of continuous verification, formerly known as chaos engineering, and its importance in ensuring the reliability of services and systems. They share their personal experiences and provide insights into testing methodologies, automation tools, and monitoring practices. The presentation emphasizes the need to anticipate and respond to failures effectively, ultimately enhancing system resilience.
Transcription
Today, we're going to be talking about continuous verification. Some of you might know it as chaos engineering, which is what it was previously called before everyone realized that's a terrible name and nobody's going to sign on to doing it. So the whole industry decided we're going to rename it as continuous verification, and I'll tell you what it is. But first, a little about me.
I've been doing this for a long time. I've worked at Disney, I've worked at The New York Times. I've built my own things, I've broken many of those things, and therefore I know how things fail in production because it's usually me causing those problems. I prefer Vim as my text editor — don't kill me — and I can usually quit it. But oftentimes I just have to force quit the entire program because at some point, you get frustrated and you're just like, "Nah, I'm good with this."
This is a true story: My worst production incident? I put rm -rf dot on a Cron job that ran as root when I thought it was going to run as my user. There were four WordPress websites being hosted by cPanel, and every three hours the entire server got wiped. I had to call the offshore tech support, and they ended up, thankfully, reimaging it from an image from a week before I had made the change. Otherwise, that could have been bad.
Okay, so basically, I presented this to my friends, and they said, "You need an agenda slide." So here's the agenda slide. There are three things: I'm going to tell you about the problem we're facing, I'm going to provide a solution — not *the* solution but *a* solution — and then I'm going to talk to you about applying it and how you can actually do this in your organization, you know, tomorrow or whenever you go back to work.
So, first: the problem. "Everything fails all the time." Now, this is Werner Vogels, Amazon's CTO. Amazon, the company that you're expecting, a lot of you, and probably most people at this conference who aren't using GCP, are probably expecting this company to keep your servers up. And this company's CTO is telling you that everything is going to fail. But the reality is, everything does fail all of the time.
For example, Amazon US East 1 — if anyone is familiar — is where everybody hosts their stuff, including Amazon. So when US East 1 went down for seven hours, almost all of Amazon was down. To the point where airlines were being impacted. Disney+? Why would Disney+ be impacted by a single regional outage in AWS? Who knows? Well, it was. Netflix? I didn't even know Netflix still ran in AWS, but apparently, they have some dependency. And of course, the contributing factor was an intentional change, as it usually is.
Facebook. I'm sure a lot of you remember this from last summer. Facebook went down for six hours. Again, intentional change. It was DNS. We're all so surprised. It was so bad that they couldn't even get into their server rooms to reboot the servers manually because their door scans relied on the active directory servers that also used facebook.com as their DNS. So those were unaccessible too. Again, intentional change. Always an intentional change, right? It's never just something breaking.
Site Reliability Engineers are pretty much our primary driving you know directive is to improve the reliability of our services. We are there to help the application teams make sure that their applications run all the time. We do a lot of things to do to achieve that. We prepare for things degrading. We build caching. We build CDNs. We distribute our content everywhere. Oftentimes, we're running multiple levels of networking just to make sure that if one problem happens one place, we can route around it. We also prepare for failure. We build redundancy into every layer, and we build ways to accommodate for things like Auto scaling that we can't predict we're going to get traffic. We want it to automatically spin up servers. My favorite thing is like multi-region Disaster Recovery instances. It's like if we were civil engineers instead of software engineers, that'd be the equivalent of building a bridge and then building the same bridge right next to it but no one's going to use that other bridge unless the first bridge falls over. And that's what we do. We like literally will run two full copies of our apps just to make sure.
And of course, all the best laid plans of mice and men, if something goes wrong, we prepare to respond. Capturing logging. Capturing monitoring. Hooking all these systems up to the point where if something does go wrong, someone gets paged and they should know right away what happened.
And when we think about all the things that we do as site reliability engineers versus the application developers, the reality is the apps are regularly tested. We have unit testing. Code scanning. Vulnerability scanning. All these things we put into our CI pipelines to make sure that the application is regularly verified. But all this other stuff that we as site reliability engineers build, we really rarely test those things right? We usually only test them when they're needed. There's a production outage. We need to restore from a backup. Let's hope the backup system works. Let's hope our Disaster Recovery plans work. When was the last time that failover was tested? Oh, it was last time the site went down. Of course.
So, I want to do a quick poll. And I know that, hey Dylan, I know that putting asking the audience to participate is like a gimmick in speeches. But I do want you to just raise your hand if you or someone on your team is responsible for one of these things with the question mark next to it. One? Okay, a couple people. Yep. Okay, now put those. Keep put your hand back up if you've tested those things in the last 12 months. I got one out of all those people. Okay, keep it up, keep it up. Was that test intentional? Okay, and the last question. Would you bet a hundred thousand dollars right now if I went and broke something that your Disaster Recovery plan would work? 100 grand right now? All right, we got one. That's pretty good. Normally, I expect zero.
So, yeah. So the real question is, like, how do we know all these other things that we're constantly building and constantly working on and constantly expecting to protect us against failure, how do we know they're actually going to work when they're needed? And that is where continuous verification comes in.
Norah Jones and Casey Rosenthal are like the godparents of this concept of chaos engineering. They would say this big thing about what CV is, but what it comes down to is it's unit testing for the stuff that site reliability engineers build. It's unit testing for our stuff. It's testing all of our code with the same level of regularity and precision that application developers test applications. It's not a crazy idea. It's just most people hear the term "chaos engineering" and they assume that we're going to just break stuff on purpose, but that's not how it is. And I'm here to give you a better way to do this type of testing.
Just to give you a little background, chaos engineering started at Netflix with something called Chaos Monkey. And now, Chaos Monkey is true chaos engineering. It would literally just turn stuff off in the middle of the day with no warning. The expectation is that the services and systems would respond to that failure. As Netflix got better at what they were doing, they introduced Simeon Army, which would do the same thing but all across their entire infrastructure, doing way more than just turning off servers. It would interrupt network connectivity, take down databases, all sorts of stuff.
Then, basically between 2011 and 12 and 2020-ish, the industry grew very slowly. Very few companies actually adopted this approach. Even though, you know, Netflix is doing it and every company thinks they're like a Netflix or Google. So they're like, "Okay, we gotta do this too." They didn't do this. It's not like Kubernetes where everybody is like, "Oh, Google's doing Kubernetes, so we got to do that too." But in 2020, AWS added chaos testing to their well-architected application framework, which is what a lot of companies follow when they're designing applications. So, it is something that the industry is slowly coming around to. And I'm guessing the name "Chaos engineering" didn't really help that.
With continuous verification, we can answer things like: Do my backups work? Can my application survive hardware failure? Have any recent changes affected our ability to scale? Things like this, where you kind of assume they're going to work and when they actually happen, you hope they're going to work. But maybe we can verify those ahead of time.
The way we're going to do this is we're going to treat it like a scientific hypothesis, which some of you may remember from your high school science days. A scientific hypothesis is very specific and testable. And that is what we're going to be working on today. We're gonna do three things: We're going to define our test cases. We're going to think about ways to run these experiments. And when you go back to your companies, you'll have some ideas on how to actually execute them. Then we'll talk about how we're going to measure them. So that you know what you're doing is actually working.
We're going to start with defining. Your first instinct is, "I just want my application to stay up." But the reality is, we really need to define how your application should respond. Not just it should stay up, but what should it do in order to stay up? You should be asking yourself, "How should it work? How do you expect it to work in these different conditions?"
For those of you who aren't application developers or don't write code regularly with unit tests, here's what a unit test looks like: You take a function, give it some inputs, and expect certain outputs. For instance, if you have a function that adds two numbers, you'll give it one and one, expect two. Give it two and one, expect three. What happens if you pass a null value? Who knows, we'll find out.
There are two benefits to doing this in application code. One, you verify that your functions work the way you expect them to. You validate your business logic within the function. But the second reason is regression testing. If you're constantly changing things and deploying those changes, how do you know one of those changes didn't break something else? Unit tests in application code help identify when something has deteriorated.
With continuous verification, which we'll just call CV from here on out, we're doing the same thing but with your infrastructure and scaling. A CV test case looks like the same as other unit tests. You have a target, which in this case is infrastructure. You have conditions that impact that target and then specific behaviors that you expect to happen when those conditions impact those targets. We're going to go through this kind of slowly so you understand how to identify these things. Then, you can apply it to your actual infrastructure when you get home.
We're going to start by focusing on the target. Ask yourself: Where can things go wrong in your application?
Now, I'm sure that some of you have known troublemakers. There's probably that one service that always falls over. Nobody knows how it works, and every time it falls over, you just reboot it and it works. So you're just going to kind of do that forever. But for a lot of people, they probably have never experienced significant enough incidents or regular enough incidents to know, "This is a problematic resource."
So what we're going to do is, we're going to draw a very simple... oh my God, these animations... a very simple architecture diagram. And I'm not talking down to like, "We're running this image on this server with these networking conditions." It's really just the logical network diagram or a flow diagram, right? It's like, what things talk to other things? You have this auto-scaling group that could be a thousand servers. It doesn't matter; it's a bunch of servers. You have this Redis cache. Who cares if it's Redis? It's a cache; it's something that every server relies on. So you just kind of want to lay this out, so you see all the risk points, all of the things that could fail in your system, and how they talk to each other.
Then, you're going to go through and you're going to identify those things. You'll identify each individual component. You'd say, "Oh, I got a database here with a read and write replica, got the auto-scaling group," and then you got these dependencies between them. So you say, "Oh, my servers depend on this cache. My servers depend on these databases." And then, of course, there's always these spooky services that nobody knows what they do, or where they are, or who owns them. But they are somehow critically important to your application, and you just don't want to touch those. But you at least want to identify them because they are a risk to your system.
So now, you have a bunch of targets, basically a bunch of individual components and a bunch of dependency pairings that you just identified. So now, we need to think: What could go wrong with those things that we just identified? This is where it gets really wild, and you can go pretty deep into it, but we're going to try to take it from the surface level and start easy.
To do that, we're actually going to use what's commonly known as a risk assessment matrix. This is used in almost every industry, not just software. The idea is, we are mapping the likelihood versus the impact of those conditions. And we'll go through, and I'll show you some examples. As we go through, you'll see that some of them are very high risk and some of them are very low risk. This prioritization exercise helps you identify what things you should work on first, what things are going to have the most value to your organization, which is how you get people to sign on to this.
Some examples of what these conditions might be, the things we've all heard, things that break, things that go wrong, etc., etc. My favorite one being when a key engineer leaves the company, and their credentials are all deactivated on Friday at 5 pm, and nobody knows why Active Directory decided to do that, but it causes a really hellish weekend for whoever's on call. And if none of these things are scaring you enough, go to the Void Incident Report, and you'll stay up for weeks because it's really scary what kind of things can go wrong even in the biggest companies.
Some examples of this, looking at the simple dependency graph I just drew you. I mean, like cert expiration, right? That happens to all of us. Certs expire; they just naturally expire, and somebody's job somewhere is to renew that cert. If they don't, it's expired, lots of problems happen. But then something like resource exhaustion, which can happen at almost every layer. If you consider it on your caching level, which every service uses, that's a big problem. That's a red, high-risk problem. But the same resource exhaustion on a server might not be a problem if you have 50 servers and one dies, life goes on. So that's green. So, you go through and you kind of identify using this process, which risks are the riskiest for you, which things are of the most value for you to prevent, what things do you not want to happen, and what will, frankly I apologize for misunderstanding your request earlier.
Let's format the original transcript for better clarity:
If you're doing this with a product manager or someone who's not technical, you'll probably end up with a lot of really high-risk things. Because they'll hear, "Oh, resource exhaustion? Scary! Red, right? Credential exploration is scary, red." But resource exhaustion is not. Network latency? Scary, but not red.
But if you have too many of these higher-risk things, give them numbers. Then you can actually order them. You'll have the most valuable, highest risk conditions going down to the green ones, which you probably don't need to do. If it's a three, it doesn't matter. You'll want to start with those nines. You will have nines and probably tens. Now, you have a list of conditions and the targets those conditions can occur to.
We're going to look at what you expect to happen when that condition occurs on that target. We're going to use the scientific hypothesis we all know and love: if-then. Essentially, it's like a Mad Libs. For each target you found and each condition you identified, start with the one that's a nine, the highest risk thing. What do you expect to happen, and what's the desired behavior?
Some examples:
- "Will my service self-heal on hardware failure?" So, "If an instance fails in my auto-scaling group, I expect a new instance to be added."
- "Will my service scale?" So, "If I get unexpected load, I expect new instances to be added."
- "Does my backup plan work?"
- "Does my disaster recovery plan work?"
- "Will my failovers actually work?"
These don't have to be chaos. They don't have to be failures. For example, if latency is high, you might want to ensure your paging system is connected. It's about ensuring all tools integrated as an SRE work under their intended conditions.
Now, we have test cases. Start with the highest risk condition and work down. Start with one. How do you create conditions you expect to trigger the behavior defined in the previous step?
From easiest to highest investment:
- **Manually** - The internet now calls this "Click Ops", which is going into the console and clicking things. Examples from Amazon include rebooting a server or throttling your Lambda concurrency.
- **Native Tools** - Amazon's Systems Manager lets you execute commands on EC2 instances without SSHing. It's Amazon running commands on your instance using their agent. Their Fault Injection Simulator allows for powerful control and test design.
- **Open Source** - As you progress in implementing continuous verification, one-off attacks might not suffice. There are open-source tools like Chaos Toolkit, which can target every known provider.
Here's the formatted version of the YouTube transcript you provided:
It's all written in Python. It's a great tool, and it can do a lot of integration, so it can send notifications and stuff. It's awesome. Litmus used to be free, now, well, I think they still have an open-source version, but uh, yeah, they're good too. And Chaos Monkey, which you talked about.
Now, Toxiproxy - it's from Shopify and it's good for simulating network failure. So you actually set up a Toxiproxy instance, you route all your traffic through it, and then you can degrade traffic, cause latency, cause packet loss, all sorts of things through that proxy. It's really easy to use. And then K6 is an awesome load testing tool. If you ever want a good load testing tool, Fire Drill over here is the tool I'm working on.
Fire Drill basically takes all of these and more, and runs them automatically for you in your account, and automatically collects results and normalizes those results. A lot of the glue code that you would, probably in your mind, think, "I'm gonna have to write so much glue code for this," a lot of that stuff can be gotten rid of with Fire Drill. And I'll talk to you a little more about that at the end.
So if you're using Kubernetes, I didn't realize there was going to be so many Kubernetes companies here, but if you're using Kubernetes, there are also Kubernetes-specific ones. And these are probably the best I've ever seen because if you're running Kubernetes, you have access to everything. You have access to the underlying operating system; you can control the servers. So you can obviously cause a lot of problems if you give one of these tools administrative access to your clusters. I haven't used the Chaos Controller yet from DataDog; that's brand new, but apparently, it's kick-ass. So I would definitely give these a try if you're using Kubernetes. Oh, too far.
The beauty of open source, again, is it's free. It's community maintained; you can expect it to work. The initial effort is a bit more. You have to run the tools, get used to the commands, figure out how to give it credentials, and so on. Some of them are command-line tools; some are configuration-driven. It's a bit of a pain in the ass, but automation's possible, and you get a lot more power.
These are really powerful tools and because you're deploying them as an administrator, you can do really cool things with them like route all your traffic to that toxiproxy.
What your couldn't do if you where just using amazon tools. They don't want you proxying traffic in their network.
So I keep saying, automation is possible. What I mean by that is anything that's a command-line tool can be automated with a script. If you want to get these things running and start automating, write a script. You can run that script manually or as a build step in your pipeline. For instance, if you're using GitHub Actions, which I know a lot of people are excited about, you can do that. Then, what's cool about writing your own script is you can send logs, send events, send metrics to your monitoring and alert if something goes wrong. Essentially, all the tasks you'd need to manually monitor can be integrated into your script. It's quick and dirty and may not be scalable for larger applications, but for a single application, it's an excellent solution.
A lot of people don't know that Lambdas can now run Docker images. Any script you write on your local machine can run as a Docker image, and you can host that on a Lambda. Sure, here's the formatted version without any rewriting:
Then you can trigger that Lambda either through your pipeline or through a scheduled Cron job. Boom! You got yourself automated continuous verification test without any tooling whatsoever. And obviously, if your environment's different, you could probably find a way to do this or, of course, just run it manually. It's a great start.
Now, at some point, you're going to get, well, at some point in your journey or right away, you're going to realize I need something more powerful than a bash script that runs in a Docker container. So there are SaaS products that do this. The three major players in the industry are Harness, which just bought Litmus, Verico, which is Kubernetes-specific, and Gremlin, which is kind of the top of the line tool right now.
The beauty of SaaS products, I mean obviously they cost money, a lot of money sometimes, and there's a little bit of effort, you know, like there's sales people, sorry tiny salespeople people who are hearing this, but there's like sales people you gotta deal with, and it's like a high touch thing, whatever, but once you get set up, super high Automation and super powerful, right. Like, these things will automate out of the box, they'll connect with your existing tools, they'll have integrations, you click some buttons and boom, you got yourself a fully fledged platform here to do this type of testing. But the cost is prohibitive for a lot of people.
There are a bunch of different ways to run these. If you have a specific configuration, like, I want to test Redis for example, there's probably open source tools out there that just test the thing you want to test. So do some Googling before you pick anything.
You know what you're going to test, you know how you're going to test it, but now we need to make sure that the test actually works and that whatever you are testing passed whatever conditions you expected to pass. So you're going to ask three questions: Did it work? Did it work correctly? Which are two different questions almost all the time. And did anything else go wrong?
Really, the beauty of this type of testing is not just verifying that your stuff works, but that nothing else breaks when these things happen because you'll find very quickly that dependencies you didn't realize were dependencies are actually critical dependencies and when you break them, a lot of things go wrong that you would have never predicted in a million years. But this is a good way to figure that stuff out.
So, we're going to turn to our existing monitoring tools to do this. People, when I always talk to people about this, they're like, "well how do you know that nothing else is breaking?" And my answer to that is, "how do you know now that nothing is breaking?" The answer is you have monitoring. If you don't have monitoring then, you know, just get rid of all this, you have a lot of work to do before you get to this point. I assume you have some kind of application monitoring. And if you don't, choose the native ones. If you're running in Amazon or GCP, and I assume Azure has something like this, there's native tools, do this, there's native logging, there's native monitoring, you can set up alerts, you can set up dashboards, and if you log as Json in your application, you can actually use those metrics, you can actually produce your own metrics in your logs. So there's no excuse not to be monitoring your application at this point. There's tools to do it.
But there are so many metrics and that's really where people I think get a little confused. You look at this list of metrics in like the Amazon Metrics Explorer and there's just hundreds of thousands of them. So we're going to go through and basically for each test case that we defined in the beginning, we're going to figure out which metrics we need to actually test that thing. So for example, before we just had if an instance fails a new instance should come up, but the reality is if one instance fails one instance should come up, if two fail two should come up. And you can write multiple test cases against the same exact condition by just changing that number. And same thing like error rate, it's like, well, what at what point do we expect our system to start reacting, right? An error rate of 5% could be anything but an error of 10%, something's wrong. So you might expect something to happen at that point. Along the same lines, once you get more advanced...
Certainly, here's the transcript formatted for clarity with no content changes:
You're going to want to define: What would you consider a failure? For instance, if it takes more than 30 seconds for your instance to come up, that could be a huge problem for your application. Or, what if your latency on your application - on your home page, your API - spikes while these things are happening? You might not want that, and you might be expecting that your system is going to respond fast enough that that latency doesn't go up. So, you want to say that in this test case, if that happens, it failed. We don't consider it a pass even if the application stayed up and healthy. If it goes over this SLA that you have, fail and don't release whatever it was you were testing.
Now, a little side path here: Things might go wrong. I don't mean they will go wrong. You're intentionally breaking things. Things are going to go wrong, especially the first few times, when things go wrong in ways that you couldn't have predicted. To alleviate the inevitable management question of this, we want to define what we consider both a test failure, which we talked about, and also: What do we want to do if something goes terribly wrong? At what point do we stop the test and let the system recover on its own? And at what point do we consider this an incident? Because the reality is, the first few times you do this, something's going to break so bad that it will go down. I always recommend doing it in lower environments first for that reason. But if something breaks and goes down, you learn something from that. You might not want to declare an incident in Dev, but if you're doing this in production, there's definitely going to be a level where you're like, "I need to call someone about this". And of course, if you already have defined SLOs, use those. You've already done all the work.
Now, one last thing to mention about this: You'll notice that the database deadlocks go 1, 1, 2 versus being different levels. That's because there are things in production where, if they happen even once, that is a problem. Like, let's say you have credit card processing. You can't allow more than one failed credit card process in prod. You might get sued for that. There are legal ramifications for a lot of these things if you're dealing with PCI or PII. So, you might want to have some really stringent rules if that's your case.
After you've run the test, how do you actually capture those results? A couple of ways to do it: If you already have your monitoring tools set up, you can just go into your monitoring tool, find the window that your test occurred, and capture all these metrics from there. But you basically want to capture four things:
- Did the condition occur? That's very important. Let's say your condition is an error rate of 10%. If your error rate never got up above 7% during your test, then the condition didn't occur. So, nothing should have happened.
- You want to make sure that the behavior occurred.
- You want to make sure that the test conditions were met, like the time box or the latency or whatever.
- And then, finally, this is optional, just capture all the metrics that happened during the time for future reference: latency, error rate, CPU usage, stuff like that.
I always recommend, if you're copying links to your monitoring, that you include the correct time window in the links. People will click it, they'll say, "I don't see anything", and you'll be like, "Did you click the link I just sent you?" The reality is they probably just have the wrong time window.
So, you've defined your test cases, established how you're going to monitor this app, and how you're going to make sure the test doesn't blow up. Now, it's time to get management buy-in, which is everybody's favorite part of being a technologist. I'm going to tell you right now, that getting people on board to this type of testing is going to be your biggest obstacle.
Here's the formatted transcript:
There are very few technical limitations. It's very easy to do this type of testing, but the people need to be on board. When you go to your manager, or their manager or whatever, and you say, "I'm going to break things in prod on purpose," they're going to say, "Absolutely not." So, you need to convince them that it's a good idea because they won't get it right away.
So I have a couple of suggestions. First of all, as we talked about, accurate and reliable monitoring is the most important thing to getting buy-in. You need to be able to point to your Datadog or Dynatrace or whatever and say, "Last time this incident happened, we picked it up but it was too late. And here on our dashboard, I can show you where this metric went crazy." So you can show whoever is concerned, "We know if something breaks. We're anticipating that something might break, and we're ready for it." So having that monitoring is so important and being able to demonstrate it is so important.
The second thing is: Think like a showman. And I mean like the Greatest Showman, not just like a general showman. We need to put on a show. We want people to be wowed. So I always recommend the first demo that you do to whoever's making the decision should be something with a little bit of a wow factor. Find an incident that took down your entire website and see if you can validate that behavior. Something where even someone with zero technical knowledge would see what you're doing and understand the value in the testing. So, think, make it a show. Come up with a demo, come up with a script, practice the script, maybe get your teammates in on that, and then when you present it, just really give it your all. Really sell it like you're a salesman trying to sell a product, even though it's something that you built.
And then finally, as mentioned, don't use the word "chaos" when even talking to other people in Slack. Just don't say it. The new term is "reliability management". So if anyone's like, "Oh, what are you talking about?" say "reliability management", not "chaos engineering", because it'll scare people away. And as you're approaching this, go from zero to one first. Go for the low-hanging fruit. Don't put a lot of investment into this. Just try one thing, get it working, and if all you're doing is writing a bunch of one-off scripts for a really long time, that is fine. It's better than nothing. And what you'll find is, while you're doing this with your own application, other people will see what you're doing and see the value in what you're doing. And the fact that you've already gotten approval to do this means you've done the hard part, right? Nobody wants to get management approval for anything. So you'll find people coming to you and saying, "How do I get in on that?" And that is the beauty of this little section here. You'll be able to not only get other use cases, you'll be able to demonstrate the value of this type of testing in other applications.
So let's assume that you have gotten management approval, you've got your first test case, you're ready to go. You know what you're doing. That's you, by the way, celebrating your management approval. So what do you do in preparation for your first continuous verification test? I almost said chaos test, but we're not calling it chaos anymore.
First thing: Set up an SOS channel. If you use Google Me or Zoom or whatever, get the URL ahead of time. If you use Slack or Teams, set up the channel ahead of time. And in every single one of your communications about this test, you include those links. And what you tell people is, "If anything happens to your app during this testing window, come in here, and we'll stop the test. Come into this channel, come into this Slack room, whatever it is, and tell us, and we'll stop the test." And that is mostly to reassure people that you understand the risk of what you're doing. You're not just going and breaking things and hoping it works. You actually know that there might be an impact to this, and you're preparing for that.
The second thing is notifying stakeholders. So these are the four main groups of stakeholders. You probably don't need Customer Support and Executives if you're not doing this in prod. But notify the stakeholders. I mean, notify them two months in advance, one month in advance, the day of, and after the test. This number three is so important. People always forget this after the test is complete because what will happen is you'll run the test and then four hours later somebody says, "Hey, my app is down. Is this your fault?" And you know it was not your fault. That test was done three hours ago, but they're still gonna blame you. So get ready for that.
So just warn them when it's done. When you're actually running the test, the first many times you run these tests, stare at your dashboards, reload your app. If it's a mobile app, reload the app on your phone. Watch your app. And it's not just for you, even though it's very reassuring to know that your app is still working. It's the fact that there's going to be somebody expecting you to do this. If you're breaking things on purpose, they're going to expect you to know when it breaks, if it breaks. It shouldn't break. So just constantly checking it. It'll make everybody happier. And eventually, you'll get to the point where you don't need to do that anymore.
And finally, every failure, especially in this type of testing, is a learning experience. And I mean that seriously. Because if you can break something on purpose by creating some kind of condition that might actually happen in your infrastructure, and it causes a problem, be glad that that problem occurred while you were sitting there watching it and not at one in the morning on Sunday, on Christmas, or whatever. Because that's usually when the worst things happen. So if you can do this and you can prove that it happened, even if you broke something, you proved that it can break. You can fix it, and you can fix it in a controlled manner, not just like, "Well, now we got a hotfix prod." Right? You can actually do it the right way. So it actually is a learning experience.
And so that's kind of it for the main content. I want to quickly talk about the tool I'm building, which is Fire Drill. I was hoping to have a demo ready, but I got distracted and I didn't get it deployed. So I don't have a demo for you. But if you want it on the beta, you know, I'll explain it to you and everything. Hello at firedrill.sh, just send me an email. If you go to the website, I have a video of it working against a WordPress site, so you can see actually how it works. But the idea is basically, instead of you writing all that glue code, just pay me, and I'll write all the glue code essentially. That's how it works.
But that's it for me. I'm happy to answer any questions. If you want to talk about continuous verification, chaos testing, anything like that, feel free to reach out. I'll hang around for a while to answer questions.
Thank you so much
Stay up to date
Sign up to the Navigate mailing list and stay in the loop with all the latest updates and news about the event.