We've seen a tremendous transition in the architecture of our systems over the years, from basic, linear systems to increasingly sophisticated, non-linear systems. We've moved away from monolithic programs, where a single person could comprehend the entire operation of a system, and toward a distributed world dominated by a microservices design.
This change has resulted in unpredictable system behavior, making it impossible to pinpoint where problems arise across a network of hundreds of microservices. To debug faults in a monolithic system, it was straightforward enough to approach someone who understood the application end-to-end. However, in the age of microservices, creating a complete mental model of service interactions is far from simple.
I've been emphasizing the importance of Chaos Engineering for the past few years. I had the opportunity to give a talk about Chaos Engineering 2023 using Chaos Mesh at KubeCon & Cloud Native Con EU 2023. I explained how Chaos Mesh is a powerful tool that enables developers to experiment with numerous failure situations, ultimately increasing confidence in their systems' capacity to endure real-world conditions.
Watch the full talk here:
The need for Chaos Engineering
Testing these complex, distributed systems is challenging. Traditional testing tools like JMeter struggle to effectively test these productionized applications, highlighting the need for a different mechanism: chaos engineering.
Chaos engineering isn't a new concept. It has been around for over 13 years, with Netflix's Chaos Monkey being one of the earliest implementations. As systems have matured, so have the tools, leading to the emergence of cloud-native chaos engineering.
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. It involves injecting failures into the system to understand its behavior under different conditions and catch potential issues before they occur in production.
How does Chaos Engineering work?
Chaos engineering operates on principles that start with defining a steady state representing how our application should behave. We then hypothesize that the steady state will continue, despite introducing real-world variables such as latency or node failures. By injecting these failures, we can observe how our application behaves and see if our hypothesis is disproven.
Doing chaos engineering in production might seem counterintuitive. After all, turning off things isn't a good thing in production, right? This is where the concept of minimizing the blast radius comes into play. We carefully select which nodes or pods to run chaos experiments on to avoid negatively impacting customers.
Communication is key when conducting chaos experiments. Everyone on the team should be aware of the what, why, and when of the chaos experiments. And as we keep adding new features to our applications, we need to redo these experiments to ensure the application's resilience.
Introduction to Chaos Mesh as a Chaos Engineering Tool
The architecture of Chaos Mesh is straightforward. As a user, you first install Chaos Mesh on your Civo Kubernetes cluster. Installation is done via Helm, and once complete, it sets up a Controller Manager and a Chaos Daemon as a DaemonSet across all nodes.
From here, users create a custom resource with a specific chaos experiment type. When you install Chaos Mesh, it comes with multiple Custom Resource Definitions (CRDs), each representing a different experiment. So, if you want to introduce a stress chaos, you create a Kubernetes object for that experiment type and specify the details of the chaos you want to inflict.
Once you've created the object, the Kubernetes API server notifies the Chaos Controller Manager, which recognizes the object as one of its own and assigns it to a Chaos Daemon. The Chaos Daemon then carries out the chaos experiment, deciding which node and pod to affect based on your specifications.
After the chaos experiment concludes, the results are sent back to the dashboard, providing a visual representation of the experiment's process and effects.
Find more information on how to intall Chaos Mesh with Civo Marketplace here.
What’s new with Chaos Mesh?
Over time, Chaos Mesh has seen substantial updates, with the most recent being version 2.5. This update introduced several new features that build upon those from versions 2.0 and 2.4.
The latest features include multi-cluster chaos experiments, which allow you to install Chaos Mesh on one cluster and connect a remote cluster to it. This means you can create experiment on one cluster (Cluster 1) and target that to run on another cluster (Cluster2).
Version 2.5 also introduced HTTP chaos with TLS support, allowing you to bypass TLS using self-signed certificates. Additionally, the new workflow UI is now enabled by default, providing an improved user interface for managing your chaos workflows.
As we move forward in 2023, Chaos Mesh continues to evolve, offering more advanced features and flexibility for chaos engineering in Kubernetes environments. By integrating this tool into your testing processes, you can ensure your systems are more robust and capable of handling unexpected disruptions.
Embracing Chaos Engineering and Chos Mesh
To effectively manage and improve our software infrastructure, we must embrace chaos, not fear it. Chaos engineering has grown from its infancy into an integral part of a well-architected software framework. As we design and develop our applications, chaos engineering principles should be at the forefront of our considerations, embedded within the very fabric of our architecture.
Learn more about Chaos Engineering
For those new to chaos engineering, start with learning the basic concepts and theories. Several insightful books on chaos engineering are available to guide you. But don't stop there. Engage with the maintainers and practitioners of chaos engineering tools like Chaos Mesh. Their insights into real-world use cases can prove invaluable when implementing chaos experiments in your own applications.
As with any complex system, effective communication is key. When running chaos experiments in production environments, it's crucial to convey what, when, and where you're implementing chaos.
If you’re interested in this topic, you can find more from my previous talks on Chaos Engineering here:
- The evolution of Chaos Engineering and Litmus Chaos | Civo Meetup
- Cloud Native Chaos Engineering with Litmus Chaos | KubeCon NA 2022
- Chaos Engineering | Chaos Carnival 2022
The growing Chaos Mesh community
Chaos Mesh continues to grow and evolve, with new features being added regularly. Stay informed by joining community groups such as the CNCF working group and the Chaos Engineering Working Group, both of which offer valuable resources, including best practice guidelines.
Your participation matters. Open-source projects like Chaos Mesh thrive on contributions from users like you. Whether it's through documentation, development of new experiments, or providing feedback on missing features, your input is invaluable. In fact, getting involved could set you on a path to becoming a maintainer of the project, driving it forward to new heights.
Chaos Mesh hosts a community call each month, and they have a dedicated CNCF Slack channel where you can ask questions and engage with the development community.
Conclusion
As we continue to advance into 2023, Chaos Mesh offers a powerful and flexible toolset for anyone looking to incorporate chaos engineering into their software development lifecycle. By understanding, embracing, and properly communicating chaos, we can build more resilient systems and contribute to the continued growth of this exciting field.