Chaos and Reliability Engineering techniques are quickly gaining traction as essential disciplines to building reliable applications. Many organizations have embraced Chaos Engineering over the last few years.
Creating reliable software is a fundamental necessity for modern cloud applications and architectures. As we move to the cloud or rearchitect our systems to be cloud native, our systems are becoming distributed by design and the potential for unplanned failure and unexpected outages increases significantly. Additionally, moving to DevOps further complicated reliability testing.
The importance of testing for reliability
Testing disciplines like QA and others emerged in response to something that breaks consistently and warrants a particular testing methodology.
For example, unit tests verify that the code we write does what it's supposed to. Integration tests verify that code we wrote plays nicely with the rest of the codebase. Sometimes we have system tests that attempt to verify that the entire system conforms to design specifications.
Traditionally, development teams would pass their code to be tested in order to verify that it worked as expected or to find issues that needed to be fixed.
At this point, the code would be tossed over the proverbial wall to an operations team whose job it was to make that code run in a production environment. Operations bore the responsibility for getting stuff running, and because of the uniqueness of each organization's environment, individual operations teams would come up with their own strategies and plans.
DevOps merged the development and operations teams together and made them share responsibility for production readiness and deployment. Agile and DevOps software processes have increased our development and deployment velocity by orders of magnitude so we can get products and features to customers faster.
But, the faster code is created and checked into master, the more frequently QA has to write tests and the more tests are needed. With faster velocity, the chances that an occasional error will slip past grow higher. To keep up, testing has been automated as much as possible.
Additionally, as we moved to microservices and other distributed, cloud-based architectures. These distributed systems have emergent behaviors, responding to various production conditions by scaling up and down in order to make sure the application can deliver a seamless experience to increasing customer demands. In other words, these systems never follow the same path to arrive at the customer experience. Emergent behaviors also means emergent failures. Distributed systems will fail, but it's unlikely that they will fail the same way twice.
Our previous understanding of tests do not account for the unique and constantly changing production environments of today. The Ops side of DevOps does its best to make things work, but their mandate frequently only covers getting the code into production and hoping for the best or rolling back changes or making hotfixes when failures occur. They automate some testing, but don't typically run tests that would uncover system failure arising from turbulent conditions in production.
Traditional QA is not enough anymore
Traditional quality assurance only covers the application layer of our software stack. And no amount of traditional QA testing or other traditional testing is going to verify whether our application, its various services, or the entire system will respond reliably under any condition, whether "working as designed" or under extreme loads and unusual circumstances.
A failure at any software stack or application layer can disrupt the customer experience. Traditional QA testing methods will not catch any of these potential problem conditions before they actually happen.
Furthermore, most traditional QA activities were absorbed into other teams. Many tests are now automated by CI/CD pipelines and watched over by an SRE or DevOps team. The responsibility for finding and fixing problems has become the responsibility of service owners. Adding to that is the undeniable fact that it is impossible to make testing and staging environments that accurately mimic production environments.
How does Chaos Engineering help testing evolve?
Because Chaos Engineering can test the quality of code at runtime, and has the potential for both automated and manual forms of testing, the discipline emerged as a powerful tool in the new Quality Assessment toolbox.
Earlier we explained how distributed systems are constantly changing, which means they'll never break the same way twice, but that they will break. Chaos Engineering helps businesses mitigate this risk by allowing engineers to simulate how their systems will respond to failures in a safe and controlled environment.
We use chaos experiments to simulate things on canary instances that we know have the potential to cause problems, like network latency. Does the new service hold up under light testing? Medium? Heavy? We push the new instances hard. In production. We gradually build up and even test past the point where we expect things to work. And we learn things. What we learn often creates opportunities to refine our work further in the next build.
Chaos Engineering is the only way to find systemic issues in today's complex reality, regardless of whether we use canary deployments or not. How will our REST API-driven inventory service behave when network latency increases by two microseconds? What happens when a large number of delayed requests all hit the microservice concurrently? How do we know? We test it.