Application Layer

Overview

Why application-level fault injection is useful

Operators think in requests

Most metrics, dashboards, and alerts that we consume are in terms of requests. RPS, error rate, and latency all implicitly use a request as a unit of work. Requests are not a concept available at the infrastructure-level. At that level, all we see are streams of packets with IP addresses and ports. By moving up to the application-level, we can use all of the request-level metadata in constructing an attack.

Operators Think in Requests

Since requests can include identifiers like customer ID, device ID, country, etc, those facets may be used in constructing an attack. When you have that ability, it is much easier to create a small, well-defined blast radius in your attack. That, in turn, allows for much faster feedback loops and lets you discover latent problems more quickly.

Fault injection without system access

Injecting infrastructure failures requires running a process and accessing other system-level resources. In serverless environments such as AWS Lambda, Google Cloud Functions, and Azure Functions, this access is impossible. In these cases, it is necessary to include the fault-injection mechanism within the application itself. ALFI runs in the JVM as a library, so once you have integrated it into your application, you may use it in any environment.

Examples

  • Simulate an outage in production by creating an attack on your customer ID only. Then you can look for signs of problems when logged in as yourself, while no other users are even aware an attack is occurring.
  • Simulate a problem with a specific endpoint. Partial failure in distributed systems is quite common - some endpoints may be unavailable while others are working perfectly. In order to simulate such a scenario, you can create an attack targeted to some endpoints only and then determine how your system reacts.
  • Always-on failure testing. If you limit an attack to a set of devices you control, then you can run tests against those devices on a regular basis and evaluate how the user experience works when the system is degraded.