Transport Layer Security (TLS), and its preceding protocol, Secure Sockets Layer (SSL), are essential components of the modern Internet. By encrypting network communications, TLS protects both users and organizations from publicly exposing their in-transit data to third parties. This is especially true for the web, where TLS is used to secure HTTP traffic (HTTPS) between backend servers and customersā browsers.
TLS is such a critical part of the modern web that browsers and search engines will penalize unencrypted websites. Unsecured pages are displayed with warnings and given reduced SEO rankings. This caused a surge in websites using TLS, growing HTTPS traffic on the desktop from just 45% of websites to 98%.
While TLS adoption has gotten easier through initiatives like Letās Encrypt, itās not without challenges. For one, a TLS certificate is only valid for a certain period of time (called the validity period). Security teams need to request new certificates and roll them out over existing ones before the old ones expire. If a certificateās expiration date lapses, customers will see an alarming warning when trying to access your website or service.
Additionally, organizations often have multiple certificates in rotation for different services. Security teams need to track which certificates are in use, where theyāre deployed, and when theyāre due for renewal, creating logistical overhead. The risk of a certificate expiring and bringing down a critical service increases as the size and complexity of the service grows. For example, in 2020, Spotify had a nearly hour-long outage when an expired certificate brought down one of their main endpoints.
To add to this challenge, certificate renewal is an infrequent maintenance task. Renewals only happen once every few months to every few years (depending on our validity period). Some certificate providers support fully automated renewals, which makes it even less likely that teams will catch renewal problems until they happen. This creates a lot of risks, such as:
- Automated renewal notifications falling through the cracks or getting ignored.
- Security team personnel changing and losing track of ownership over certificate rotations.
- Expiration dates changing when certificates renew.
Since certificates are time-sensitive, and different certificates can expire at different times, we need a way to continuously check for expiring certificates across multiple services. But how do we test whether a certificate is expiring? Fortunately, we can use Chaos Engineering to help.
Using Chaos Engineering to detect expired certificates
With Chaos Engineering, we can simulate the conditions that would cause a certificate to expire. First, letās look at how certificates are validated.
When a device connects to an encrypted website, it downloads the websiteās certificate and checks the expiration date against its own internal system time. If the expiration date and time falls after the current date and time, then the certificate is valid. However, if we change the deviceās time to after the expiration date, the device will think the certificate is expired. In other words, by "time travelling" into the future, we can accurately detect when a certificate is going to expire simply by connecting to a website.
With Gremlin, we can use Chaos Engineering to test TLS security using a Time Travel experiment. Time Travel changes the system clock on a host, letting us shift seconds, minutes, hours, days, or even years into the future. We can move our systems forward, send a request out to a website, and if we receive an expiration error, we know how much time we have before our certificate expires.
The benefits of this approach is that:
- It tests the entire SSL chain including intermediate certificates, root certificates, and certificate authorities (CA).
- It detects certificates that we might have overlooked or forgotten about (do you know when your Gremlin certificates expire?).
- It tests for other time-sensitive failure modes, like Daylight Savings Time compatibility and time synchronization errors.
To run this chaos experiment, we need two things: a website to test, and a separate host with Gremlin installed. Weāll use the second host as a stand-in for a userās device. We also need a tool to access the website. For this experiment, weāll use curl.
To demonstrate how curl responds to healthy and expired certificates, letās run a curl request against a working website:
1curl -I https://gremlin-demo-lab-host/
1HTTP/2 2002cache-control: max-age=6048003content-length: 5439584content-type: text/plain5date: Mon, 18 Jan 2021 23:06:18 GMT6...
Now letās try sending a request to a website with an expired certificate:
1curl -I https://expired.badssl.com/
1curl: (60) SSL certificate problem: certificate has expired2More details here: https://curl.haxx.se/docs/sslcerts.html
Now that we know what to look for, letās design our experiment. Our hypothesis is that if we set our system clock forward (e.g. by one day) and send a curl request, weāll see a successful response. But if curl returns an error, then we know the certificate will expire within the next day.
Weāll log into the Gremlin web app, create a new attack, and select our test host, which is shown here as "gremlin-demo-lab-host":
Next, weāll expand the State category and select Time Travel. Weāll keep the length of the experiment set to 60 seconds, block NTP (Network Time Protocol) communication so that our host doesnāt automatically update to the correct time, and set the offset set to 2,678,400 seconds, or exactly one month from now.
Tip: To calculate the offset, use a date/time conversion tool such as the ones provided by timeanddate.com.
Now letās run the attack. While the attack is running, letās re-run curl:
1curl -I https://gremlin-demo-lab-host/
1curl: (60) SSL certificate problem: certificate has expired2More details here: https://curl.haxx.se/docs/sslcerts.html
Curl returned an error, meaning that our certificate is going to expire within the next month. Weāll click the Halt button in the top-right corner of the Gremlin web app to halt the experiment, which automatically reverts the system clock to the correct time. Weāll record our observations in Gremlin, then work on replacing our certificate. Using Time Travel allowed us to catch this before it became a problem for our customers, while doing so in a safe and controlled way.
Scaling up and automating your certificate checks
For large teams, manually testing each and every certificate isnāt scalable. Imagine if we were managing dozens of certificates across different hosts and services. We also need a way to test further than one day out, otherwise we could have multiple certificates expiring within a short time frame.
To address these, weāll do two things: weāll use a Scenario to gradually increase the magnitude (e.g. the time period) of our Time Travel experiment, then automate our experiment using the Gremlin REST API.
Using Scenarios to gradually increase the magnitude of a Time Travel experiment
With a Scenario, we can run multiple Time Travel attacks back-to-back and increase the interval each time. This lets us test over multiple time periods during a single experiment.
To create a Scenario, weāll click on our previously run Time Travel attack to open the Attack Details page. From here, weāll click Create Scenario. Weāll call our Scenario "SSL/TLS certificate expiration" and enter a description.
Next, weāll click āAdd a recent attackā, re-select our previous Time Travel attack, and choose our test host. Weāll change the offset for the second attack to 604,800 (one week). Weāll repeat this step to create a third attack, then change its offset to 2,678,400 (one month).
While the Scenario is running, weāll run curl in a continuous loop. In the following script, curl makes a request, and if the request is successful, it waits 10 seconds before repeating it. If curl fails, it exits the loop and prints the failure to the console. This script also prints the current system time before each check, so we can see which stage of the Scenario was active when curl failed.
1#!/bin/bash2while :; do3 echo $(date)4 curl -s https://gremlin-demo-lab-host/ > /dev/null5 if [[ "$?" -ne 0 ]]; then6 break7 fi8 sleep 109done10echo "Failed to connect."
Now letās run the Scenario and start our script. Once the Scenario hits stage 2, curl returns an error and the loop exits. This tells us our TLS certificate will expire between one day and one week from now.
If you have a Gremlin account, you can use this card to use a pre-configured Scenario. Click "Run Scenario" to open the Recommended Scenario in the Gremlin web app, click "Add targets and run" to select the hosts you want to run the attack on, then run the Scenario.
Using the Gremlin REST API to automate a Scenario
The Gremlin REST API provides a RESTful interface for performing actions in Gremlin, such as starting attacks and Scenarios. By using the REST API in our test script, we can automatically initiate our ScenarioĀ
First, letās reopen our executed Time Travel Scenario in the web app. Click Rerun, then in the bottom-right corner of the page, click Gremlin API Examples. This generates a full curl request that we can use to initiate the attack:
1curl -i -X POST 'https://api.gremlin.com/v1/scenarios/<your Scenario ID>/runs?teamId=<your team ID>' -H 'Content-Type: application/json;charset=utf-8' -H 'Authorization: Bearer <your bearer token>' -d '{}'
Next, weāll copy this command to our curl script and add it just before the loop. We can add a second API command after the end of the loop to halt the experiment if curl fails. This lets us safely rollback after detecting an expired certificate without having to open the Gremlin web app and halt the experiment ourselves. Make sure to replace <your Scenario ID>
, <your team ID>
, and <your bearer token>
with your own values:
1#!/bin/bash23# Start the Scenario4RUN=$(curl -X POST 'https://api.gremlin.com/v1/scenarios/<your Scenario ID>/runs?teamId=<your team ID>' -H 'Content-Type: application/json;charset=utf-8' -H 'Authorization: Bearer <your bearer token>' -d '{}')56# Test your website(s)7while :; do8 echo $(date)9 curl -s https://gremlin-demo-lab-host/ > /dev/null10 if [[ "$?" -ne 0 ]]; then11 break12 fi13 sleep 1014done1516# Halt the Scenario17curl -X POST 'https://api.gremlin.com/v1/scenarios/halt/<your Scenario ID>/runs/'$RUN'?teamId=<your team ID>' -H 'Content-Type: application/json;charset=utf-8' -H 'Authorization: Bearer <your bearer token>' -d '{}'
Now we have a fully scripted chaos experiment that we can schedule using a service like cron, add to our CI/CD pipeline, or run as part of our client-side testing suite.
Conclusion
Staying ahead of expiring certificates is vital for keeping your websites and services accessible and secure. Time Travel lets you quickly and safely test your certificates on any environment, whether your websites are hosted on AWS, GCP, Azure, or on-premises.