For this episode your hosts, Jason Yee and Julie Gunderson, are sitting down for a year in review! With the new year just around the corner, lets take a glance back at a year of chaos...engineering that is. The rest of the chaos we will leave out of the conversation. Julie and Jason talk about their favorite outages of the year. From Fastly to texts from Julieâs mom, weâve definitely got a heck of a year to consider!
Show Notes
In this episode, we cover:
- 00:00:00 - Introduction
- 00:30:00 - Fastly Outage
- 00:04:05 - Salesforce Outage
- 00:07:25 - Hypothesizing
- 00:10:00 - Julie Joins the Team!
- 00:14:05 - Looking Forward/Outro
Transcript
Jason: Thereâs a bunch of cruft that theyâll cut from the beginning, and plenty of stupid things to cold-open with, so.
Julie: I mean, I probably should have not said that I look forward to more incidents.
[audio break 00:00:12]
Jason: Hey, Julie. So, itâs been quite a year, and weâre going to do a year-end review episode here. As with everything, this feels like a year of a lot of incidents and outages. So, Iâm curious, what is your favorite outage of the year?
Julie: Well, Jason, it has been fun. Thereâs been so many outages, itâs really hard to pick a favorite. I will say that one that sticks out as my favorite, I guess, you could say was the Fastly outage, basically because of a lot of the headlines that we saw such as, âFastly slows down and stops the internet.â You know, âWhat is Fastly and why did it cause an outage?â And then I think that people started realizing that thereâs a lot more that goes into operating the internet. So, I think from just a consumer side, that was kind of a fun one. Iâm sure that the increases in Google searches for Fastly were quite large in the next couple of days following that.
Jason: Thatâs an interesting thing, right? Because I think for a lot of us in the industry, like, you know what Fastly is, I know what Fastly is; Iâve been friends with folks over there for quite a while and theyâve got a great service, but for everybody else out there in the general public, suddenly, this company, they never heard of that, you know, handles, like, 25% of the worldâs internet traffic, like, is suddenly on the front page news and they didnât realize how much of the internet runs through this service. And I feel it that way with a lot of the incidents that weâre seeing lately, right? Weâre recording this in December, and a week ago, Amazon had a rather large outage, affecting us-east-1, which it seems like itâs always us-east-1. But that took down a bunch of stuff and similar, they are people, like you know, my dad, whoâs just like, âI buy things from Amazon. How did this crash, like, the internet?â
Julie: I will tell you that my mom generally calls meâand I hate to throw her under the busâanytime there is an outage. So, Hulu had some issues earlier this year and I got texts from my mom actually asking me if I could call any of my friends over at Hulu and, like, help her get her Hulu working. She does this similarly for Facebook. So, when that Facebook outage happened, I alwaysâalmostâknow about an outage first because of my mother. She is my alerting mechanism.
Jason: I didnât realize Hulu had an outage, and now it makes me think weâve had J. Paul Reed and some other folks from Netflix on the show. We definitely need to have an engineer from Hulu come on the show. So, if youâre out there listening and you work for Hulu, and youâd like to be on the show and dish all the dirt on Huluâactually donât do that, but weâd love to talk with you about reliability and what youâre doing over there at Hulu. So, reach out to us at podcast@gremlin.com.
Julie: Iâm sure my mother would appreciate their email address and phone number just in caseâ
Jason: [laugh].
Julie: âfor the future. [laugh].
Jason: If you do reach out to us, we will connect you with Julieâs mother to help solve her streaming issues. You had mentioned one thing though. You said the phrase about throwing your mother under the bus, and that reminds me of one of my favorite outages from this year, which I donât know if you remember, itâs all about throwing people under the bus, or one person in particular, and thatâs the Salesforce outage. Do you remember that?
Julie: Oh. Yes, I do. So, I was not here at the time of the Salesforce outage, but I do remember the impact that that had on multiple organizations. And thenâ
Jason: Yesâ
Julie: âthe retro.
Jason: âthe Salesforce outage was one where ,similarly ,Salesforce affects so much, and it is a major name. And so people like my dad or your mom probably knew like, âOh, Salesforce. Thatâs a big thing.â The retro on it, I think, was what really stood out. I think, you know, most people understand, like, âOh, youâre having DNS issues.â Like, obviously itâs always DNS, right? Thatâs the meme: Itâs always DNS that causes your issues.
In this case it was, but their retro on this they publicly published was basically, âWe had an engineer that went to update DNS, and this engineer decided to push things out using an EBF process, an Emergency Brake Fix process.â So, they sort of circumvented a lot of the slow rollout processes because they just wanted to get this change made and get it done without all the hassle. And turns out that they misconfigured it and it took everything down. And so the entire incident retro was basically throwing this one engineer under the bus. Not good.
Julie: No, it wasnât. And I think that itâs interesting because especially when I was over at PagerDuty, right, we talked a lot about blamelessness. That was very not blameless. It doesnât teach you to embrace failure, it doesnât show that we really just want to take that and learn better ways of doing things, or how we can make our systems more resilient. But going back to the Fastly outage, I mean, the NPR headline was, âTuesdayâs Internet Outage was Caused by One Customer Changing a Setting, Fastly says.â So again, we could have better ways of communicating.
Jason: Definitely donât throw your engineers on their bus, but even moreso, donât throw your customers under the bus. I think for both of these, we have to realize, like, for the engineer at Salesforce, like, the blameless lesson learned here is, what safeguards are you going to put in place? Or what safeguards were there? Like, obviously, this engineer thought, like, âThe regular process is a hassle; we donât need to do that. Whatâs the quickest, most expedient way to resolve the issue or get this job done?â And so they took that.
And similarly with the customer at Fastly, theyâre just like, âHow can I get my systems working the way I want them to? Letâs roll out this configuration.â Itâs really up to all of us, and particularly within our companies, to think about how are people using our products. How are they working on our systems? And, what are the guardrails that we need to put in place? Because people are going to try to make the best decisions that they can, and that obviously means getting the job done as quickly as possible and then moving on to the next thing.
Julie: Well, and I think youâre really onto something there, too, because I think itâs also about figuring out those unique ways that our customers can break our products, things that we didnât think through. And I mean, that goes back to what we do here at Gremlin, right? Then that goes back to Chaos Engineering. Letâs think through a hypothesis. Letâs see, you know, what if ABC Company, somebody there does something. How can we test for that?
And I think that shouldnât get lost in the whole aspect of now weâve got this postmortem. But how do we recreate that? How do we make sure that these things donât happen again? And then how do we get creative with trying to figure out, well, how can we break our stuff?
Jason: I definitely love that. And thatâs something that weâve done internally at Gremlin this year is, weâve really started to build up a better practice around running Chaos Engineering internally on our own systems. Weâve done that for a long time, but a lot of times it was just specific teams, and so earlier this year, the advocacy team was partnering up with the various engineering teams and running Chaos Engineering experiments. And it was interesting to learn and think through some of those ideas of as weâre doing this work, weâre going to be trying to do things expediently with the least amount of hassle, but what if we decide to do something thatâs outside of the documented process, but for which there is no technical guardrails? So, some of the things that we ended up doing were testing dependencies, right, things that again, are outside of the normal process.
Like, we use LaunchDarkly for feature flagging. What happens if we decide to circumvent that, just push things straight to production? What happens if we decide to just block LaunchDarkly all together? And we found some actual critical issues and weâre able to resolve those without impacting our customers.
Julie: Thatâs the key element: Practice, play, think through the what ifs. And I love the what ifs part. You know, going back to my past, I have to tell you that the IT team used to always give me all of the new tech because if something was going to break for some reasonâthey used to call me the âAllSparkâ to be honest with everybody out thereâfor some reason, if something was going to break, with me it would break in the most unique possible way, so before anything got rolled out to the entire company, I was the one that got to test it.
Jason: Thatâs amazing. So, what youâre saying is on my next project, I need to give that to you first?
Julie: Oh, a hundred percent. Really, it was remarkable how things would break. I mean, I had keyboards that would randomly type letters. I definitely took down some internal things, but Iâm just saying that you should leverage those people within your organization, as well. The thing was, it was never a, âJulie is awful; things break because of Julie.â It was, âYou know what? Leverage Julie to learn about what weâre using.â And it was kind of fun. I mean, granted, this was years ago, and that name has stuck, and sometimes they still definitely make fun of me for it, but really, they just used me to break things in unique ways. Because I did.
Jason: Thatâs actually a really good segue to some of the stuff that weâve been doing because you joined Gremlin, now, a few months backâmore than a few monthsâbut late summer, and a lot of what we were doing early on was just, we had these processes that, internally for myself and other folks whoâd been around for a while, it was just we knew what to do because weâd done it so much. And it was that nice thing of weâre going to do this thing, but letâs just have Julie do it. Also, weâre not going to tell you anything; weâre just going to point you at the docs. It became really evident as you went through that of, like, âHey, this doc is missing this thing. It doesnât make sense.â
And you really helped us improve some of those documentation points, or some of the flows that we had, you would execute, and itâs like, âWhy are we doing it this way?â And a lot of times, it was like, âOh, thatâs a legacy thing. We do it becauseâoh, right, that thing we did it because of doesnât exist anymore. Like, weâre doing it completely backwards because of some sort of legacy thing that doesnât exist. Letâs update that.â And you were able to help us do that, which was fantastic.
Julie: Oh, yeah. And it was really great on my end, too because I always felt like I could ask the questions. And that is a cultural trait that is really important in an organization, to make sure that folks can ask questions and feel comfortable doing so. Iâve definitely seen it the other way, and when folks donât know the right way to do something or theyâre afraid to ask those questions, thatâs also where you see the issues with the systems because theyâre like, âOkay, Iâm just going to do this.â And even going back to my days of being a recruiterâwhich is when I started in tech, but donât worry, everybody, I was super cool; I was not a bad recruiterâthat was something that I always looked for in the interview process. When Iâd ask somebody how to do something, would they say, âI donât know, I would ask,â or, âI would do this,â or would they just fumble their way through it, I think that itâs important that organizations really adopt that culture of again, failure, blamelessness, Itâs okay to ask questions.
Jason: Absolutely. I think sort of the flip side of that, or the corollary of that is something that Alex Hidalgo brought up. So, one of our very first episodes of 2021 on this podcast, we had Alex Hidalgo whoâs now at Nobl9, and he brought up a thing from his time at Google called Hyrumâs Law. And Hyrumâs Law is this guy Hyrum who worked at Google basically said, âIf youâve got an API, that API will be used in every way possible. If you donât actually technically prevent it, somebody is going to use your API in a way it wasnât designed for. And that because it allows that, it becomes totally, like, a plausible or a valid use case for this.â
And so as we think about this, and thinking about blamelessness, use the end-runaround to deploy this DNS change, like, thatâs a valid process now because you didnât put anything in place to validate against it, and to guarantee that people werenât using it in ways that were not intended.
Julie: I think that that makes a lot of sense. Because I know Iâve definitely used things in ways that were not intended, which people can go back and look at my quest for Diet Cherry 7 Up during the pandemic, when I used tools in ways they werenât intended, but I would like to say that Diet Cherry 7 Up is back, from those tools. Thank you PagerDuty and some APIs that were open to me to be able to leverage in interesting ways.
Jason: If you needed an alert for Diet Cherry 7 Up, PagerDuty, I guess itâs a good enough tool for that.
Julie: Well, the fact is, is I [laugh] was able to get very creative. I mean, what are terms of service, Jason?
Jason: I donât know. Does anybody actually read those?
Julie: Yeah. I would call them âlight guardrails.â
Jason: [laugh]. So Julie, weâre getting towards the end of the year. Iâm curious, what are you looking forward to in 2022?
Julie: Well, aside from, ideally, the end to the pandemic, I would say that one of the things that Iâm looking forward to in 2022, from joining Gremlin, I had a really great opportunity to work on certifications here, and Iâm really excited because in 2022 weâll be launching some more certifications and Iâm excited for what weâre going to do with that and getting creative around that. But Iâm also really interested to just see how everybody evolves or learns from this year and the outages that we had. I always love fun outages, so Iâm kind of curious whatâs going to happen over the holiday season to see if we see anything new or interesting. But Jason, what about you? What are you looking forward to?
Jason: You know I, similarly, am looking forward to the end of the pandemic. I donât know if thereâs really going to be an end, but I think weâre starting to see a return to some normalcy. And so, weâve already participated in some great events, went to KubeCon a couple months ago, went to Amazon re:Invent a few weeks ago, and both of those were fantastic just to see people getting out there, and learning, and building things again. So, Iâm super excited for this next year. I think weâre going to start seeing a lot more events back in person, and a lot of people really eager to get together to learn and build things together. So, thatâs what Iâm excited about. Hopefully, less incidents, but as systems get more complex, Iâm not sure that thatâs going to happen. So, at least if we donât have less incidents, more learning from incidents is really what Iâm hoping for.
Julie: I like how Iâm looking forward to more incidents and youâre looking forward to less. To be fair, from my perspective, every incident that we have is an opportunity to talk about something new and to teach folks things, and just sometimes itâs fun going down the rabbit holes to find out, well, what was the cause of this? And what was the outcome? So, when I say more incidents, I donât mean that I donât want to be able to watch the Queenâs Gambit on Netflix, okay, J. Paul? Just throwing that out there.
Jason: Well, thanks, Julie, for being on. And for all of our listeners, whether youâre seeing more incidents or less incidents, Julie and I both hope that youâre learning from the incidents that you have, that youâre working to become more reliable and building more reliable systems, and hopefully testing them out with some chaos engineering. If youâd like to hear more from the Break Things on Purpose podcast, weâve got a bunch of episodes that weâve published this year, so if you havenât heard some of them, go back into our catalog. You can see all of the episodes at gremlin.com/podcast. And we look forward to seeing you in our next podcast.
Jason: For links to all the information mentioned, visit our website at gremlin.com/podcast. If you liked this episode, subscribe to the Break Things on Purpose podcast on Spotify, Apple Podcasts, or your favorite podcast platform. Our theme song is called Battle of Pogs by Komiku and is available on loyaltyfreakmusic.com.