Anurag Gupta on Day 2 Operations, DevOps, and Automated Remediation – InfoQ.com

Posted: April 6, 2021 at 1:44 am


without comments

Introductions[00:21]

Daniel Bryant: Hello, and welcome to The InfoQ podcast. I'm Daniel Bryant, news manager here at InfoQ and director of DevRel at Ambassador Labs. In this edition of the podcast I had the pleasure of sitting down with Anurag Gupta, founder and CEO of shoreline.io.

In this podcast, I wanted to explore the topic of Day 2operations. If Day 0 is all about preparing and designing software, and Day 1 is focusing on building and deploying that software, Day 2 is squarely focused on when users interact with the product and the software. Here is where we quickly learn if our requirements and our assumptions were correct. The inevitable firefighting and fixes that emerge in this day used to be the domain of the sysadmin operating on infrastructure. But increasingly we are seeing developers take ownership here by also being on call and interacting with platforms.

I've recently been following Anurag's work in this space and was keen to draw on his experience, both at running teams at AWS and also his new venture.

Daniel Bryant: I wanted to understand more about observability and automation in relation to data operations, and also explore the relationship that DevOps and site reliability engineering has to this space. Before we start today's podcast, I wanted to share with you details of our upcoming QCon Plus virtual event, taking place this May 17th to 28th.

Daniel Bryant: QCon Plus focuses on emerging software trends and practices from the world's most innovative software professionals. All 16 tracks are curated by domain experts with the goal of helping you focus on the topics that matter right now in software development. Tracks include architecting a modern financial institution, observability and understandability of production, and accelerating APIs and edge computing. And of course there's more. You'll learn new ideas and insights from over 80 software practitioners, innovator, and early adopter companies.

Daniel Bryant: The event runs over two weeks for a few hours per day, and you can experience technical talks real-time interactive sessions, async learning and optional workshops to help you learn about emerging trends and bad at your upcoming software roadmap. If you're a senior software engineer architect or team lead, and you want to take your technical learning and personal development to a whole new level this year, join us at QCon Plus this May 17th through 28th. Visit us at http://www.qcon.plus For more information. Welcome to the InfoQ podcast Anurag.

Anurag Gupta: Hi. It's great to be here. Thank you so much, Daniel.

Daniel Bryant: Could you briefly introduce yourself for the listeners please?

Anurag Gupta: So my name is Anurag Gupta. Currently, I'm the founder of shoreline.io, which is a DevOps company that uses automated remediation to improve availability and reduce labor for Day 2 ops. We'll get into Day 2 ops over time. Before this, I used to run the relational database and analytic services for AWS, that services like RDS, Aurora, EMR, Redshift, Glue, stuff like that. AWS is where I grew to appreciate the importance of operations as companies move from building products to delivering services. At AWS, we used to talk about utility computing, where you could just plug something into the wall, so to speak, it's metered by the minute and you are delivering compute storage databases in the same way that your utility company delivers electricity and gas. And you learn as you start delivering services, is that nobody cares about features if your service isn't up. And the more reliable you make your services, the more people depend upon them to be reliable. My mother-in-law lives in India, so she's got a big generator in her backyard. I don't have anything like that. I expect PG&E to provide me power 24/7.

Daniel Bryant: So you mentioned already Anurag about Day 2 operations, I think super interesting. We talk a lot about standing up cloud, standing up Kubernetes, some of the Day 0, Day 1. Could you set the scene for us and talk about the problem space of Day 2 operations, please?

Anurag Gupta: Yeah. That's a great question to start with. The challenge with the DevOps space is that there's just this bewildering landscape of tools and technologies. Personally, I like to break things down in terms of Day 0, Day 1 and Day 2, thinking about it from the perspective of the customer. So for me, Day 0 is about all the tooling that supports the development of the service, the stuff that is used by the developer, Jira, Git, review boards, all that kind of stuff. And that's the work you do before your customer sees anything. So Day 1 is about the deployment and configuration as you deploy out to production. Actually, this is a source of a lot of downtime for many companies and things can go wrong when they're perturbed. And that's what happens whenever you deploy something, right? And there are a bunch of great tools in this space. But honestly, I think of this as mostly a process problem.

Anurag Gupta: I've had people who used to work with me at AWS, go to other companies and get a tenfold decrease in outages due to deployments in just a year, just by improving processes. And the AWS, we have lots of deployments going on all the time, but some services would go a year, maybe even more than that without a deployment related outage. And because there are a lot of best practices here, Canary workloads, cell-based architectures to minimize blast radius, deployment for bar raisers for reviews, finding the key metrics for your service on the ones that depend on you. And most importantly, in my view, automated robots. I know Clare Liguori did a podcast with you earlier on and I really appreciated that conversation.

Anurag Gupta: So what's the Day 2? Day two is about keeping the lights on for your running service. And that's a lot more challenging. You can schedule deployments, but you can't schedule when things break. You're on 24/7 seven, a lot of Day 2 issues are about probability. As your fleet grows, you're going to see more issues. Your ops team is not going to grow at the same pace as your fleet. You need to have context for the entirety of your environment. Now they're just a portion of it that you're deploying and your environment is getting more complex, blades to VMs was a tenfold increase in the number of resources you were managing. Containers are another tenfold, plus you've got microservices, multi-cloud, hybrid cloud. There's an intense and growing amount of work involved to keep services healthy. My personal belief and the reason I started shoreline is that the only way people are going to keep up is to automate fixing things rather than the manual operator in the middle approaches that are going on today. Process improvements can only go so far.

Daniel Bryant: You touched on operators there, Anurag. Who typically within an organization is responsible or maybe even accountable for looking after these Day 2 problems?

Anurag Gupta: At AWS, all developers did on-call shifts for their services and all service owners were part of the escalation chain. I used to get paged a lot. Sometimes four nights in a row for different services. Personally, I feel like this creates a healthy culture because you actually feel ownership and you feel pain when your customers feel pain, that's important. And it's kind of the same thing as developers having a stake into QA and pipelines, right? And for larger services, we have followed the Sun ops teams and places like Dublin or Sydney. And those teams tended to be more operation-centric since they have the same ops load as the local teams did. But they tended to be a lot smaller, right? And so when not on call, they'd be doing more analysis, building ops dashboards, tooling, things like that. And they tended to come out of sysadmin, backgrounds rather than CS like the guys I'd have in Seattle and the Bay area. But the key thing is we all shared the load.

Daniel Bryant: How does this relate to job titles? Like DevOps engineers, we see a lot of site reliability engineering, sort of coming out at the likes of Google, Amazon, and so forth. Is there a danger that some folks, some organizations simply rebrand sysadmins or platform teams as DevOps, as SRE?

Anurag Gupta: Yeah. Some do, but that's really a mistake, in my view. So for me, DevOps is about ensuring developers take responsibility for operations, just as they used to for more quality. And SRE is about ensuring that the site is reliable, not sysadmin, which is kind of more about squashing one ticket after the next that comes in. In SRE, you're trying to build better telemetry, alarming automation. Across the org you're trying to keep raising the bar on the alarm threshold, right? It used to be that I was maybe alarmed at 96%. Maybe now it should be 97% or are you trying to improve downtime minutes? And that scale, this gets to be a really interesting problem as you start collecting data and can start applying statistical methodology. I do think too many people use SRE as a title to let developers just throw code over the wall and avoid the responsibility for uptime or reliability. And that's a mistake. I much prefer the model where everybody takes on call, but you do have specialists just like you had QA people, even though you have developers writing you in the test.

Daniel Bryant: Moving it more into what's the Day 2 ops now. There's definitely an increasing trend, as everything is code. Now, do you think this is helping or hindering, particularly in relation to Greenfield and Brownfield kind of projects where obviously Greenfield gets a lot of folks excited, but the reality is a lot of stuff is Brownfield, right?

Anurag Gupta: I think there's a mistake in thinking that everything needs to turn into code. Now, coding is a very fungible skill and you can take a developer and have them do operations or have them do unit tests or QA. And when the team small fungibility is important, but over time as things grow, you want specialization and having developers write unit tests and put them into pipelines didn't remove the need for QA people doing integration tests, scale tests, usability tests, right? I think the more important principle here is the SRE principle to automate the thing. At scale, you can't afford to have people fixing issues. It's too expensive. It takes too long and it introduces human error. The biggest outages we've had at AWS were exacerbated by human error. And so, let me give you an example of how I think things should work.

Anurag Gupta: At AWS, my team's used to ticket per individual database that had an issue, not on the fleet wide basis, not on like a rack or a bunch of nodes. And that's because you, as a customer, don't care how my fleet of RDS instances are doing, you care about your individual database. It's just like if I called up my local utility company and said, "Hey, the power's out of my house." You wouldn't want them to come back to you and say, "Hey, did you know that we have six nines of power availability in the state of California?" I don't care. If my database is down. My app is certainly down, maybe my entire company is down. Now doing instance by instance ticketing is pretty challenging if you're growing 50 or a hundred percent year over year and starting to manage millions of databases. Now, the solution for us was each week we do a parade of analysis of the prior week's tickets and find one, at least one, that we'd extinguish forever.

Anurag Gupta: Now as a service, that meant that we could keep our heads above water. As a customer, your availability goes up because the issue response is seconds, not an hour that's involved for a human dealing with stuff. And that's a virtual cycle, because we see issues that others don't see because of our scale. We're motivated to fix things other people wouldn't bother with because we're dealing with the ticket by ticket. And the customer is getting the same proportion that improvement in availability as our overall fleet volume is going up. Now, the problem with that was, is that we were doing this with a Java control point and it was a big code base, millions of lines of fairly brittle code. It's difficult to build. It's difficult to maintain. It's much better if you can build it as small, independent bits of isolated functionality that your sysadmins are capable of doing because they're the ones with the biggest expertise in operations.

Anurag Gupta: And that's actually what I'm trying to work on now, try to make it as easy to automate something and fix it once and for all, would be to fix it once. I also think it's a mistake to think about this as Greenfield versus Brownfield. If you can operate a system at the Linux command prompt, you should be able to automate it. And that sure has to be the goal. It shouldn't require some particular platform, some particular YAML file, some particular tool. Those things do make things easier. No mistake, right? But you got to operate the environments you've got, nothing's pristine. And a year from now, there's going to be something better.

Daniel Bryant: No, that's totally makes sense. I saw the subtle YAML dig at Kubernetes there. Right? Great stuff.

Anurag Gupta: Hey, I love Kubernetes. I can't say I love YAML.

Daniel Bryant: Yeah, totally. I love Kubernetes; I'm the same. I've learned a love-hate relationship, I think with YAML, right. On that note, you talked several times there about automation and I'm thinking about back when I was doing more ops, there was a lot more bash scripting. Now it's YAML, now it's HashiCorp like HCL configures, all these different things. Do sysadmin teams, do engineers have to learn new fundamental skills because of this move towards everything is code?

Anurag Gupta: I'm not sure. I don't think so. I think the YAML and things like that are more what the software developers do as part of their Day 1 deployment work. The Day 2 work, I think, should still stay in the realm of scripting and so forth. Now the question is, how do we make that the same kind of GitOps thing that you do for Day 1 or Day 0? But you know, your goal as a service operator on call is to get the system back up right, and collect telemetry for later analysis, it requires a triage mindset. And so that's I think the shift in thinking as you go from sysadmin, which is about productivity and squashing tickets to getting the entire system up. So, in this world, you're thinking about the principle of doing least harm, minimizing change, limiting blast radius, identifying the subsystem at fault and getting the right next person to look at it.

Anurag Gupta: You should think about it the way you think about an ER doctor, as opposed to your GP. Now, if an ER doctor sees someone come in with a heart attack, then it shouldn't be talking about, "Hey, you know, your BMI is higher than your cholesterol size," the goal is to get your vital signs back to normal. And so that fast response and like small set of well understood changes is what you're looking at here. That's when you're on call, when you're off call, it really helps to build step skills. Because you want to figure whether the distribution of a metric is changed or it's changed in an anomalous way. Are your resources getting consumed more heavily? Is the survival function changing? What's the biggest return on investment from the next automation or the next cost reduction exercise.

Anurag Gupta: I don't really think that software development skills are necessary for sysadmins, but I think the discipline associated with software delivery techniques is really important. And what I would find when running services is that my very specialized data plane developers, for example, people doing query optimizers and stuff like that, they had a harder time with on-call than my control plane or QA guys. Because they have a better holistic sense of the entire system, despite their being very skilled software developers.

Daniel Bryant: You mentioned GitOps a moment ago, which I think is super interesting. I see it mentioned on all of the interwebs and in podcasts constantly comes up. Love the way it works folks. I think GitOps is fantastic, but I don't want to bias you. I'd love to get your thoughts on what GitOps is, what benefits it has, perhaps what weaknesses it has too.

Anurag Gupta: I'm a big fan. So I think it's the right mindset to automate everything and remove manual labor, both from a scale perspective, as well as from an error creation perspective. And for Day 2, it just makes sense to me that the artifacts one uses to monitor alarm repair issues, goes through the same review process, the same pipeline process, version control deployment, as everything else. Now, honestly, there just aren't good patterns for GitOps for Day 2 ops right now. It's still kind of ad hoc. There are a bunch of tools, but they're very isolated from the software development environment. You have to take them, you change across the entirety of your environment, not in a version controlled way for this version versus that version in dev versus tests, stuff like that. And that I think is a source of error. Like even if you just think about something as straightforward as a Wiki, you're going to have a Wiki for the current version or the last version, not really associated for which one you happened to be dealing with right now.

Daniel Bryant: I did want to look around the connection with say more modern technologies now like containers and Kubernetes, I've heard a few times talk about not all of us necessarily are running the latest tech. Totally makes sense. Yeah. There's always going to be this long tail adoption of these modern platforms. So what do you think operations teams, sysadmins are doing if they're not working with the kind of GitOps friendly platform like Kubernetes?

Anurag Gupta: So realistically, I think for most companies, maybe about 10% of their cloud workloads around Kubernetes and maybe 10 to 20% of their overall IT workloads are in the cloud and that's across the broader expensive companies, right? Now that's changing, right? But it takes time. So honestly there's no silver bullet with Kubernetes. It does make it much faster to move resources around, gets you a bunch of orchestration capabilities that you can configure through new files, which is great. But you can build GitOps models on VMs and a lot of companies do. And I do run into people who think that as they move to Kubernetes or for that matter serverless, they're going to get rid of all their ops issues. I don't know anybody in production who believes that. Yeah, you can see even with Google, they've launched GKE Autopilot as a way to deal with the Day 2 operations because they recognize that there are lots of things around node management, security, improving usage, there are all sorts of things that remain despite running containers.

Daniel Bryant: Moving perhaps away from the platform area a little bit now, looking towards understanding what's going on. And for me, observability is really important. So I'd love to get your take on how important things like metrics, logs and tracing is in baking into the applications, baking into the platforms in relation to Day 2 ops.

Anurag Gupta: Super critical. It's absolutely necessary. I mean, you need the observability tools to know when there's an issue. There have been times in my past where it turned out that there was an issue, let's say, with S3 or DynamoDB, and it turned out my observability tool was dependent on them, just like my service was. And like you feel totally helpless, right? Your systems unhealthy, you don't know why all you can do is get your entire team together and start associating onto boxes so you can figure out what's going on. And at the same time, I can't say I ever got excited when someone said, "here's one more dashboard to go look up." Right? I did get excited when they said, "Hey, this issue we ran into last week that woke you up, it's fixed forever." That's a whole different sort of thing, right?

Anurag Gupta: It's fundamentally useful to my customer, not just to me. And so that's what helps operators load it's what helps customer availability. Now the other interesting thing is, is that as remediations become automated real-time metrics, real-time observability, real-time visibility to logs becomes super important. It doesn't really matter if an observability tool takes 10 minutes to raise an alarm, if it takes two hours on average to react to it and repair it. But if you're fixing an issue in seconds, then you need to detect them in seconds as well. That's the difference between a glitch, where you have to refresh your browser, and my having to go on an apology to work customers really tolerate our long outages.

Daniel Bryant: No, absolutely. In these sort of modern days, we're spoiled with good UX, good experiences in general that, yeah, you're right. If your customers are telling you it's going wrong, it's a big problem.

Anurag Gupta: Yeah. I actually used to monitor Twitter to figure out if one of my customers was seeing an issue like, "Hey, is Redshift down? I can't start a new data warehouse." It's kind of shameful that that was the technique. But a lot of people actually do that.

Daniel Bryant: If folks don't understand their systems or they can't observe them in the current state, what's your recommendation in relation to Day 2 ops? What's the first kind of actions they should take to embrace all the ideas you're talking about?

Anurag Gupta: It's a tough situation and you don't have the right metrics or logs. But new things arise and you may not be monitoring the right metrics. You may not be collecting the right information. As I mentioned before, sometimes your telemetry system fails. And really all you can do in those circumstances is collect a lot of people and start associating into your boxes and debug your way through.

Anurag Gupta: I remember one side, this large scale event where the cause was completely invisible to us. So we ended up getting 20 people saying like, "here are 20 boxes for you to each go into as a sample set and go figure out what's different between the bad ones versus the good ones." Now it turned out in that case to be a bad bias. There's no way I'd be monitoring or logging something like a bias version. But those things can break too. Right? And that's one reason shoreline gives you the ability to do fleet-wide debugging by interactively running Linux commands across your fleet, because you can do that in real time and get like a fleet wide view into what the route response to something is from your logs, from your Linux systems. But failing that, you have to put hands on people.

Daniel Bryant: Could you walk us through a typical Day 2 life cycle of an incident being identified, how someone might go in and try and figure out what's going on and then apply a fix? I'm sort of thinking, looking back to my days of doing this, there was a series of pain points along the way. Could I SSH in? Did I have permissions? Could I understand what's going on? I'd love to get your take on those things.

Anurag Gupta: There are all those things. I mean, people do want to limit SSH naturally, it's a security nightmare. But you do want to build tools that reduce the need for that, by being able to execute an action. So it's a system doing it rather than the human being. But the simplistic model of operations is that there's some incident management tool that causes you to get paged. You go in and register, you are working on it. You start entering some information. You bring in other people if you look at some Wiki to figure out what to do. And either some commonplace issue and you can fix it quickly, or it's something you haven't seen before, which you need to start to bugging or bringing someone more experienced to help off of, a Slack channel. And I really do recommend that people use multiple channels for communication because all of their channels can also fail because they might be dependent on common resources.

Anurag Gupta: Like we used to use not only our internal system, but also IRC because it's an old system and it's very unlikely to break, due to modern tech. Right? And so some of the challenges with this process that other companies are working on is issues don't necessarily go to the right person. You want the ticket to be enriched with contextual information, Wikis and runbooks go still, do you want to do postmortems to improve the process next time? There are companies that do all of that stuff. My problem with that, is that those are all process improvements that leave a human in the loop and that doesn't reduce the pain for my operators and it doesn't improve the availability for my customer. It doesn't scale, right? So my belief is, the only time operators should be paged is the exception case where you don't know what to do through an automated remediation or in some cases that you walked through a series of paths on that and it failed to correct the issue.

Anurag Gupta: You eventually always need to ground with a page to somebody. Now, most of the time, I think remediations can be automated because everybody's got wikis and runbooks, let's say what to do. And for at least the issues that are commonplace, you're much better off automatically doing something just like you're much better off using a deployment tool rather than having humans sit there and like FTP data on the boxes. We shouldn't wake people up to click a button, that's just wrong in so many ways.

Daniel Bryant: Now, sort of following on from that, how can remediation be automated? Now we do hear a lot of buzz on InfoQ and other places around AIOps. And there's sometimes a bit of eye-rolling associated with that kind of thing.

Anurag Gupta: I roll my eyes to a degree. Here's my take. When I look at AIOps reality versus messaging, what I see a lot of is tooling that deduplicates tickets or correlates cause and effect. So you know which tickets you need to go work on. That's super useful, there's no doubt. But it doesn't actually reduce the burden of fixing something. And so my take is that operators know how to fix an individual box. They're experienced, they know what to do. What they need is the capability to orchestrate that automation of that fix across a large fleet. They need the ability to tell the system, every second, go read thousands of metrics compared to hundreds of alarms and take an action if it's needed. Why am I having to do that? And machines are good at that kind of work. Humans are not.

Anurag Gupta: And so the other thing that I believe, this is that operations is at the core of distributed systems problems. It's harder at scale because things that are more likely to break. What you need is orchestration that allows individual nodes to automate the observe alarm acting. You want to make it easy to define those metrics, alarms and actions using the Linux tools, CLIs, scripting that you already know how to do. But what you want from your orchestrations system, just as we get with Kubernetes, it's the ability to deal with distributing that content in a consistent way, running it everywhere. And you would need clean models that deal with failure, limit blast radius of changes, and run locally, because even your operations endpoints can fail. And so if you think about Kubernetes, it does this for restarts, but restarts don't fix everything.

Anurag Gupta: And so what I would love to see from AIOps is automatic setting of alarm thresholds. So you can figure out, I should raise my bar now. You want to identify automatically the portion of a fleet, seeing a problem by correlating the tags. So, Oh, almost everything is related to this AZ or this version of software or stuff like that. That's it's correlation at stats, it's going to be done. And then you want proactive identification of issues. Because I'm starting to see a few readers on the disc, I should probably replace it because after I've seen two, it's very likely that disc is going to fail. You want to do things before they fail, not after they fail. You're not going to do that manually, right? Because you're barely treading water as it is. I personally don't want the system to do anything that I haven't told him to do. In terms of that form of AIOps of truly automating what's going on. I wouldn't tell him what to do and let it figure out where to do it, when to do it and why to do it.

Daniel Bryant: I love that. All right, good. I think actually leads on to one of my final questions here around, bringing the human in the loop is valuable when it's valuable. Like you say, computers are really good for automating things, but we as humans are really good at dealing with complex systems or filling in the unknowns, I guess. Now, how do you recommend teams support some of this automation? Should they prepare and practice a failure? Should they do things like game days, how best to learn these skills to support and the ideas you've talked about?

Anurag Gupta: That's a good point. Because the more things work all the time, the less people are prepared to deal with when they don't work. Like we see that with aircraft pilots, even for example, where the system will automatically take off travel and land on it. So, when things go awry, they're not really even used to handling it, right? We've seen that in some of the crashes recently. It's really interesting to actually look at failure modalities and other environments and bring them back to software systems.

Anurag Gupta: But in summary, I would say that it's really, really important to run game days. There just aren't a lot of live failures for people to get properly trained and particularly you will have new people coming in and the experience level is going to vary. So one of the things I used to do, this is that for new services, we'd spend a month or two creating events of shoring the alarm spire people actually were responding to the ticket and they had their phone next to them that they escalated appropriately, that they set up the calls, but they knew how to operate their tools, all of that kind of stuff.

Anurag Gupta: So that's kind of baseline stuff. Now you have to keep doing this as well. Otherwise you lose the muscle memory as new people join. And they've only seen quiet on-calls. And one way that I used to do that, this is that every time we launched a new region, we do a bunch of game days. And that's a good time to emulate availability events like by crashing processes or restarting nodes or changing your network settings. So you've lost access to underlying systems or stuff like that. And you wouldn't tell people what was going on underneath the covers, but you'd let them discover it through the process and you could easily emulate your fault tolerance and as well as your people's ability to find the right issue.

Daniel Bryant: This has been an awesome tour of the landscape, Anurag. If listeners would like to reach out to you, where can they find you online, Twitter, LinkedIn?

Anurag Gupta: Twitter or LinkedIn. I'm A-W G-U-P-T-A on both platforms. And I'd love to talk with anybody about this sort of stuff or distributed systems. I've been a data nerd for a long time, but also data platforms would be fun.

Daniel Bryant: Awesome, awesome. Many thanks to you for your time today, Anurag.

Anurag Gupta: Thank you so much Daniel

Read the original post:
Anurag Gupta on Day 2 Operations, DevOps, and Automated Remediation - InfoQ.com

Related Posts

Written by admin |

April 6th, 2021 at 1:44 am




matomo tracker