Auz Burger | Freelancer Writing Sample Why Dev Ops Needs To Embrace Tracing

Writing Sample- Why DevOps Needs to Embrace Tracing

Conway's Law Programmer Melvin Conway wrote in 1967 that in his experience, organizations which design systems are constrained to produce designs, which are copies of the communication structures of those organizations. Usually, Conway's Law is brought up in a negative sense, like something to be avoided, but when what you want your systems, and your organization mirrors each other in some way, would there be value in that? For example, if you take an ant colony, these ants manage to coordinate in a number of different activities, whether it's hiding and gather leaves, these particular kinds of ants do some really amazing things, but they do it with essentially no central coordination. At the same time, they have a rigid structure, but that structure is distributed, and I think we want as DevOps to enable exactly this sort of loosely coupled work across teams. That's common tools, common practices, and best practices, and it's probably your job to help deploy those tools and establish those processes. I've heard teams talk about creating a paved road, that is making an easy way for teams to adopt these kinds of tools and processes, and not necessarily constraining them, but offering the best options for them. So, Conway's Law can be a good thing, right? If we want to optimize for velocity, if we want to optimize for fast moving teams, then we want them to be able to make decisions independently of one another. To do that, they need to build strong abstractions between, not only their teams, but their services. So they need to be loosely coupled, right? I think we should be embracing Conway's Law. We should be shipping our org charts, then we should be thinking about how we can apply the same sorts of constraints around a structure, not only to the software, but to our organizations themselves. So, I describe some logos, which are some of my favorite tools, that some of which we used for coordination and orchestration, for building and for deployment. These are all ways that we as DevOps engineers, or as an SRE, can enable teams to do this, and maybe as a service owner, you're using some of these tools. But here's the thing about all these tools, and about this structure: your users probably don't care. Your user had an error occur, but how do we map that back to the software that's actually behind it? Is that part of the client? Is that part of the API gateway, or is that further in one of the other back ends? Because you care about this structure, and you and other people in your organization need to be able to figure out why that error occurred in order to correct it. DevOps Distributed Systems I think in a lot of ways DevOps and distributed architectures go hand-in-hand, whether that's microservices or serverless, or some other service orientated architecture. Those architectures all enable a kind of distributed ownership, enable teams to really own their own services, end-to-end. At the same time, there are a lot of great tools out there that automate distributed infrastructure management, whether that's things like Kubernetes or other serverless technologies, but these are all ways that sort of amplify your work as a DevOps engineer. But what do we lose here? Sure, we've created these really strong abstractions, we've built loosely coupled teams, but what about end-to-end visibility? What about understanding what our users are seeing? Is that something that we can solve with metrics? What about not just in our software, but in our organization? How do we communicate across teams? How do we find the right team to communicate with in the middle of an incident? How do we recover that visibility that we lost, that end-to-end understanding of our systems with distributed tracing? In the end, we want distributed tracing to be about end-to-end observability, but we have to start somewhere, and there's still value to be had in a single service. I want to start with your users, and understand what they care about, and how that maps into things like service level indicators, and how do we use symptoms to decide when to take action? Once we've decided that we need to take action, how do we connect cause and effect? Whether that's performing root cause analysis when something's on fire, or when we're trying to be a little bit more proactive, and maybe trying to plan some performance optimization work. Distributed Tracing 101 Here is a quick review for those of you who don't know a lot about distributed tracing, to bring you up to speed. Tracing is a diagnostic that reveals a couple of things. First, how a set of services coordinate to handle an end user request, not by end user request. I mean, all the way from mobile or browser to backends to databases. Traces will also include some other important metadata, which might take the form of logs or tags or other annotations. The point of all of this is to provide a request centric view, in as much as the architectures that we're choosing enable teams and services to work independently. Tracing is a little bit of a foil to that. It allows us to take the opposite perspective and understand it from where our users are seeing things. To clarify, there are a bunch of other kinds of tracing: process tracing, kernel tracing, and browser tracing. Those are all great things, most of them focus on an individual piece of your application. With distributed tracing, we’re looking at the bigger picture. Some tags that tell us something about the host that we're running on, like the communication protocol. The trace represents the end-to-end request, that's a single trace. A trace consists of maybe one, but maybe thousands, of these individual spans. A span is the work done, or some of the work done, by an individual service, as part of this request. In the end, what it really is, is just a timed event and some metadata. So, it has some duration associated with it and some additional information that tells us about what happened. If you want a little more background on this, look up Dapper. That was a system that was built at Google to do this kind of tracing by one of my co-founders, and they wrote a nice technical report about how they set that up and their experience doing that. It's pretty interesting to see what happened at scale. What are enough spans, and what are the right spans among those? Where do I start? Do I need all of these traces? There's potentially a lot of data here. You might be concerned about overhead, and if we want to pick and choose among these traces, which are the ones that matter? Service Levels and TLAs As a service owner, who are your users? I hope you know that. What do they expect from you? How are they going to measure your success? If you don't hit that, if you don't meet those expectations, what will they do? What will they do if you fail? One way we can formalize the answers to those questions is through service levels. A service level indicator (SLI) is what you measure, and that could be something like latency. It could also be uptime or correctness. There's a lot of different things that you might measure. From that, you might define a service level objective (SLO). Not every SLI has an SLO, but if the SLI is really important, hopefully, you have a target, you have a goal that you're trying to meet. For example, that might be something like keeping P99 latency below one second. It might be achieving 99.9% uptime, or it might be a 100% correctness. Finally, if we want to go all the way, we might then define a service level agreement (SLA). This is what happens if you miss the objective, if you fail to meet it. An agreement is really about the consequences, so this is really getting toward what is the thing that your users will expect when you fail to meet it? That can take the form of a refund or a cancellation. Not all SLOs will have SLAs, but SLAs can be important because they can understand how to drive investment toward SLOs and SLIs. High Percentiles I just kind of drew a really simple distributed system here. This is, maybe you have some document search system, it's spanning out across a bunch of shards. When we think about the services at the bottom of the stack here, that this will keep things simple. The average latency is one millisecond, but the 99th percentile latency is one second. That means one percent of the traces are a thousand times slower than the average request. Now if an end user request hits only one of those bottom services, well then, of course, one percent of those end user requests will also take a second or more. The thing that happens here, which is maybe a little bit subtle, but really important to understand, is as the complexity of handling those requests increases, it increases the number of these service instances at the bottom of the stack that have to participate in that request, this 99th percentile will have a bigger impact. Say that it required one hundred of these bottom of the stack services to participate in answering one of these search queries. If that's the case, then 63 percent of end user requests are going to take more than a second. Even though it's only one percent of the request at the bottom of the stack, more than half of the request that the users would take more than one second. As our systems increase in complexity, these rare behaviors get magnified through that complexity. So, when one of the really important reasons to be measuring 99th percentile, not just at the top of the stack, but throughout, is to understand how that's going to be magnified, because it's going to have a big impact as we kind of go up the stack. Best Practices for Performance SLIs Some of the best practices around choosing performance SLIs include setting the right scope, which is really important. It might be easy to think, “I own a service, so I'm just going to set an SLI for that service.” Latency is important for my users, so I'm going do that, but think for a second about the different kinds of things that your service does. For example, you have two endpoints, one that reads and one that writes. Chances are the read one is going to be a lot faster than the write one. So now when you're going to set a single SLI, if you're going try to cover both of those, either you're going to set it so wide it's going to be such a slow SLI, but you won’t really see or understand performance changes in the faster operation, in the read operation, or you're going to set it so tight that you're going to be constantly missing your SLO, not because anything's really wrong, but just because writes take longer. In that case, it might be better to set two SLIs, one for your read operation and one for your write operation, so that you can really measure performance in a way that's meaningful for both of them. Once you've chosen the right scope, now it's about choosing the right kinds of performance measure, and I think the place where I would start here would be with high percentile latency and error rate. Those are things that many users will care about, and they're great indicators of the overall health of what you're doing. Beyond that, you might also measure something like throughput. It's a little bit harder to set an SLO around this, so this is something like the number of requests that you're satisfying in a given time period; that's independent of the behavior of your users, but it might be something that you still want to measure, just to look at trends and see if it's going up or down quickly. The fourth thing on this list looking at saturation. What percentage of your resources are you utilizing to serve traffic right now? This is really important for getting ahead of things, so understanding when you might need to increase the size of a database or scale out to a new set of shards. I qualified this and said, performance SLIs. Yes, uptime and correctness are also really important, but because we typically achieve those through different means, we end up measuring them in different ways. Symptoms Versus Causes Another way of thinking about what we measure is to make an analogy to medicine. When we think about our health, and whether or not we are healthy, one of the first things we do is measure symptoms. Symptoms are things that are easy to observe, but at the same time indicate that something is not the way it should be. The easiest example of this is measuring body temperature and pulse, because those are super easy to measure, noninvasive, and they're actually great indicators of other things happening. From a software point of view, these symptoms will usually be pretty related to your SLOs. These are things that your users care about, and luckily there aren't that many of them, so, measuring them and observing them is not particularly hard. The second piece of this end is to try to understand the causes of those problems when something has gone wrong, and causes are things that you can take action to address in some way. To continue with the analogy, these are the diagnoses. And, like in medicine, there are typically too many of these to count. So, we don't start by trying to look for individual diseases or illness. We start with body temperature. When we're talking about software, we shouldn't immediately be starting with if the particular host overloaded? Are we serving traffic in a way that meets our users' needs or not? One of the things about choosing symptoms that are easy to observe is that those should be robust to changes in your software. We want the symptoms to stay stable over time, so that we don't have to adapt them as quickly as we roll out new versions of our software. The causes, on the other hand, are going to change all the time. Another important distinction in this is thinking about how robust, or how stable these things are, and where they fit in this model. Know Your Users Your users might be your application's end users, it might be other services within your applications, and what those users expect will tell you what things you need to measure. It'll tell you what symptoms matter, and then by extension, what you should be alerted on. Usually, these are pretty simple things, like does your service work or not? But it's important to understand that. I worked at a company where we were doing what I think of now as low-frequency trading. It was really replacing a pen and paper process for doing financial trading, and we would talk about uptime and latency and things like that, but it turned out that this set of users really didn't care about uptime. We were replacing a process that was so slow and so manual that we could have our service down for twenty minutes in the middle of a trading day, and no one would blink an eye. That was just business as usual for them. What did matter for them was correctness and making the right trades? For us, that meant we invested a lot in testing and being able to reproduce interesting conditions within the market. We didn't worry quite as much about liability, because that was something that we didn't really need to have an SLA around and therefore we didn't need to invest in. Change Drives Outages We're measuring the right things and something's gone wrong. When something goes wrong, the easiest way to think about this is by thinking about change. Change drives outages, and there are lots of different kinds of changes. So, one set of changes is things that are internal to your service. You just ruled something new out, whether that's a new version of the code or a new configuration. This is pretty common, and it's definitely important, and I think this is something that everyone understands and expects. As far as external factors go, there are a bunch of external kinds of changes. So that might be your end users, it might be other services within your application. Those users are doing something different, maybe good, maybe bad, but different from your point of view. Another set of external changes might be the infrastructure itself. You might be competing for resources with other services, or even with yourself. The last set of changes is upstream changes. So that might be all of these sorts of things, whether they're new versions, configuration changes, resource starvation happening to some of your dependencies. Ultimately, in any of these cases, your goal is to explain that variation in performance. That is, the performance of your service in terms of one of these changes, because that'll tell you what you need to do next. Users As operators sometimes, it can be a little hard to emphasize with our users, but they are critical to our business. They are our business. Users can do lots of really interesting things that can alter your business. Maybe they thought of a new way of using your service, that you hadn't considered and hadn't load tested for. When they do, that has an impact on your service. Maybe you just had a viral tweet, and there's a lot of people that are coming to use your service that you hadn't expected. That's great, these are all things that you want, you want your users to be engaged, and now the question is just understanding how that affects your service performance. Here is an example from LightStep, and we use LightStep to monitor LightStep itself. This is a piece of a bigger trace. I narrowed it down to one individual service. This is a service that we call liveview, and liveview does a couple of different things. Not only showing liveview, but also loading historical data, so let that be a lesson to you when naming services. The thing that we were observing here, is that liveview was responsible for a slower page load. So, in effect, liveview has an SLO to other parts of our service, which it needs to be able to serve these queries within a reasonable period of time, that'll support in our act of rendering. And 3.65 seconds is too long, it's outside of that. Now we have an example to look at here, to understand why things went wrong. If we look within the trace, it turns out to be in the processing, so there are 227 spans below this, that contributed to loading and processing these histograms. We can see, we're able to load it pretty fast. It came from storage in 194 milliseconds, and the rest of this time turned out to be in processing the data itself. It didn't turn out to be a thing that makes for a great demo. I'm sure you have all done this sort of manual grepping through logs, looking at a bunch of different metrics, trying to understand that the size of the data here. It turned out in all of that, we were able to isolate it to a particular project ID, which corresponds basically to one of our customers. That user was just behaving a little bit differently than a lot of our other customers, and they were seeing much poorer performance than those other customers. The reason was that they just had much bigger traces. They had a lot of traces that fit within this query, and there was just a lot of processing that was going on here. This is actually something that happened, and it really drove us to think about how we could better support this use case. How can we automate this process of grepping through these logs and tracking down all these metrics? We started building these different hypotheses and then trying to map those hypotheses back to the data that we had. Shared Resources The second kind of change that can happen here has, at some level, nothing to do with your services themselves. It's just that isolation isn't perfect. Threads still run on CPUs, containers still run on hosts. Maybe we haven't provisioned them exactly the way that we want to. Databases still ultimately have to provide shared access. These are all reasons that requests can be slow that have nothing to do with that request itself. So, to take another example from LightStep, this service named bento, is responsible for managing some metadata within LightStep. Bento's SLO says that it needs to be able to serve this metadata quite fast because it's used in serving some other critical traffic, even on our ingress path. Therefore, 1.29 seconds is way out of SLO for us. Looking at this portion of the trace, understanding essentially, when do the requests come into bento and when do they go back out? Looking at this, we'd say of that 1.29 seconds, 1.1 affects almost the whole thing, which was actually in terms of getting data from an SQL database. So maybe we need a new index here, or maybe there is some other writer that's misbehaving in terms of this table, and we can figure out how we can isolate the readers and writers a little bit better. So maybe we need to build a cache. The goal here would be to try and understand what makes these requests different, right? Most of them are not taking this long. This is an outlier, but still important to understand, as a kind of part of one percent at the bottom of the stack. It'd have a much bigger impact higher up. To understand what's different, what you can do is not just look at a single trace, but look at a collection of traces, and try to understand what makes that set, what defines that set, and what distinguishes different parts of that set. To do that in a way to visualize that is to look not just in a single trace, but at this histogram, a latency histogram describing the latency of a set of these traces. There were a bunch of requests within bento that took an order of a millisecond, but there are some that are quite a bit slower, that take as much as a second to process, though they are rare. But there are a bunch of them that are out there, on the order of within a sample of 100,000 spans. So even the 95th percentile, somewhere on the order of a couple hundred milliseconds, so it's still two orders of magnitude slower than the fastest one. We want to understand within this population, what makes this slow one slow, and what makes them different than the fast ones? We can do an automated analysis, where we look at different characteristics of these spans and of the traces that they're a part of. The answer here is not the database in the end. This is metadata, it's not stuff that changes that often, so we really shouldn't have another fast writer that's contending. It's really pretty simple, the idea that we need an index is actually a little bit farfetched. What we found, was that this was contention for CPU on the host that bento was running on. So, we have a tag that describes the host, and the number .63 is an indication of how strong the correlation is between the occurrence of this tag and the traces being in this slow subset. What we were able to do in this case, was by looking at a bunch of traces in an automated fashion, we can find a hypothesis, and we can try to validate that hypothesis that wasn't what we might've thought of. Even as experienced engineers, our intuition can often be wrong about what's going on here. In this case, we're to go to another tool and verify that in fact this particular host was overloaded, partly because of some other work that bento was doing, probably because of some other things that were happening, and that explained what was going on here. I think, to go one step beyond p99 latency, and other high percentile latency, the shape here is important. You can see there's a bunch of different behaviors that are present within bento, each of these different bumps within the histogram could represent a different problem, a different opportunity for optimization. Instrumentation What's the right granularity of spans? Should we have one span for each user request as it passes through? Two, or a hundred, and even within those spans, what are the right tags? There are some simple guidelines that we can kind of take away from these kinds of examples. For ingress operations, that is where we're serving some of our users. Understanding the peer, where is this coming from? Important request info, maybe that would be something about the customer. It might be something about the segment that they're in. It could be what geographic location they're coming from. Then, something about the response code is also really important. We think of this as a successful response or not. For egress operations, that is, for our upstream dependencies, a similar side of things. It’s not only important parts of our request, but important parts of the response. It's not just the response code, was this a success or a failure? But how big was this response? Was this a query that returned no results, or one million? Then within a particular service, looking at large components or libraries. Often, they'll have domain-specific tags for this kind of thing, but we were doing some significant processing of these histograms within the liveview service, and that sort of warranted its own set of spans. Generally, all of these in any other place where failure is likely, and where performance might be unpredictable. The idea of these spans is really, these are going to be signals that'll help us figure out what's going on later, and obviously having too many is going to increase the cost of doing distributed tracing. So, focusing on the cases where failure is likely is important, and having the right sort of tags there to do the kinds of correlation analysis that I described, that we were doing for bento. The other question you might have is, well why not just add more dashboards? Why can't we just measure more of these things, for each of these different SLIs that we care about, for each of our different upstream services, for each of these different large components, that we have internally? Your goal is to explain that variation in performance. Doing so is not just enumerating all of these things, because there's going to be hundreds of thousands of them, and it's as the complexity of your system increases, there's going to be more and more of those things. What's really important for you to do is to be able to quickly identify the metrics that matter. That's where I think that the tracing is important, because it ties these different tags and these different performance characteristics back to your SLIs. Since tracing enables you to see how your users' actions are impacting your SLI, how your resource retention might be affecting that SLI, and how the services you depend on might also be affecting your SLI. All of this was thinking about these more reactive situations. The other thing that we might want to do is step back a little bit, and think about how we can look at the bigger picture, try to see a little bit farther away, and see if we can be a little bit more proactive in how we're approaching our work, and how we can do that with distributed tracing. Being Proactive with Distributed Tracing The first step is to establish a ground truth for what's happening in production. Once we do that, step two is make it better. I guess step three is profit. But really, I want to focus on step one. We can do testing, testing is important. We can do load testing. We can look in our command environments. But understanding what's happening in production, what our real users are seeing, that's really what should be driving any proactive work. That could be thinking about how you're planning performance optimizations. Say you've got some time set aside for the next couple of weeks to improve performance. Maybe that's a companywide initiative, or it could be that you're thinking about bringing on a new SaaS provider, and you're trying to evaluate them. Really understanding performance in production is important for both of these situations. Amdahl's Law If we're trying to improve end user performance, we need to look at this from a holistic point of view, which is where Amdahl’s Law comes in. Amdahl's Law is usually brought up in the context of parallel computing, but really what Amdahl's Law's about is how optimization in parts of a bigger task can impact the performance of that bigger task. For example, think of the time passing from left to right, and we can break this task down into these two pieces. On the left, service A is doing some work, and then on the right service B is doing some work. This might seem a little bit obvious, but if we improve service A, it's not really going to matter that much. On the other hand, if we can improve service B by kind of the same proportion, that'll have a really big impact. So, given a choice between speeding up A and B, a 50% improvement in B is going to have a bigger impact than a 50% improvement in A. The other thing that Amdahl's Law will tell us, is that no improvement in A will ever improve overall performance by more than 15%. So, if we're looking for a big win, even if we could wave a magic wand and make service A's work become instantaneous, that's still only going to be a 15% improvement. This is all kind of obvious once you have the data, but the thing that I see happen again and again, is we have an organization-wide initiative to improve performance. Service A goes off and says “hey, we're going to work on this, this is really important.” Likewise, Service B says, “we're going to work on this, this is really important.” But service A should probably be spending their time on something else, because they're not going to have a big impact on what their company's overall performance is going to be. If you see this in a trace, there's basically two pieces to charging a customer. First, we want to do some sort of fraud check to make sure that they're on the up and up, and then secondly, we're actually going to go and process the transaction through an external payment gateway. What Amdahl's Law's is going to tell us here is that improving database performance here is not going to change things much. If we're going to spend time on this, we need to think about how we're going to improve the performance of what the gateway is doing. That's what's going to have an impact on our users. A Recipe for Meaningful Performance Improvements First, choose the thing that you're trying to improve. I said latency, but it's got to be a lot more specific than that. Maybe you're talking about 50th percentile latency, sort of meeting in case, great, so choose that. Maybe you're just worried about p99 latency. Maybe there are other things but be super precise in what you're trying to improve. Second, collect a bunch of traces within that population. Being specific is critical. If we're going to try and improve p99 latency, we need to look at the one percent of traces that are the slowest. Then what we want to do, is get enough traces within that sample to have a meaningful set that we can then draw some conclusions from. Then we just find the biggest contributors to the aggregate critical path. So, we take the traces in that sample, and we just kind of put them end-to-end. If we imagine my simple example, with these two different pieces, service A and service B, turns out service A is usually shorter. Service B is usually longer. Then we take these requests and we put them end-toend and look at the whole thing, and then ask how much faster could it be? If we improved A, how much faster could that be if we improved B? SaaS as a Service? Another example is software as a service (SaaS), and thinking about SaaS as another service. SaaS is a great option for a lot of things. We use SaaS for a lot of things at LightStep. It's a way for us to focus on the places where we, as an engineering team, can provide the most value, and make the operational concerns someone else's problem. Storage, analysis, load balancing, there are a lot of great places where we can push that work on someone else. The other place where you might want to consider this is if you want to offload some particularly complex or tricky functionality, especially where there are things like compliance involved. So, pushing off payments, things around fraud and abuse detection, or authentication, stuff that's really tricky to get right. But even with all of that, you can and should treat SaaS like any other service, and that means starting in just like if you were going to bring up a new service within your application, thinking about performance during design and sort of prototype in the evaluation phase. The same thing is true for SaaS. Reading SLAs I'm not a lawyer, but I do read SLAs, and when I'm designing a system, they don't really tell me that much. Downtime for less than a minute doesn't really count. I know that's not what everybody does, but a minute is a long time, and if this is going to be on an interactive request path, it's got to be a lot faster than that. So, these SLAs really don't give you the visibility that you need, and when you're doing this evaluation, often they just focus on uptime and correctness. But you need to understand latency too, so we did this as part of evaluating SaaS at some point at LightStep. Part of what LightStep does, in addition to storing traces, is we store some time series data, and we were evaluating a couple of different options. One of them was to run Cassandra locally. This is a low-frequency time series, so it doesn't demand super high ingress. We did some benchmarking of Cassandra, and we looked at a latency across a different spectrum. For this read operation on Cassandra, the p50 latency was about two milliseconds. What we wanted to do is compare this against Spanner. It was interesting to see some of the performance characteristics that were different, the medium latency on Spanner is over three times higher. But, the long tail latency is much higher on Spanner. This was interesting for us to see, and to think about how this would work if this one percent of requests is at the bottom of the stack, how that's going to bubble back up higher on the stack. Measure Your SaaS... and Keep Measuring It The takeaway in this is that SaaS is just like anything else. Measure it like any other service you would have. If it's upstream, to you it matters, and you should be using best practices for things like provisioning, understanding what you need from it. You want to instrument, even if you can't instrument the SaaS itself, instrumenting the client libraries, the request you're making to it, and monitoring it just like you would any of your other downstream dependencies. DevOps and Distributed Systems go Hand-in-Hand I think DevOps and distributed systems is a great match. The software architecture behind things like microservices and serverless is a great way to enable distributed ownership. There are a lot of great tools in this space that help amplify the work that DevOps does. At the same time, distributed tracing is a great way to regain that end-to-end visibility. I talked a lot about single services and the visibility you can get just from mapping a single input to an output. Distributed tracing is a way to automate organizational navigation. It tells you who you need to talk to when something is going wrong. It helps you connect those inputs to outputs in a way that matters; by understanding how your upstream dependencies are affecting your users. Most importantly, know what your users care about, whether that's latency, errors, uptime, whatever it is. Know it, measure it, and alert on it, and then be able to connect that variation and those service level indicators back to root causes with tracing. Understanding whether that's user behavior that's changed, whether it's upstream services, whether it's SaaS performance, whether it's some other contention. That's either in the case where you're responding to an incident, or whether you're trying to get ahead of things and improve baseline performance. We think about tracing not just as looking at individual traces. Tracing is really about using the data that's found in tracing to draw conclusions to from and validate hypotheses. Often, it’s understanding what's normal and not normal, and what is critical in understanding things. Sometimes, those patterns will only emerge through meaningful samples, so looking at an individual trace might be a one-off case, and by looking at a group of them, we can really take more correct corrective action.