Writing Sample- Why DevOps Needs to Embrace Tracing
Conway's Law
Programmer Melvin Conway wrote in 1967 that in his experience, organizations which design
systems are constrained to produce designs, which are copies of the communication structures of
those organizations. Usually, Conway's Law is brought up in a negative sense, like something to be
avoided, but when what you want your systems, and your organization mirrors each other in some
way, would there be value in that?
For example, if you take an ant colony, these ants manage to coordinate in a number of different
activities, whether it's hiding and gather leaves, these particular kinds of ants do some really
amazing things, but they do it with essentially no central coordination. At the same time, they have
a rigid structure, but that structure is distributed, and I think we want as DevOps to enable exactly
this sort of loosely coupled work across teams.
That's common tools, common practices, and best practices, and it's probably your job to help
deploy those tools and establish those processes. I've heard teams talk about creating a paved road,
that is making an easy way for teams to adopt these kinds of tools and processes, and not
necessarily constraining them, but offering the best options for them. So, Conway's Law can be a
good thing, right? If we want to optimize for velocity, if we want to optimize for fast moving teams,
then we want them to be able to make decisions independently of one another. To do that, they
need to build strong abstractions between, not only their teams, but their services. So they need to
be loosely coupled, right?
I think we should be embracing Conway's Law. We should be shipping our org charts, then we
should be thinking about how we can apply the same sorts of constraints around a structure, not
only to the software, but to our organizations themselves.
So, I describe some logos, which are some of my favorite tools, that some of which we used for
coordination and orchestration, for building and for deployment. These are all ways that we as
DevOps engineers, or as an SRE, can enable teams to do this, and maybe as a service owner, you're
using some of these tools. But here's the thing about all these tools, and about this structure: your
users probably don't care. Your user had an error occur, but how do we map that back to the
software that's actually behind it? Is that part of the client? Is that part of the API gateway, or is that
further in one of the other back ends? Because you care about this structure, and you and other
people in your organization need to be able to figure out why that error occurred in order to correct
it.
DevOps Distributed Systems
I think in a lot of ways DevOps and distributed architectures go hand-in-hand, whether that's
microservices or serverless, or some other service orientated architecture. Those architectures all
enable a kind of distributed ownership, enable teams to really own their own services, end-to-end.
At the same time, there are a lot of great tools out there that automate distributed infrastructure
management, whether that's things like Kubernetes or other serverless technologies, but these are
all ways that sort of amplify your work as a DevOps engineer.
But what do we lose here? Sure, we've created these really strong abstractions, we've built loosely
coupled teams, but what about end-to-end visibility? What about understanding what our users are
seeing? Is that something that we can solve with metrics? What about not just in our software, but
in our organization? How do we communicate across teams? How do we find the right team to
communicate with in the middle of an incident?
How do we recover that visibility that we lost, that end-to-end understanding of our systems with
distributed tracing? In the end, we want distributed tracing to be about end-to-end observability,
but we have to start somewhere, and there's still value to be had in a single service.
I want to start with your users, and understand what they care about, and how that maps into
things like service level indicators, and how do we use symptoms to decide when to take action?
Once we've decided that we need to take action, how do we connect cause and effect? Whether
that's performing root cause analysis when something's on fire, or when we're trying to be a little
bit more proactive, and maybe trying to plan some performance optimization work.
Distributed Tracing 101
Here is a quick review for those of you who don't know a lot about distributed tracing, to bring you
up to speed. Tracing is a diagnostic that reveals a couple of things. First, how a set of services
coordinate to handle an end user request, not by end user request. I mean, all the way from mobile
or browser to backends to databases. Traces will also include some other important metadata,
which might take the form of logs or tags or other annotations.
The point of all of this is to provide a request centric view, in as much as the architectures that
we're choosing enable teams and services to work independently. Tracing is a little bit of a foil to
that. It allows us to take the opposite perspective and understand it from where our users are
seeing things.
To clarify, there are a bunch of other kinds of tracing: process tracing, kernel tracing, and browser
tracing. Those are all great things, most of them focus on an individual piece of your application.
With distributed tracing, we’re looking at the bigger picture. Some tags that tell us something about
the host that we're running on, like the communication protocol.
The trace represents the end-to-end request, that's a single trace. A trace consists of maybe one, but
maybe thousands, of these individual spans.
A span is the work done, or some of the work done, by an individual service, as part of this request.
In the end, what it really is, is just a timed event and some metadata. So, it has some duration
associated with it and some additional information that tells us about what happened.
If you want a little more background on this, look up Dapper. That was a system that was built at
Google to do this kind of tracing by one of my co-founders, and they wrote a nice technical report
about how they set that up and their experience doing that. It's pretty interesting to see what
happened at scale.
What are enough spans, and what are the right spans among those? Where do I start? Do I need all
of these traces? There's potentially a lot of data here. You might be concerned about overhead, and
if we want to pick and choose among these traces, which are the ones that matter?
Service Levels and TLAs
As a service owner, who are your users? I hope you know that. What do they expect from you? How
are they going to measure your success? If you don't hit that, if you don't meet those expectations,
what will they do? What will they do if you fail?
One way we can formalize the answers to those questions is through service levels.
A service level indicator (SLI) is what you measure, and that could be something like latency. It
could also be uptime or correctness. There's a lot of different things that you might measure.
From that, you might define a service level objective (SLO). Not every SLI has an SLO, but if the SLI
is really important, hopefully, you have a target, you have a goal that you're trying to meet. For
example, that might be something like keeping P99 latency below one second. It might be achieving
99.9% uptime, or it might be a 100% correctness.
Finally, if we want to go all the way, we might then define a service level agreement (SLA). This is
what happens if you miss the objective, if you fail to meet it. An agreement is really about the
consequences, so this is really getting toward what is the thing that your users will expect when you
fail to meet it? That can take the form of a refund or a cancellation. Not all SLOs will have SLAs, but
SLAs can be important because they can understand how to drive investment toward SLOs and
SLIs.
High Percentiles
I just kind of drew a really simple distributed system here. This is, maybe you have some document
search system, it's spanning out across a bunch of shards. When we think about the services at the
bottom of the stack here, that this will keep things simple. The average latency is one millisecond,
but the 99th percentile latency is one second. That means one percent of the traces are a thousand
times slower than the average request.
Now if an end user request hits only one of those bottom services, well then, of course, one percent
of those end user requests will also take a second or more. The thing that happens here, which is
maybe a little bit subtle, but really important to understand, is as the complexity of handling those
requests increases, it increases the number of these service instances at the bottom of the stack that
have to participate in that request, this 99th percentile will have a bigger impact.
Say that it required one hundred of these bottom of the stack services to participate in answering
one of these search queries. If that's the case, then 63 percent of end user requests are going to take
more than a second. Even though it's only one percent of the request at the bottom of the stack,
more than half of the request that the users would take more than one second.
As our systems increase in complexity, these rare behaviors get magnified through that complexity.
So, when one of the really important reasons to be measuring 99th percentile, not just at the top of
the stack, but throughout, is to understand how that's going to be magnified, because it's going to
have a big impact as we kind of go up the stack.
Best Practices for Performance SLIs
Some of the best practices around choosing performance SLIs include setting the right scope, which
is really important. It might be easy to think, “I own a service, so I'm just going to set an SLI for that
service.”
Latency is important for my users, so I'm going do that, but think for a second about the different
kinds of things that your service does. For example, you have two endpoints, one that reads and one
that writes. Chances are the read one is going to be a lot faster than the write one. So now when
you're going to set a single SLI, if you're going try to cover both of those, either you're going to set it
so wide it's going to be such a slow SLI, but you won’t really see or understand performance
changes in the faster operation, in the read operation, or you're going to set it so tight that you're
going to be constantly missing your SLO, not because anything's really wrong, but just because
writes take longer. In that case, it might be better to set two SLIs, one for your read operation and
one for your write operation, so that you can really measure performance in a way that's
meaningful for both of them.
Once you've chosen the right scope, now it's about choosing the right kinds of performance
measure, and I think the place where I would start here would be with high percentile latency and
error rate. Those are things that many users will care about, and they're great indicators of the
overall health of what you're doing.
Beyond that, you might also measure something like throughput. It's a little bit harder to set an SLO
around this, so this is something like the number of requests that you're satisfying in a given time
period; that's independent of the behavior of your users, but it might be something that you still
want to measure, just to look at trends and see if it's going up or down quickly.
The fourth thing on this list looking at saturation. What percentage of your resources are you
utilizing to serve traffic right now? This is really important for getting ahead of things, so
understanding when you might need to increase the size of a database or scale out to a new set of
shards. I qualified this and said, performance SLIs. Yes, uptime and correctness are also really
important, but because we typically achieve those through different means, we end up measuring
them in different ways.
Symptoms Versus Causes
Another way of thinking about what we measure is to make an analogy to medicine. When we think
about our health, and whether or not we are healthy, one of the first things we do is measure
symptoms. Symptoms are things that are easy to observe, but at the same time indicate that
something is not the way it should be.
The easiest example of this is measuring body temperature and pulse, because those are super easy
to measure, noninvasive, and they're actually great indicators of other things happening. From a
software point of view, these symptoms will usually be pretty related to your SLOs. These are things
that your users care about, and luckily there aren't that many of them, so, measuring them and
observing them is not particularly hard.
The second piece of this end is to try to understand the causes of those problems when something
has gone wrong, and causes are things that you can take action to address in some way. To continue
with the analogy, these are the diagnoses. And, like in medicine, there are typically too many of
these to count. So, we don't start by trying to look for individual diseases or illness. We start with
body temperature.
When we're talking about software, we shouldn't immediately be starting with if the particular host
overloaded? Are we serving traffic in a way that meets our users' needs or not? One of the things
about choosing symptoms that are easy to observe is that those should be robust to changes in your
software. We want the symptoms to stay stable over time, so that we don't have to adapt them as
quickly as we roll out new versions of our software. The causes, on the other hand, are going to
change all the time. Another important distinction in this is thinking about how robust, or how
stable these things are, and where they fit in this model.
Know Your Users
Your users might be your application's end users, it might be other services within your
applications, and what those users expect will tell you what things you need to measure. It'll tell
you what symptoms matter, and then by extension, what you should be alerted on. Usually, these
are pretty simple things, like does your service work or not? But it's important to understand that.
I worked at a company where we were doing what I think of now as low-frequency trading. It was
really replacing a pen and paper process for doing financial trading, and we would talk about
uptime and latency and things like that, but it turned out that this set of users really didn't care
about uptime. We were replacing a process that was so slow and so manual that we could have our
service down for twenty minutes in the middle of a trading day, and no one would blink an eye.
That was just business as usual for them. What did matter for them was correctness and making the
right trades? For us, that meant we invested a lot in testing and being able to reproduce interesting
conditions within the market. We didn't worry quite as much about liability, because that was
something that we didn't really need to have an SLA around and therefore we didn't need to invest
in.
Change Drives Outages
We're measuring the right things and something's gone wrong. When something goes wrong, the
easiest way to think about this is by thinking about change. Change drives outages, and there are
lots of different kinds of changes. So, one set of changes is things that are internal to your service.
You just ruled something new out, whether that's a new version of the code or a new configuration.
This is pretty common, and it's definitely important, and I think this is something that everyone
understands and expects.
As far as external factors go, there are a bunch of external kinds of changes. So that might be your
end users, it might be other services within your application. Those users are doing something
different, maybe good, maybe bad, but different from your point of view. Another set of external
changes might be the infrastructure itself. You might be competing for resources with other
services, or even with yourself.
The last set of changes is upstream changes. So that might be all of these sorts of things, whether
they're new versions, configuration changes, resource starvation happening to some of your
dependencies. Ultimately, in any of these cases, your goal is to explain that variation in
performance. That is, the performance of your service in terms of one of these changes, because
that'll tell you what you need to do next.
Users
As operators sometimes, it can be a little hard to emphasize with our users, but they are critical to
our business. They are our business. Users can do lots of really interesting things that can alter your
business. Maybe they thought of a new way of using your service, that you hadn't considered and
hadn't load tested for. When they do, that has an impact on your service. Maybe you just had a viral
tweet, and there's a lot of people that are coming to use your service that you hadn't expected.
That's great, these are all things that you want, you want your users to be engaged, and now the
question is just understanding how that affects your service performance.
Here is an example from LightStep, and we use LightStep to monitor LightStep itself. This is a piece
of a bigger trace. I narrowed it down to one individual service. This is a service that we call liveview,
and liveview does a couple of different things. Not only showing liveview, but also loading historical
data, so let that be a lesson to you when naming services. The thing that we were observing here, is
that liveview was responsible for a slower page load. So, in effect, liveview has an SLO to other parts
of our service, which it needs to be able to serve these queries within a reasonable period of time,
that'll support in our act of rendering. And 3.65 seconds is too long, it's outside of that.
Now we have an example to look at here, to understand why things went wrong. If we look within
the trace, it turns out to be in the processing, so there are 227 spans below this, that contributed to
loading and processing these histograms. We can see, we're able to load it pretty fast. It came from
storage in 194 milliseconds, and the rest of this time turned out to be in processing the data itself. It
didn't turn out to be a thing that makes for a great demo.
I'm sure you have all done this sort of manual grepping through logs, looking at a bunch of different
metrics, trying to understand that the size of the data here. It turned out in all of that, we were able
to isolate it to a particular project ID, which corresponds basically to one of our customers. That
user was just behaving a little bit differently than a lot of our other customers, and they were seeing
much poorer performance than those other customers. The reason was that they just had much
bigger traces. They had a lot of traces that fit within this query, and there was just a lot of
processing that was going on here.
This is actually something that happened, and it really drove us to think about how we could better
support this use case. How can we automate this process of grepping through these logs and
tracking down all these metrics? We started building these different hypotheses and then trying to
map those hypotheses back to the data that we had.
Shared Resources
The second kind of change that can happen here has, at some level, nothing to do with your services
themselves. It's just that isolation isn't perfect. Threads still run on CPUs, containers still run on
hosts. Maybe we haven't provisioned them exactly the way that we want to. Databases still
ultimately have to provide shared access. These are all reasons that requests can be slow that have
nothing to do with that request itself.
So, to take another example from LightStep, this service named bento, is responsible for managing
some metadata within LightStep. Bento's SLO says that it needs to be able to serve this metadata
quite fast because it's used in serving some other critical traffic, even on our ingress path.
Therefore, 1.29 seconds is way out of SLO for us. Looking at this portion of the trace, understanding
essentially, when do the requests come into bento and when do they go back out? Looking at this,
we'd say of that 1.29 seconds, 1.1 affects almost the whole thing, which was actually in terms of
getting data from an SQL database. So maybe we need a new index here, or maybe there is some
other writer that's misbehaving in terms of this table, and we can figure out how we can isolate the
readers and writers a little bit better. So maybe we need to build a cache.
The goal here would be to try and understand what makes these requests different, right? Most of
them are not taking this long. This is an outlier, but still important to understand, as a kind of part
of one percent at the bottom of the stack. It'd have a much bigger impact higher up. To understand
what's different, what you can do is not just look at a single trace, but look at a collection of traces,
and try to understand what makes that set, what defines that set, and what distinguishes different
parts of that set. To do that in a way to visualize that is to look not just in a single trace, but at this
histogram, a latency histogram describing the latency of a set of these traces.
There were a bunch of requests within bento that took an order of a millisecond, but there are some
that are quite a bit slower, that take as much as a second to process, though they are rare. But there
are a bunch of them that are out there, on the order of within a sample of 100,000 spans. So even
the 95th percentile, somewhere on the order of a couple hundred milliseconds, so it's still two
orders of magnitude slower than the fastest one.
We want to understand within this population, what makes this slow one slow, and what makes
them different than the fast ones? We can do an automated analysis, where we look at different
characteristics of these spans and of the traces that they're a part of.
The answer here is not the database in the end. This is metadata, it's not stuff that changes that
often, so we really shouldn't have another fast writer that's contending. It's really pretty simple, the
idea that we need an index is actually a little bit farfetched. What we found, was that this was
contention for CPU on the host that bento was running on. So, we have a tag that describes the host,
and the number .63 is an indication of how strong the correlation is between the occurrence of this
tag and the traces being in this slow subset.
What we were able to do in this case, was by looking at a bunch of traces in an automated fashion,
we can find a hypothesis, and we can try to validate that hypothesis that wasn't what we might've
thought of. Even as experienced engineers, our intuition can often be wrong about what's going on
here. In this case, we're to go to another tool and verify that in fact this particular host was
overloaded, partly because of some other work that bento was doing, probably because of some
other things that were happening, and that explained what was going on here.
I think, to go one step beyond p99 latency, and other high percentile latency, the shape here is
important. You can see there's a bunch of different behaviors that are present within bento, each of
these different bumps within the histogram could represent a different problem, a different
opportunity for optimization.
Instrumentation
What's the right granularity of spans? Should we have one span for each user request as it passes
through? Two, or a hundred, and even within those spans, what are the right tags? There are some
simple guidelines that we can kind of take away from these kinds of examples.
For ingress operations, that is where we're serving some of our users. Understanding the peer,
where is this coming from? Important request info, maybe that would be something about the
customer. It might be something about the segment that they're in. It could be what geographic
location they're coming from. Then, something about the response code is also really important. We
think of this as a successful response or not.
For egress operations, that is, for our upstream dependencies, a similar side of things. It’s not only
important parts of our request, but important parts of the response. It's not just the response code,
was this a success or a failure? But how big was this response? Was this a query that returned no
results, or one million?
Then within a particular service, looking at large components or libraries. Often, they'll have
domain-specific tags for this kind of thing, but we were doing some significant processing of these
histograms within the liveview service, and that sort of warranted its own set of spans. Generally,
all of these in any other place where failure is likely, and where performance might be
unpredictable. The idea of these spans is really, these are going to be signals that'll help us figure
out what's going on later, and obviously having too many is going to increase the cost of doing
distributed tracing.
So, focusing on the cases where failure is likely is important, and having the right sort of tags there
to do the kinds of correlation analysis that I described, that we were doing for bento. The other
question you might have is, well why not just add more dashboards? Why can't we just measure
more of these things, for each of these different SLIs that we care about, for each of our different
upstream services, for each of these different large components, that we have internally?
Your goal is to explain that variation in performance. Doing so is not just enumerating all of these
things, because there's going to be hundreds of thousands of them, and it's as the complexity of
your system increases, there's going to be more and more of those things. What's really important
for you to do is to be able to quickly identify the metrics that matter. That's where I think that the
tracing is important, because it ties these different tags and these different performance
characteristics back to your SLIs.
Since tracing enables you to see how your users' actions are impacting your SLI, how your resource
retention might be affecting that SLI, and how the services you depend on might also be affecting
your SLI. All of this was thinking about these more reactive situations.
The other thing that we might want to do is step back a little bit, and think about how we can look
at the bigger picture, try to see a little bit farther away, and see if we can be a little bit more
proactive in how we're approaching our work, and how we can do that with distributed tracing.
Being Proactive with Distributed Tracing
The first step is to establish a ground truth for what's happening in production. Once we do that,
step two is make it better. I guess step three is profit. But really, I want to focus on step one. We can
do testing, testing is important. We can do load testing. We can look in our command environments.
But understanding what's happening in production, what our real users are seeing, that's really
what should be driving any proactive work.
That could be thinking about how you're planning performance optimizations. Say you've got some
time set aside for the next couple of weeks to improve performance. Maybe that's a companywide
initiative, or it could be that you're thinking about bringing on a new SaaS provider, and you're
trying to evaluate them. Really understanding performance in production is important for both of
these situations.
Amdahl's Law
If we're trying to improve end user performance, we need to look at this from a holistic point of
view, which is where Amdahl’s Law comes in. Amdahl's Law is usually brought up in the context of
parallel computing, but really what Amdahl's Law's about is how optimization in parts of a bigger
task can impact the performance of that bigger task.
For example, think of the time passing from left to right, and we can break this task down into these
two pieces. On the left, service A is doing some work, and then on the right service B is doing some
work. This might seem a little bit obvious, but if we improve service A, it's not really going to matter
that much. On the other hand, if we can improve service B by kind of the same proportion, that'll
have a really big impact. So, given a choice between speeding up A and B, a 50% improvement in B
is going to have a bigger impact than a 50% improvement in A.
The other thing that Amdahl's Law will tell us, is that no improvement in A will ever improve
overall performance by more than 15%. So, if we're looking for a big win, even if we could wave a
magic wand and make service A's work become instantaneous, that's still only going to be a 15%
improvement. This is all kind of obvious once you have the data, but the thing that I see happen
again and again, is we have an organization-wide initiative to improve performance. Service A goes
off and says “hey, we're going to work on this, this is really important.” Likewise, Service B says,
“we're going to work on this, this is really important.” But service A should probably be spending
their time on something else, because they're not going to have a big impact on what their
company's overall performance is going to be.
If you see this in a trace, there's basically two pieces to charging a customer. First, we want to do
some sort of fraud check to make sure that they're on the up and up, and then secondly, we're
actually going to go and process the transaction through an external payment gateway. What
Amdahl's Law's is going to tell us here is that improving database performance here is not going to
change things much. If we're going to spend time on this, we need to think about how we're going to
improve the performance of what the gateway is doing. That's what's going to have an impact on
our users.
A Recipe for Meaningful Performance Improvements
First, choose the thing that you're trying to improve. I said latency, but it's got to be a lot more
specific than that. Maybe you're talking about 50th percentile latency, sort of meeting in case, great,
so choose that. Maybe you're just worried about p99 latency. Maybe there are other things but be
super precise in what you're trying to improve.
Second, collect a bunch of traces within that population. Being specific is critical. If we're going to
try and improve p99 latency, we need to look at the one percent of traces that are the slowest. Then
what we want to do, is get enough traces within that sample to have a meaningful set that we can
then draw some conclusions from. Then we just find the biggest contributors to the aggregate
critical path.
So, we take the traces in that sample, and we just kind of put them end-to-end. If we imagine my
simple example, with these two different pieces, service A and service B, turns out service A is
usually shorter. Service B is usually longer. Then we take these requests and we put them end-toend and look at the whole thing, and then ask how much faster could it be? If we improved A, how
much faster could that be if we improved B?
SaaS as a Service?
Another example is software as a service (SaaS), and thinking about SaaS as another service. SaaS is
a great option for a lot of things. We use SaaS for a lot of things at LightStep. It's a way for us to
focus on the places where we, as an engineering team, can provide the most value, and make the
operational concerns someone else's problem.
Storage, analysis, load balancing, there are a lot of great places where we can push that work on
someone else. The other place where you might want to consider this is if you want to offload some
particularly complex or tricky functionality, especially where there are things like compliance
involved. So, pushing off payments, things around fraud and abuse detection, or authentication,
stuff that's really tricky to get right.
But even with all of that, you can and should treat SaaS like any other service, and that means
starting in just like if you were going to bring up a new service within your application, thinking
about performance during design and sort of prototype in the evaluation phase. The same thing is
true for SaaS.
Reading SLAs
I'm not a lawyer, but I do read SLAs, and when I'm designing a system, they don't really tell me that
much. Downtime for less than a minute doesn't really count. I know that's not what everybody does,
but a minute is a long time, and if this is going to be on an interactive request path, it's got to be a lot
faster than that. So, these SLAs really don't give you the visibility that you need, and when you're
doing this evaluation, often they just focus on uptime and correctness. But you need to understand
latency too, so we did this as part of evaluating SaaS at some point at LightStep.
Part of what LightStep does, in addition to storing traces, is we store some time series data, and we
were evaluating a couple of different options. One of them was to run Cassandra locally. This is a
low-frequency time series, so it doesn't demand super high ingress. We did some benchmarking of
Cassandra, and we looked at a latency across a different spectrum. For this read operation on
Cassandra, the p50 latency was about two milliseconds. What we wanted to do is compare this
against Spanner. It was interesting to see some of the performance characteristics that were
different, the medium latency on Spanner is over three times higher. But, the long tail latency is
much higher on Spanner. This was interesting for us to see, and to think about how this would work
if this one percent of requests is at the bottom of the stack, how that's going to bubble back up
higher on the stack.
Measure Your SaaS... and Keep Measuring It
The takeaway in this is that SaaS is just like anything else. Measure it like any other service you
would have. If it's upstream, to you it matters, and you should be using best practices for things like
provisioning, understanding what you need from it. You want to instrument, even if you can't
instrument the SaaS itself, instrumenting the client libraries, the request you're making to it, and
monitoring it just like you would any of your other downstream dependencies.
DevOps and Distributed Systems go Hand-in-Hand
I think DevOps and distributed systems is a great match. The software architecture behind things
like microservices and serverless is a great way to enable distributed ownership. There are a lot of
great tools in this space that help amplify the work that DevOps does. At the same time, distributed
tracing is a great way to regain that end-to-end visibility. I talked a lot about single services and the
visibility you can get just from mapping a single input to an output.
Distributed tracing is a way to automate organizational navigation. It tells you who you need to talk
to when something is going wrong. It helps you connect those inputs to outputs in a way that
matters; by understanding how your upstream dependencies are affecting your users.
Most importantly, know what your users care about, whether that's latency, errors, uptime,
whatever it is. Know it, measure it, and alert on it, and then be able to connect that variation and
those service level indicators back to root causes with tracing. Understanding whether that's user
behavior that's changed, whether it's upstream services, whether it's SaaS performance, whether
it's some other contention. That's either in the case where you're responding to an incident, or
whether you're trying to get ahead of things and improve baseline performance.
We think about tracing not just as looking at individual traces. Tracing is really about using the data
that's found in tracing to draw conclusions to from and validate hypotheses. Often, it’s
understanding what's normal and not normal, and what is critical in understanding things.
Sometimes, those patterns will only emerge through meaningful samples, so looking at an
individual trace might be a one-off case, and by looking at a group of them, we can really take more
correct corrective action.