Kubernetes is a popular platform for running containers, providing a flexible environment to manage them at enormous scale. It’s also complicated, with a lot of moving parts that you’ll need to keep an eye on. Observability is key to understanding the inner workings of K8s platforms to ensure that they are meeting the demands of your workloads.
Observability vs Monitoring
Observability and monitoring are two terms that get used interchangeably. There is no real hard line between the two, although the ‘observability’ term has gotten more fashionable in recent years. Historically, monitoring referred to the process of collecting metric data from different components, defining thresholds for those metrics and generating alerts when they were crossed. Observability expands on that further, bringing in logs and traces, and a focus on making the data explorable to people and tools. Rather than just having an alert to tell us something is broken, we want to be able to work with the data to find out what happened, what was the impact, why did it happen? We want to see what happens when we adjust a knob, we want to compare the different types of traffic our app is handling to help troubleshoot an issue.
Monitoring tools tried to do it all - instrument, collect, store, analyze, and visualize data, but only for their particular domain. They typically didn’t play nicely with others: importing or exporting data typically meant writing custom ETL pipelines if it was supported at all. Getting a clear picture across your infrastructure meant using many different tools, each operating independently from the other with their own unique set of data. These limitations became even more apparent with container technology.
The Cloud Native Computing Foundation (CNCF) was founded in 2015 by industry leaders to advance container technology and collaborate with the community of developers and end users on open source projects. Two projects brought into the CNCF were Prometheus and OpenTelemetry, which have played a huge role in the development of the practice of observability.
Prometheus describes itself as “Open source systems and monitoring and alerting toolkit”. Extremely wide adoption and feature set for collecting and storing metric data, with a robust query language for easy data extraction and a set of APIs for instrumenting code and pulling/pushing data.
The OpenTelemetry project aims to provide vendor/tool agnostic SDKs, APIs, and tools for instrumenting, ingesting, transforming, and transporting metrics, logs, and traces to observability backends. It is a combination of the now sunset OpenCensus and OpenTracing projects.
Both of these projects are developed with containerized environments in mind, with deep support to deploy and use the tools easily in that space. While Prometheus only handles metric data, its feature set, API, and data model have made it the de facto standard for many organizations. OpenTelemetry, while not providing a backend to store data, typically works behind the scenes in an observability platform, doing the hard work in getting the data from complex systems to the observability backends. Many open and closed source observability systems leverage OpenTelemetry in their own instrumentation and collection tooling.
Projects like Prometheus and OpenTelemetry make it easy to instrument and collect data from many sources. Almost too easy! Now the challenge is how to manage and use all this data effectively without being overwhelmed.
Common observability problem #1 - Incomprehensible graphs
Common observability problem #2 - Long lists of metrics without understanding their meaning and purpose
Common observability problem #3 - High costs
One major issue is metric cardinality.
In the Prometheus data model, each collected data point is made up of these components:
- Metric: The metric being collected
- Resource: The resource the metric is being collected for.
- Label and value: A characteristic of the metric being measured for. Each combination of label and value is referred to as a dimension.
In this example, the metric being collected is “requests”. For “pod” resources, we measure it by different labels such as “code”, “type”, “mode” and so on. The “code” label can have one of the standard HTTP status codes such as “200”, “404”, “427” and so on.
Each unique set of metric, resource, label, and value is a unique time series that must be collected and stored. It’s common for a single metric to have dozens of dimensions, and hundreds and even thousands isn’t unheard of. The total cardinality of a metric is then the product of
- Number of resources reporting the metric
- Dimension of the metric
For example, if collecting a metric with a dimension of 50 from 1000 pods, the total cardinality is 50000. If that metric is collected at 1 minute interval for a week, I’m collecting 50000*10080 minutes/week = 50 million data points. There’s a cost to instrument, collect and store those 50 million data points, and that’s just for a single metric. The Prometheus local storage takes around 2 to 3 bytes per data point, so this works out to roughly 500 to 1500 MB per week of retained data.
The cardinality has a huge benefit during the troubleshooting process. Inspecting the value of a metric by different labels can provide huge insight into an issue - providing I understand what the different labels and values mean. As the scale of the environment expands, the storage requirements alone grow quickly. Similar problems arise when the data is used for analysis and visualization - it takes time to query the data, return the result, and display it.
The first instinct when starting out with these tools is to by default collect everything available from every endpoint. Many common containerized applications will provide an endpoint that can be scraped for metrics, and collectors deployed in the cluster will automatically identify these endpoints and collect all available metrics. Typically enabling this endpoint, if it isn’t already on by default, is a matter of updating a configuration parameter that will result in a few annotations on the resource.
The downside of this approach is that there is no control on the volume of metrics being collected. Many components will produce 100s of metrics for each pod with cardinality in the dozens. In anything but the smallest of environments this will quickly become untenable.
A more curated approach based on your needs is required at scale.
K8s Platform Observability
As a K8s platform owner, our goals for observability of our K8s platform are
- Good visibility of core functions
- Ability to drill deeper as needed
- Responsive UI - No long pauses and timeouts
The core functions the platform provides to our tenants are
- Container lifecycle management - Creating pods, scaling, etc
- Providing CPU, memory, disk, and network resources to pods
- Accessibility to the network
Finally, the key components that provide those functions are
- Nodes - Host pods, provide CPU, memory, disk, and network resources
- Control plane components
- kubelet , kube-apiserver, kube-scheduler, kube-controllermanager, core-dns, etcd
If running your own Kubernetes cluster, the next steps in building out an observability platform would be to:
- Install the Prometheus components in your cluster
- Update the configuration of each control plane component to enable it’s metric endpoint
- Configure the Prometheus collector to identify and scrape each endpoint. Comb through the list of metrics and filter out the ones you don’t want if there’s too many.
We can use the concept of Golden Signals to identify which metrics we want to focus on. For example, for the kube_apiserver, we’d start with these metrics:
This seems like a lot of work. Can’t someone else do it?
Most certainly! Unless you are doing it the hard way, you are probably using a Kubernetes environment provided by a managed service provider such as Google Cloud’s GKE platform, or a PaaS offering like OpenShift. These offerings typically will provide a curated collection of metrics for you, although you may need to flip a few switches to get them. Cloud service providers also typically offer a managed Prometheus service to save you the time and effort of managing Prometheus itself.
This configuration will collect a curated list of metrics from the kube-apiserver, kube-scheduler, and kube-controller-manager, along with some node level metrics. We’ve also enabled support for Google’s Managed Service for Prometheus offering, which allows us to collect metrics from our workloads using Prometheus style scraping and use them in Cloud Operations.
Once that’s enabled, you’ll get additional options for viewing detailed health and status metrics for your clusters. You’ll also be able to work with new metrics in Cloud Operations and build your own dashboards and alerts.
A quick note about dashboards. The default ones typically provided by your service provider are great places to start, but are typically focused on individual clusters. You’ll want to build your own as your Kubernetes environment grows. Creating good dashboards is a delicate balancing act. Here’s a few tips:
- Consider the amount of data points a dashboard displays. Each data point needs to be queried from the backend and rendered. Too many data points will cause the dashboard to be sluggish and hard to use. Use aggregation functions as much as possible.
- Focus on your Golden Signals at first and default to a high level, aggregate view of your environment. Create separate, more detailed dashboards for each component.
- Use grouping and aggregate functions to provide drill down capability to individual cluster, node, component, namespace, etc
- Not every metric needs to go on a dashboard somewhere. In fact, most won’t! Your base set of dashboards just need to point you in the right direction when diagnosing a problem. Don’t expect to have a dashboard pinpointing the root cause every time, be comfortable creating dashboards adhoc when troubleshooting.
Once you’ve established a solid basecamp around the golden signals for those key components, it’s time to explore further. Learning the different metrics, labels, and values you have available to you. These can help with drilling deeper into complex issues to help pinpoint where the source of the problem lies.
For example, for the metric apiserver_request_total, we can use the different labels to pinpoint what verb/resource combination is the source of all those 500 errors.
What about logs and traces?
For brevity’s sake in an already long blog post, I focused on the metrics aspect of observability in this post. For general observability of a system, metrics will be the most important. However, logs and traces provide an entirely different level of knowledge about a system, especially when getting down to code level issues.
The ecosystem that has grown around Kubernetes has provided all the necessary components for building a complete monitoring and observability platform. The open nature of the ecosystem allows DevOps teams to choose when to build their own and when to invest in off the shelf solutions and services.