Monitoring Kubernetes cluster at scale with Thanos
In one of my previous blogs, I mentioned how you can set up a monitoring and alerting system on your Kubernetes cluster using Prometheus and a great visualization system over it using Grafana. This is a great system on itself however it does have some issue like
- What if Prometheus went down? And what if that happened on a critical stage? Now this can be handled by deploying multiple instances of Prometheus but still, there would be a need for a global view
- What if you need to analyse old monitoring data? Prometheus keeps data for only 15 days by default. Hence, there is a need to maintain a database and a querying system for keeping the historical data
- What if your architecture scales to hundreds of clusters/environments? In that case, there would be multiple Prometheus deployments and again there would be a need for a global view.
The first solution that comes to mind is to create a monitoring cluster which would ingest data from multiple clusters and/or multiple Prometheus deployments on a single cluster. This would work in case of small static numbers but as the architecture would scale, the ingestion of metrics from all these clusters would break the monitoring cluster eventually.
There are a lot of issues when it comes to scaling and creating a highly available system for monitoring. Thanos was created to solve all of them.
CNCF Sandbox project Thanos is a set of components that can be composed into a highly available metric system with unlimited storage capacity. It can be added seamlessly on top of existing Prometheus deployments. Also, it provides a dashboard much similar to Prometheus hence the learning curve is calm.
So in this blog, we will try and understand different components of Thanos. Thanos provides a global query view, data backup, and historical data access as its core features. All three features can be run independently of each other. This allows you to have a subset of Thanos features ready for immediate benefit or testing, while also making it flexible for gradual adoption in more complex environments.
Scalability
Starting with how Thanos helps with growing clusters and Prometheus deployments, let’s take a look at the following diagram.
So here we can see that instead of ingesting the metrics from the monitored clusters (leaf clusters) into a common monitoring cluster and overflowing it, a sidecar is installed onto the leaf clusters which serves the metric through a StoreAPI and the monitoring cluster queries on them. The StoreAPI is a generic gRPC remote procedure API call system implemented on almost all Thanos components allowing other components to fetch metrics from various systems and backends.
Sidecar, as the name suggests, should be deployed together with Prometheus. Apart from exposing Prometheus metrics as a common Thanos StoreAPI, it performs other tasks like watching for configuration and Prometheus rules (alerting or recording) and notifying Prometheus for dynamic reloads.
The Querier component (also called “Query”) running on the monitoring cluster is essentially a vanilla PromQL Prometheus engine that fetches the data from any service that implements Thanos StoreAPI. This means that Querier exposes the Prometheus HTTP v1 API to query the data in a common PromQL language. This allows compatibility with Grafana or other consumers of Prometheus’ API.
To further optimize, the sidecars also tells the Querier which metric it has so that the Querier runs the query on the relevant clusters only.
Now, there is no need for ingestion of metrics in the monitoring cluster which solves the scalability issue.
High Availability
The most well-known process of achieving high availability is replication and that is how we achieve HA in this setup as well. Multiple deployments of the Query component ensures a fallback mechanism in case one of them fails. Any external component like Grafana can then round-robin between both the Thanos Query components.
Now talking about high availability on the leaf clusters, multi Prometheus setups can be deployed on these clusters. In such a case, a query component would run on the leaf cluster which would effective moderate the growth in APIs and in query calls by creating a two-staged call over the sidecar. Such a setup is shown in the image below
Now, we have seen how we can achieve high availability on both the monitoring as well as the leaf cluster.
Managing storage
Thanos, in Greek, came from a short form of the personal name Athanasios, literally meaning ‘immortal’ which makes sense here since Thanos lets you store your data infinitely and query on them, hence making your monitoring records immortal.
Sidecars push time-series database(TSDB) data into an Object Storage. Object stores can be really flexible and cheap so are a great choice for storing long-term historical data. Onto the monitoring cluster would run a Store gateway component for reading the data from the object store and serving it back.
Now we have our storage in place where we can collect the monitoring data but we will fill it really soon at the pace Prometheus pulls metrics (i.e. 15 seconds by default) and that’s the smallest problem. The major problem is that querying this data can be really heavy and just the transfers can have really high latency. To handle this Thanos has a Compactor component. The function of a Compactor is to downsample and compress the datapoints using Gorilla compression without losing the accuracy of data.
This model is good to go for those cases when there are limited clusters but as the clusters grow and they get added and deleted on the fly, this becomes complex as you have to configure it each time. So we need some way in which instead of pulling metrics into the storage, the leaf clusters could push them. However, this brings to the same issue where we started, scaling ingestion. The solution for this lies in hash rings.
The hash rings, or what we will call the Recieve component, would act as a local memory which would store the metrics for remote write requests. Incoming queries will then be load-balanced amongst them. Note that the data for a single time series would go into the same replica to maintain coherence.
Now there can be a case where a replica fails and so the incoming data destined for that replica would no more be accepted. So, to overcome that, it can be configured so that data is sent to say 3 replicas. If it is not accepted by 2, then it is not considered as a successful transmission. This is just an example of the configuration and can be done as per your preference.
As shown in the image above, the newly added Recieve component would take data from the leaf clusters. To make this data long term and durable, the hash rings can write it up in the object storage.
Ingestion and Alerting rules
Like Prometheus, custom rules can be added for ingestion and alerting in Thanos as well. This component is called the Rule component. Typically, the Rule component would run on the monitoring cluster which is, according to me, the right place for ingestion rules. When it comes to alerting rule, it should be close to the alerting cluster and should be implemented on the leaf clusters.
So what changed with Thanos
Now that we have discussed all the components of Thanos, let’s see what changed in the architecture.
The image above shows the different components of Prometheus. So if we look at them one by one,
- The Scrape Engine is effectively taken over by the Sidecars and Recieve arrangement
- The Compactor of Prometheus is enhanced to a Compactor and Store system
- The Query Engine is now the Querying component
- And the Rule Engine is now the Rule component
And now the whole setup looks something like this
Parting note
I hope my explanation would help you understand the working of the Thanos system more effectively and would help you scale your existing Prometheus monitoring system more efficiently. To read more on deploying and using Thanos, visit the official site.