In our previous article we explored how to deploy Grafana and Prometheus on AKS. However, despite Prometheus being an excellent way to monitor your AKS cluster, it lacks some functionality to save metrics for future use and enable us to query historical data.
Prometheus is a metrics collector, it can be used to collect metrics from any source, and it can be used to generate graphs. However, it is not a data storage solution and therefore isn't not focused on storing data for historical purposes. Prometheus has its storage solution and supports both on-disk storage as well as remote locations. You can configure a lot of options but the data management options are limited. This is where Thanos can help.
There is also a configuration overhead when we want to scale our Prometheus deployment and make it highly available. This generally includes the use of a federated set-up, and the use of a shared storage solution which, in Kubernetes, is usually accompised with a shared PersistentVolume
.
Thanos is an open source Prometheus setup with long term storage capabilities. It is not a new implementation of Prometheus, but a pre-built setup that has been designed to be used in production environments when long term storage is needed.
Storing metrics data for long term use requires it to be stored in a way that is optimized for that use. Long term storage may require unlimited retention periods, with ever-growing storage requirements.
We will need to address two issues. Firstly, unlimited data requires storage that will scale accordingly. Secondly, the more data we have the slower the querying will become, and for this downsampling and compaction techniques are used to reduce the size of the data and improve query times.
Thanos provides all of this and more out of the box in a single binary. You also don’t need to install all of those features, and you can have a subset of Thanos running in your cluster.
Thanos currently depends on Prometheus v2.2.1+ and an optional object storage if you want to store your data in a remote location. Several supported object storage clients are available. We will be using Azure Blob Storage.
Thanos is composed of several components:
-
Thanos Sidecar: The sidecar runs alongside Prometheus server to gather the metrics that are stored on disk. It’s composed of a StoreAPI and a Shipper, the shipper is responsible for sending the metrics to the object storage.
-
Thanos Store Gateway: This component is responsible for querying the object storage and exposing a StoreAPI that is queried by the other components.
-
Query Layer: Provides all the components required to query the Data, including the Web UI and the API.
-
Compactor: Reads from object storage and compacts the data that’s not compacted yet. It’s completely independent of the other components.
-
Ruler: Provides the ruler API that is used to evaluate rules and alerts from the Prometheus Alertmanager.
The overall architecture can be described as follows (via: Thanos Quick Tutorial).
Thanos uses a mix of HTTP and gRPC requests. HTTP requests are mostly used to query Prometheus, whilst gRPC requests are mostly used within Thanos' Store API.
We'll be deploying our metrics server to an Azure Kubernetes Service (AKS) cluster. If you don’t have a running AKS cluster, take a look at the quickstart to Deploy an Azure Kubernetes Service cluster using the Azure CLI.
We’ll also need an Azure Storage Account, you can create one using the Azure Portal or the Azure CLI. You will also need the storage account access key which can also be retrieved using the Azure CLI.
Create a storage account.
az storage account create --name <name> --resource-group <resource-group>
Create a storage container called metrics
.
az storage container create --name metrics --account-name <name>
Retrieve the storage account access key for later use.
az storage account keys list --account-name <name> --resource-group <resource-group> -o tsv --query "[0].value"
Thanos is designed to scale and extend vanilla Prometheus. Start by creating the Prometheus configuration that you will use to deploy the kube-prometheus-stack.
First, create a file called prometheus.yaml
.
# prometheus.yaml
grafana:
adminPassword: admin
prometheus:
thanosService:
enabled: true
thanosServiceMonitor:
enabled: true
interval: 5s
prometheusSpec:
thanos:
objectStorageConfig:
key: thanos.yaml
name: thanos-objstore-config
prometheusOperator:
thanosImage:
repository: quay.io/thanos/thanos
version: v0.23.0
tag: v0.23.0
kubelet:
serviceMonitor:
https: false
In this file, we’re first configuring the Prometheus Operator to use the Thanos image, and we’re also configuring the Grafana to use a password for the admin user (this is optional, but the default password is prom-operator). We’re also enabling the Thanos service and monitor (for Thanos metrics).
The thanos
key is the configuration object for the storage object, the remote configuration that Thanos will use to upload the metrics to. In our case, this is the Azure Storage account that we created earlier. Thanos configuration for Azure object can be found here, but we’ll only need a handful of options.
Create a thanos.yaml
file locally.
# thanos.yaml
type: AZURE
config:
storage_account: '<storage-account-name>'
storage_account_key: '<storage-account-key>'
container: 'metrics'
Replace <storage-account-name>
with your storage account name and <storage-account-key>
with the storage account access key you retrieved earlier.
Make sure you are authenticated to your Azure Kubernetes Service cluster (e.g. via az aks get-credentials) before running the following kubectl commands below.
Create a new namespace called monitoring
.
kubectl create ns monitoring
Make sure you are in the same directory as thanos.yaml
, then create a secret called thanos-objstore-config
.
kubectl -n monitoring create secret generic thanos-objstore-config --from-file=thanos.yaml=thanos.yaml
Add the helm repo for the Prometheus Community Helm Charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Run the helm installation for the Prometheus Operator.
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack -n monitoring --values prometheus.yaml
After a while, you’ll see that the Prometheus Operator is installing the Prometheus and Thanos components.
Check out the result by querying the Prometheus.
kubectl --namespace monitoring get pods
You’ll see that there are three containers inside the prometheus-prometheus-kube-prometheus-prometheus-0
pod, one of them is thanos-sidecar
. By default it will back up all the blocks that prometheus generates to our Azure Storage Account every two hours.
Installing the sidecar is the first step to create a complete Thanos deployment. It’s important to install both the Querier, and the Compactor along with the Store for a complete solution. Each of these components is installed as a separate deployment with its own network configuration.
In the following section you will create multiple kubernetes manifests. After you create each file, apply it to your cluster using the kubectl apply command (e.g. kubectl -f FILENAME.yaml
).
The Querier is the layer that will allow us to query all Prometheus instances at once. It needs a Deployment
that will be pointed to all sidecars, and it also needs its own Service
to be able to be discovered and used.
Create the querier Deployment
(don't forget to run kubectl apply -f querier-deployment.yaml
).
# querier-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-query
namespace: monitoring
labels:
app: thanos-query
spec:
replicas: 1
selector:
matchLabels:
app: thanos-query
template:
metadata:
labels:
app: thanos-query
spec:
containers:
- name: thanos-query
image: quay.io/thanos/thanos:v0.23.0
args:
- 'query'
- '--log.level=debug'
- '--query.replica-label=prometheus_replica'
- '--store=prometheus-kube-prometheus-thanos-discovery.monitoring.svc:10901'
resources:
requests:
cpu: '100m'
memory: '64Mi'
limits:
cpu: '250m'
memory: '256Mi'
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
- name: cluster
containerPort: 10900
Note that the image name and version must match the image name and version we deployed on the helm chart above. When we activated the thanosService
option, we created a new service discovery network that would allow us to query all the Prometheus instances in the cluster from a single point, this is the option we used to configure the Querier with.
We also need the Service
that will be used to expose the Querier and the metrics about that component so we can also monitor it using a ServiceMonitor
. These can be can split across multiple files, or delimited with ---
in the same file as we have below.
Create the Service
and ServiceMonitor
.
# querier-service-servicemonitor.yaml
apiVersion: v1
kind: Service
metadata:
name: thanos-query
labels:
app: thanos-query
release: prometheus-operator
jobLabel: thanos
namespace: monitoring
spec:
selector:
app: thanos-query
ports:
- port: 9090
protocol: TCP
targetPort: http
name: http-query
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: prom-thanos-query
namespace: monitoring
spec:
jobLabel: thanos
selector:
matchLabels:
app: thanos-query
namespaceSelector:
matchNames:
- 'monitoring'
endpoints:
- port: http-query
path: /metrics
interval: 5s
This is telling prometheus to scrape the metrics from the path /metrics at port 10902 from the Querier application. Lastly, we add the ServiceMonitor
to monitor our Querier
.
Upon running the above, you’ll see that the Querier is being installed.
Take a look at the logs for that deployment.
kubectl -n monitoring logs deploy/thanos-query
You should see output as follows:
level=debug ts=2022-01-01T19:56:05.927270012Z caller=main.go:65 msg="maxprocs: Updating GOMAXPROCS=[1]: using minimum allowed GOMAXPROCS"
ts=2022-01-01T19:56:05.928679219Z caller=log.go:168 level=debug msg="Lookback delta is zero, setting to default value" value=5m0s
level=info ts=2022-01-01T19:56:05.932400335Z caller=options.go:27 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-01-01T19:56:05.93349084Z caller=query.go:618 msg="starting query node"
level=debug ts=2022-01-01T19:56:05.933743641Z caller=endpointset.go:320 component=endpointset msg="starting to update API endpoints" cachedEndpoints=0
level=debug ts=2022-01-01T19:56:05.933774441Z caller=endpointset.go:323 component=endpointset msg="checked requested endpoints" activeEndpoints=0 cachedEndpoints=0
level=info ts=2022-01-01T19:56:05.933864742Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2022-01-01T19:56:05.933887542Z caller=http.go:63 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902
ts=2022-01-01T19:56:05.934207943Z caller=log.go:168 service=http/server component=query level=info msg="TLS is disabled." http2=false
level=info ts=2022-01-01T19:56:05.934254744Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2022-01-01T19:56:05.934281344Z caller=grpc.go:127 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
The store runs along with the Querier to bring the data from our Object Storage to the queries. It’s composed of a StatefulSet, and a configuration that contains the configuration for the Store, which we previously created as a secret.
Create a StatefulSet
.
# store-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: thanos-store
namespace: monitoring
labels:
app: thanos-store
spec:
serviceName: 'thanos-store'
replicas: 1
selector:
matchLabels:
app: thanos-store
template:
metadata:
labels:
app: thanos-store
spec:
containers:
- name: thanos-store
image: quay.io/thanos/thanos:v0.23.0
args:
- 'store'
- '--log.level=debug'
- '--data-dir=/var/thanos/store'
- '--objstore.config-file=/config/thanos.yaml'
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
- name: cluster
containerPort: 10900
volumeMounts:
- name: config
mountPath: /config/
readOnly: true
- name: data
mountPath: /var/thanos/store
volumes:
- name: data
emptyDir: {}
- name: config
secret:
secretName: thanos-objstore-config
Create a ServiceMonitor
.
# store-servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: thanos-store
namespace: monitoring
labels:
release: prom-op
spec:
jobLabel: thanos
endpoints:
- port: http
path: /metrics
interval: 30s
selector:
matchLabels:
app: thanos-store
The compactor is the service that will downsample the historical data. It’s recommended when you have a lot of incoming data in order to reduce the storage requirements. Just like the Querier component, it is composed of a StatefulSet
and a Service
. It takes configurations like the Store.
Create the StatefulSet
.
# compactor-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: thanos-compactor
namespace: monitoring
labels:
app: thanos-compactor
spec:
serviceName: 'thanos-compactor'
replicas: 1
selector:
matchLabels:
app: thanos-compactor
template:
metadata:
labels:
app: thanos-compactor
spec:
containers:
- name: thanos-compactor
image: quay.io/thanos/thanos:v0.23.0
args:
- 'compact'
- '--log.level=debug'
- '--data-dir=/var/thanos/store'
- '--objstore.config-file=/config/thanos.yaml'
- '--wait'
ports:
- name: http
containerPort: 10902
volumeMounts:
- name: config
mountPath: /config/
readOnly: true
- name: data
mountPath: /var/thanos/store
volumes:
- name: data
emptyDir: {}
- name: config
secret:
secretName: thanos-objstore-config
Create the Service
and the ServiceMonitor
.
# compactor-service-servicemonitor.yaml
apiVersion: v1
kind: Service
metadata:
name: thanos-compactor
labels:
app: thanos-compactor
namespace: monitoring
spec:
selector:
app: thanos-compactor
ports:
- port: 10902
name: http
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: thanos-compactor
namespace: monitoring
labels:
release: prom-op
spec:
jobLabel: thanos
endpoints:
- port: http
path: /metrics
interval: 30s
selector:
matchLabels:
app: thanos-compactor
To query Thanos using the stack we installed, you need to port-forward your Grafana to your computer.
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
Open your browser at localhost:3000 and login with username admin
and password admin
.
Open the "Configurations" menu on the left hand side and click "Data sources" then click "Add data source".
Select "Prometheus" under "Time series databases". Name it "Prometheus Thanos" and set the "HTTP URL" to http://thanos-query:9090
and click "Save & test".
Congratulations! You have successfully deployed Thanos to your Azure Kubernetes Service cluster, with data stored in Azure Blob Storage.
You can now click on "Explore" on the left hand-side and select "Prometheus Thanos" from the drop-down menu to submit PromQL queries.
Additionally, select the "Prometheus Thanos" data source when using Dashboards such as Kubernetes API Server (Dashboards > Browse > General / Kubernetes / API Server) below:
Don't forget to delete your cluster, and storage account, with the az group delete
command to avoid any ongoing charges.