Configure Prometheus
Updating Denied Packets Rules
This is an example of how to modify the sample rule created by the sample manifest. The process of updating rules is the same as for user created rules (documented below).
-
Save the current alert rule:
kubectl -n tigera-prometheus get prometheusrule -o yaml > calico-prometheus-alert-rule-dp.yaml
-
Make necessary edits to the alerting rules then apply the updated manifest.
kubectl apply -f calico-prometheus-alert-rule-dp.yaml
Your changes should be applied in a few seconds by the prometheus-config-reloader
container inside the prometheus pod launched by the prometheus-operator
(usually named prometheus-<your-prometheus-instance-name>
).
As an example, the range query in this Manifest is 10 seconds.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: calico-prometheus-dp-rate
namespace: tigera-prometheus
labels:
role: tigera-prometheus-rules
prometheus: calico-node-prometheus
spec:
groups:
- name: calico.rules
rules:
- alert: DeniedPacketsRate
expr: rate(calico_denied_packets[10s]) > 50
labels:
severity: critical
annotations:
summary: 'Instance {{$labels.instance}} - Large rate of packets denied'
description: '{{$labels.instance}} with calico-node pod {{$labels.pod}} has been denying packets at a fast rate {{$labels.sourceIp}} by policy {{$labels.policy}}.'
To update this alerting rule, to say, execute the query with a range of 20 seconds modify the manifest to this:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: calico-prometheus-dp-rate
namespace: tigera-prometheus
labels:
role: tigera-prometheus-rules
prometheus: calico-node-prometheus
spec:
groups:
- name: calico.rules
rules:
- alert: DeniedPacketsRate
expr: rate(calico_denied_packets[20s]) > 50
labels:
severity: critical
annotations:
summary: 'Instance {{$labels.instance}} - Large rate of packets denied'
description: '{{$labels.instance}} with calico-node pod {{$labels.pod}} has been denying packets at a fast rate {{$labels.sourceIp}} by policy {{$labels.policy}}.'
Creating a New Alerting Rule
Creating a new alerting rule is straightforward once you figure out what you want your rule to look for. Check alerting rules and Queries for more information.
New Alerting Rule for Monitoring Calico Node
To add the new alerting rule to our Prometheus instance, define a PrometheusRule manifest
in the tigera-prometheus
namespace with the labels
role: tigera-prometheus-rules
and prometheus: calico-node-prometheus
. The
labels should match the labels defined by the ruleSelector
field of the
Prometheus manifest.
As an example, to fire a alert when a calico-node instance has been down for
more than 5 minutes, save the following to a file, say calico-node-down-alert.yaml
.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: calico-prometheus-calico-node-down
namespace: tigera-prometheus
labels:
role: tigera-prometheus-rules
prometheus: calico-node-prometheus
spec:
groups:
- name: calico.rules
rules:
- alert: CalicoNodeInstanceDown
expr: up == 0
for: 5m
labels:
severity: warning
annotations:
summary: 'Instance {{$labels.instance}} Pod: {{$labels.pod}} is down'
description: '{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes'
Then create/apply this manifest in kubernetes.
kubectl apply -f calico-node-down-alert.yaml
Your changes should be applied in a few seconds by the prometheus-config-reloader
container inside the prometheus pod launched by the prometheus-operator
(usually named prometheus-<your-prometheus-instance-name>
).
New Alerting Rule for Monitoring BGP Peers
Let’s look at an example of a new alerting rule to our Prometheus instance with respect to monitoring BGP
peering health. Define a PrometheusRule manifest in the tigera-prometheus namespace with the labels
role: tigera-prometheus-rules
and prometheus: calico-node-prometheus
. The labels should match the labels
defined by the ruleSelector
field of the Prometheus manifest.
As an example, to fire an alert when the number of peering connections with a status other than “Established”
is increasing at a non-zero rate in the cluster (over the last 5 minutes), save the following to a file, say
tigera-peer-status-not-established.yaml
.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: calico-node-prometheus
role: tigera-prometheus-rules
name: tigera-prometheus-peer-status-not-established
namespace: tigera-prometheus
spec:
groups:
- name: calico.rules
rules:
- alert: CalicoNodePeerStatusNotEstablished
annotations:
description: '{{$labels.instance}} has at least one peer connection that is
no longer up.'
summary: Instance {{$labels.instance}} has peer connection that is no longer
up
expr: rate(bgp_peers{status!~"Established"}[5m]) > 0
labels:
severity: critical
Then create/apply this manifest in kubernetes.
kubectl apply -f tigera-peer-status-not-established.yaml
Your changes should be applied in a few seconds by the prometheus-config-reloader
container inside the prometheus pod launched by the prometheus-operator
(usually named prometheus-<your-prometheus-instance-name>
).
Additional Alerting Rules
The Alerting Rules installed by the Calico Enterprise install manifest is a simple one that fires an alert when the rate of denied packets denied by a policy on a node from a particular Source IP exceeds a certain packets per second threshold. The Prometheus query used for this (ignoring the threshold value 20) is:
rate(calico_denied_packets[10s])
and this query will return results something along the lines of:
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.129"} 0.6
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.175"} 0.2
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.157"} 0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.175"} 1
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.129"} 0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.159"} 0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.175"} 0.4
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.175"} 0.6
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.157"} 0.6
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.159"} 0.6
We can modify this query to find out all packets dropped by different policies on every node.
(sum by (instance,policy) (rate(calico_denied_packets[10s])))
This query will aggregate the results from all different Source IPs, and
preserve the policy
and instance
labels. Note that the instance
label
represents the calico node's IP Address and PrometheusReporterPort
. This
query will return results like so:
{instance="10.240.0.84:9081",policy="profile/k8s_ns.test/0/deny"} 2
{instance="10.240.0.81:9081",policy="profile/k8s_ns.test/0/deny"} 2.8
To include the pod name in these results, add the label pod
to the labels
listed in the by
expression like so:
(sum by (instance,pod,policy) (rate(calico_denied_packets[10s])))
which will return the following results:
{instance="10.240.0.84:9081",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny"} 2
{instance="10.240.0.81:9081",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny"} 2.8
An interesting use case is when a rogue Pod is using tools such as nmap to scan a subnet for open ports. To do this, we have to execute a query that will aggregate across all policies on all instances while preserving the source IP address. This can be done using this query:
(sum by (srcIP) (rate(calico_denied_packets[10s])))
which will return results, different source IP address:
{srcIP="192.168.167.159"} 1.0000000000000002
{srcIP="192.168.167.129"} 1.2000000000000002
{srcIP="192.168.252.175"} 1.4000000000000001
{srcIP="192.168.167.175"} 0.4
{srcIP="192.168.252.157"} 1.0000000000000002
To use these queries as Alerting Rules, follow the instructions defined in the Creating a new Alerting Rule section and create a ConfigMap with the appropriate query.
Updating the scrape interval
You may wish to modify the scrape interval (time between Prometheus polling each node for new denied packet information). Increasing the interval reduces load on Prometheus and the amount of storage required, but decreases the detail of the collected metrics.
The scrape interval of endpoints (calico-node in our case) is defined as part of the ServiceMonitor manifest. To change the interval:
-
Save the current ServiceMonitor manifest:
kubectl -n tigera-prometheus get servicemonitor calico-node-monitor -o yaml > calico-node-monitor.yaml
-
Update the
interval
field underendpoints
to desired settings and apply the updated manifest.kubectl apply -f calico-node-monitor.yaml
Your changes should be applied in a few seconds by the prometheus-config-reloader
container inside the prometheus pod launched by the prometheus-operator
(usually named prometheus-<your-prometheus-instance-name>
).
As an example on what to update, the interval in this ServiceMonitor manifest
is 5 seconds (5s
).
apiVersion: monitoring.coreos.com/v1alpha1
kind: ServiceMonitor
metadata:
name: calico-node-monitor
namespace: tigera-prometheus
labels:
team: network-operators
spec:
selector:
matchLabels:
k8s-app: calico-node
namespaceSelector:
matchNames:
- kube-system
endpoints:
- port: calico-metrics-port
interval: 5s
To update Calico Enterprise Prometheus' scrape interval to 10 seconds modify the manifest to this:
apiVersion: monitoring.coreos.com/v1alpha1
kind: ServiceMonitor
metadata:
name: calico-node-monitor
namespace: tigera-prometheus
labels:
team: network-operators
spec:
selector:
matchLabels:
k8s-app: calico-node
namespaceSelector:
matchNames:
- kube-system
endpoints:
- port: calico-metrics-port
interval: 10s
Troubleshooting Config Updates
Check config reloader logs to see if they detected any recent activity.
-
For prometheus run:
kubectl -n tigera-prometheus logs prometheus-<your-prometheus-name> prometheus-config-reloader
-
For alertmanager run:
kubectl -n tigera-prometheus logs alertmanager-<your-prometheus-name> config-reloader
The config-reloaders watch each pods file-system for updated config from ConfigMap's or Secret's and will perform steps necessary for reloading the configuration.