Skip to main content

Configure Prometheus

Updating Denied Packets Rules​

This is an example of how to modify the sample rule created by the sample manifest. The process of updating rules is the same as for user created rules (documented below).

  • Save the current alert rule:

    kubectl -n tigera-prometheus get prometheusrule -o yaml > calico-prometheus-alert-rule-dp.yaml
  • Make necessary edits to the alerting rules then apply the updated manifest.

    kubectl apply -f calico-prometheus-alert-rule-dp.yaml

Your changes should be applied in a few seconds by the prometheus-config-reloader container inside the prometheus pod launched by the prometheus-operator (usually named prometheus-<your-prometheus-instance-name>).

As an example, the range query in this Manifest is 10 seconds.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: calico-prometheus-dp-rate
namespace: tigera-prometheus
labels:
role: tigera-prometheus-rules
prometheus: calico-node-prometheus
spec:
groups:
- name: calico.rules
rules:
- alert: DeniedPacketsRate
expr: rate(calico_denied_packets[10s]) > 50
labels:
severity: critical
annotations:
summary: 'Instance {{$labels.instance}} - Large rate of packets denied'
description: '{{$labels.instance}} with calico-node pod {{$labels.pod}} has been denying packets at a fast rate {{$labels.sourceIp}} by policy {{$labels.policy}}.'

To update this alerting rule, to say, execute the query with a range of 20 seconds modify the manifest to this:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: calico-prometheus-dp-rate
namespace: tigera-prometheus
labels:
role: tigera-prometheus-rules
prometheus: calico-node-prometheus
spec:
groups:
- name: calico.rules
rules:
- alert: DeniedPacketsRate
expr: rate(calico_denied_packets[20s]) > 50
labels:
severity: critical
annotations:
summary: 'Instance {{$labels.instance}} - Large rate of packets denied'
description: '{{$labels.instance}} with calico-node pod {{$labels.pod}} has been denying packets at a fast rate {{$labels.sourceIp}} by policy {{$labels.policy}}.'

Creating a New Alerting Rule​

Creating a new alerting rule is straightforward once you figure out what you want your rule to look for. Check alerting rules and Queries for more information.

New Alerting Rule for Monitoring Calico Node​

To add the new alerting rule to our Prometheus instance, define a PrometheusRule manifest in the tigera-prometheus namespace with the labels role: tigera-prometheus-rules and prometheus: calico-node-prometheus. The labels should match the labels defined by the ruleSelector field of the Prometheus manifest.

As an example, to fire a alert when a calico-node instance has been down for more than 5 minutes, save the following to a file, say calico-node-down-alert.yaml.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: calico-prometheus-calico-node-down
namespace: tigera-prometheus
labels:
role: tigera-prometheus-rules
prometheus: calico-node-prometheus
spec:
groups:
- name: calico.rules
rules:
- alert: CalicoNodeInstanceDown
expr: up == 0
for: 5m
labels:
severity: warning
annotations:
summary: 'Instance {{$labels.instance}} Pod: {{$labels.pod}} is down'
description: '{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes'

Then create/apply this manifest in kubernetes.

kubectl apply -f calico-node-down-alert.yaml

Your changes should be applied in a few seconds by the prometheus-config-reloader container inside the prometheus pod launched by the prometheus-operator (usually named prometheus-<your-prometheus-instance-name>).

New Alerting Rule for Monitoring BGP Peers​

Let’s look at an example of a new alerting rule to our Prometheus instance with respect to monitoring BGP peering health. Define a PrometheusRule manifest in the tigera-prometheus namespace with the labels role: tigera-prometheus-rules and prometheus: calico-node-prometheus. The labels should match the labels defined by the ruleSelector field of the Prometheus manifest.

As an example, to fire an alert when the number of peering connections with a status other than “Established” is increasing at a non-zero rate in the cluster (over the last 5 minutes), save the following to a file, say tigera-peer-status-not-established.yaml.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: calico-node-prometheus
role: tigera-prometheus-rules
name: tigera-prometheus-peer-status-not-established
namespace: tigera-prometheus
spec:
groups:
- name: calico.rules
rules:
- alert: CalicoNodePeerStatusNotEstablished
annotations:
description: '{{$labels.instance}} has at least one peer connection that is
no longer up.'
summary: Instance {{$labels.instance}} has peer connection that is no longer
up
expr: rate(bgp_peers{status!~"Established"}[5m]) > 0
labels:
severity: critical

Then create/apply this manifest in kubernetes.

kubectl apply -f tigera-peer-status-not-established.yaml

Your changes should be applied in a few seconds by the prometheus-config-reloader container inside the prometheus pod launched by the prometheus-operator (usually named prometheus-<your-prometheus-instance-name>).

Additional Alerting Rules​

The Alerting Rules installed by the Calico Cloud install manifest is a simple one that fires an alert when the rate of denied packets denied by a policy on a node from a particular Source IP exceeds a certain packets per second threshold. The Prometheus query used for this (ignoring the threshold value 20) is:

rate(calico_denied_packets[10s])

and this query will return results something along the lines of:

{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.129"}    0.6
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.175"} 0.2
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.157"} 0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.175"} 1
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.129"} 0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.159"} 0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.175"} 0.4
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.175"} 0.6
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.157"} 0.6
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.159"} 0.6

We can modify this query to find out all packets dropped by different policies on every node.

(sum by (instance,policy) (rate(calico_denied_packets[10s])))

This query will aggregate the results from all different Source IPs, and preserve the policy and instance labels. Note that the instance label represents the calico node's IP Address and PrometheusReporterPort. This query will return results like so:

{instance="10.240.0.84:9081",policy="profile/k8s_ns.test/0/deny"}   2
{instance="10.240.0.81:9081",policy="profile/k8s_ns.test/0/deny"} 2.8

To include the pod name in these results, add the label pod to the labels listed in the by expression like so:

(sum by (instance,pod,policy) (rate(calico_denied_packets[10s])))

which will return the following results:

{instance="10.240.0.84:9081",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny"}   2
{instance="10.240.0.81:9081",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny"} 2.8

An interesting use case is when a rogue Pod is using tools such as nmap to scan a subnet for open ports. To do this, we have to execute a query that will aggregate across all policies on all instances while preserving the source IP address. This can be done using this query:

(sum by (srcIP) (rate(calico_denied_packets[10s])))

which will return results, different source IP address:

{srcIP="192.168.167.159"}   1.0000000000000002
{srcIP="192.168.167.129"} 1.2000000000000002
{srcIP="192.168.252.175"} 1.4000000000000001
{srcIP="192.168.167.175"} 0.4
{srcIP="192.168.252.157"} 1.0000000000000002

To use these queries as Alerting Rules, follow the instructions defined in the Creating a new Alerting Rule section and create a ConfigMap with the appropriate query.

Updating the scrape interval​

You may wish to modify the scrape interval (time between Prometheus polling each node for new denied packet information). Increasing the interval reduces load on Prometheus and the amount of storage required, but decreases the detail of the collected metrics.

The scrape interval of endpoints (calico-node in our case) is defined as part of the ServiceMonitor manifest. To change the interval:

  • Save the current ServiceMonitor manifest:

    kubectl -n tigera-prometheus get servicemonitor calico-node-monitor -o yaml > calico-node-monitor.yaml
  • Update the interval field under endpoints to desired settings and apply the updated manifest.

    kubectl apply -f calico-node-monitor.yaml

Your changes should be applied in a few seconds by the prometheus-config-reloader container inside the prometheus pod launched by the prometheus-operator (usually named prometheus-<your-prometheus-instance-name>).

As an example on what to update, the interval in this ServiceMonitor manifest is 5 seconds (5s).

apiVersion: monitoring.coreos.com/v1alpha1
kind: ServiceMonitor
metadata:
name: calico-node-monitor
namespace: tigera-prometheus
labels:
team: network-operators
spec:
selector:
matchLabels:
k8s-app: calico-node
namespaceSelector:
matchNames:
- kube-system
endpoints:
- port: calico-metrics-port
interval: 5s

To update Calico Cloud Prometheus' scrape interval to 10 seconds modify the manifest to this:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ServiceMonitor
metadata:
name: calico-node-monitor
namespace: tigera-prometheus
labels:
team: network-operators
spec:
selector:
matchLabels:
k8s-app: calico-node
namespaceSelector:
matchNames:
- kube-system
endpoints:
- port: calico-metrics-port
interval: 10s

Troubleshooting Config Updates​

Check config reloader logs to see if they detected any recent activity.

  • For prometheus run:

    kubectl -n tigera-prometheus logs prometheus-<your-prometheus-name> prometheus-config-reloader
  • For alertmanager run:

    kubectl -n tigera-prometheus logs alertmanager-<your-prometheus-name> config-reloader

The config-reloaders watch each pods file-system for updated config from ConfigMap's or Secret's and will perform steps necessary for reloading the configuration.