Calico Cloud documentation

Configure Prometheus

Updating Denied Packets Rules

This is an example of how to modify the sample rule created by the sample manifest. The process of updating rules is the same as for user created rules (documented below).

Save the current alert rule:

kubectl -n tigera-prometheus get prometheusrule -o yaml > calico-prometheus-alert-rule-dp.yaml

Make necessary edits to the alerting rules then apply the updated manifest.
```
kubectl apply -f calico-prometheus-alert-rule-dp.yaml
```

Your changes should be applied in a few seconds by the prometheus-config-reloader container inside the prometheus pod launched by the prometheus-operator (usually named prometheus-<your-prometheus-instance-name>).

As an example, the range query in this Manifest is 10 seconds.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: calico-prometheus-dp-rate
  namespace: tigera-prometheus
  labels:
    role: tigera-prometheus-rules
    prometheus: calico-node-prometheus
spec:
  groups:
    - name: calico.rules
      rules:
        - alert: DeniedPacketsRate
          expr: rate(calico_denied_packets[10s]) > 50
          labels:
            severity: critical
          annotations:
            summary: 'Instance {{$labels.instance}} - Large rate of packets denied'
            description: '{{$labels.instance}} with calico-node pod {{$labels.pod}} has been denying packets at a fast rate {{$labels.sourceIp}} by policy {{$labels.policy}}.'

To update this alerting rule, to say, execute the query with a range of 20 seconds modify the manifest to this:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: calico-prometheus-dp-rate
  namespace: tigera-prometheus
  labels:
    role: tigera-prometheus-rules
    prometheus: calico-node-prometheus
spec:
  groups:
    - name: calico.rules
      rules:
        - alert: DeniedPacketsRate
          expr: rate(calico_denied_packets[20s]) > 50
          labels:
            severity: critical
          annotations:
            summary: 'Instance {{$labels.instance}} - Large rate of packets denied'
            description: '{{$labels.instance}} with calico-node pod {{$labels.pod}} has been denying packets at a fast rate {{$labels.sourceIp}} by policy {{$labels.policy}}.'

Creating a New Alerting Rule

Creating a new alerting rule is straightforward once you figure out what you want your rule to look for. Check alerting rules and Queries for more information.

New Alerting Rule for Monitoring Calico Node

To add the new alerting rule to our Prometheus instance, define a PrometheusRule manifest in the tigera-prometheus namespace with the labels role: tigera-prometheus-rules and prometheus: calico-node-prometheus. The labels should match the labels defined by the ruleSelector field of the Prometheus manifest.

As an example, to fire a alert when a calico-node instance has been down for more than 5 minutes, save the following to a file, say calico-node-down-alert.yaml.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: calico-prometheus-calico-node-down
  namespace: tigera-prometheus
  labels:
    role: tigera-prometheus-rules
    prometheus: calico-node-prometheus
spec:
  groups:
    - name: calico.rules
      rules:
        - alert: CalicoNodeInstanceDown
          expr: up == 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: 'Instance {{$labels.instance}} Pod: {{$labels.pod}} is down'
            description: '{{$labels.instance}} of job {{$labels.job}} has been down for more than 5 minutes'

Then create/apply this manifest in kubernetes.

kubectl apply -f calico-node-down-alert.yaml

New Alerting Rule for Monitoring BGP Peers

Let’s look at an example of a new alerting rule to our Prometheus instance with respect to monitoring BGP peering health. Define a PrometheusRule manifest in the tigera-prometheus namespace with the labels role: tigera-prometheus-rules and prometheus: calico-node-prometheus. The labels should match the labels defined by the ruleSelector field of the Prometheus manifest.

As an example, to fire an alert when the number of peering connections with a status other than “Established” is increasing at a non-zero rate in the cluster (over the last 5 minutes), save the following to a file, say tigera-peer-status-not-established.yaml.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: calico-node-prometheus
    role: tigera-prometheus-rules
  name: tigera-prometheus-peer-status-not-established
  namespace: tigera-prometheus
spec:
  groups:
    - name: calico.rules
      rules:
        - alert: CalicoNodePeerStatusNotEstablished
          annotations:
            description: '{{$labels.instance}} has at least one peer connection that is
              no longer up.'
            summary: Instance {{$labels.instance}} has peer connection that is no longer
              up
          expr: rate(bgp_peers{status!~"Established"}[5m]) > 0
          labels:
            severity: critical

Then create/apply this manifest in kubernetes.

kubectl apply -f tigera-peer-status-not-established.yaml

Additional Alerting Rules

The Alerting Rules installed by the Calico Cloud install manifest is a simple one that fires an alert when the rate of denied packets denied by a policy on a node from a particular Source IP exceeds a certain packets per second threshold. The Prometheus query used for this (ignoring the threshold value 20) is:

rate(calico_denied_packets[10s])

and this query will return results something along the lines of:

{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.129"}	0.6
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.175"}	0.2
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.157"}	0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.175"}	1
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.129"}	0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.159"}	0.4
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.175"}	0.4
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.175"}	0.6
{endpoint="calico-metrics-port",instance="10.240.0.81:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.252.157"}	0.6
{endpoint="calico-metrics-port",instance="10.240.0.84:9081",job="calico-node-metrics",namespace="kube-system",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny",service="calico-node-metrics",srcIP="192.168.167.159"}	0.6

We can modify this query to find out all packets dropped by different policies on every node.

(sum by (instance,policy) (rate(calico_denied_packets[10s])))

This query will aggregate the results from all different Source IPs, and preserve the policy and instance labels. Note that the instance label represents the calico node's IP Address and PrometheusReporterPort. This query will return results like so:

{instance="10.240.0.84:9081",policy="profile/k8s_ns.test/0/deny"}	2
{instance="10.240.0.81:9081",policy="profile/k8s_ns.test/0/deny"}	2.8

To include the pod name in these results, add the label pod to the labels listed in the by expression like so:

(sum by (instance,pod,policy) (rate(calico_denied_packets[10s])))

which will return the following results:

{instance="10.240.0.84:9081",pod="calico-node-97m3g",policy="profile/k8s_ns.test/0/deny"}	2
{instance="10.240.0.81:9081",pod="calico-node-hn0kl",policy="profile/k8s_ns.test/0/deny"}	2.8

An interesting use case is when a rogue Pod is using tools such as nmap to scan a subnet for open ports. To do this, we have to execute a query that will aggregate across all policies on all instances while preserving the source IP address. This can be done using this query:

(sum by (srcIP) (rate(calico_denied_packets[10s])))

which will return results, different source IP address:

{srcIP="192.168.167.159"}	1.0000000000000002
{srcIP="192.168.167.129"}	1.2000000000000002
{srcIP="192.168.252.175"}	1.4000000000000001
{srcIP="192.168.167.175"}	0.4
{srcIP="192.168.252.157"}	1.0000000000000002

To use these queries as Alerting Rules, follow the instructions defined in the Creating a new Alerting Rule section and create a ConfigMap with the appropriate query.

Updating the scrape interval

You may wish to modify the scrape interval (time between Prometheus polling each node for new denied packet information). Increasing the interval reduces load on Prometheus and the amount of storage required, but decreases the detail of the collected metrics.

The scrape interval of endpoints (calico-node in our case) is defined as part of the ServiceMonitor manifest. To change the interval:

Save the current ServiceMonitor manifest:

kubectl -n tigera-prometheus get servicemonitor calico-node-monitor -o yaml > calico-node-monitor.yaml

Update the interval field under endpoints to desired settings and apply the updated manifest.
```
kubectl apply -f calico-node-monitor.yaml
```

As an example on what to update, the interval in this ServiceMonitor manifest is 5 seconds (5s).

apiVersion: monitoring.coreos.com/v1alpha1
kind: ServiceMonitor
metadata:
  name: calico-node-monitor
  namespace: tigera-prometheus
  labels:
    team: network-operators
spec:
  selector:
    matchLabels:
      k8s-app: calico-node
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - port: calico-metrics-port
      interval: 5s

To update Calico Cloud Prometheus' scrape interval to 10 seconds modify the manifest to this:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ServiceMonitor
metadata:
  name: calico-node-monitor
  namespace: tigera-prometheus
  labels:
    team: network-operators
spec:
  selector:
    matchLabels:
      k8s-app: calico-node
  namespaceSelector:
    matchNames:
      - kube-system
  endpoints:
    - port: calico-metrics-port
      interval: 10s

Troubleshooting Config Updates

Check config reloader logs to see if they detected any recent activity.

For prometheus run:

kubectl -n tigera-prometheus logs prometheus-<your-prometheus-name> prometheus-config-reloader

For alertmanager run:

kubectl -n tigera-prometheus logs alertmanager-<your-prometheus-name> config-reloader

The config-reloaders watch each pods file-system for updated config from ConfigMap's or Secret's and will perform steps necessary for reloading the configuration.

Updating Denied Packets Rules​

Creating a New Alerting Rule​

New Alerting Rule for Monitoring Calico Node​

New Alerting Rule for Monitoring BGP Peers​

Additional Alerting Rules​

Updating the scrape interval​

Troubleshooting Config Updates​