Version: 3.18 (latest)

Policy metrics

Calico Enterprise adds the ability to monitor effects of policies configured in your cluster. By defining a set of simple rules and thresholds, you can monitor traffic metrics and receive alerts when it exceeds configured thresholds.

                                      +------------+
                                      |            |
                                      |    TSEE    |
                                      |   Manager  |
                                      |            |
                                      |            |
                                      |            |
                                      +------------+
                                            ^
                                            |
                                            |
                                            |
 +-----------------+                        |
 | Host            |                        |
 | +-----------------+                +------------+     +------------+
 | | Host            |------------->--|            |     |            |--->--
 | | +-----------------+   policy     | Prometheus |     | Prometheus |        alert
 +-| | Host            |----------->--|   Server   |-->--|   Alert    |--->--
   | |   +----------+  |   metrics    |            |     |  Manager   |      mechanisms
   +-|   |  Felix   |-------------->--|            |     |            |--->--
     |   +----------+  |              +------------+     +------------+
     +-----------------+                    ^                   ^
                                            |                   |
                             Collect and store metrics.     Web UI for accessing alert
                             WebUI for accessing and        states.
                             querying metrics.              Configure fan out
                             Configure alerting rules.      notifications to different
                                                            alert receivers.

Policy inspection and reporting is accomplished using four key pieces:

A Calico Enterprise specific Felix binary running inside calico-node container monitors the host for denied/allowed packets and collects metrics.
Prometheus Server(s) deployed as part of the Calico Enterprise manifest scrapes every configured calico-node target. Alerting rules querying denied packet metrics are configured in Prometheus and when triggered, fire alerts to the Prometheus Alertmanager.
Prometheus Alertmanager (or simply Alertmanager), deployed as part of the Calico Enterprise manifest, receives alerts from Prometheus and forwards alerts to various alerting mechanisms such as Pager Duty, or OpsGenie.
Calico Enterprise Manager, also deployed as part of the Calico Enterprise manifest, processes the metrics using pre-defined Prometheus queries and provides dashboards and associated workflows.

Metrics will only be generated at a node when there are packets directed at an endpoint that are being actively profiled by a policy. Once generated they stay alive for 60 seconds.

Once Prometheus scrapes a node and collects policy metrics, it will be available at Prometheus until the metric is considered stale, i.e., Prometheus has not seen any updates to this metric for some time. This time is configurable. Refer to Configuring Prometheus configuration for more information.

Because of metrics being expired, as just described, it is entirely possible for a GET on a metrics query URL to return no information. This is expected if there have not been any packets being processed by a policy on that node, in the last 60 seconds.

Metrics generated by each Calico Enterprise node are:

calico_denied_packets - Total number of packets denied by Calico Enterprise policies.
calico_denied_bytes - Total number of bytes denied by Calico Enterprise policies.
cnx_policy_rule_packets - Sum of allowed/denied packets over rules processed by Calico Enterprise policies.
cnx_policy_rule_bytes - Sum of allowed/denied bytes over rules processed by Calico Enterprise policies.
cnx_policy_rule_connections - Sum of connections over rules processed by Calico Enterprise policies.

The metrics calico_denied_packets and calico_denied_bytes have the labels policy and srcIP. Using these two metrics, one can identify the policy that denied packets as well as the source IP address of the packets that were denied by this policy. Using Prometheus terminology, calico_denied_packets is the metric name and policy and srcIP are labels. Each one of these metrics will be available as a combination of {policy, srcIP}.

Example queries:

Total number of bytes, denied by Calico Enterprise policies, originating from the IP address "10.245.13.133" by k8s_ns.ns-0 profile.

calico_denied_bytes{policy="profile|k8s_ns.ns-0|0|deny", srcIP="10.245.13.133"}

Total number of packets denied by Calico Enterprise policies, originating from the IP address "10.245.13.149" by k8s_ns.ns-0 profile.

calico_denied_packets{policy="profile|k8s_ns.ns-0|0|deny", srcIP="10.245.13.149"}}

The metrics cnx_policy_rule_packets, cnx_policy_rule_bytes and cnx_policy_rule_connections have the labels: tier, policy, namespace, rule_index, action, traffic_direction, rule_direction.

Using these metrics, one can identify allow, and denied byte rate and packet rate, both inbound and outbound, indexed by both policy and rule. Calico Enterprise Manager Dashboard makes heavy usage of these metrics. Staged policy names are prefixed with "staged:".

Example queries:

Query counts for rules: Packet rates for specific rule by traffic_direction

sum(irate(cnx_policy_rule_packets{namespace="namespace-2",policy="policy-0",rule_direction="ingress",rule_index="rule-5",tier="tier-0"}[30s])) without (instance)

Query counts for rules: Packet rates for each rule in a policy by traffic_direction

sum(irate(cnx_policy_rule_packets{namespace="namespace-2",policy="policy-0",tier="tier-0"}[30s])) without (instance)

Query counts for a single policy by traffic_direction and action

sum(irate(cnx_policy_rule_packets{namespace="namespace-2",policy="policy-0",tier="tier-0"}[30s])) without (instance,rule_index,rule_direction)

Query counts for all policies across all tiers by traffic_direction and action

sum(irate(cnx_policy_rule_packets[30s])) without (instance,rule_index,rule_direction)

See the Felix configuration reference for the settings that control the reporting of these metrics. Calico Enterprise manifests normally set PrometheusReporterEnabled=true and PrometheusReporterPort=9081, so these metrics are available on each compute node at http://<node-IP>:9081/metrics.