Policy metrics
Calico Cloud adds the ability to monitor effects of policies configured in your cluster. By defining a set of simple rules and thresholds, you can monitor traffic metrics and receive alerts when it exceeds configured thresholds.
+------------+
| |
| TSEE |
| Manager |
| |
| |
| |
+------------+
^
|
|
|
+-----------------+ |
| Host | |
| +-----------------+ +------------+ +------------+
| | Host |------------->--| | | |--->--
| | +-----------------+ policy | Prometheus | | Prometheus | alert
+-| | Host |----------->--| Server |-->--| Alert |--->--
| | +----------+ | metrics | | | Manager | mechanisms
+-| | Felix |-------------->--| | | |--->--
| +----------+ | +------------+ +------------+
+-----------------+ ^ ^
| |
Collect and store metrics. Web UI for accessing alert
WebUI for accessing and states.
querying metrics. Configure fan out
Configure alerting rules. notifications to different
alert receivers.
Policy inspection and reporting is accomplished using four key pieces:
- A Calico Cloud specific Felix binary running inside
calico-node
container monitors the host for denied/allowed packets and collects metrics. - Prometheus Server(s) deployed as part of the Calico Cloud manifest scrapes
every configured
calico-node
target. Alerting rules querying denied packet metrics are configured in Prometheus and when triggered, fire alerts to the Prometheus Alertmanager. - Prometheus Alertmanager (or simply Alertmanager), deployed as part of the Calico Cloud manifest, receives alerts from Prometheus and forwards alerts to various alerting mechanisms such as Pager Duty, or OpsGenie.
- Calico Cloud Manager, also deployed as part of the Calico Cloud manifest, processes the metrics using pre-defined Prometheus queries and provides dashboards and associated workflows.
Metrics will only be generated at a node when there are packets directed at an endpoint that are being actively profiled by a policy. Once generated they stay alive for 60 seconds.
Once Prometheus scrapes a node and collects policy metrics, it will be available at Prometheus until the metric is considered stale, i.e., Prometheus has not seen any updates to this metric for some time. This time is configurable. Refer to Configuring Prometheus configuration for more information.
Because of metrics being expired, as just described, it is entirely possible for a GET on a metrics query URL to return no information. This is expected if there have not been any packets being processed by a policy on that node, in the last 60 seconds.
Metrics generated by each Calico Cloud node are:
calico_denied_packets
- Total number of packets denied by Calico Cloud policies.calico_denied_bytes
- Total number of bytes denied by Calico Cloud policies.cnx_policy_rule_packets
- Sum of allowed/denied packets over rules processed by Calico Cloud policies.cnx_policy_rule_bytes
- Sum of allowed/denied bytes over rules processed by Calico Cloud policies.cnx_policy_rule_connections
- Sum of connections over rules processed by Calico Cloud policies.
The metrics calico_denied_packets
and calico_denied_bytes
have the labels policy
and srcIP
.
Using these two metrics, one can identify the policy that denied packets as well as
the source IP address of the packets that were denied by this policy. Using
Prometheus terminology, calico_denied_packets
is the metric name and policy
and srcIP
are labels. Each one of these metrics will be available as a
combination of {policy, srcIP}
.
Example queries:
- Total number of bytes, denied by Calico Cloud policies, originating from the IP address "10.245.13.133"
by
k8s_ns.ns-0
profile.
calico_denied_bytes{policy="profile|k8s_ns.ns-0|0|deny", srcIP="10.245.13.133"}
- Total number of packets denied by Calico Cloud policies, originating from the IP address "10.245.13.149"
by
k8s_ns.ns-0
profile.
calico_denied_packets{policy="profile|k8s_ns.ns-0|0|deny", srcIP="10.245.13.149"}}
The metrics cnx_policy_rule_packets
, cnx_policy_rule_bytes
and cnx_policy_rule_connections
have the
labels: tier
, policy
, namespace
, rule_index
, action
, traffic_direction
, rule_direction
.
Using these metrics, one can identify allow, and denied byte rate and packet rate, both inbound and outbound, indexed by both policy and rule. Calico Cloud Manager Dashboard makes heavy usage of these metrics. Staged policy names are prefixed with "staged:".
Example queries:
- Query counts for rules: Packet rates for specific rule by traffic_direction
sum(irate(cnx_policy_rule_packets{namespace="namespace-2",policy="policy-0",rule_direction="ingress",rule_index="rule-5",tier="tier-0"}[30s])) without (instance)
- Query counts for rules: Packet rates for each rule in a policy by traffic_direction
sum(irate(cnx_policy_rule_packets{namespace="namespace-2",policy="policy-0",tier="tier-0"}[30s])) without (instance)
- Query counts for a single policy by traffic_direction and action
sum(irate(cnx_policy_rule_packets{namespace="namespace-2",policy="policy-0",tier="tier-0"}[30s])) without (instance,rule_index,rule_direction)
- Query counts for all policies across all tiers by traffic_direction and action
sum(irate(cnx_policy_rule_packets[30s])) without (instance,rule_index,rule_direction)
See the
Felix configuration reference for
the settings that control the reporting of these metrics. Calico Cloud manifests
normally set PrometheusReporterEnabled=true
and
PrometheusReporterPort=9081
, so these metrics are available on each compute
node at http://<node-IP>:9081/metrics
.