Calico Cloud Pro documentation

Fluentd metrics

Big picture

Use the Prometheus monitoring and alerting tool for Fluentd metrics to ensure continuous network visibility.

Value

Platform engineering teams rely on logs for visibility into their networks. If collecting or storing logs are disrupted, this can impact network visibility. Prometheus can monitor log collection and storage metrics so platform engineering teams are alerted about problems before they occur.

Concepts

Component	Description
Prometheus	Monitoring tool that scrapes metrics from instrumented jobs and displays time series data in a visualizer (such as Grafana). For Calico Cloud, the “jobs” that Prometheus can harvest metrics from the Fluentd component.
Fluentd	Sends Calico Cloud logs to Elasticsearch for storage.

How to

Create Prometheus alerts for Fluentd

The following example creates a Prometheus rule to monitor some important Fluentd metrics, and alert when they have crossed certain thresholds:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: tigera-prometheus-log-collection-monitoring
  namespace: tigera-prometheus
  labels:
    role: tigera-prometheus-rules
    prometheus: calico-node-prometheus
spec:
  groups:
  - name: tigera-log-collection.rules
    rules:
    - alert: FluentdPodConsistentlyLowBufferSpace
      expr: avg_over_time(fluentd_output_status_buffer_available_space_ratio[5m]) < 75
      labels:
        severity: Warning
      annotations:
        summary: "Fluentd pod {{$labels.pod}}'s buffer space is consistently below 75 percent capacity."
        description: "Fluentd pod {{$labels.pod}} has very low buffer space. There may be connection issues between Elasticsearch
and Fluentd or there are too many logs to write out, check the logs for the Fluentd pod."

The alerts created in the example are described as follows:

Alert	Severity	Requires	Issue/reason
FluentdPodConsistentlyLowBufferSpace	Non-critical, warning	Immediate investigation to ensure logs are being gathered correctly.	A Fluentd pod’s available buffer size has averaged less than 75% over the last 5 minutes. This could mean Fluentd is having trouble communicating with the Elasticsearch cluster, the Elasticsearch cluster is down, or there are simply too many logs to process.

Big picture​

Value​

Concepts​

How to​

Create Prometheus alerts for Fluentd​

The alerts created in the example are described as follows:​