Fluentd metrics
Big picture
Use the Prometheus monitoring and alerting tool for Fluentd metrics to ensure continuous network visibility.
Value
Platform engineering teams rely on logs for visibility into their networks. If collecting or storing logs are disrupted, this can impact network visibility. Prometheus can monitor log collection and storage metrics so platform engineering teams are alerted about problems before they occur.
Concepts
Component | Description |
---|---|
Prometheus | Monitoring tool that scrapes metrics from instrumented jobs and displays time series data in a visualizer (such as Grafana). For Calico Cloud, the “jobs” that Prometheus can harvest metrics from the Fluentd component. |
Fluentd | Sends Calico Cloud logs to Elasticsearch for storage. |
How to
Create Prometheus alerts for Fluentd
The following example creates a Prometheus rule to monitor some important Fluentd metrics, and alert when they have crossed certain thresholds:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: tigera-prometheus-log-collection-monitoring
namespace: tigera-prometheus
labels:
role: tigera-prometheus-rules
prometheus: calico-node-prometheus
spec:
groups:
- name: tigera-log-collection.rules
rules:
- alert: FluentdPodConsistentlyLowBufferSpace
expr: avg_over_time(fluentd_output_status_buffer_available_space_ratio[5m]) < 75
labels:
severity: Warning
annotations:
summary: "Fluentd pod {{$labels.pod}}'s buffer space is consistently below 75 percent capacity."
description: "Fluentd pod {{$labels.pod}} has very low buffer space. There may be connection issues between Elasticsearch
and Fluentd or there are too many logs to write out, check the logs for the Fluentd pod."
The alerts created in the example are described as follows:
Alert | Severity | Requires | Issue/reason |
---|---|---|---|
FluentdPodConsistentlyLowBufferSpace | Non-critical, warning | Immediate investigation to ensure logs are being gathered correctly. | A Fluentd pod’s available buffer size has averaged less than 75% over the last 5 minutes. This could mean Fluentd is having trouble communicating with the Elasticsearch cluster, the Elasticsearch cluster is down, or there are simply too many logs to process. |