Tigera operator troubleshooting checklist
If you have issues getting your cluster up and running, use this checklist.
- Check installation start errors
- Check Calico Cloud installation
- Check logs for fatal errors
- Check custom resources
- Check pod capacity
- Check Manager UI dashboard for traffic
Check installation start errors
Are you seeing any of these issues at the start of installation?
[ERROR] Detected plugin ls: No such file or directory, it is currently not supported
The cluster you are using to install Calico Cloud does not have a CNI plugin installed, the CNI is incompatible. If your cluster has functional pod networking and you see this message, it is likely that kubelet has been configured to use kubenet networking, which is not compatible with Calico Cloud. You can use a different cluster, or re-create your cluster with compatible networking.
Calico Cloud cannot be connected to a cluster with FIPS mode enabled
At this time, FIPS mode is not supported in Calico Cloud. Disable FIPS mode in the cluster and install again.
Install script is taking a long time
If you are migrating a large cluster from a previous manifest-based Calico install, the script can take some time; this is normal.
But, it could also mean that your cluster has an incompatibility. Go to the next step Check Calico Cloud installation.
Check Calico Cloud installation
Installing Calico Cloud on your Kubernetes cluster is managed by the Tigera operator. The Tigera operator is deployed as a ReplicaSet in the tigera-operator
namespace, and records status in a custom resource named, tigerastatus
. The operator get its configuration from several Custom Resources (CRs); the central one is the Installation CR.
Check tigerastatus
using the following command.
kubectl get tigerastatus
Sample output
NAME AVAILABLE PROGRESSING DEGRADED SINCE
apiserver True False False 10m
calico True False False 11m
cloud-core True False False 11m
compliance True False False 9m39s
intrusion-detection True False False 9m49s
log-collector True False False 9m29s
management-cluster-connection True False False 9m54s
monitor True False False 10m
runtime-security True False False 10m
If all components show a status of "Available" = TRUE, Calico Cloud is properly installed.
The runtime-security
component is available only if the container threat detection feature is enabled.
Issue: Calico Cloud is not installed
If Calico Cloud is not installed, you'll get the following error. Install Calico Cloud on the node using the curl
command that you got from Support.
kubectl get tigerastatus
error: the server doesn't have a resource type "tigerastatus"
Issue: Calico Cloud components are missing or are degraded
If some of the Calico Cloud components are Available = FALSE or DEGRADED = TRUE, run the following command and contact Support with the following output.
kubectl get tigerastatus -o yaml
If you are using the AWS or Azure CNI plugin, a degraded state is likely because you do not have enough pod capacity on your nodes. To fix this, see Check pod capacity.
Sample output
In the following example, the typha component has an issue because it is showing AVAILABLE: FALSE
, and DEGRADED: TRUE
. To understand details of Calico Cloud components, see Deep dive into custom resources.
apiVersion: v1
items:
- apiVersion: operator.tigera.io/v1
kind: TigeraStatus
metadata:
creationTimestamp: '2020-12-30T17:13:30Z'
generation: 1
managedFields:
- apiVersion: operator.tigera.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec: {}
f:status:
.: {}
f:conditions: {}
manager: operator
operation: Update
time: '2020-12-30T17:16:20Z'
name: calico
resourceVersion: '8166'
selfLink: /apis/operator.tigera.io/v1/tigerastatuses/calico
uid: 39a8a2d0-2074-418c-b52d-0baa0a48f4a1
spec: {}
status:
conditions:
- lastTransitionTime: '2020-12-30T17:13:30Z'
status: 'False'
type: Available
- lastTransitionTime: '2020-12-30T17:13:30Z'
message: DaemonSet "calico-system/calico-node" is not yet scheduled on any nodes
reason: Not all pods are ready
status: 'True'
type: Progressing
- lastTransitionTime: '2020-12-30T17:13:30Z'
message: 'failed to wait for operator typha deployment to be ready: waiting
for typha to have 4 replicas, currently at 3'
reason: error migrating resources to calico-system
status: 'True'
type: Degraded
kind: List
metadata:
resourceVersion: ''
selfLink: ''
Check logs for fatal errors
Check that the Tigera operator is running and that logs do not have any fatal errors.
kubectl get pods -n tigera-operator
NAME READY STATUS RESTARTS AGE
tigera-operator-8687585b66-68gmr 1/1 Running 0 139m
kubectl logs -n tigera-operator tigera-operator-8687585b66-68gmr
2020/12/30 17:38:54 [INFO] Version: 90975f4
2020/12/30 17:38:54 [INFO] Go Version: go1.14.4
2020/12/30 17:38:54 [INFO] Go OS/Arch: linux/amd64
{"level":"info","ts":1609349935.2848425,"logger":"setup","msg":"Checking type of cluster","provider":""}
{"level":"info","ts":1609349935.2868738,"logger":"setup","msg":"Checking if TSEE controllers are required","required":true}
<...>
Check custom resources
Verify that you have the installation custom resource, and that the values are appropriate for your environment.
kubectl get installation.operator.tigera.io default -o yaml
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"operator.tigera.io/v1","kind":"Installation","metadata":{"annotations":{},"name":"default"},"spec":{"imagePullSecrets":[{"name":"tigera-pull-secret"}],"variant":"TigeraSecureEnterprise"}}
creationTimestamp: '2021-01-20T19:50:23Z'
generation: 2
managedFields:
- apiVersion: operator.tigera.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:spec:
.: {}
f:imagePullSecrets: {}
f:variant: {}
manager: kubectl
operation: Update
time: '2021-01-20T19:50:23Z'
- apiVersion: operator.tigera.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:calicoNetwork:
.: {}
f:bgp: {}
f:hostPorts: {}
f:ipPools: {}
f:mtu: {}
f:multiInterfaceMode: {}
f:nodeAddressAutodetectionV4:
.: {}
f:firstFound: {}
f:cni:
.: {}
f:ipam:
.: {}
f:type: {}
f:type: {}
f:componentResources: {}
f:flexVolumePath: {}
f:nodeUpdateStrategy:
.: {}
f:rollingUpdate:
.: {}
f:maxUnavailable: {}
f:type: {}
f:status:
.: {}
f:variant: {}
manager: operator
operation: Update
time: '2021-01-20T19:55:10Z'
name: default
resourceVersion: '5195'
selfLink: /apis/operator.tigera.io/v1/installations/default
uid: 016c3f0b-39f0-48a0-9da8-a59a81ed9128
spec:
calicoNetwork:
bgp: Enabled
hostPorts: Enabled
ipPools:
- blockSize: 26
cidr: 10.42.0.0/16
encapsulation: IPIP
natOutgoing: Enabled
nodeSelector: all()
mtu: 0
multiInterfaceMode: None
nodeAddressAutodetectionV4:
firstFound: true
cni:
ipam:
type: Calico
type: Calico
componentResources:
- componentName: Node
resourceRequirements:
requests:
cpu: 250m
flexVolumePath: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
imagePullSecrets:
- name: tigera-pull-secret
nodeUpdateStrategy:
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
variant: TigeraSecureEnterprise
status:
variant: TigeraSecureEnterprise
Verify that you have the following custom resources. In the default installation, there is no configuration information.
Check API server
kubectl get apiserver.operator.tigera.io tigera-secure
NAME AGE
tigera-secure 85m
Check cloud core
kubectl get cloudcore.operator.tigera.io tigera-secure
NAME AGE
tigera-secure 88m
Check compliance
kubectl get compliance.operator.tigera.io tigera-secure
NAME AGE
tigera-secure 90m
Check intrusion detection
kubectl get intrusiondetection.operator.tigera.io tigera-secure
NAME AGE
tigera-secure 93m
Check log collector
kubectl get logcollector.operator.tigera.io tigera-secure
NAME AGE
tigera-secure 96m
Check management cluster
kubectl get ManagementClusterConnection.operator.tigera.io tigera-secure -o yaml
apiVersion: operator.tigera.io/v1
kind: ManagementClusterConnection
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"operator.tigera.io/v1","kind":"ManagementClusterConnection","metadata":{"annotations":{},"name":"tigera-secure"},"spec":{"managementClusterAddr":"<Your cluster prefix>.tigera.io:9000"}}
creationTimestamp: '2021-01-20T19:55:40Z'
generation: 1
managedFields:
- apiVersion: operator.tigera.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:spec:
.: {}
f:managementClusterAddr: {}
manager: kubectl
operation: Update
time: '2021-01-20T19:55:40Z'
name: tigera-secure
resourceVersion: '5425'
selfLink: /apis/operator.tigera.io/v1/managementclusterconnections/tigera-secure
uid: b7a2093e-a4b6-4e76-b291-15f45bfa11cf
spec:
managementClusterAddr: <Your cluster prefix>.tigera.io:9000
Check monitor
kubectl get monitor.operator.tigera.io tigera-secure
NAME AGE
tigera-secure 98m
**Check runtime security **
kubectl get runtimesecurity.operator.tigera.io default
NAME AGE
default 99m
The runtime-security
custom resource will only be available if the container threat detection feature is enabled.
For more information on operator custom resources see the Installation API reference.
Deep dive into custom resources
Run the following command to see if you have required custom resources:
kubectl get tigerastatus
NAME | AVAILABLE | PROGRESSING | DEGRADED | SINCE | |
---|---|---|---|---|---|
1 | apiserver | TRUE | FALSE | FALSE | 10m |
2 | calico | TRUE | FALSE | FALSE | 11m |
3 | cloud-core | TRUE | FALSE | FALSE | 11m |
4 | compliance | TRUE | FALSE | FALSE | 9m39s |
5 | intrusion-detection | TRUE | FALSE | FALSE | 9m49s |
6 | log-collector | TRUE | FALSE | FALSE | 9m29s |
7 | management-cluster-connection | TRUE | FALSE | FALSE | 9m54s |
8 | monitor | TRUE | FALSE | FALSE | 11m |
9 | runtime-security | TRUE | FALSE | FALSE | 10m |
1 - api server
apiserver
is a required component that is an aggregated api-server. It is required for things like applying the tigera license. If tigerastatus
reports it as unavailable or degraded, check the pods and logs in the tigera-system
namespace. For example,
kubectl get pods -n tigera-system
NAME READY STATUS RESTARTS AGE
tigera-apiserver-5c75bc8d4b-sbn6g 2/2 Running 0 45m
2 - calico
calico
is the core component for networking. If it is not available or degraded, check the pods and their logs in the calico-system
namespace. There should be a calico-node pod running on each of your nodes. You should have at least one calico-typha
pod and the number will scale with the number of nodes in your cluster. You should have a calico-kube-controllers
pod running. For example,
kubectl get pods -n calico-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5c77d4d559-hfl5d 1/1 Running 0 44m
calico-node-6s2c9 1/1 Running 0 40m
calico-node-8nf28 1/1 Running 0 41m
calico-node-djlrg 1/1 Running 0 40m
calico-node-ms8nv 1/1 Running 0 40m
calico-node-t7pck 1/1 Running 0 40m
calico-typha-bdb494458-76gcx 1/1 Running 0 41m
calico-typha-bdb494458-847tr 1/1 Running 0 41m
calico-typha-bdb494458-k8lhj 1/1 Running 0 40m
calico-typha-bdb494458-vjbjz 1/1 Running 0 40m
3 - cloud-core
cloud-core
is responsible for predefined and custom roles for users. Check the pods and logs in the calico-cloud
namespace with the label selector k8s-app=cc-core-operator
.
$ kubectl get pods -n calico-cloud -l k8s-app=cc-core-operator
NAME READY STATUS RESTARTS AGE
cc-core-operator-126dcd494a-9kj7g 1/1 Running 0 80m
4 - compliance
compliance
is responsible for the compliance features. Check the pods and logs in the tigera-compliance
namespace.
$ kubectl get pods -n tigera-compliance
NAME READY STATUS RESTARTS AGE
compliance-benchmarker-bqvps 1/1 Running 0 65m
compliance-benchmarker-h58hr 1/1 Running 0 65m
compliance-benchmarker-kdtwp 1/1 Running 0 65m
compliance-benchmarker-mzm2z 1/1 Running 0 65m
compliance-benchmarker-s5mmf 1/1 Running 0 65m
compliance-controller-77785646df-ws2cj 1/1 Running 0 65m
compliance-snapshotter-6bcbdc65b-66k9v 1/1 Running 0 65m
5 - intrusion-detection
intrusion-detection
is responsible for the intrusion detection features. Check the pods and logs in the tigera-intrusion-detection
namespace.
$ kubectl get pods -n tigera-intrusion-detection
NAME READY STATUS RESTARTS AGE
intrusion-detection-controller-669bf45c75-grvz9 1/1 Running 0 66m
intrusion-detection-es-job-installer-xm22v 1/1 Running 0 66m
6 - log-collector
log-collector
collects flow and other logs and forwards them to Calico Cloud. Check the pods and logs in the tigera-fluentd
namespace. You should have one pod running on each of your nodes.
kubectl get pods -n tigera-fluentd
NAME READY STATUS RESTARTS AGE
fluentd-node-5mzh6 1/1 Running 0 70m
fluentd-node-7vmxw 1/1 Running 0 70m
fluentd-node-bbc4p 1/1 Running 0 70m
fluentd-node-chfz4 1/1 Running 0 70m
fluentd-node-d6f56 1/1 Running 0 70m
7 - management-cluster-connection
The management-cluster-connection
is required for your managed clusters to connect to the Calico Cloud backend. If it is not available or degraded, check the pods and logs in the tigera-guardian
namespace.
kubectl get pods -n tigera-guardian
NAME READY STATUS RESTARTS AGE
tigera-guardian-7d5d94d5cc-49rg8 1/1 Running 0 48m
To verify that the guardian component has network connectivity to the management cluster:
Find the URL to the management cluster:
kubectl get managementclusterconnection tigera-secure -o=jsonpath='{.spec.managementClusterAddr}'
<your prefix>.tigera.io:9000
then, from a worker node, verify network connectivity to the management cluster:
openssl s_client -connect <your prefix>.tigera.io:9000
CONNECTED(00000003)
depth=0 CN = tigera-voltron
verify error:num=18:self signed certificate
verify return:1
depth=0 CN = tigera-voltron
verify return:1
---
Certificate chain
0 s:CN = tigera-voltron
i:CN = tigera-voltron
---
Server certificate
-----BEGIN CERTIFICATE-----
MIIC5DCCAcygAwIBAgIBATANBgkqhkiG9w0BAQsFADAZMRcwFQYDVQQDEw50aWdl
cmEtdm9sdHJvbjAeFw0yMDEyMjExOTA1MzhaFw0yNTEyMjAxOTA1MzhaMBkxFzAV
<...>
8 - monitor
monitor
is responsible for configuring prometheus and associated custom resources. Check the pods and logs in the tigera-prometheus
namespace.
$ kubectl get pods -n tigera-prometheus
NAME READY STATUS RESTARTS AGE
alertmanager-calico-node-alertmanager-0 2/2 Running 0 125m
alertmanager-calico-node-alertmanager-1 2/2 Running 0 125m
alertmanager-calico-node-alertmanager-2 2/2 Running 0 125m
calico-prometheus-operator-77bf897c9b-7f88x 1/1 Running 0 125m
prometheus-calico-node-prometheus-0 3/3 Running 1 125m
9 - runtime-security
runtime-security
is responsible for the container threat detection feature. Check the pods and logs in the calico-cloud
namespace with the label selector k8s-app=tigera-runtime-security-operator
.
$ kubectl get pods -n calico-cloud -l k8s-app=tigera-runtime-security-operator
NAME READY STATUS RESTARTS AGE
tigera-runtime-security-operator-127b606afc-ap25k 1/1 Running 0 80m
Check additional custom resources
Check for the presence of other custom resources created by the Tigera operator: FelixConfiguration, IPPool, Tigera License, and Prometheus for component metrics.
FelixConfiguration
contains configuration that are not configured as environment variables to thecalico-node
container.
kubectl get Felixconfiguration default
NAME CREATED AT
default 2021-01-20T19:49:35Z
The operator creates a default IPPool
for your pod networking if it does not already exist; in this case, the CIDR is taken from the Installation CR.
kubectl get IPPool
NAME CREATED AT
default-ipv4-ippool 2021-01-20T19:49:35Z
A Tigera license is applied by the installation script.
kubectl get LicenseKeys.crd.projectcalico.org
NAME AGE
default 120m
The installation script deploys a Prometheus operator and associated custom resources. If you already have a Prometheus operator running in your cluster, contact Tigera support.
kubectl get pods -n tigera-prometheus
NAME READY STATUS RESTARTS AGE
alertmanager-calico-node-alertmanager-0 2/2 Running 0 125m
alertmanager-calico-node-alertmanager-1 2/2 Running 0 125m
alertmanager-calico-node-alertmanager-2 2/2 Running 0 125m
calico-prometheus-operator-77bf897c9b-7f88x 1/1 Running 0 125m
prometheus-calico-node-prometheus-0 3/3 Running 1 125m
Check pod capacity
If cluster does not have enough capacity, it will not be able to deploy pods. There is no specific error associated with this condition.
The high-level components Calico Cloud needs to run are:
- Per node: 1 fluentd, 1 compliance benchmarker
- On top of per node: 3 alertmanager (from statefulset), 1 prometheus, 1 prometheus operator, 1 kube-controllers, 2 compliance snapshotter and controller, 1 guardian, 1 ids controller, 1 apiserver
Some clusters have limited pod-networked pod capacity.
- For the AWS CNI plugin, the number of pods that can be networked is based on the size of the instance.
- For AKS with the Azure CNI, the number of pods that can be networked is set at cluster deployment time or when new node pools are created (default of 30).
- For GKE, there is a hard limit of 110 pods per node.
- For Calico CNI, the pod limit is based on the available IPs in the IPPool, and there is no specific per node limit.
Verify you have the following pod-networked pod capacity.
- Verify on each node in your cluster that there is capacity for at least 2 pods.
- Verify there is capacity for at least 11 pods in the cluster in addition to the per node capacity.
To check the capacity of individual nodes on AWS or AKS, query the node status and look at Capacity.Pods
(which is the total capacity for the node). To get the number of pod-networked pods for a node, count the pods on the node that are pod-networked (non-hostNetworked pods).
Check pod security policy violation
If your cluster is using Kubernetes version 1.24 or earlier, a pod security policy (PSP) violation may be blocking pods on the cluster.
Search for the term PodSecurityPolicy
in the status message of failed cluster deployments. If a PSP is present, install open source Calico in the cluster before you connect to Calico Cloud.
Check Manager UI dashboard for traffic
Manager UI main dashboard is missing traffic graphs
When you log in to Manager UI, the first item in the left nav is the main Dashboard. This dashboard is a birds-eye view of your managed cluster activity for policy and networking. For the graphs to display traffic, you must have the Prometheus operator. If it is missing, the following message is displayed during installation:
Prometheus Operator detected in the cluster. Skipping Tigera Prometheus Operator
To install an appropriate Prometheus operator, contact Support below.