Configure egress gateways, AWS
Big picture
Control the source IP address seen by external services/appliances by routing the traffic from certain pods through egress gateways. Use native VPC subnet IP addresses for the egress gateways so that the IPs are valid in the AWS fabric.
Value
Controlling the source IP seen when traffic leaves the cluster allows groups of pods to be identified by external firewalls, appliances and services (even as the groups are scaled up/down or pods restarted). Calico Enterprise controls the source IP by directing traffic through one or more "egress gateway" pods, which change the source IP of the traffic to their own IP. The egress gateways used can be chosen at the pod or namespace scope allowing for flexibility in how the cluster is seen from outside.
In AWS, egress gateway source IP addresses are chosen from an IP pool backed by a VPC subnet using Calico Enterprise IPAM. Calico Enterprise IPAM allows the IP addresses to be precisely controlled, this allows for static configuration of external appliances. Using an IP pool backed by a VPC subnet allows Calico Enterprise to configure the AWS fabric to route traffic to and from the egress gateway using its own IP address.
Concepts
CIDR notation
This article assumes that you are familiar with network masks and CIDR notation.
- CIDR notation is defined in RFC4632.
- The Wikipedia article on CIDR notation provides a good reference.
AWS-backed IP pools
Calico Enterprise supports IP pools that are backed by the AWS fabric. Workloads that use an IP address from an AWS-backed pool can communicate on the AWS network using their own IP address and AWS will route their traffic to/from their host without changing the IP address.
Pods that use an IP address from an AWS-backed pool may also be assigned an AWS Elastic IP via a pod annotation . Elastic IPs used in this way have the normal AWS semantics: when accessing resources inside the AWS network, the workload's private IP (from the IP pool) is used. When accessing resources outside the AWS network, AWS translates the workload's IP to the Elastic IP. Elastic IPs also allow for incoming requests from outside the AWS fabric, direct to the workload.
In overview, the AWS-backed IP Pools feature works as follows:
-
An IP pool is created with its
awsSubnetID
field set to the ID of a VPC subnet. This "AWS-backed" IP pool's CIDR must be contained within the VPC subnet's CIDR.cautionYou must ensure that the CIDR(s) used for AWS-backed IP pool(s) are reserved in the AWS fabric. For example, by creating a dedicated VPC subnet for Calico Enterprise. If the CIDR is not reserved; both Calico Enterprise and AWS may try to assign the same IP address, resulting in a conflict.
-
Since they are a limited resource, Calico Enterprise IPAM does not use AWS-backed pools by default. To request an AWS-backed IP address, a pod must have a resource request:
spec:
containers:
- ...
resources:
requests:
projectcalico.org/aws-secondary-ipv4: 1
limits:
projectcalico.org/aws-secondary-ipv4: 1Calico Enterprise manages the
projectcalico.org/aws-secondary-ipv4
capacity on the Kubernetes Node resource, ensuring that Kubernetes will not try to schedule too many AWS-backed workloads to the same node. Only AWS-backed pods are limited in this way; there is no limit on the number of non-AWS-backed pods. -
When the CNI plugin spots such a resource request, it will choose an IP address from an AWS-backed pool. Only pools with VPC subnets in the availability zone of the host are considered.
-
When Felix, Calico Enterprise's per-host agent spots a local workload with an AWS-backed address it tries to ensure that the IP address of the workload is assigned to the host in the AWS fabric. If need be, it will create a new secondary ENI device and attach it to the host to house the IP address. Felix supports two modes for assigning secondary ENIs: ENI-per-workload mode (added in v3.13) and Secondary-IP-per-workload mode. These modes are described below.
-
If the pod has one or more AWS Elastic IPs listed in the
cni.projectcalico.org/awsElasticIPs
pod annotation, Felix will try to ensure that one of the Elastic IPs is assigned to the pod's private IP address in the AWS fabric. (Specifying multiple Elastic IPs is useful for multi-pod deployments; ensuring that each pod in the deployment gets one of the IPs.)
Egress gateway
An egress gateway acts as a transit pod for the outbound application traffic that is configured to use it. As traffic leaving the cluster passes through the egress gateway, its source IP is changed to that of the egress gateway pod, and the traffic is then forwarded on.
Source IP
When an outbound application flow leaves the cluster, its IP packets will have a source IP. This begins as the pod IP of the pod that originated the flow, then:
-
If no egress gateway is configured and the pod IP came from an IP pool with
natOutgoing: true
, the node hosting the pod will change the source IP to its own as the traffic leaves the host. This allows the pod to communicate with external service even though the external network is unaware of the pod's IP. -
If the pod is configured with an egress gateway, the traffic is first forwarded to the egress gateway, which changes the source IP to its own and then sends the traffic on. To function correctly, egress gateways should have IPs from an IP pool with
natOutgoing: false
, meaning their host forwards the packet onto the network without changing the source IP again. Since the egress gateway's IP is visible to the underlying network fabric, the fabric must be configured to know about the egress gateway's IP and to send response traffic back to the same host.
AWS VPCs and subnets
An AWS VPC is a virtual network that is, by default, logically isolated from other VPCs. Each VPC has one or more
(often large) CIDR blocks associated with it (for example 10.0.0.0/16
). In general, VPC CIDRs may overlap, but only
if the VPCs remain isolated. AWS allows VPCs to be peered with each other through VPC Peerings. VPCs can only be
peered if none of their associated CIDRs overlap.
Each VPC has one or more VPC subnets associated with it, each subnet owns a non-overlapping part of one of the VPC's CIDR blocks. Each subnet is associated with a particular availability zone. Instances in one availability zone can only use IP addresses from subnets in that zone. Unfortunately, this adds some complexity to managing egress gateways IP addresses: much of the configuration must be repeated per-AZ.
AWS VPC and DirectConnect peerings
AWS VPC Peerings allow multiple VPCs to be connected together. Similarly, DirectConnect allows external datacenters to be connected to an AWS VPC. Peered VPCs and datacenters communicate using private IPs as if they were all on one large private network.
By using AWS-backed IP pools, egress gateways can be assigned private IPs allowing them to communicate without NAT within the same VPC, with peered VPCs, and, with peered datacenters.
Secondary Elastic Network Interfaces (ENIs)
Elastic network interfaces are network interfaces that can be added and removed from an instance dynamically. Each ENI has a primary IP address from the VPC subnet that it belongs to, and it may also have one or more secondary IP addresses, chosen for the same subnet. While the primary IP address is fixed and cannot be changed, the secondary IP addresses can be added and removed at runtime.
To arrange for AWS to route traffic to and from egress gateways, Calico Enterprise adds secondary Elastic Network Interfaces (ENIs) to the host. Calico Enterprise supports two modes for provisioning the secondary ENIs. The table below describes the trade-offs between ENI-per-workload and Secondary-IP-per-workload modes:
ENI-per-workload (since v3.13) | Secondary-IP-per-workload |
---|---|
One secondary ENI is attached for each AWS-backed workload. | Secondary ENIs are shared, multiple workloads per ENI. |
Supports one AWS-backed workload per secondary ENI. | Supports 2-49 AWS-backed workloads per secondary ENI (depending on instance type). |
ENI Primary IP is set to Workload's IP. | ENI Primary IP chosen from dedicated "host secondary" IP pools. |
Makes best use of AWS IP space, no need to reserve IPs for hosts. | Requires "host secondary" IPs to be reserved. These cannot be used for workloads. |
ENI deleted when workload deleted. | ENI retained (ready for next workload to be scheduled). |
Slower to handle churn/workload mobility. (Creating ENI is slower than assigning IP.) | Faster at handling churn/workload mobility. |
The number of ENIs that an instance can support and the number of secondary IPs that each ENI can support depends on the instance type according to this table. Note: the table lists the total number of network interfaces and IP addresses but the first interface on the host (the primary interface) and, in Secondary-IP-per-workload mode, the first IP of each interface (its primary IP) cannot be used for egress gateways.
The primary interface cannot be used for egress gateways because it belongs to the VPC subnet that is in use for Kubernetes hosts; this means that a planned egress gateway IP could get used by AWS as the primary IP of an instance (for example when scaling up the cluster).
Before you begin
Required
- Calico CNI
- Open port UDP 4790 on the host
Not Supported
-
Amazon VPC CNI
Calico Enterprise CNI and IPAM is required. The ability to control the egress gateway’s IP is a feature of Calico Enterprise CNI and IPAM. AWS VPC CNI does not support that feature, so it is incompatible with egress gateways.
How to
- Configure IP autodetection
- Ensure Kubernetes VPC has free CIDR range
- Create dedicated VPC subnets
- Configure AWS IAM roles
- Configure IP reservations for each VPC subnet
- Enable egress gateway support
- Enable AWS-backed IP pools
- Configure IP pools backed by VPC subnets
- Deploy a group of egress gateways
- Configure iptables backend for egress gateways
- Configure namespaces and pods to use egress gateways
- Optionally enable ECMP load balancing
- Verify the feature operation
- Control the use of egress gateways
- Policy enforcement for flows via an egress gateway
- Upgrade egress gateways
Configure IP autodetection
Since this feature adds additional network interfaces to nodes, it is important to configure Calico Enterprise to autodetect the correct primary interface to use for normal pod-to-pod traffic. Otherwise, Calico Enterprise may autodetect a newly-added secondary ENI as the main interface, causing an outage.
For EKS clusters, the default IP autodetection method is can-reach=8.8.8.8
, which will choose the interface
with a route to 8.8.8.8
; this is typically the interface with a default route, which will be the correct (primary) ENI.
(Calico Enterprise ensures that the secondary ENIs do not have default routes in the main routing table.)
For other AWS clusters, Calico Enterprise may default to firstFound
, which is not suitable.
To examine the autodetection method, check the operator's installation resource:
$ kubectl get installations.operator.tigera.io -o yaml default
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
...
name: default
...
spec:
calicoNetwork:
...
nodeAddressAutodetectionV4:
firstFound: true
...
If nodeAddressAutodetectionV4
is set to firstFound: true
or is not specified, then you must change it to another method by editing the
resource. The NodeAddressAutodetection options, canReach
and cidrs
are suitable. See Installation reference. If using the cidrs
option, set the CIDRs list to include only the
CIDRs from which your primary ENI IPs are chosen (do not include the dedicated VPC subnets chosen below).
Ensure Kubernetes VPC has free CIDR range
For egress gateways to be useful in AWS, we want to assign them IP addresses from a VPC subnet that is in the same AZ as their host.
To avoid clashes between AWS IP allocations and Calico Enterprise IP allocations, it is important that the range of IP addresses assigned to Calico Enterprise IP pools is not used by AWS for automatic allocations. In this guide we assume that you have created a dedicated VPC subnet per Availability Zone (AZ) that is reserved for Calico Enterprise and configured not to be used as the default subnet for the AZ.
If you are creating your cluster and VPC from scratch, plan to subdivide the VPC CIDR into (at least) two VPC subnets per AZ. One VPC subnet for the Kubernetes (and any other) hosts and one VPC subnet for egress gateways. (The next section explains the sizing requirements for the egress gateway subnets.)
If you are adding this feature to an existing cluster, you may find that the existing VPC subnets already cover the entire VPC CIDR, making it impossible to create a new subnet. If that is the case, you can make more room by adding a second CIDR to the VPC that is large enough for the new subnets. For information on adding a secondary CIDR range to a VPC, see this guide.
Create dedicated VPC subnets
Calico Enterprise requires a dedicated VPC subnet in each AWS availability zone that you wish to deploy egress gateways. The subnet must be dedicated to Calico Enterprise so that AWS will not use IP addresses from the subnet for other purposes (as this could clash with an egress gateway's IP). When creating the subnet you should configure it not to be used for instances.
Some IP addresses from the dedicated subnet are reserved for AWS and Calico Enterprise internal use:
- The first four IP addresses in the subnet cannot be used. These are reserved by AWS for internal use.
- Similarly, the last IP in the subnet (the broadcast address) cannot be used.
- In Secondary-IP-per-workload mode, Calico Enterprise requires one IP address from the subnet per secondary ENI that it provisions (for use as the primary IP address of the ENI). In ENI-per-workload mode, this is not required.
- ENI-per-workload
- Secondary-IP-per-workload
Example for ENI-per-workload mode:
- You anticipate having up to 30 instances running in each availability zone (AZ).
- You intend to use
t3.large
instances, these are limited to 3 ENIs per host. - So, each host can accept 2 secondary ENIs, each of which can handle one egress gateway.
- With 2 ENIs per node and 30 nodes, the part of the cluster in this AZ could handle up to
30 * 2 = 60
egress gateways. - AWS reserves 5 IPs from the AWS subnet for internal use, no "host secondary IPs" need to be reserved in this mode.
- Since VPC subnets are allocated by CIDR, a
/25
subnet containing 128 IP addresses would comfortably fit the 5 reserved IPs as well as the 60 possible gateways (with headroom for more nodes to be added later).
Example for Secondary-IP-per-workload mode:
- You anticipate having up to 30 instances running in each availability zone (AZ).
- You intend to use
t3.large
instances, these are limited to 3 ENIs per host (one of which is the primary) and each ENI can handle 12 IP addresses, (one of which is the primary). - So, each host can accept 2 secondary ENIs and each secondary ENI could handle 11 egress gateway pods.
- Each in-use secondary ENI requires one IP from the VPC subnet (up to 60 in this case) and AWS requires 5 IPs to be reserved so that's up to 65 IPs reserved in total.
- With 2 ENIs and 11 IPs per ENI, the part of the cluster in this AZ could handle up to
30 * 2 * 11 = 660
egress gateways. - Since VPC subnets are allocated by CIDR, a
/22
subnet containing 1024 IP addresses would comfortably fit the 65 reserved IPs as well as the 660 possible gateways.
Calico Enterprise allocates ENIs on-demand so each instance will only claim one of those reserved IP addresses when the first egress gateway is assigned to that node. It will only claim its second IP when that ENI becomes full and then an extra egress gateway is provisioned.
Configure AWS IAM roles
To provision the required AWS resources, each calico-node pod in your cluster requires the
following IAM permissions to be granted. The permissions can be granted to the node IAM Role itself, or by using
the AWS IAM roles for service accounts feature to grant the permissions to the
calico-node
service account.
- DescribeInstances
- DescribeInstanceTypes
- DescribeNetworkInterfaces
- DescribeSubnets
- DescribeTags
- CreateTags
- AssignPrivateIpAddresses
- UnassignPrivateIpAddresses
- AttachNetworkInterface
- CreateNetworkInterface
- DeleteNetworkInterface
- DetachNetworkInterface
- ModifyNetworkInterfaceAttribute
The above permissions are similar to those used by the AWS VPC CNI (since both CNIs need to provision the same kinds of resources). In addition, to support elastic IPs, each calico-node also requires the following permissions:
- DescribeAddresses
- AssociateAddress
- DisassociateAddress
Configure AWS Security Group rules
To allow egress gateway traffic into the egress gateway pod's host from the client, the ingress rules of the security group need to be updated. A rule to allow all packets from within the security group must be added to the inbound rules.
Configure IP reservations for each VPC subnet
Since the first four IP addresses and the last IP address in a VPC subnet cannot be used, it is important to prevent Calico Enterprise from trying to use them. For each VPC subnet that you plan to use, ensure that you have an entry in an IP reservation for its first four IP addresses and its final IP address.
For example, if your chosen VPC subnets are 100.64.0.0/22
and 100.64.4.0/22
, you could create the following
IPReservation
resource, which covers both VPC subnets (if you're not familiar with CIDR notation, replacing the
/22
of the original subnet with /30
is a shorthand for "the first four IP addresses"):
apiVersion: projectcalico.org/v3
kind: IPReservation
metadata:
name: aws-ip-reservations
spec:
reservedCIDRs:
- 100.64.0.0/30
- 100.64.3.255
- 100.64.4.0/30
- 100.64.7.255
Enable egress gateway support
In the default FelixConfiguration, set the egressIPSupport
field to EnabledPerNamespace
or
EnabledPerNamespaceOrPerPod
, according to the level of support that you need in your cluster. For
support on a per-namespace basis only:
kubectl patch felixconfiguration default --type='merge' -p \
'{"spec":{"egressIPSupport":"EnabledPerNamespace"}}'
Or for support both per-namespace and per-pod:
kubectl patch felixconfiguration default --type='merge' -p \
'{"spec":{"egressIPSupport":"EnabledPerNamespaceOrPerPod"}}'
egressIPSupport
must be the same on all cluster nodes, so you should set them only in thedefault
FelixConfiguration resource.- The operator automatically enables the required policy sync API in the FelixConfiguration.
Enable AWS-backed IP pools
- ENI-per-workload
- Secondary-IP-per-workload
To enable ENI-per-workload mode, in the default FelixConfiguration, set the awsSecondaryIPSupport
field to
EnabledENIPerWorkload
:
kubectl patch felixconfiguration default --type='merge' -p \
'{"spec":{"awsSecondaryIPSupport":"EnabledENIPerWorkload"}}'
To enable Secondary-IP-per-workload mode, set the field to Enabled
(the name Enabled
predates
the addition of the ENI-per-workload mode):
kubectl patch felixconfiguration default --type='merge' -p \
'{"spec":{"awsSecondaryIPSupport":"Enabled"}}'
You can verify that the setting took effect by examining the Kubernetes Node resources:
kubectl describe node <nodename>
Should show the new projectcalico.org/aws-secondary-ipv4
capacity (in the Allocated Resources section).
Changing modes
You can change between the two modes by:
- Ensuring that the number of egress gateways on every node is within the limits of the particular mode. i.e. when switching to ENI-per-workload mode, the number of egress gateways must be less than or equal to the number of secondary ENIs that your instances can handle.
- Editing the setting (using the patch commands above, for example).
Changing the mode will cause disruption as ENIs must be removed and re-added.
Configure IP pools backed by VPC subnets
- ENI-per-workload
- Secondary-IP-per-workload
In ENI-per-workload mode, IP pools are (only) used to subdivide the VPC subnets into small pools used for particular groups of egress gateways. These IP Pools must have:
awsSubnetID
set to the ID of the relevant VPC subnet. This activates the AWS-backed IP feature for these pools.allowedUse
set to["Workload"]
to tell Calico Enterprise IPAM to use those pools for the egress gateway workloads.vxlanMode
andipipMode
set toNever
to disable encapsulation for the egress gateway pods. (Never
is the default if these fields are not specified.)blockSize
set to 32. This aligns Calico Enterprise IPAM with the behaviour of the AWS fabric.disableBGPExport
set totrue
. This prevents routing conflicts if your cluster is using IPIP or BGP networking.
It's also recommended to:
- Set
nodeSelector
to"!all()"
. This prevents Calico Enterprise IPAM from using the pool automatically. It will only be used for workloads that explicitly name it in thecni.projectcalico.org/ipv4pools
annotation.
Continuing the example above, with VPC subnets
100.64.0.0/22
in, say, availability zone west-1 and idsubnet-000000000000000001
100.64.4.0/22
in, say, availability zone west-2 and idsubnet-000000000000000002
And, assuming that there are two clusters of egress gateways "red" and "blue" (which in turn serve namespaces "red" and "blue"), one way to structure the IP pools is to have one IP pool for each group of egress gateways in each subnet. Then, if a particular egress gateway from the egress gateway cluster is scheduled to one AZ or the other, it will take an IP from the appropriate pool.
For the "west-1" availability zone:
-
IP pool "egress-red-west-1", CIDR
100.64.0.4/30
(the first non-reserved /30 CIDR in the VPC subnet). These addresses will be used for "red" egress gateways in the "west-1" AZ. -
IP pool "egress-blue-west-1", CIDR
100.64.0.8/30
(the next 4 IPs from the "west-1" subnet). These addresses will be used for "blue" egress gateways in the "west-1" AZ.
For the "west-2" availability zone:
-
IP pool "egress-red-west-2", CIDR
100.64.4.4/30
(the first non-reserved /30 CIDR in the VPC subnet). These addresses will be used for "red" egress gateways in the "west-2" AZ. -
IP pool "egress-blue-west-2", CIDR
100.64.4.8/30
(the next 4 IPs from the "west-2" subnet). These addresses will be used for "blue" egress gateways in the "west-2" AZ.
Converting this to IPPool
resources:
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-red-west-1
spec:
cidr: 100.64.0.4/30
allowedUses: ['Workload']
awsSubnetID: subnet-000000000000000001
blockSize: 32
nodeSelector: '!all()'
disableBGPExport: true
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-blue-west-1
spec:
cidr: 100.64.0.8/30
allowedUses: ['Workload']
awsSubnetID: subnet-000000000000000001
blockSize: 32
nodeSelector: '!all()'
disableBGPExport: true
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-red-west-2
spec:
cidr: 100.64.4.4/30
allowedUses: ['Workload']
awsSubnetID: subnet-000000000000000002
blockSize: 32
nodeSelector: '!all()'
disableBGPExport: true
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-blue-west-2
spec:
cidr: 100.64.4.8/30
allowedUses: ['Workload']
awsSubnetID: subnet-000000000000000002
blockSize: 32
nodeSelector: '!all()'
disableBGPExport: true
In Secondary-IP-per-workload mode, IP pools are used to subdivide the VPC subnets as follows:
-
One medium-sized IP pool per-Subnet reserved for Calico Enterprise to use for the primary IP addresses of its secondary ENIs. These pools must have:
awsSubnetID
set to the ID of the relevant VPC subnet. This activates the AWS-backed IP feature for these pools.allowedUse
set to["HostSecondaryInterface"]
to reserve them for this purpose.blockSize
set to 32. This aligns Calico Enterprise IPAM with the behaviour of the AWS fabric.vxlanMode
andipipMode
set toNever
. (Never
is the default if these fields are not specified.)disableBGPExport
set totrue
. This prevents routing conflicts if your cluster is using IPIP or BGP networking.
-
Small pools used for particular groups of egress gateways. These must have:
awsSubnetID
set to the ID of the relevant VPC subnet. This activates the AWS-backed IP feature for these pools.allowedUse
set to["Workload"]
to tell Calico Enterprise IPAM to use those pools for the egress gateway workloads.vxlanMode
andipipMode
set toNever
to disable encapsulation for the egress gateway pods. (Never
is the default if these fields are not specified.)blockSize
set to 32. This aligns Calico Enterprise IPAM with the behaviour of the AWS fabric.disableBGPExport
set totrue
. This prevents routing conflicts if your cluster is using IPIP or BGP networking.
It's also recommended to:
- Set
nodeSelector
to"!all()"
. This prevents Calico Enterprise IPAM from using the pool automatically. It will only be used for workloads that explicitly name it in thecni.projectcalico.org/ipv4pools
annotation.
Continuing the example above, with VPC subnets
100.64.0.0/22
in, say, availability zone west-1 and idsubnet-000000000000000001
100.64.4.0/22
in, say, availability zone west-2 and idsubnet-000000000000000002
And, assuming that there are two clusters of egress gateways "red" and "blue" (which in turn serve namespaces "red" and "blue"), one way to structure the IP pools is to have a "hosts" IP pool in each VPC subnet and one IP pool for each group of egress gateways in each subnet. Then, if a particular egress gateway from the egress gateway cluster is scheduled to one AZ or the other, it will take an IP from the appropriate pool.
For the "west-1" availability zone:
-
IP pool "hosts-west-1", CIDR
100.64.0.0/25
(the first 128 addresses in the "west-1" VPC subnet).- We'll reserve these addresses for hosts to use.
100.64.0.0/25
covers the addresses from100.64.0.0
to100.64.0.127
(but addresses100.64.0.0
to100.64.0.3
were reserved above).
-
IP pool "egress-red-west-1", CIDR
100.64.0.128/30
(the next 4 IPs from the "west-1" subnet).- These addresses will be used for "red" egress gateways in the "west-1" AZ.
-
IP pool "egress-blue-west-1", CIDR
100.64.0.132/30
(the next 4 IPs from the "west-1" subnet).- These addresses will be used for "blue" egress gateways in the "west-1" AZ.
For the "west-2" availability zone:
-
IP pool "hosts-west-2", CIDR
100.64.4.0/25
(the first 128 addresses in the "west-2" VPC subnet).100.64.4.0/25
covers the addresses from100.64.4.0
to100.64.4.127
(but addresses100.64.4.0
to100.64.4.3
were reserved above).
-
IP pool "egress-red-west-2", CIDR
100.64.4.128/30
(the next 4 IPs from the "west-2" subnet).- These addresses will be used for "red" egress gateways in the "west-2" AZ.
-
IP pool "egress-blue-west-2", CIDR
100.64.4.132/30
(the next 4 IPs from the "west-2" subnet).- These addresses will be used for "blue" egress gateways in the "west-2" AZ.
Converting this to IPPool
resources:
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: hosts-west-1
spec:
cidr: 100.64.0.0/25
allowedUses: ['HostSecondaryInterface']
awsSubnetID: subnet-000000000000000001
blockSize: 32
disableBGPExport: true
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-red-west-1
spec:
cidr: 100.64.0.128/30
allowedUses: ['Workload']
awsSubnetID: subnet-000000000000000001
blockSize: 32
nodeSelector: '!all()'
disableBGPExport: true
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-blue-west-1
spec:
cidr: 100.64.0.132/30
allowedUses: ['Workload']
awsSubnetID: subnet-000000000000000001
blockSize: 32
nodeSelector: '!all()'
disableBGPExport: true
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: hosts-west-2
spec:
cidr: 100.64.4.0/25
allowedUses: ['HostSecondaryInterface']
awsSubnetID: subnet-000000000000000002
blockSize: 32
disableBGPExport: true
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-red-west-2
spec:
cidr: 100.64.4.128/30
allowedUses: ['Workload']
awsSubnetID: subnet-000000000000000002
blockSize: 32
nodeSelector: '!all()'
disableBGPExport: true
---
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-blue-west-2
spec:
cidr: 100.64.4.132/30
allowedUses: ['Workload']
awsSubnetID: subnet-000000000000000002
blockSize: 32
nodeSelector: '!all()'
disableBGPExport: true
Deploy a group of egress gateways
Use an egress gateway custom resource to deploy a group of egress gateways.
Using the example of the "red" egress gateway cluster, we use several features of Kubernetes and Calico Enterprise in tandem to get a cluster of egress gateways that spans both availability zones and uses AWS-backed IP addresses:
kubectl apply -f - <<EOF
apiVersion: operator.tigera.io/v1
kind: EgressGateway
metadata:
name: "egress-gateway-red"
namespace: "calico-egress"
spec:
logSeverity: "Info"
replicas: 2
ipPools:
- name: "egress-red-west-1"
- name: "egress-red-west-2"
# Uncomment this block to add ICMP, HTTP probes
# egressGatewayFailureDetection:
# healthTimeoutDataStoreSeconds: 30
# icmpProbe:
# ips:
# - <IP to probe>
# - <IP to probe>
# timeoutSeconds: 15
# intervalSeconds: 5
# httpProbe:
# urls:
# - <URL to probe>
# - <URL to probe>
# timeoutSeconds: 30
# intervalSeconds: 10
aws:
nativeIP: Enabled
template:
metadata:
labels:
egress-code: red
spec:
nodeSelector:
kubernetes.io/os: linux
terminationGracePeriodSeconds: 0
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: "DoNotSchedule"
labelSelector:
matchLabels:
egress-code: red
EOF
-
replicas: 2
tells Kubernetes to schedule two egress gateways in the "red" cluster. -
ipPools tells Calico Enterprise IPAM to use one of the "red" IP pools:
ipPools:
- name: "egress-red-west-1"
- name: "egress-red-west-2"Depending on which AZ the pod is scheduled in, Calico Enterprise IPAM will automatically ignore IP pools that are backed by AWS subnets that are not in the local AZ.
External services and appliances can recognise "red" traffic because it will all come from the CIDRs of the "red" IP pools.
-
When nativeIP is enabled, IPPools must be AWS-backed. It also tells Kubernetes to only schedule the gateway to a node with available AWS IP capacity:
aws:
nativeIP: Enabled -
The following topology spread constraint ensures that Kubernetes spreads the Egress gateways evenly between AZs (assuming that your nodes are labeled with the expected well-known label
topology.kubernetes.io/zone
):topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
egress-code: red -
The labels are arbitrary. You can choose whatever names and values are convenient for your cluster's Namespaces and Pods to refer to in their egress selectors. If labels are not specified, a default label
projectcalico.org/egw
:name
will be added by the Tigera Operator. -
icmpProbe may be used to specify the Probe IPs, ICMP interval and timeout in seconds.
ips
if set, the egress gateway pod will probe each IP periodically using an ICMP ping. If all pings fail then the egress gateway will report non-ready via its health port.intervalSeconds
controls the interval between probes.timeoutSeconds
controls the timeout before reporting non-ready if no probes succeed.icmpProbe:
ips:
- <IP to probe>
- <IP to probe>
timeoutSeconds: 20
intervalSeconds: 10 -
httpProbe may be used to specify the Probe URLs, HTTP interval and timeout in seconds.
urls
if set, the egress gateway pod will probe each external service periodically. If all probes fail then the egress gateway will report non-ready via its health port.intervalSeconds
controls the interval between probes.timeoutSeconds
controls the timeout before reporting non-ready if all probes are failing.httpProbe:
urls:
- <URL to probe>
- <URL to probe>
timeoutSeconds: 30
intervalSeconds: 10 -
Please refer to the operator reference docs for details about the egress gateway resource type.
- It is advisable to have more than one egress gateway per group, so that the egress IP function
continues if one of the gateways crashes or needs to be restarted. When there are multiple
gateways in a group, outbound traffic from the applications using that group is load-balanced
across the available gateways. The number of
replicas
specified must be less than or equal to the number of free IP addresses in the IP Pool. - IPPool can be specified either by its name (e.g.
-name: egress-ippool-1
) or by its CIDR (e.g.-cidr: 10.10.10.0/31
). - The labels are arbitrary. You can choose whatever names and values are convenient for
your cluster's Namespaces and Pods to refer to in their egress selectors.
The health port
8080
is used by: - The Kubernetes
readinessProbe
to expose the status of the egress gateway pod (and any ICMP/HTTP probes). - Remote pods to check if the egress gateway is "ready". Only "ready" egress gateways will be used for remote client traffic. This traffic is automatically allowed by Calico Enterprise and no policy is required to allow it. Calico Enterprise only sends probes to egress gateway pods that have a named "health" port. This ensures that during an upgrade, health probes are only sent to upgraded egress gateways.
Configure iptables backend for egress gateways
The Tigera Operator configures egress gateways to use the same iptables backend as calico-node
.
To modify the iptables backend for egress gateways, you must change the iptablesBackend
field in the Felix configuration.
Configure namespaces and pods to use egress gateways
You can configure namespaces and pods to use an egress gateway by:
- annotating the namespace or pod
- applying an egress gateway policy to the namespace or pod.
Using an egress gateway policy is more complicated, but it allows advanced use cases.
Configure a namespace or pod to use an egress gateway (annotation method)
In a Calico Enterprise deployment, the Kubernetes namespace and pod resources honor annotations that tell that namespace or pod to use particular egress gateways. These annotations are selectors, and their meaning is "the set of pods, anywhere in the cluster, that match those selectors".
So, to configure all the pods in a namespace to use the egress gateways that are
labelled with egress-code: red
, you would annotate that namespace like this:
kubectl annotate ns <namespace> egress.projectcalico.org/selector="egress-code == 'red'"
By default, that selector can only match egress gateways in the same namespace. To select gateways
in a different namespace, specify a namespaceSelector
annotation as well, like this:
kubectl annotate ns <namespace> egress.projectcalico.org/namespaceSelector="projectcalico.org/name == 'default'"
Egress gateway annotations have the same syntax and range of expressions as the selector fields in Calico Enterprise network policy.
To configure a specific Kubernetes Pod to use egress gateways, specify the same annotations when creating the pod. For example:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
annotations:
egress.projectcalico.org/selector: egress-code == 'red'
egress.projectcalico.org/namespaceSelector: projectcalico.org/name == 'default'
name: my-client,
namespace: my-namespace,
spec:
...
EOF
Configure a namespace or pod to use an egress gateway (egress gateway policy method)
Creating an egress gateway policy allows gives you more control over how your egress gateways work. For example, you can:
- Send egress gateway traffic to multiple egress gateways, depending on the destination.
- Skip egress gateways for traffic that is bound for local endpoints that aren't in the cluster.
The following is an example of Egress Gateway Policy:
apiVersion: projectcalico.org/v3
kind: EgressGatewayPolicy
metadata:
name: "egw-policy1"
spec:
rules:
- destination:
cidr: 10.0.0.0/8
description: "Local: no gateway"
- destination:
cidr: 11.0.0.0/8
description: "Gateway to on prem"
gateway:
namespaceSelector: "projectcalico.org/name == 'default'"
selector: "egress-code == 'blue'"
maxNextHops: 2
- description: "Gateway to internet"
gateway:
namespaceSelector: "projectcalico.org/name == 'default'"
selector: "egress-code == 'red'"
gatewayPreference: PreferNodeLocal
-
If the
destination
field is not specified, it takes the default value of 0.0.0.0/0. -
If the
gateway
field is not specified, then egress traffic is routed locally, and not through an egress gateway. This is helpful for reaching local endpoints that are not part of a cluster. -
Required when
gateway
field is specified. -
Required when
gateway
field is specified. -
The
maxNextHops
field specifies the maximum number of egress gateway replicas from the selected deployment that a pod depends on. For more information, see Optimize egress networking for workloads with long-lived TCP connections. -
gatewayPreference
specifies hints to the gateway selection process. The defaultNone
, selects the default selection process. If set toPreferNodeLocal
, then egress gateways local to the client's node are used if available. If there are no local egress gateways, Calico Enterprise uses other egress gateways. In this example, for the default route, egress gateways local to the client's node are used if present. If not, all egress gateways matching the selector are used.
CIDRs specified in rules in an egress gateway policy are matched in Longest Prefix Match(LPM) fashion.
Calico Enterprise rejects egress gateway policies that do any of the following:
- The policy has no rule that specifies a gateway or a destination
- The policy has a rule with empty
selector
ornamespaceSelector
fields. - The policy has two or more rules with the same destination.
To configure all the pods in a namespace to use an egress gateway policy named egw-policy1
, you could annotate the namespace like this:
kubectl annotate ns <namespace> egress.projectcalico.org/egressGatewayPolicy="egw-policy1"
To configure a specific Kubernetes pod to use the same policy, specify the same annotations when creating the pod. For example:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
annotations:
egress.projectcalico.org/egressGatewayPolicy: "egw-policy1"
name: my-client,
namespace: my-namespace,
spec:
...
EOF
You must create the egress gateway policy before you apply it to a namespace or pod. If you attempt to apply an egress gateway policy that has not been created, Calico Enterprise will block all traffic from the namespace or pod.
Add AWS Elastic IPs to the egress gateway deployment
To add AWS Elastic IPs to the egress gateway pods, follow these steps:
-
Ensure that your worker nodes are either in a private subnet, or they are using Elastic IPs for their public IP.
warningIf your worker nodes are using "standard" VPC public IPs, adding the Elastic IP to the node triggers the node to lose its VPC Public IP. This is because in AWS networking, a node is not allowed to have a VPC public IP and an Elastic IP. However, we have found this is only enforced at boot time.
-
Ensure that your VPC has an Internet Gateway and a (default) route to the Internet Gateway from the AWS subnets used for egress gateways. (This is a standard requirement for Elastic IPs in AWS.)
-
Create one or more Elastic IPs for the deployment. This can be done through the AWS Console or using the AWS command line interface.
-
Add the Elastic IPs to the egress gateway resource.
aws:
nativeIP: Enabled
elasticIPs: ["37.1.2.3", "43.2.5.6"]
Once the update has rolled out, Calico Enterprise will try to add one of the requested Elastic IPs to each pod in the deployment.
Optionally enable ECMP load balancing
If you are provisioning multiple egress gateways for a given client pod, and you want
traffic from that client to load balance across the available gateways, set the
fib_multipath_hash_policy
sysctl to allow that:
sudo sysctl -w net.ipv4.fib_multipath_hash_policy=1
You will need this on each node with clients that you want to load balance across multiple egress gateways.
Verify the feature operation
To verify the feature operation, cause the application pod to initiate a connection to a server outside the cluster, and observe -- for example using tcpdump -- the source IP of the connection packet as it reaches the server.
In order for such a connection to complete, the server must know how to route back to the egress gateway's IP.
By way of a concrete example, you could use netcat to run a test server outside the cluster (outside AWS if you're using Elastic IPs); for example:
docker run --net=host --privileged subfuzion/netcat -v -l -k -p 8089
Then provision an egress IP Pool, and egress gateways, as above.
Then deploy a pod, with egress annotations as above, and with any image that includes netcat, for example:
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: my-netcat-pod
namespace: my-namespace
spec:
containers:
- name: alpine
image: alpine
command: ["/bin/sleep"]
args: ["infinity"]
EOF
Now you can use kubectl exec
to initiate an outbound connection from that pod:
kubectl exec <pod name> -n <pod namespace> -- nc <server IP> 8089 </dev/null
where <server IP>
should be the IP address of the netcat server.
Then, if you check the logs or output of the netcat server, you should see:
Connection from <source IP> <source port> received
with <source IP>
being one of the IPs of the egress IP pool that you provisioned.
Control the use of egress gateways
If a cluster ascribes special meaning to traffic flowing through egress gateways, it will be important to control when cluster users can configure their pods and namespaces to use them, so that non-special pods cannot impersonate the special meaning.
If namespaces in a cluster can only be provisioned by cluster admins, one option is to enable egress gateway function only on a per-namespace basis. Then only cluster admins will be able to configure any egress gateway usage.
Otherwise -- if namespace provisioning is open to users in general, or if it's desirable for egress gateway function to be enabled both per-namespace and per-pod -- a Kubernetes admission controller will be needed. This is a task for each deployment to implement for itself, but possible approaches include the following.
-
Decide whether a given Namespace or Pod is permitted to use egress annotations at all, based on other details of the Namespace or Pod definition.
-
Evaluate egress annotation selectors to determine the egress gateways that they map to, and decide whether that usage is acceptable.
-
Impose the cluster's own bespoke scheme for a Namespace or Pod to identify the egress gateways that it wants to use, less general than Calico Enterprise's egress annotations. Then the admission controller would police those bespoke annotations (that that cluster's users could place on Namespace or Pod resources) and either reject the operation in hand, or allow it through after adding the corresponding Calico Enterprise egress annotations.
Policy enforcement for flows via an egress gateway
For an outbound connection from a client pod, via an egress gateway, to a destination outside the cluster, there is more than one possible enforcement point for policy:
The path of the traffic through policy is as follows:
- Packet leaves the client pod and passes through its egress policy.
- The packet is encapsulated by the client pod's host and sent to the egress gateway
- The encapsulated packet is sent from the host to the egress gateway pod.
- The egress gateway pod de-encapsulates the packet and sends the packet out again with its own address.
- The packet leaves the egress gateway pod through its egress policy.
To ensure correct operation, (as of v3.15) the encapsulated traffic between host and egress gateway is auto-allowed by Calico Enterprise and other ingress traffic is blocked. That means that there are effectively two places where policy can be applied:
- on egress from the client pod
- on egress from the egress gateway pod (see limitations below).
The policy applied at (1) is the most powerful since it implicitly sees the original source of the traffic (by virtue of being attached to that original source). It also sees the external destination of the traffic.
Since an egress gateway will never originate its own traffic, one option is to rely on policy applied at (1) and to allow all traffic to at (2) (either by applying no policy or by applying an "allow all").
Alternatively, for maximum "defense in depth" applying policy at both (1) and (2) provides extra protection should the policy at (1) be disabled or bypassed by an attacker. Policy at (2) has the following limitations:
-
Domain-based policy is not supported at egress from egress gateways. It will either fail to match the expected traffic, or it will work intermittently if the egress gateway happens to be scheduled to the same node as its clients. This is because any DNS lookup happens at the client pod. By the time the policy reaches (2) the DNS information is lost and only the IP addresses of the traffic are available.
-
The traffic source will appear to be the egress gateway pod, the source information is lost in the address translation that occurs inside the egress gateway pod.
That means that policies at (2) will usually take the form of rules that match only on destination port and IP address, either directly in the rule (via a CIDR match) or via a (non-domain based) NetworkSet. Matching on source has little utility since the IP will always be the egress gateway and the port of translated traffic is not always preserved.
Since v3.15.0, Calico Enterprise also sends health probes to the egress gateway pods from the nodes where their clients are located. In iptables mode, this traffic is auto-allowed at egress from the host and ingress to the egress gateway. In eBPF mode, the probe traffic can be blocked by policy, so you must ensure that this traffic is allowed; this should be fixed in an upcoming patch release.
Upgrade egress gateways
From v3.16, egress gateway deployments are managed by the Tigera Operator.
-
When upgrading from a pre-v3.16 release, no automatic upgrade will occur. To upgrade a pre-v3.16 egress gateway deployment, create an equivalent EgressGateway resource with the same namespace and the same name as mentioned above; the operator will then take over management of the old Deployment resource, replacing it with the upgraded version.
-
Use
kubectl apply
to create the egress gateway resource. Tigera Operator will read the newly created resource and wait for the other Calico Enterprise components to be upgraded. Once the other Calico Enterprise components are upgraded, Tigera Operator will upgrade the existing egress gateway deployment with the new image.
By default, upgrading egress gateways will sever any connections that are flowing through them. To minimise impact, the egress gateway feature supports some advanced options that give feedback to affected pods. For more details see the egress gateway maintenance guide.
Additional resources
Please see also:
- The
egressIP...
andaws...
fields of the FelixConfiguration resource. - Troubleshooting egress gateways.
- Additional configuration for egress gateway maintenance