Skip to main content

Configure egress gateways, Azure

Big picture​

Control the source IP address seen by external services/appliances by routing the traffic from certain pods through egress gateways. Use native VNet subnet IP addresses for the egress gateways so that the IPs are valid in the Azure fabric.

Value​

Controlling the source IP seen when traffic leaves the cluster allows groups of pods to be identified by external firewalls, appliances and services (even as the groups are scaled up/down or pods restarted). Calico Cloud controls the source IP by directing traffic through one or more "egress gateway" pods, which change the source IP of the traffic to their own IP. The egress gateways used can be chosen at the pod or namespace scope allowing for flexibility in how the cluster is seen from outside.

In Azure, egress gateway source IP addresses are chosen from an arbitrary user-defined IP pool using Calico Cloud IPAM. egress gateway pods use dedicated IPPools to use as source IP addresses, which enables static configuration of external appliances.

Concepts​

Egress gateway​

An egress gateway acts as a transit pod for the outbound application traffic that is configured to use it. As traffic leaving the cluster passes through the egress gateway, its source IP is changed to that of the egress gateway pod, and the traffic is then forwarded on.

Source IP​

When an outbound application flow leaves the cluster, its IP packets will have a source IP. This begins as the pod IP of the pod that originated the flow, then:

  • If no egress gateway is configured and the pod IP came from an IP pool with natOutgoing: true, the node hosting the pod will change the source IP to its own as the traffic leaves the host. This allows the pod to communicate with external service even though the external network is unaware of the pod's IP.

  • If the pod is configured with an egress gateway, the traffic is first forwarded to the egress gateway, which changes the source IP to its own and then sends the traffic on. To function correctly, egress gateways should have IPs from an IP pool with natOutgoing: false, meaning their host forwards the packet onto the network without changing the source IP again. Since the egress gateway's IP is visible to the underlying network fabric, the fabric must be configured to know about the egress gateway's IP and to send response traffic back to the same host.

Azure VNets and subnets​

An Azure VNet is a virtual network that is, by default, logically isolated from other VNets. Each VNet has one or more (often large) CIDR blocks associated with it (for example 10.0.0.0/16). In general, VNet CIDRs may overlap, but only if the VNets remain isolated. Azure allows VNets to be peered with each other through VNet Peerings. VNets can only be peered if none of their associated CIDRs overlap.

Each VNet has one or more VNet subnets associated with it, each subnet owns a non-overlapping part of one of the VNet's CIDR blocks. VNet subnets span all availability zones in a region, so it is possible to distribute egress gateway resources across availability zones in a region without the need to repeat the configuration per-AZ.

Azure VNet and ExpressRoute peerings​

Azure VNet Peerings allow multiple VNets to be connected together. Similarly, ExpressRoute allows external datacenters to be connected to an Azure VNet. Peered VPCs and datacenters communicate using private IPs as if they were all on one large private network.

By advertising routes to Azure fabric using Azure Route Servers, egress gateways can be assigned private IPs allowing them to communicate without NAT within the same VPC, with peered VPCs, and, with peered datacenters.

Azure Route Server​

Azure Route Server is a managed networking service that allows a network virtual appliance, like Calico Cloud, to dynamically configure Azure fabric by exchanging routing information using Border Gateway Protocol (BGP). Calico Cloud can establish BGP sessions with an Azure Route Server in a VNet to advertise the IPs of egress gateways to that VNet. The learned routes are then propagated to the rest of VNets through VNet peering, or to external datacenters through ExpressRoute, allowing communication with egress gateway.

Before you begin​

Required

  • Calico CNI
  • Open port UDP 4790 on the host

Not Supported

  • Azure VNet CNI

Calico Cloud CNI and IPAM are required. The ability to control the egress gateway’s IP is a feature of Calico Cloud CNI and IPAM. Azure VNet CNI does not support that feature, so it is incompatible with egress gateways.

How to​

Choose route reflectors​

It is possible to establish BGP connections between all Calico Cloud nodes and Azure Route Servers, but to avoid hitting Azure Route Server peers limit, it is better to select some nodes as route reflectors and set up BGP connections between those and Azure Route Server. The number of route reflectors depends on the cluster size, but it is recommended to have at least 3 at all times, and Azure Route Servers supports up to 8 peers in a VNet.

Create Azure Route Server​

Deploy Azure Route Server in the VNet (hub or spoke VNet) that routes egress addresses. Then, add the selected route reflectors as peers to the Azure Route Server.

note
  • The BGP connections between Calico route reflectors and Azure Route Servers are critical for the functionality of egress gateways. It is important to maintain route reflectors with care, and to make sure there are always enough healthy route reflectors.
  • If possible, assign a static address to the route reflectors so after reboots the same address is kept.
  • In AKS, it is recommended to run applications in user node pools and leave system node pools for running critical system pods. The nodes in system node pools are perfect Route Reflector candidates.

Disable the default BGP node-to-node mesh​

The default Calico Cloud node-to-node BGP mesh may be turned off to enable other BGP topologies. To do this, modify the default BGP configuration resource.

Run the following command to disable the BGP full-mesh:

kubectl apply -f - <<EOF
apiVersion: projectcalico.org/v3
kind: BGPConfiguration
metadata:
name: default
spec:
nodeToNodeMeshEnabled: false
EOF
note

Disabling the node-to-node mesh will break pod networking until/unless you configure replacement BGP peerings using BGPPeer resources. You may configure the BGPPeer resources before disabling the node-to-node mesh to avoid pod networking breakage.

Enable BGP​

Usually, in the default installation of Calico Cloud in a public cloud, BGP is disabled as it is not used. However, running egress gateways in Azure relies on BGP connections between Calico Cloud and Azure Route Servers. To enable BGP routing protocol, set bgp field to Enabled in the default installation:

kubectl patch installation default --type='merge' -p '{"spec": {"calicoNetwork": {"bgp": "Enabled"}}}'

and then restart the Calico Cloud daemonSet:

kubectl rollout restart ds calico-node -n calico-system

Provision an egress IP pool​

Provision a small IP Pool with the range of source IPs that you want to use for a particular application when it connects to an external service. For example:

kubectl apply -f - <<EOF
apiVersion: projectcalico.org/v3
kind: IPPool
metadata:
name: egress-ip-red-pool
spec:
cidr: 10.10.10.0/30
blockSize: 31
nodeSelector: "!all()"
vxlanMode: Never
EOF

Where:

  • blockSize must be specified when the prefix length of the whole cidr is more than the default blockSize of 26.

  • nodeSelector: "!all()" is recommended so that this egress IP pool is not accidentally used for cluster pods in general. Specifying this nodeSelector means that the IP pool is only used for pods that explicitly identify it in their cni.projectcalico.org/ipv4pools annotation.

  • Set vxlanMode to Never since in Azure we rely on Azure fabric to route traffic to egress gateway pods.

(Optional) Limit number of route advertisements​

It is only necessary to advertise the egress IP pool, created in the previous section, to Azure Route Servers. By default Calico Cloud advertise all other IP pools, which does not affect any functionality. However, Azure Route Servers have a limit of 1000 advertisements per peer. This limit should not be an issue in a small or medium size cluster, but could be for large size cluster. To prevent hitting the limit, it is possible to create BGP filters to advertise only egress IP pools. As an example, for the IP pool created in the previous section, configure a BGP filter by running the following command:

kubectl apply -f - <<EOF
kind: BGPFilter
apiVersion: projectcalico.org/v3
metadata:
name: export-egress-ips
spec:
exportV4:
- action: Reject
matchOperator: NotIn
cidr: 10.10.10.0/30
EOF

Even with the BGP filter, Calico Cloud still advertises pod IP blocks. To prevent Calico Cloud advertising pod IP blocks, it is possible to set disableBGPExport to true in the default IP pools.

Configure route reflectors​

Calico Cloud nodes can be configured to act as route reflectors. To do this, each node that you want to act as a route reflector must have a cluster ID - typically an unused IPv4 address.

To configure a node to be a route reflector with cluster ID 244.0.0.1, run the following command:

kubectl annotate node <node name> projectcalico.org/RouteReflectorClusterID=244.0.0.1

Typically, you will want to label this node to indicate that it is a route reflector, allowing it to be easily selected by a BGPPeer resource. You can do this with kubectl. For example:

kubectl label node <node name> route-reflector=true

Now it is easy to configure route reflector nodes to peer with each other and other non-route-reflector nodes using label selectors. For example:

kind: BGPPeer
apiVersion: projectcalico.org/v3
metadata:
name: peer-with-route-reflectors
spec:
nodeSelector: all()
peerSelector: route-reflector == 'true'

Finally, Add the IP addresses of Azure Route Servers, usually there are two, as peers to route reflectors. For example, if the Calico Cloud cluster is in a subnet with 10.224.0.0 as network address, the 10.224.0.1 is, by definition, the gateway and Azure Route Servers are deployed with ASN 65515 and IP addresses of 10.225.0.4 and 10.225.0.5, create two BGPPeer resources:

apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: azure-route-server-a
spec:
peerIP: 10.225.0.4
reachableBy: 10.224.0.1
asNumber: 65515
keepOriginalNextHop: true
nodeSelector: route-reflector == 'true'
filters:
- export-egress-ips
---
apiVersion: projectcalico.org/v3
kind: BGPPeer
metadata:
name: azure-route-server-b
spec:
peerIP: 10.225.0.5
reachableBy: 10.224.0.1
asNumber: 65515
keepOriginalNextHop: true
nodeSelector: route-reflector == 'true'
filters:
- export-egress-ips
note
  • Adding routeReflectorClusterID to a node spec will remove it from the node-to-node mesh immediately, tearing down the existing BGP sessions. Adding the BGP peering will bring up new BGP sessions. This will cause a short (about 2 seconds) disruption to dataplane traffic of workloads running in the nodes where this happens. To avoid this, make sure no workloads are running on the nodes, by provisioning new nodes or by running kubectl drain on the node (which may itself cause a disruption as workloads are drained).
  • It is important to set keepOriginalNextHop: true since route reflectors advertise routes on behalf of other nodes. Advertised routes to Azure Route Servers should have the original next hop otherwise the return packets will be sent to route reflectors, and get dropped.
  • It is mandatory to set reachableBy field set to the gateway of the subnet Calico Cloud cluster is running in for peering with Azure Route Servers to prevent BGP connection flapping. See BGP peer for more information.
  • Including filters to apply the BGP filter configured in the previous section, is optional.

Enable egress gateway support​

In the default FelixConfiguration, set the egressIPSupport field to EnabledPerNamespace or EnabledPerNamespaceOrPerPod, according to the level of support that you need in your cluster. For support on a per-namespace basis only:

kubectl patch felixconfiguration default --type='merge' -p \
'{"spec":{"egressIPSupport":"EnabledPerNamespace"}}'

Or for support both per-namespace and per-pod:

kubectl patch felixconfiguration default --type='merge' -p \
'{"spec":{"egressIPSupport":"EnabledPerNamespaceOrPerPod"}}'
note
  • egressIPSupport must be the same on all cluster nodes, so you should set them only in the default FelixConfiguration resource.
  • The operator automatically enables the required policy sync API in the FelixConfiguration.

Deploy a group of egress gateways​

Use a Kubernetes Deployment to deploy a group of egress gateways.

Using the example of the "red" egress gateway cluster, we use several features of Kubernetes and Calico Cloud in tandem to get a cluster of egress gateways that uses the user defined IP addresses:

kubectl apply -f - <<EOF
apiVersion: operator.tigera.io/v1
kind: EgressGateway
metadata:
name: egress-gateway-red
namespace: my-namespace
spec:
logSeverity: "Info"
replicas: 2
ipPools:
- name: egress-ip-red-pool
# Uncomment this block to add ICMP, HTTP probes
# egressGatewayFailureDetection:
# healthTimeoutDataStoreSeconds: 30
# icmpProbe:
# ips:
# - <IP to probe>
# - <IP to probe>
# timeoutSeconds: 15
# intervalSeconds: 5
# httpProbe:
# urls:
# - <URL to probe>
# - <URL to probe>
# timeoutSeconds: 30
# intervalSeconds: 10
template:
metadata:
labels:
egress-code: red
spec:
terminationGracePeriodSeconds: 0
nodeSelector:
kubernetes.io/os: linux
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
egress-code: red
EOF
  • replicas: 2 tells Kubernetes to schedule two egress gateways in the "red" cluster.

  • This annotation tells Calico Cloud IPAM to use one of the "egress-ip-red-pool" IP pool.

External services and appliances can recognise "red" traffic because it will all come from the CIDRs of the "red" IP pool.

  • The following topology spread constraint ensures that Kubernetes spreads the Egress gateways evenly between AZs (assuming that your nodes are labeled with the expected well-known labeltopology.kubernetes.io/zone):
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
egress-code: red
  • The labels are arbitrary. You can choose whatever names and values are convenient for your cluster's Namespaces and Pods to refer to in their egress selectors. If labels are not specified, a default label projectcalico.org/egw:name will be added by the Tigera Operator.

  • icmpProbe may be used to specify the Probe IPs, ICMP interval and timeout in seconds. ips if set, the egress gateway pod will probe each IP periodically using an ICMP ping. If all pings fail then the egress gateway will report non-ready via its health port. intervalSeconds controls the interval between probes. timeoutSeconds controls the timeout before reporting non-ready if no probes succeed.

icmpProbe:
ips:
- <IP to probe>
- <IP to probe>
timeoutSeconds: 20
intervalSeconds: 10
  • httpProbe may be used to specify the Probe URLs, HTTP interval and timeout in seconds. urls if set, the egress gateway pod will probe each external service periodically. If all probes fail then the egress gateway will report non-ready via its health port. intervalSeconds controls the interval between probes. timeoutSeconds controls the timeout before reporting non-ready if all probes are failing.
httpProbe:
urls:
- <URL to probe>
- <URL to probe>
timeoutSeconds: 30
intervalSeconds: 10
note
  • It is advisable to have more than one egress gateway per group, so that the egress IP function continues if one of the gateways crashes or needs to be restarted. When there are multiple gateways in a group, outbound traffic from the applications using that group is load-balanced across the available gateways. The number of replicas specified must be less than or equal to the number of free IP addresses in the IP Pool.

  • IPPool can be specified either by its name (e.g. -name: egress-ip-red-pool) or by its CIDR (e.g. -cidr: 10.10.10.0/30).

  • The labels are arbitrary. You can choose whatever names and values are convenient for your cluster's Namespaces and Pods to refer to in their egress selectors.

The health port 8080 is used by:

  • The Kubernetes readinessProbe to expose the status of the egress gateway pod (and any ICMP/HTTP probes).

  • Remote pods to check if the egress gateway is "ready". Only "ready" egress gateways will be used for remote client traffic. This traffic is automatically allowed by Calico Cloud and no policy is required to allow it. Calico Cloud only sends probes to egress gateway pods that have a named "health" port. This ensures that during an upgrade, health probes are only sent to upgraded egress gateways.

Configure iptables backend for egress gateways​

The Tigera Operator configures egress gateways to use the same iptables backend as calico-node. To modify the iptables backend for egress gateways, you must change the iptablesBackend field in the Felix configuration.

Configure namespaces and pods to use egress gateways​

You can configure namespaces and pods to use an egress gateway by:

  • annotating the namespace or pod
  • applying an egress gateway policy to the namespace or pod.

Using an egress gateway policy is more complicated, but it allows advanced use cases.

Configure a namespace or pod to use an egress gateway (annotation method)​

In a Calico Cloud deployment, the Kubernetes namespace and pod resources honor annotations that tell that namespace or pod to use particular egress gateways. These annotations are selectors, and their meaning is "the set of pods, anywhere in the cluster, that match those selectors".

So, to configure all the pods in a namespace to use the egress gateways that are labelled with egress-code: red, you would annotate that namespace like this:

kubectl annotate ns <namespace> egress.projectcalico.org/selector="egress-code == 'red'"

By default, that selector can only match egress gateways in the same namespace. To select gateways in a different namespace, specify a namespaceSelector annotation as well, like this:

kubectl annotate ns <namespace> egress.projectcalico.org/namespaceSelector="projectcalico.org/name == 'default'"

Egress gateway annotations have the same syntax and range of expressions as the selector fields in Calico Cloud network policy.

To configure a specific Kubernetes Pod to use egress gateways, specify the same annotations when creating the pod. For example:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
annotations:
egress.projectcalico.org/selector: egress-code == 'red'
egress.projectcalico.org/namespaceSelector: projectcalico.org/name == 'default'
name: my-client,
namespace: my-namespace,
spec:
...
EOF

Configure a namespace or pod to use an egress gateway (egress gateway policy method)​

Creating an egress gateway policy allows gives you more control over how your egress gateways work. For example, you can:

  • Send egress gateway traffic to multiple egress gateways, depending on the destination.
  • Skip egress gateways for traffic that is bound for local endpoints that aren't in the cluster.

The following is an example of Egress Gateway Policy:

apiVersion: projectcalico.org/v3
kind: EgressGatewayPolicy
metadata:
name: "egw-policy1"
spec:
rules:
- destination:
cidr: 10.0.0.0/8
description: "Local: no gateway"
- destination:
cidr: 11.0.0.0/8
description: "Gateway to on prem"
gateway:
namespaceSelector: "projectcalico.org/name == 'default'"
selector: "egress-code == 'blue'"
maxNextHops: 2
- description: "Gateway to internet"
gateway:
namespaceSelector: "projectcalico.org/name == 'default'"
selector: "egress-code == 'red'"
gatewayPreference: PreferNodeLocal
  1. If the destination field is not specified, it takes the default value of 0.0.0.0/0.

  2. If the gateway field is not specified, then egress traffic is routed locally, and not through an egress gateway. This is helpful for reaching local endpoints that are not part of a cluster.

  3. Required when gateway field is specified.

  4. Required when gateway field is specified.

  5. The maxNextHops field specifies the maximum number of egress gateway replicas from the selected deployment that a pod depends on. For more information, see Optimize egress networking for workloads with long-lived TCP connections.

  6. gatewayPreference specifies hints to the gateway selection process. The default None, selects the default selection process. If set to PreferNodeLocal, then egress gateways local to the client's node are used if available. If there are no local egress gateways, Calico Cloud uses other egress gateways. In this example, for the default route, egress gateways local to the client's node are used if present. If not, all egress gateways matching the selector are used.

note

CIDRs specified in rules in an egress gateway policy are matched in Longest Prefix Match(LPM) fashion.

Calico Cloud rejects egress gateway policies that do any of the following:

  • The policy has no rule that specifies a gateway or a destination
  • The policy has a rule with empty selector or namespaceSelector fields.
  • The policy has two or more rules with the same destination.

To configure all the pods in a namespace to use an egress gateway policy named egw-policy1, you could annotate the namespace like this:

kubectl annotate ns <namespace> egress.projectcalico.org/egressGatewayPolicy="egw-policy1"

To configure a specific Kubernetes pod to use the same policy, specify the same annotations when creating the pod. For example:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
annotations:
egress.projectcalico.org/egressGatewayPolicy: "egw-policy1"
name: my-client,
namespace: my-namespace,
spec:
...
EOF
caution

You must create the egress gateway policy before you apply it to a namespace or pod. If you attempt to apply an egress gateway policy that has not been created, Calico Cloud will block all traffic from the namespace or pod.

(Optional) Enable ECMP load balancing​

If you are provisioning multiple egress gateways for a given client pod, and you want traffic from that client to load balance across the available gateways, set the fib_multipath_hash_policy sysctl to allow that:

sudo sysctl -w net.ipv4.fib_multipath_hash_policy=1

You will need this on each node with clients that you want to load balance across multiple egress gateways.

Verify the feature operation​

To verify the feature operation, cause the application pod to initiate a connection to a server outside the cluster, and observe -- for example using tcpdump -- the source IP of the connection packet as it reaches the server.

note

In order for such a connection to complete, the server must know how to route back to the egress gateway's IP.

By way of a concrete example, you could use netcat to run a test server outside the cluster; for example:

docker run --net=host --privileged subfuzion/netcat -v -l -k -p 8089

Then provision an egress IP Pool, and egress gateways, as above.

Then deploy a pod, with egress annotations as above, and with any image that includes netcat, for example:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: my-netcat-pod
namespace: <namespace>
spec:
containers:
- name: alpine
image: alpine
command: ["/bin/sleep"]
args: ["infinity"]
EOF

Now you can use kubectl exec to initiate an outbound connection from that pod:

kubectl exec <pod name> -n <namespace> -- nc <server IP> 8089 </dev/null

where <server IP> should be the IP address of the netcat server.

Then, if you check the logs or output of the netcat server, you should see:

Connection from <source IP> <source port> received

with <source IP> being one of the IPs of the egress IP pool that you provisioned.

Control the use of egress gateways​

If a cluster ascribes special meaning to traffic flowing through egress gateways, it will be important to control when cluster users can configure their pods and namespaces to use them, so that non-special pods cannot impersonate the special meaning.

If namespaces in a cluster can only be provisioned by cluster admins, one option is to enable egress gateway function only on a per-namespace basis. Then only cluster admins will be able to configure any egress gateway usage.

Otherwise -- if namespace provisioning is open to users in general, or if it's desirable for egress gateway function to be enabled both per-namespace and per-pod -- a Kubernetes admission controller will be needed. This is a task for each deployment to implement for itself, but possible approaches include the following.

  1. Decide whether a given Namespace or Pod is permitted to use egress annotations at all, based on other details of the Namespace or Pod definition.

  2. Evaluate egress annotation selectors to determine the egress gateways that they map to, and decide whether that usage is acceptable.

  3. Impose the cluster's own bespoke scheme for a Namespace or Pod to identify the egress gateways that it wants to use, less general than Calico Cloud's egress annotations. Then the admission controller would police those bespoke annotations (that that cluster's users could place on Namespace or Pod resources) and either reject the operation in hand, or allow it through after adding the corresponding Calico Cloud egress annotations.

Policy enforcement for flows via an egress gateway​

For an outbound connection from a client pod, via an egress gateway, to a destination outside the cluster, there is more than one possible enforcement point for policy:

The path of the traffic through policy is as follows:

  1. Packet leaves the client pod and passes through its egress policy.
  2. The packet is encapsulated by the client pod's host and sent to the egress gateway
  3. The encapsulated packet is sent from the host to the egress gateway pod.
  4. The egress gateway pod de-encapsulates the packet and sends the packet out again with its own address.
  5. The packet leaves the egress gateway pod through its egress policy.

To ensure correct operation, (as of v3.15) the encapsulated traffic between host and egress gateway is auto-allowed by Calico Cloud and other ingress traffic is blocked. That means that there are effectively two places where policy can be applied:

  1. on egress from the client pod
  2. on egress from the egress gateway pod (see limitations below).

The policy applied at (1) is the most powerful since it implicitly sees the original source of the traffic (by virtue of being attached to that original source). It also sees the external destination of the traffic.

Since an egress gateway will never originate its own traffic, one option is to rely on policy applied at (1) and to allow all traffic to at (2) (either by applying no policy or by applying an "allow all").

Alternatively, for maximum "defense in depth" applying policy at both (1) and (2) provides extra protection should the policy at (1) be disabled or bypassed by an attacker. Policy at (2) has the following limitations:

  • Domain-based policy is not supported at egress from egress gateways. It will either fail to match the expected traffic, or it will work intermittently if the egress gateway happens to be scheduled to the same node as its clients. This is because any DNS lookup happens at the client pod. By the time the policy reaches (2) the DNS information is lost and only the IP addresses of the traffic are available.

  • The traffic source will appear to be the egress gateway pod, the source information is lost in the address translation that occurs inside the egress gateway pod.

That means that policies at (2) will usually take the form of rules that match only on destination port and IP address, either directly in the rule (via a CIDR match) or via a (non-domain based) NetworkSet. Matching on source has little utility since the IP will always be the egress gateway and the port of translated traffic is not always preserved.

note

Since v3.15.0, Calico Cloud also sends health probes to the egress gateway pods from the nodes where their clients are located. In iptables mode, this traffic is auto-allowed at egress from the host and ingress to the egress gateway. In eBPF mode, the probe traffic can be blocked by policy, so you must ensure that this traffic is allowed.

Additional resources​

Please see also: