Calico Cloud documentation

Policy best practices

Big picture

Policy best practices for run-time security starts with Calico Cloud’s robust network security policy, but other Calico Cloud resources play equally important roles in security, scalability, and performance.

Learn Calico Cloud policy best practices and resources that support a zero trust network model:

Prepare for policy authoring
Policy best practices for day-one zero trust
Policy design for efficiency and performance
Policy life cycle tools

Prepare for policy authoring

Determine who can write policy

Any team familiar with deploying microservices in Kubernetes can easily master writing network policies. The challenge in many organizations is deciding who will be given permission to write policy across teams. Although there are different approaches, Calico Cloud policy tools have the flexibility and guardrails to accommodate different approaches.

Let’s review two common approaches.

Microservices teams write policy

In this model, network policy is treated as code, built into and tested during the development process, just like any other critical part of a microservice’s code. The team responsible for developing a microservice has a good understanding of other microservices they consume and depend on, and which microservices consume their microservice. With a defined, standardized approach to policy and label schemas, there is no reason that the teams cannot implement network policies for their microservice as part of the development of the microservice. With visibility in Service Graph, teams can even do basic troubleshooting.
Dev/Ops writes policy, microservice team focuses on internals An equally valid approach is to have development teams focus purely on the internals of the microservices they are responsible for, and leave responsibility for operating the microservices with devops teams. A Dev/ops team needs the same understanding as the microservices team above. However, network security may come much later in the organization’s processes, or even as an afterthought on a system already in production. This can be more challenging because getting network policies wrong can have significant production impacts. But using Calico Cloud tools, this approach is still achievable.

When you get clarity on who can write policies, you can move to creating tiers. Calico Cloud tiers, along with standard Kubernetes RBAC, provide the infrastructure to meet security concerns across teams.

Understand the depth of Calico Cloud network policy

Because Calico Cloud policy goes well beyond the features in Kubernetes policy, we recommend that you have a basic understanding of network policy and global network policy and how they provide workload access controls. And even though you may not implement the following policies, it is helpful to know the depth of defense that is available in Calico Cloud.

Create policy tiers

Tiers are a hierarchical construct used to group policies and enforce higher precedence policies that cannot be circumvented by other teams. As part of your microsegmentation strategy, tiers let you apply identity-based protection to workloads and hosts.

Before creating policies, we recommend that you create your tier structure. This often requires internal debates and discussions. As noted previously, Calico Cloud policy workflow has the guardrails you need to allow diverse teams to participate in policy writing.

To understand how tiered policy works and best practices, see Get started with tiered policies.

Create label standards

Creating a label standard is often an overlooked step. But if you skip this step, it will cost you in troubleshooting down the road; especially given visibility/troubleshooting is already a challenge in a Kubernetes deployment.

Why are label standards important?

Network policies in Kubernetes depend on labels and selectors (not IP addresses and IP ranges) to determine which workloads can talk to each other. As pods dynamically scale up and down, network policy is enforced based on the labels and selectors that you define. So workloads and host endpoints need unique, identifiable labels. If you create duplicate label names, or labels are not intuitive, troubleshooting network policy issues and authoring network policies becomes more difficult.

Recommendations:

Follow the Kubernetes guidelines for labels. If the Kubernetes guidelines do not cover your use cases, we recommend this blog from Tigera Support: Label standard and best practices for Kubernetes security.
Develop a comprehensive set of labels that meets the deployment, reporting, and security requirements of different stakeholders in your organization.
Standardize the way you label your pods and write your network policies using a consistent schema or design pattern.
Labels should be defined to achieve a specific and explicit purpose
Use an intuitive language in your label definition that enables a quick and simple identification of labeled Kubernetes objects.
Use label key prefixes and suffixes to identify attributes required for asset classification.
Ensure the right labels are applied to Kubernetes objects by implementing label governance checks in your CI/CD pipeline or at runtime.

Create network sets

Network sets and global network sets are grouping mechanisms for arbitrary sets of IPs/subnets/CIDRs or domains. They are key resources for efficient policy design. The key use cases for network sets are:

Use/reuse in policy to support scaling

You reference network sets in policies using selectors (rather than updating individual policies with CIDRs or domains).
Visibility to traffic to/from a cluster

For apps that integrate with third-party APIs and SaaS services, you get enhanced visibility to this traffic in Service Graph.
Global deny lists

Create a “deny-list” of CIDRs for bad actors or embargoed countries in policy.

Recommendation: Create network sets and labels before writing policy.

For network set tutorial and best practices, see Get started with network sets.

Policy best practices for day-one zero trust

Create a global default deny policy

A global default deny network policy provides an enhanced security posture – so pods without policy (or incorrect policy) are not allowed traffic until appropriate network policy is defined. We recommend creating a global default deny, regardless of whether you use Calico Enterprise and/or Kubernetes network policy.

But, be sure to understand the best practices for creating a default deny policy to avoid breaking your cluster.

Here are sample default deny policies.

Define both ingress and egress network policy rules for every pod in the cluster

Although defining network policy for traffic external to clusters (north-south) is certainly important, it is equally important to defend against attacks for east-west traffic. Simply put, every connection from/to every pod in every cluster should be protected. Although having both doesn’t guarantee protection against other attacks and vulnerabilities, one innocuous workload can lead to exposure of your most critical workloads.

For examples, see basic ingress and egress policies.

Policy design for efficiency and performance

Teams can write policies that work, but ultimately you want policies that also scale, and do not negatively impact performance.

If you follow a few simple guidelines, you’ll be well on your way to writing efficient policy.

Use global network policy only when all rules apply globally

Do

Use global network policy for cluster-wide scope when all rules apply to multiple namespaces or host endpoints. For example, use a global network policy to create a deny-list of CIDRs for embargoed countries, or for global default deny everywhere, even for new namespaces.

Why? Although at the level of packet processing there is no difference between network policy and global network, for CPU usage, one global network policy is faster than a large number of network policies.
Avoid

Using a global network policy as a way to combine diverse, namespaced endpoints with different connectivity requirements. Although creating such a policy can work, appears efficient and is easier to view than several separate network policies, it is inefficient and should be avoided.

Why? Putting a lot of anything in policy (rules, CIDRs, ports) that are manipulated by selectors is inefficient. iptables/eBPF rules depend on minimizing executions and updates. When a selector is encountered in a policy rule, it is converted into one iptables rule that matches on an IP set. Then, different code keeps the IP sets up to date; this is more efficient than updating iptables rules. Also, because iptables rules execute sequentially in order, having many rules results in longer network latencies for the first packet in a flow (approximately 0.25-0.5us per rule). Finally, having more rules slows down programming of the data plane, making policy updates take longer.

Example: Inefficient global network policy

The following policy is a global network policy for a microservice that limits all egress communication external to the cluster in the security tier. Does this policy work? Yes. And logically, it seems to cleanly implement application controls.

apiVersion: projectcalico.org/v3
kind: GlobalNetworkPolicy
metadata:
  name: security.allow-egress-from-pods
spec:
  tier: security
  order: 1
  selector: all()
  egress:
   - action: Deny
     source:
       namespaceSelector: projectcalico.org/namespace starts with "tigera"
     destination:
       selector: threatfeed == "feodo"
   - action: Allow
     protocol: TCP
     source:
       namespaceSelector: projectcalico.org/name == "sso"
       ports:
         - '443'
         - '80'
     destination:
       domains:
         - '*.googleapis.com'
   - action: Allow
     protocol: TCP
     source:
       selector: psql == "external"
     destination:
       ports:
         - '5432'
       domains:
         - '*.postgres.database.azure.com'
   - action: Allow
     protocol: TCP
     source: {}
     destination:
       ports:
         - '443'
         - '80'
       domains:
         - '*.logic.azure.com'
   - action: Allow
     protocol: TCP
     source: {}
     destination:
       ports:
         - '443'
         - '80'
       domains:
         - '*.azurewebsites.windows.net'
   - action: Allow
     protocol: TCP
     source:
       selector: 'app in { "call-archives-api" }||app in { "finwise" }'
     destination:
       domains:
         - '*.documents.azure.com'
   - action: Allow
     protocol: TCP
     source:
       namespaceSelector: projectcalico.org/name == "warehouse"
     destination:
       ports:
         - '1433'
       domains:
         - '*.database.windows.net'
   - action: Allow
     protocol: TCP
     source: {}
     destination:
       nets:
         - 65.132.216.26/32
         - 10.10.10.1/32
       ports:
         - '80'
         - '443'
   - action: Allow
     protocol: TCP
     source:
       selector: app == "api-caller"
     destination:
       ports:
         - '80'
         - '443'
       domains:
         - api.example.com
   - action: Allow
     source:
       selector: component == "tunnel"
   - action: Allow
     destination:
       selector: all()
       namespaceSelector: all()
   - action: Deny
 types:
   - Egress

Why this policy is inefficient

First, the policy does not follow guidance on use for global network policy: that all rules apply to the endpoints. So the main issue is inefficiency, although the policy works.

The main selector all() (line 8) means the policy will be rendered on every endpoint (workload and host endpoints). The selectors in each rule (for example, lines 12 and 14) control traffic that are matched by that rule. So, even if the host doesn’t have any workloads that match "selector: app == "api-caller", you’ll still get the iptables/eBPF rule rendered on every host to implement that rule. If this sample policy had 100 pods, that’s a 10 - 100x increase in the number of rules (depending on how many local endpoints match each rule). In short, it adds:

Memory and CPU to keep track of all the extra rules
Complexity to handle changes to endpoint labels, and to re-render all the policies too.

Avoid policies that may select unwanted endpoints

The following policy is for an application in a single namespace, app1-ns namespace. There are two microservices that are all labeled appropriately:

microservice 1 has app: app1, svc: svc1
microservice 2 has app: app1, svc: svc2

The following policy works correctly and does not incur a huge performance hit. But it could select additional endpoints that were not intended.

apiVersion: projectcalico.org/v3
kind: NetworkPolicy
metadata:
  name: application.app1
  namespace: app1-ns
spec:
  tier: application
  order: 10
  selector: app == "app1"
  types:
    - Ingress
  ingress:
    - action: Allow
      source:
        selector: trusted-ip == "load-balancer"
      destination:
        selector: svc == "svc1"
        ports:
          - 10001
      protocol: TCP
    - action: Allow
      source:
        selector: svc == "svc1"
      destination:
        selector: svc == "svc2"
        ports:
          - 10002
      protocol: TCP

The policy incorrectly assumes that the main policy selector (app == "app1") will be combined (ANDed) with the endpoint selector, and only for certain policy types. In this case,

Ingress - combines policy selector and destination endpoint selector or
Egress - combines policy selector and source endpoint selector

But if the assumptions behind the labels are not understood by other policy authors and are not correctly assigned, the endpoint selector may select additional endpoints that were not intended. For ingress policy, this can open up the endpoint to more IP addresses than necessary. This unintended consequence would be exacerbated if the author used a global network policy.

Put multiple relevant policy rules together in the same policy

As discussed previously, it is better to create separate policies for different endpoint connectivity rules, than a single global network policy. However, you may interpret this to mean that the best practice is to make unique policies that do not aggregate any rules. But that is not the case. Why? When Calico Cloud calculates and enforces policy, it updates the iptables/eBPF and reads policy changes and pod/workload endpoints from the datastore. The more policies in memory, the more work it takes determine which policies match a particular endpoint. If you group more rules into one policy, there are fewer policies to match against.

Understand effective use of label selectors

Label selectors abstract network policy from the network. Misuse of selectors can slow things down. As discussed previously, the more selectors you create, the harder Calico Cloud works to find matches.

The following policy shows an inefficient use of selectors. Using selector: all() renders the policy on all nodes for all workloads. If there are 10,000 workloads, but only 10 match label==foo, that is very inefficient at the data plane level.

selector: all()
ingress:
  - source:
      selector: label == 'bar'
    destination:
      selector: label == 'foo'

The best practice policy below allows the same traffic, but is more efficient and scalable. Why? Because the policy will be rendered only on nodes with workloads that match the selector label==foo.

selector: label == 'foo'
ingress:
  source:
    selector: label == 'bar'

Another common mistake is using selector: all() when you don’t need to. all() means all workloads so that will be a large IP set. Whenever there's a source/destination selector in a rule, it is rendered as an IP set in the data plane.

source:
  selector: all()

Put domains and CIDRs in network sets rather than policy

Network sets allow you to specify CIDRs and/or domains. As noted in Network set best practices, we do not recommend putting large CIDRs or domains directly in policy. Although nothing stops you from do this in policy, using network sets is more efficient and supports scaling.

Policy life cycle tools

Preview, stage, deploy

A big obstacle to adopting Kubernetes is not having confidence that you can effectively prevent, detect, and mitigate across diverse teams. The following policy life cycle tools in the web console (Policies tab) can help.

Policy recommendations

Get a policy recommendation for unprotected workloads. Speeds up learning, while supporting zero trust.
Policy impact preview

Preview the impacts of policy changes before you apply them to avoid unintentionally exposing or blocking other network traffic.
Policy staging and audit modes

Stage network policy so you can monitor traffic impact of both Kubernetes and Calico Cloud policy as if it were actually enforced, but without changing traffic flow. This minimizes misconfiguration and potential network disruption.

For details, see Policy life cycle tools.

Do not trust anything

Zero trust means that you do not trust anyone or anything. Calico Cloud handles authentication on a per request basis. Every action is either authorized or restricted, and the default is everything is restricted. To apply zero trust to policy and reduce your attack surface and risk, we recommend the following:

Ensure that all expected and allowed network flows are explicitly allowed; any connection not explicitly allowed is denied
Create a quarantine policy that denies all traffic that you can quickly apply to workloads when you detect suspicious activity or threats

Big picture​

Prepare for policy authoring​

Determine who can write policy​

Understand the depth of Calico Cloud network policy​

Create policy tiers​

Create label standards​

Create network sets​

Policy best practices for day-one zero trust​

Create a global default deny policy​

Define both ingress and egress network policy rules for every pod in the cluster​

Policy design for efficiency and performance​

Use global network policy only when all rules apply globally​

Avoid policies that may select unwanted endpoints​

Put multiple relevant policy rules together in the same policy​

Understand effective use of label selectors​

Put domains and CIDRs in network sets rather than policy​

Policy life cycle tools​

Preview, stage, deploy​

Do not trust anything​

Additional resources​