Provisioning Speed Comparison: Terraform for AKS versus EKS

Table of Contents

A few weeks ago I came across a post on LinkedIn where someone said that provisioning infrastructure on Amazon Web Services (AWS) was 10, 100, or even 1000 times faster compared to provisioning equivalent infrastructure on Microsoft Azure. That person had a list of multiple services for which this was true.

When I read this post I asked myself: Is this really true?

I have worked a lot with both AWS and Azure for many years, and I know there is some truth to this statement.

I remember provisioning an Azure API Management resource in the development tier a few years ago and it could take somewhere around 40 minutes (I have no documented proof of this, so take that number with a grain of salt). Later Azure released the consumption tier of API Management which was much quicker, and since then I have not used anything else than the consumption tier.

An API Management resource belongs to the category of resources with a long lifecycle. Perhaps it is OK that it takes a long time to provision it?

What about resources with a shorter lifecycle? Or the type of resource you would provision multiple instances of? In this category we find AWS Lambda functions, Azure Function Apps, AWS S3 buckets, Azure storage accounts, and managed Kubernetes clusters (EKS and AKS).

Some resources have negligible provisioning times. For instance and AWS S3 bucket. Comparing provisioning times of a few seconds is not of interest to me.

Comparing how long it takes to set up a basic (but working) Kubernetes cluster on each platform is interesting. This resource takes some time on both platforms, because it is a non-trivial resource type that requires other moving pieces (Kubernetes nodes, i.e. virtual machines).

In this blog post I go through how I configure a basic Kubernetes cluster on each platform, and compare how long time it takes from terraform apply to a finished cluster. Does it take 10x, 100x, or 1000x longer on Azure compared to AWS? Or will the answer be something else?

Ground rules
#

Let’s start with an obvious statement:

It is not possible to provision identical clusters on both AWS and Azure.

The overall goal is to provision a working cluster with three nodes. A working cluster means that I have a base cluster I could deploy applications on. I am not planning to actually deploy applications since this would involve additional steps than what I want to include in my comparison.

The cluster should be provisioned to a custom virtual network, so I will include the configuration of the virtual networks in the comparison. Any other necessary resources to get a working cluster will also be included (e.g. IAM roles for AWS).

The provisioning will take place using Terraform.

Configuring resources
#

In the following two sections I will go through the configuration I used for Azure (AKS) and AWS (EKS), respectively.

Azure Kubernetes Service (AKS)
#

AKS is the managed Kubernetes service on Azure. Azure manages the Kubernetes control plane for you, and you concentrate on deploying your workloads.

As with any other resource on Azure you must place your AKS cluster in a resource group. AKS will additionally create a separate managed resource group where it will place resources such as node-pools and load balancers.

We create the resource group for the AKS cluster first:

resource "azurerm_resource_group" "default" {
  name     = "rg-aks-${var.location}"
  location = var.location
}

We can create a virtual network with a single subnet because subnets on Azure are not tied to a specific availability zone:

resource "azurerm_virtual_network" "default" {
  name                = "vnet-aks-${var.location}"
  address_space       = ["10.0.0.0/16"]
  location            = var.location
  resource_group_name = azurerm_resource_group.default.name
}

resource "azurerm_subnet" "aks" {
  name                 = "snet-aks"
  resource_group_name  = azurerm_resource_group.default.name
  virtual_network_name = azurerm_virtual_network.default.name
  address_prefixes     = ["10.0.1.0/24"]
}

We use 10.0.0.0/16 as the virtual network CIDR block, and give the subnet 10.0.1.0/24. These details are not important for the comparison, but we need to make sure this CIDR block does not overlap with the service_cidr we configure for the AKS cluster.

Next let’s add the Kubernetes cluster resource itself:

resource "azurerm_kubernetes_cluster" "default" {
  name                = "aks-cluster-${var.location}"
  resource_group_name = azurerm_resource_group.default.name
  location            = var.location
  dns_prefix          = "aks"

  private_cluster_enabled = false

  default_node_pool {
    name           = "system"
    node_count     = 1
    vm_size        = "Standard_DS2_v2"
    vnet_subnet_id = azurerm_subnet.aks.id
  }

  network_profile {
    network_plugin = "azure"
    service_cidr   = "10.128.0.0/16"
    dns_service_ip = "10.128.0.10"
  }

  identity {
    type = "SystemAssigned"
  }
}

Note that we use a SystemAssigned identity for the AKS cluster. This basically means we do not need to configure a separate identity resource.

The AKS cluster must have a default node pool. We want to create a specific node pool for our workloads, but we still need this default node pool. We set the node count to 1 for the default node pool to limit the impact of this in the test.

The final piece of the puzzle is the node pool for our workloads:

resource "azurerm_kubernetes_cluster_node_pool" "workload" {
  name                   = "workload"
  kubernetes_cluster_id  = azurerm_kubernetes_cluster.default.id
  vm_size                = "Standard_DS2_v2"
  node_count             = 3
  vnet_subnet_id         = azurerm_subnet.aks.id
  node_public_ip_enabled = true
}

I’ve made the AKS cluster and the nodes publicly available, this is to avoid potential internet-reachability issues during the provisioning.

Both the default node pool and our workload node pool use the Standard_DS2_v2 VM size. Technically it does not make much of a difference what size we use, except that we might hit a quota for how many VMs of a given size we can use. To make the comparison fair we will use a similarly sized VM type on AWS.

AWS Elasic Kubernetes Service (EKS)
#

AWS EKS is the managed Kubernetes offering on AWS. AWS managed the control plane, and you focus on managing your workloads. Similar to AKS there are features available to help you work with Kubernetes without having to care about any cluster management. But to be able to compare to what we did on Azure we will not use these features here (e.g. EKS Auto Mode).

Subnets on AWS are zonal, meaning we need to create more than one subnet (remember that on Azure one subnet was sufficient). We will also need to configure some additional network details that we were able to skip on Azure because it was taken care for us. These details will not really impact the provisioning time, but they will definitely influence the time it takes to write the configuration.

To start off we configure the virtual network, or Virtual Private Cloud (VPC):

resource "aws_vpc" "default" {
  cidr_block = "10.0.0.0/16"

  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "vpc-eks-${var.aws_region}"
  }
}

We set the same CIDR range as we did on Azure (10.0.0.0/16). Again, the CIDR is not of importance for the comparison. It is important that we enable both DNS hostnames and DNS support for the VPC, otherwise the node group will not be able to join the cluster.

Next we add an internet gateway, and a route table with a route for 0.0.0.0/0 going to the internet gateway:

resource "aws_internet_gateway" "default" {
  vpc_id = aws_vpc.default.id

  tags = {
    Name = "igw-eks-${var.aws_region}"
  }
}

resource "aws_route_table" "default" {
  vpc_id = aws_vpc.default.id

  tags = {
    Name = "rt-eks-${var.aws_region}"
  }
}

resource "aws_route" "igw" {
  route_table_id         = aws_route_table.default.id
  destination_cidr_block = "0.0.0.0/0"
  gateway_id             = aws_internet_gateway.default.id
}

The internet gateway and the route to it is necessary for the node group to start up and join the cluster. They need outbound internet connectivity for this.

EKS requires at least two subnets, but we will create three subnets (again, this will not really impact the provisioning time because these resources are created very fast). One of the subnets is configured like this (the other two are configured similarly but with different CIDR blocks):

data "aws_availability_zones" "available" {}

resource "aws_subnet" "eks01" {
  vpc_id                  = aws_vpc.default.id
  cidr_block              = "10.0.1.0/24"
  map_public_ip_on_launch = true

  availability_zone = data.aws_availability_zones.available.names[0]

  tags = {
    Name = "subnet-eks-${data.aws_availability_zones.available.names[0]}"
  }
}

resource "aws_route_table_association" "eks01" {
  subnet_id      = aws_subnet.eks01.id
  route_table_id = aws_route_table.default.id
}

Note that I set map_public_ip_on_launch to true. This is because I allow the nodes to be publicly accessible. This is similar to what we did on Azure. In a production scenario you would put the nodes in private subnets, but then you need to perform extra work to be able to reach applications on the cluster. This is bad for the provisioning-speed comparison’s sake, thus I let the nodes be publicly accessible. If I would place the nodes in a private subnet I would also need to provision one or more NAT-gateways in public subnets (and provisioning NAT-gateways would definitely influence the result).

The EKS cluster requires an IAM role, and the node group requires another IAM role. These roles are given the bare minimum permissions required to work.

The IAM role for the cluster is configured like this:

resource "aws_iam_role" "cluster" {
  name = "eks-cluster"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = [
          "sts:AssumeRole",
          "sts:TagSession"
        ]
        Effect = "Allow"
        Principal = {
          Service = "eks.amazonaws.com"
        }
      },
    ]
  })
}

resource "aws_iam_role_policy_attachment" "amazon_eks_cluster_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
  role       = aws_iam_role.cluster.name
}

And the IAM role for the node group is configured like this:

resource "aws_iam_role" "nodes" {
  name = "eks-nodes"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      },
    ]
  })
}

resource "aws_iam_role_policy_attachment" "amazon_eks_worker_node_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
  role       = aws_iam_role.example.name
}

resource "aws_iam_role_policy_attachment" "amazon_eks_cni_policy" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
  role       = aws_iam_role.example.name
}

resource "aws_iam_role_policy_attachment" "amazon_ec2_container_registry_read_only" {
  policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
  role       = aws_iam_role.example.name
}

Next we configure the EKS cluster and node group (similar to node pool for AKS):

resource "aws_eks_cluster" "default" {
  name = "eks-cluster-${var.aws_region}"

  access_config {
    authentication_mode = "API"
  }

  role_arn = aws_iam_role.cluster.arn

  vpc_config {
    subnet_ids = [
      aws_subnet.eks01.id,
      aws_subnet.eks02.id,
      aws_subnet.eks03.id,
    ]
  }
}

Finally, we configure the node group for our workloads:

resource "aws_eks_node_group" "workload" {
  cluster_name    = aws_eks_cluster.default.name
  node_group_name = "user"
  node_role_arn   = aws_iam_role.nodes.arn

  subnet_ids = [
    aws_subnet.eks01.id,
    aws_subnet.eks02.id,
    aws_subnet.eks03.id,
  ]

  instance_types = [
    "m5.large"
  ]

  scaling_config {
    desired_size = 3
    max_size     = 3
    min_size     = 3
  }

  update_config {
    max_unavailable = 2
  }
}

The node group will use the m5.large instance type, this is similar to the Standard_DS2_v2 size we used on Azure. The desired size of the node group is set to three, also similar to what we did on Azure.

Results
#

I provisioned each cluster to three different regions in the world. These regions were selected so that the corresponding Azure and AWS regions were fairly close to each other.

I selected the following regions:

US (Virginia)
- Azure: East US
- AWS: us-east-1
Europe (Sweden)
- Azure: Sweden Central
- AWS: eu-north-1
Asia (Singapore)
- Azure: Southeast Asia
- AWS: ap-southeast-1

I used the time command in zsh to measure how long time terraform apply takes, e.g.:

$ time terraform apply -auto-approve

The result from this command was reported back on this format:

$ time terraform apply -auto-approve
...
terraform apply -auto-approve  3.22s user 1.24s system 1% cpu 6:05.50 total

From this output I read the total time reported at the end of the line (i.e. 6:05 or 6m05s in the example above). I did not include milliseconds in the reported results below.

The results I obtained are shown in the table below. The fastest time is marked with a ⭐️, and the slowest time is marked with a 🐢.

	US (Virginia)	Europe (Sweden)	Asia (Singapore)
Azure	`08m10s`	`06m18s` ⭐️	`08m29s`
AWS	`11m36s` 🐢	`11m26s`	`10m23s`

Discussion and conclusions
#

The results show that it is faster to provision an AKS cluster than an EKS cluster.

Is this a fair comparison?

As I mentioned in the beginning, it is not possible to configure identical infrastructure on both Azure and AWS. There will be slight differences. In both cases we ended up with a base cluster that we can deploy stuff on, without adding any bells and whistles.

There are a few sources of error we could keep in mind:

Provisioning speed will vary throughout the day and it will also be different in different regions. My tests was run during one and the same evening (from Sweden). The AWS us-east-1 region is arguably one of the busier regions, which might explain why it was the slowest region to provision to.
Would it be faster to use a different tool than Terraform? Possibly, but most likely not by a lot. Terraform issues the required API calls to achieve the goal. We can only assume that the HashiCorp teams together with the AWS and Azure teams have done a good job on these providers.
We had to configure more resources on AWS compared to Azure. These included additional networking components (route tables, routes, internet gateway, more than one subnet) and IAM resources (roles and attaching policies to these roles). However, given that these resources are required for a working cluster we need to include them.
Could we have improved the performance using terraform apply -parallelism=n with an n greater than 10 (the default value)? Not at all. These configurations did not include enough resources without dependencies where this starts to become important.

A notable result is the Swedish Azure region. I am based in Sweden, so it might be part of the reason why this was much faster than the other Azure regions. It is also possible that the Swedish region is less busy than the other regions I included. I actually ran the test in the Swedish region twice and reported the worst result in the table above, the other result was around 15 seconds faster.

How important is provisioning speed?

I don’t think a difference of a few minutes (or seconds) is of any importance at all!

However, if the difference can be measured in 10 minutes, 20 minutes, or even longer, then it is important. Maybe it is even important enough to pick one cloud provider over the other.

Provisioning speed might not be important from an end-result perspective, but you will definitely start to see annoyed developers when they have to wait for long times for no apparent reason.

If you have automated provisioning development environments when the day starts, and destroying it when the day is over, then provisioning speed is something you need to take into account. You don’t want to trigger the automation at 08:00 when the day starts and have the environment ready closer to 09:00.

All in all, the difference in the time to provision a managed Kubernetes service on Azure and AWS is not that big (2-5 minutes difference depending on the region). I don’t remember the name of who wrote about this on LinkedIn, but if I see it again I will ask for specifics.

This was a comparison for managed Kubernetes services. I might do more of these for other types of services in the future.

Ground rules#

Configuring resources#

Azure Kubernetes Service (AKS)#

AWS Elasic Kubernetes Service (EKS)#

Results#

Discussion and conclusions#