Nomad on Azure: A first attempt to provision servers

Nomad On Azure - This article is part of a series.

Part 1: This Article

Part 2: Nomad on Azure: The one where we introduce Consul

Part 3: Nomad on Azure: Nomad clients and a first Nomad job

Part 4: Nomad on Azure: Nomad UI and Azure Load Balancer

Part 5: Nomad on Azure: DNS, TLS, and Gossip Encryption

Part 6: Nomad on Azure: Nomad Enterprise, enable ACLs, and refactor Terraform configurations

Part 7: Nomad on Azure: Revisiting Consul

This is the first part in a series of blog posts where I will provision HashiCorp Nomad on Microsoft Azure.

The idea of this blog series is to begin from the start and work towards having a reliable Nomad cluster on Azure. How many parts this blog series will contain remains to be seen.

A fair warning: I am not an expert on HashiCorp Nomad. This blog series is developed along with me learning about Nomad.

The full source code discussed in this blog series is available on this GitHub repository:

mattias-fjellstrom/nomad-on-azure

Accompanying git repository for my blog series on “Nomad on Azure”

HCL

Specifically for this blog post, see the part01 directory of the repository.

What is HashiCorp Nomad?
#

You have most likely heard about Kubernetes. Nomad is similar to Kubernetes. There are things that Kubernetes can do that Nomad can’t, and vice versa. The general idea of the two platforms is the same: running and orchestrating application workloads.

Nomad comes as a single binary. You can run Nomad on your local machine for testing purposes. Nomad runs on Windows, Mac, and Linux.

In a production scenario or in any other shared environment you should run Nomad on dedicated machines. You could run Nomad on a Kubernetes cluster, but that does not make any sense. In this blog series we will run Nomad on Azure virtual machines.

Nomad can run either as servers or clients. The same Nomad binary is used in both cases, the only difference is how you configure Nomad through configuration files. The configuration files are written in the HashiCorp Configuration Language (HCL). Nomad servers are responsible for orchestrating and placing workloads on clients. Nomad clients are responsible for running the workloads.

A Nomad cluster should consist of at least three servers, and at most seven servers. The cluster can include any number of clients.

A Kubernetes cluster can orchestrate containers based on container images (e.g. Docker). Nomad provides more options. You can run containers similar to Kubernetes, but you can also run scripts, Java JAR files, and more. You must install task drivers on your Nomad clients to support running different types of artifacts.

In this and the following blog posts in this series we will encounter many of these details. To get the full picture of what Nomad is, visit the documentation.

Configure all resources using Terraform
#

Terraform is the natural weapon of choice for defining the Azure infrastructure that we need to get a Nomad cluster up and running.

Begin by configuring the Azure provider, because we will primarily be using this provider in this example:

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 4.3"
    }
  }
}

provider "azurerm" {
  subscription_id = var.azure_subscription_id

  features {
    virtual_machine_scale_set {
      force_delete = true
    }
  }
}

Here we have set the Azure subscription ID using a variable. If you provision the example code for this blog post you will need to set the value of this variable (e.g. using a terraform.tfvars file).

In the features block of the provider configuration we say that virtual machine scales sets should be forcefully deleted (this is a hint that we will be using virtual machine scale-sets, see how in the following sections).

This is most likely not strictly required, but I have always wanted to use the features block for something and this felt like a good time for that!

Create an Azure resource group
#

If you are familiar with Azure you know that all resources on Azure goes into resource groups. This is one of the best features of Azure, if you ask me. Create a resource group where all Nomad resources will go:

resource "azurerm_resource_group" "default" {
  name     = "rg-nomad-on-azure"
  location = var.azure_location

  tags = {
    projet = "nomad"
  }
}

Create an Azure virtual network and subnet
#

The next step on our journey is to provision a virtual network where we can run our Nomad cluster. In our first attempt we will keep things simple. We define a virtual network along with a single subnet:

resource "azurerm_virtual_network" "default" {
  name                = "vnet-nomad"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
  address_space = [
    "10.0.0.0/16",
  ]
}

resource "azurerm_subnet" "nomad" {
  name                 = "snet-nomad"
  resource_group_name  = azurerm_resource_group.default.name
  virtual_network_name = azurerm_virtual_network.default.name
  address_prefixes = [
    "10.0.10.0/24",
  ]
}

If you come from an AWS background you might be confused about why we use a single subnet. A subnet on Azure spans the whole region, it is not limited to a single availability zone as for AWS subnets.

Create an Azure virtual machine scale-set for Nomad servers
#

We can run Nomad servers as individual virtual machines, but a better way is to use a virtual machine scale-set (VMSS). With a VMSS we configure a template for what our virtual machine instances should look like, and we specify how many copies we need. VMSS also come with additional benefits, like auto-scaling.

We can start configuring our VMSS for Nomad:

resource "azurerm_orchestrated_virtual_machine_scale_set" "nomad" {
  name                = local.nomad.vmss_name
  resource_group_name = azurerm_resource_group.default.name
  location            = azurerm_resource_group.default.location

  identity { ... }

  platform_fault_domain_count = 1
  single_placement_group      = false
  zone_balance                = false
  zones                       = ["1", "2", "3"]

  instances = var.server_count

  sku_name = var.virtual_machine_sku_name

  user_data_base64 = "..."

  network_interface { ...}

  os_disk { ... }

  os_profile { ... }

  source_image_reference { ... }

  tags = {
    nomad = "server"
  }
}

Most of the details are left out from the example above, you can see everything in the GitHub repository. There are a few arguments we should take a closer look at:

In the identity block we will configure an Azure user-assigned managed identity. This is a managed identity that the virtual machine can use to perform actions on Azure. Each VM will need to read tags to use the cloud auto-join function. More on this soon.
The number of instances of this VMSS is configured using the server_count variable. This number must also be set in the Nomad configuration files on each server. Remember that his value should be at least 3 and at most 7. The default value here is 3. More on this soon.
The sku_name is set using the virtual_machine_sku_name variable. This Nomad cluster will not run heavy workloads, so a cheap SKU could be used here. Use the default value or configure it in your terraform.tfvars file.
The user_data_base64 argument is very important. Through this we will configure what happens on boot with this instance. This includes installing and configuring Nomad. More on this soon.
In the os_profile block we configure the admin username to be azureuser and we configure an SSH key that we can use to connect to the instance. No admin password will be set for the instances.
In the source_image_reference block we tell Azure what base image we want to use. In this example an Ubuntu 24.04 LTS image is used.
Finally, the VMSS tags will be propagated to all instances.

Create the identity
#

Create a user assigned identity and give the identity the Reader role on the resource group:

resource "azurerm_user_assigned_identity" "nomad" {
  name                = "nomad"
  location            = azurerm_resource_group.default.location
  resource_group_name = azurerm_resource_group.default.name
}

resource "azurerm_role_assignment" "nomad" {
  scope                = azurerm_resource_group.default.id
  principal_id         = azurerm_user_assigned_identity.nomad.principal_id
  role_definition_name = "Reader"
}

Add the identity to the VMSS by configuring the identity block:

resource "azurerm_orchestrated_virtual_machine_scale_set" "nomad" {
  # ... other arguments omitted
  
  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.nomad.id]
  }
}

Create the admin SSH key
#

For this initial setup we allow SSH connections from anywhere to our Nomad servers. If we keep the ambition up we will eventually add HashiCorp Boundary and Vault to handle privileged access management with no permanent SSH keys. For now, we move on without this added complexity.

To create an SSH key for the admin user we will utilize the TLS provider and the local provider for Terraform. Add these to your configuration:

terraform {
  required_providers {
    # ... other providers omitted

    local = {
      source  = "hashicorp/local"
      version = "~> 2.5"
    }

    tls = {
      source  = "hashicorp/tls"
      version = "~> 4.1"
    }
  }
}

Configure a new TLS private key resource, and output the private key to a local file and use the public key to create a corresponding resource on Azure that we then configure the VMSS with:

resource "tls_private_key" "servers" {
  algorithm = "RSA"
  rsa_bits  = 4096
}

resource "local_file" "pricate_key" {
  content         = tls_private_key.servers.private_key_pem
  filename        = "${path.module}/ssh_keys/nomad-servers.pem"
  file_permission = "0400"
}

resource "azurerm_ssh_public_key" "servers" {
  name                = "nomad-servers"
  resource_group_name = azurerm_resource_group.default.name
  location            = azurerm_resource_group.default.location
  public_key          = tls_private_key.servers.public_key_openssh
}

Add the key to the VMSS in the linux_configuration block inside of the os_profile block:

resource "azurerm_orchestrated_virtual_machine_scale_set" "nomad" {
  # ... other arguments omitted
  
  os_profile {
    linux_configuration {
      # ...

      admin_ssh_key {
        username   = "azureuser"
        public_key = azurerm_ssh_public_key.servers.public_key
      }
    }
  }
}

The sample code opens port 22 on each VM to the world to allow us to connect via SSH. Perhaps you do not want to do this, if so you should edit the network security group (NSG) resource. In a future blog post in this series this is one of the things we will address.

Install and configure Nomad
#

To install and configure Nomad on our virtual machines we will use cloud-init. There are at least two options for how you can configure cloud-init:

Create a cloud-init configuration file directly in YAML.
Use the cloud-init provider for Terraform.

To avoid having to learn the exact format of a cloud-init configuration file you can go for the second option and use Terraform.

Add the cloud-init provider to the Terraform configuration:

terraform {
  required_providers {
    # ... other providers

    cloudinit = {
      source = "hashicorp/cloudinit"
      version = "~> 2.3"
    }
  }
}

Installing and configuring Nomad servers require the following:

Download the Nomad binary. Instead of downloading the binary you could use a package manager, which would also configure some of the following things for you as well. However, installing the binary yourself gives you more flexibility for how you want to configure everything.
Creating a Linux user for Nomad.
Creating the Systemd service file.
Creating the Nomad configuration file and directories.

The full details of these steps can be seen in the GitHub repository. In the following discussion we concentrate on the Nomad configuration file.

You could create one or many configuration files and they will be patched together to form a single configuration. The configuration we are creating will be relatively small so a single file is perfectly fine.

The configuration file will be placed at /etc/nomad.d/nomad.hcl and the content will be generated from the following:

data_dir  = "/opt/nomad/data"
bind_addr = "0.0.0.0"

datacenter = "dc1"

tls {
    http = false
    rpc  = false
}

ports {
    http = 4646
    rpc  = 4647
    serf = 4648
}

server {
    enabled          = true
    bootstrap_expect = ${var.server_count}

    server_join {
      retry_join = [
        "provider=azure tag_name=nomad tag_value=server subscription_id=${data.azurerm_client_config.current.subscription_id}"
      ]
    }
}

The important parts are in the server stanza:

This file is for configuring servers, so we set enabled = true.
We tell Nomad that we expect a certain number of servers in bootstrap_expect = ${var.server_count}. This number is the same as the instance count we set for the VMSS.
In the server_join stanza we use the cloud auto-join feature. Here we configure what Nomad should look for when forming a cluster as a string consisting of a few pieces:
- provider=azure is required because there are many different providers supported and Nomad can’t figure which one it is by itself.
- tag_name and tag_value are used to indicate what the resources Nomad looks for are tagged with. Remember how we tagged our instances with nomad=server?
- The subscription_id is the Azure subscription ID fetched from the azurerm_client_config data source. This is so that the API calls in the background will go to the correct subscription.

The cloud auto-join feature will use the user-assigned identity we added to the VMSS. This is why we don’t need to provide any more details or any credentials.

All the cloud-init parts are configured in a cloudinit_config data source. Add the rendered cloud-init configuration to the VMSS:

resource "azurerm_orchestrated_virtual_machine_scale_set" "nomad" {
  # ... other arguments omitted
  
  user_data_base64 = data.cloudinit_config.nomad.rendered
}

Provision the Nomad cluster on Azure
#

Go through terraform init to download the provider binaries and initialize the state file, run terraform plan to see what Terraform will do when this configuration is applied, and finally terraform apply to provision the resources.

Provisioning the cluster takes a couple of minutes.

When the provisioning is complete we can check our Azure environment to see that a lot of resources have been created:

Resources on Azure after provisioning with Terraform

Grab the public IP of one of the virtual machine instances and connect to it using the SSH key stored under the ssh_keys directory.

$ ssh -i ssh_keys/nomad-servers.pem azureuser@<public ip>

Once connected to the instance, run nomad server members to see if our cluster has been formed:

azureuser@nomadsrvQG6F2N:~$ nomad server members

Name                   Address    Port  Status  Leader  Raft Version  Build   Datacenter  Region
nomadsrvQG6F2N.global  10.0.10.5  4648  alive   false   3             1.10.2  dc1         global

Error determining leaders: 1 error occurred:
	* Region "global": Unexpected response code: 500 (No cluster leader)

We have not formed a Nomad cluster! An error message informs us that there is no cluster leader, and there is only a single server in the list.

It turns out that the tag that cloud auto-join is looking for must be set at the virtual network interface card (NIC) of the virtual machines not on the virtual machines themselves.

This is unfortunate because in the VMSS resource we can’t configure tags for the NICs. The tags from the VMSS itself is not inherited to its subresources.

We will address this in a future blog post in this series, but for now we create a simple script that uses the Azure CLI to add the required tags to the NICs:

#!/bin/bash

resourceGroup="rg-nomad-on-azure"

az network nic list \
    --resource-group "$resourceGroup" \
    --query "[].{name:name}" -o tsv | \
while read -r nicName; do
    az network nic update \
        --name "$nicName" \
        --resource-group "$resourceGroup" \
        --set tags.nomad="server"
done

Run this script to apply the tags.

After applying the tags to the NICs we can see that we have successfully formed a cluster:

azureuser@nomadsrvQG6F2N:~$ nomad server members

Name                   Address    Port  Status  Leader  Raft Version  Build   Datacenter  Region
nomadsrv2ZI6C9.global  10.0.10.6  4648  alive   true    3             1.10.2  dc1         global
nomadsrvJLAL8D.global  10.0.10.4  4648  alive   false   3             1.10.2  dc1         global
nomadsrvQG6F2N.global  10.0.10.5  4648  alive   false   3             1.10.2  dc1         global

We have three servers. One of the servers is designated as the leader.

Summary of Part 1
#

In this blog post we learned the basics of what Nomad is and we went through how to provision an initial cluster version on Azure as a VMSS.

We discovered a problem with the cloud auto-join functionality on Azure. The tags that cloud auto-join is looking for must be on the NICs attached to the virtual machines. There is no easy way to set these tags on the NIC when working with VMSS. As a workaround we used a script that run Azure CLI to set the required tags.

This cluster can be improved, and we will take the next step in improving our cluster setup in the next blog post in this series. Before we configure Nomad any further we will take a detour to configure Consul, and we will use HashiCorp Consul to help us bootstrap a Nomad cluster instead of relying on tags. Stay tuned!