This is the first part in a series of blog posts where I will provision HashiCorp Nomad on Microsoft Azure.
The idea of this blog series is to begin from the start and work towards having a reliable Nomad cluster on Azure. How many parts this blog series will contain remains to be seen.
The full source code discussed in this blog series is available on this GitHub repository:
Accompanying git repository for my blog series on “Nomad on Azure”
Specifically for this blog post, see the part01 directory of the repository.
What is HashiCorp Nomad?#
You have most likely heard about Kubernetes. Nomad is similar to Kubernetes. There are things that Kubernetes can do that Nomad can’t, and vice versa. The general idea of the two platforms is the same: running and orchestrating application workloads.
Nomad comes as a single binary. You can run Nomad on your local machine for testing purposes. Nomad runs on Windows, Mac, and Linux.
In a production scenario or in any other shared environment you should run Nomad on dedicated machines. You could run Nomad on a Kubernetes cluster, but that does not make any sense. In this blog series we will run Nomad on Azure virtual machines.
Nomad can run either as servers or clients. The same Nomad binary is used in both cases, the only difference is how you configure Nomad through configuration files. The configuration files are written in the HashiCorp Configuration Language (HCL). Nomad servers are responsible for orchestrating and placing workloads on clients. Nomad clients are responsible for running the workloads.
A Nomad cluster should consist of at least three servers, and at most seven servers. The cluster can include any number of clients.
A Kubernetes cluster can orchestrate containers based on container images (e.g. Docker). Nomad provides more options. You can run containers similar to Kubernetes, but you can also run scripts, Java JAR files, and more. You must install task drivers on your Nomad clients to support running different types of artifacts.
In this and the following blog posts in this series we will encounter many of these details. To get the full picture of what Nomad is, visit the documentation.
Configure all resources using Terraform#
Terraform is the natural weapon of choice for defining the Azure infrastructure that we need to get a Nomad cluster up and running.
Begin by configuring the Azure provider, because we will primarily be using this provider in this example:
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 4.3"
}
}
}
provider "azurerm" {
subscription_id = var.azure_subscription_id
features {
virtual_machine_scale_set {
force_delete = true
}
}
}
Here we have set the Azure subscription ID using a variable. If you provision the example code for this blog post you will need to set the value of this variable (e.g. using a terraform.tfvars
file).
In the features
block of the provider configuration we say that virtual machine scales sets should be forcefully deleted (this is a hint that we will be using virtual machine scale-sets, see how in the following sections).
This is most likely not strictly required, but I have always wanted to use the features
block for something and this felt like a good time for that!
Create an Azure resource group#
If you are familiar with Azure you know that all resources on Azure goes into resource groups. This is one of the best features of Azure, if you ask me. Create a resource group where all Nomad resources will go:
resource "azurerm_resource_group" "default" {
name = "rg-nomad-on-azure"
location = var.azure_location
tags = {
projet = "nomad"
}
}
Create an Azure virtual network and subnet#
The next step on our journey is to provision a virtual network where we can run our Nomad cluster. In our first attempt we will keep things simple. We define a virtual network along with a single subnet:
resource "azurerm_virtual_network" "default" {
name = "vnet-nomad"
location = azurerm_resource_group.default.location
resource_group_name = azurerm_resource_group.default.name
address_space = [
"10.0.0.0/16",
]
}
resource "azurerm_subnet" "nomad" {
name = "snet-nomad"
resource_group_name = azurerm_resource_group.default.name
virtual_network_name = azurerm_virtual_network.default.name
address_prefixes = [
"10.0.10.0/24",
]
}
If you come from an AWS background you might be confused about why we use a single subnet. A subnet on Azure spans the whole region, it is not limited to a single availability zone as for AWS subnets.
Create an Azure virtual machine scale-set for Nomad servers#
We can run Nomad servers as individual virtual machines, but a better way is to use a virtual machine scale-set (VMSS). With a VMSS we configure a template for what our virtual machine instances should look like, and we specify how many copies we need. VMSS also come with additional benefits, like auto-scaling.
We can start configuring our VMSS for Nomad:
resource "azurerm_orchestrated_virtual_machine_scale_set" "nomad" {
name = local.nomad.vmss_name
resource_group_name = azurerm_resource_group.default.name
location = azurerm_resource_group.default.location
identity { ... }
platform_fault_domain_count = 1
single_placement_group = false
zone_balance = false
zones = ["1", "2", "3"]
instances = var.server_count
sku_name = var.virtual_machine_sku_name
user_data_base64 = "..."
network_interface { ...}
os_disk { ... }
os_profile { ... }
source_image_reference { ... }
tags = {
nomad = "server"
}
}
Most of the details are left out from the example above, you can see everything in the GitHub repository. There are a few arguments we should take a closer look at:
- In the
identity
block we will configure an Azure user-assigned managed identity. This is a managed identity that the virtual machine can use to perform actions on Azure. Each VM will need to read tags to use the cloud auto-join function. More on this soon. - The number of instances of this VMSS is configured using the
server_count
variable. This number must also be set in the Nomad configuration files on each server. Remember that his value should be at least 3 and at most 7. The default value here is 3. More on this soon. - The
sku_name
is set using thevirtual_machine_sku_name
variable. This Nomad cluster will not run heavy workloads, so a cheap SKU could be used here. Use the default value or configure it in yourterraform.tfvars
file. - The
user_data_base64
argument is very important. Through this we will configure what happens on boot with this instance. This includes installing and configuring Nomad. More on this soon. - In the
os_profile
block we configure the admin username to beazureuser
and we configure an SSH key that we can use to connect to the instance. No admin password will be set for the instances. - In the
source_image_reference
block we tell Azure what base image we want to use. In this example an Ubuntu 24.04 LTS image is used. - Finally, the VMSS
tags
will be propagated to all instances.
Create the identity#
Create a user assigned identity and give the identity the Reader
role on the resource group:
resource "azurerm_user_assigned_identity" "nomad" {
name = "nomad"
location = azurerm_resource_group.default.location
resource_group_name = azurerm_resource_group.default.name
}
resource "azurerm_role_assignment" "nomad" {
scope = azurerm_resource_group.default.id
principal_id = azurerm_user_assigned_identity.nomad.principal_id
role_definition_name = "Reader"
}
Add the identity to the VMSS by configuring the identity
block:
resource "azurerm_orchestrated_virtual_machine_scale_set" "nomad" {
# ... other arguments omitted
identity {
type = "UserAssigned"
identity_ids = [azurerm_user_assigned_identity.nomad.id]
}
}
Create the admin SSH key#
For this initial setup we allow SSH connections from anywhere to our Nomad servers. If we keep the ambition up we will eventually add HashiCorp Boundary and Vault to handle privileged access management with no permanent SSH keys. For now, we move on without this added complexity.
To create an SSH key for the admin user we will utilize the TLS provider and the local provider for Terraform. Add these to your configuration:
terraform {
required_providers {
# ... other providers omitted
local = {
source = "hashicorp/local"
version = "~> 2.5"
}
tls = {
source = "hashicorp/tls"
version = "~> 4.1"
}
}
}
Configure a new TLS private key resource, and output the private key to a local file and use the public key to create a corresponding resource on Azure that we then configure the VMSS with:
resource "tls_private_key" "servers" {
algorithm = "RSA"
rsa_bits = 4096
}
resource "local_file" "pricate_key" {
content = tls_private_key.servers.private_key_pem
filename = "${path.module}/ssh_keys/nomad-servers.pem"
file_permission = "0400"
}
resource "azurerm_ssh_public_key" "servers" {
name = "nomad-servers"
resource_group_name = azurerm_resource_group.default.name
location = azurerm_resource_group.default.location
public_key = tls_private_key.servers.public_key_openssh
}
Add the key to the VMSS in the linux_configuration
block inside of the os_profile
block:
resource "azurerm_orchestrated_virtual_machine_scale_set" "nomad" {
# ... other arguments omitted
os_profile {
linux_configuration {
# ...
admin_ssh_key {
username = "azureuser"
public_key = azurerm_ssh_public_key.servers.public_key
}
}
}
}
Install and configure Nomad#
To install and configure Nomad on our virtual machines we will use cloud-init. There are at least two options for how you can configure cloud-init:
- Create a cloud-init configuration file directly in YAML.
- Use the cloud-init provider for Terraform.
To avoid having to learn the exact format of a cloud-init configuration file you can go for the second option and use Terraform.
Add the cloud-init provider to the Terraform configuration:
terraform {
required_providers {
# ... other providers
cloudinit = {
source = "hashicorp/cloudinit"
version = "~> 2.3"
}
}
}
Installing and configuring Nomad servers require the following:
- Download the Nomad binary. Instead of downloading the binary you could use a package manager, which would also configure some of the following things for you as well. However, installing the binary yourself gives you more flexibility for how you want to configure everything.
- Creating a Linux user for Nomad.
- Creating the Systemd service file.
- Creating the Nomad configuration file and directories.
The full details of these steps can be seen in the GitHub repository. In the following discussion we concentrate on the Nomad configuration file.
You could create one or many configuration files and they will be patched together to form a single configuration. The configuration we are creating will be relatively small so a single file is perfectly fine.
The configuration file will be placed at /etc/nomad.d/nomad.hcl
and the content will be generated from the following:
data_dir = "/opt/nomad/data"
bind_addr = "0.0.0.0"
datacenter = "dc1"
tls {
http = false
rpc = false
}
ports {
http = 4646
rpc = 4647
serf = 4648
}
server {
enabled = true
bootstrap_expect = ${var.server_count}
server_join {
retry_join = [
"provider=azure tag_name=nomad tag_value=server subscription_id=${data.azurerm_client_config.current.subscription_id}"
]
}
}
The important parts are in the server
stanza:
- This file is for configuring servers, so we set
enabled = true
. - We tell Nomad that we expect a certain number of servers in
bootstrap_expect = ${var.server_count}
. This number is the same as the instance count we set for the VMSS. - In the
server_join
stanza we use the cloud auto-join feature. Here we configure what Nomad should look for when forming a cluster as a string consisting of a few pieces:provider=azure
is required because there are many different providers supported and Nomad can’t figure which one it is by itself.tag_name
andtag_value
are used to indicate what the resources Nomad looks for are tagged with. Remember how we tagged our instances withnomad=server
?- The
subscription_id
is the Azure subscription ID fetched from theazurerm_client_config
data source. This is so that the API calls in the background will go to the correct subscription.
The cloud auto-join feature will use the user-assigned identity we added to the VMSS. This is why we don’t need to provide any more details or any credentials.
All the cloud-init parts are configured in a cloudinit_config
data source. Add the rendered cloud-init configuration to the VMSS:
resource "azurerm_orchestrated_virtual_machine_scale_set" "nomad" {
# ... other arguments omitted
user_data_base64 = data.cloudinit_config.nomad.rendered
}
Provision the Nomad cluster on Azure#
Go through terraform init
to download the provider binaries and initialize the state file, run terraform plan
to see what Terraform will do when this configuration is applied, and finally terraform apply
to provision the resources.
Provisioning the cluster takes a couple of minutes.
When the provisioning is complete we can check our Azure environment to see that a lot of resources have been created:
Grab the public IP of one of the virtual machine instances and connect to it using the SSH key stored under the ssh_keys directory.
$ ssh -i ssh_keys/nomad-servers.pem azureuser@<public ip>
Once connected to the instance, run nomad server members
to see if our cluster has been formed:
azureuser@nomadsrvQG6F2N:~$ nomad server members
Name Address Port Status Leader Raft Version Build Datacenter Region
nomadsrvQG6F2N.global 10.0.10.5 4648 alive false 3 1.10.2 dc1 global
Error determining leaders: 1 error occurred:
* Region "global": Unexpected response code: 500 (No cluster leader)
We have not formed a Nomad cluster! An error message informs us that there is no cluster leader, and there is only a single server in the list.
It turns out that the tag that cloud auto-join is looking for must be set at the virtual network interface card (NIC) of the virtual machines not on the virtual machines themselves.
This is unfortunate because in the VMSS resource we can’t configure tags for the NICs. The tags from the VMSS itself is not inherited to its subresources.
We will address this in a future blog post in this series, but for now we create a simple script that uses the Azure CLI to add the required tags to the NICs:
#!/bin/bash
resourceGroup="rg-nomad-on-azure"
az network nic list \
--resource-group "$resourceGroup" \
--query "[].{name:name}" -o tsv | \
while read -r nicName; do
az network nic update \
--name "$nicName" \
--resource-group "$resourceGroup" \
--set tags.nomad="server"
done
Run this script to apply the tags.
After applying the tags to the NICs we can see that we have successfully formed a cluster:
azureuser@nomadsrvQG6F2N:~$ nomad server members
Name Address Port Status Leader Raft Version Build Datacenter Region
nomadsrv2ZI6C9.global 10.0.10.6 4648 alive true 3 1.10.2 dc1 global
nomadsrvJLAL8D.global 10.0.10.4 4648 alive false 3 1.10.2 dc1 global
nomadsrvQG6F2N.global 10.0.10.5 4648 alive false 3 1.10.2 dc1 global
We have three servers. One of the servers is designated as the leader.
Summary of Part 1#
In this blog post we learned the basics of what Nomad is and we went through how to provision an initial cluster version on Azure as a VMSS.
We discovered a problem with the cloud auto-join functionality on Azure. The tags that cloud auto-join is looking for must be on the NICs attached to the virtual machines. There is no easy way to set these tags on the NIC when working with VMSS. As a workaround we used a script that run Azure CLI to set the required tags.
This cluster can be improved, and we will take the next step in improving our cluster setup in the next blog post in this series. Before we configure Nomad any further we will take a detour to configure Consul, and we will use HashiCorp Consul to help us bootstrap a Nomad cluster instead of relying on tags. Stay tuned!