This post will take you on a brief journey of HashiCorp Boundary through the eyes of Terraform. I have spent some time over the past few weeks learning about Boundary in preparation for a HashiCorp User Group session.

This post is not intended to explain every nook and cranny of HashiCorp Boundary, but it will touch on many of the concepts from the Boundary domain model. I will jump straight into applying Boundary for a specific use-case of providing just-in-time access to resources in AWS for on-call engineers during an incident. This is vaguely similar to an older post in the official HashiCorp blog.

All source code from this blog post is available on my GitHub.

What is Boundary?

The simple description of Boundary is:

It’s a tool for privileged access management (PAM)

This description captures the essence of Boundary. To really understand what it means you need to dig a bit deeper. What is the best way to dig deeper? Of course it is to actually get hands-on and do something with Boundary! That is exactly what this post is all about.

Boundary consists of two major components: the control plane and workers.

The control plane handles things such as authentication, roles, users, metadata for hosts and targets, and more. The Boundary workers are what allow us to eventually connect to our public and private resources.

There are other solutions that does the same thing as Boundary. Most other solutions are geared toward a specific provider. HashiCorp Boundary is agnostic as to what platform you are targeting. Once you learn the basics of setting up Boundary to access resources in provider X, you know how to do the same thing for provider Y. Resources in this case could be cloud resources (e.g. in AWS, Azure, or GCP) or basically anything else that is reachable via an IP address. You could use Boundary to access your private resources running in your basement. Note that the resources do not have to be publicly exposed for this to work, more on this later.

If you wish you can deploy the Boundary control plane to host the complete Boundary solution yourself. If you don’t have any specific requirements for where the control plane should be hosted then you can use a managed experience of Boundary on the HashiCorp Cloud Platform (HCP).

In this post I will use HCP Boundary, which means my control plane endpoint is exposed to the outside world. This is the current status quo of HCP Boundary, but it is not a farfetched idea to imagine that sometime soon you will be able to integrate HCP Boundary into a HashiCorp Virtual Network (HVN) in HCP, and peer it with your own virtual networks. HVN integration is already available (actually a requirement) for HCP Vault.

Create a HashiCorp Cloud Platform Boundary cluster

We are ready to start building our infrastructure. To interact with HCP through Terraform we use the hcp provider:

terraform {
  required_providers {
    hcp = {
      source  = "hashicorp/hcp"
      version = "0.87.1"
    }
  }
}

To authenticate to HCP you should follow the steps available here. In essence it entails creating a service principal in HCP with corresponding client credentials. If Terraform has access to the client ID and client secret through the corresponding environment variables HCP_CLIENT_ID and HCP_CLIENT_SECRET, then the following configuration of the hcp provider is sufficient:

provider "hcp" {}

In fact, you can leave it out completely. I prefer to add it even if it is empty.

Next we can create our HCP Boundary cluster:

resource "hcp_boundary_cluster" "this" {
  cluster_id = "aws-cluster"
  username   = var.boundary_admin_username
  password   = var.boundary_admin_password
  tier       = "Plus"

  maintenance_window_config {
    day          = "MONDAY"
    start        = 2
    end          = 8
    upgrade_type = "SCHEDULED"
  }
}

One thing I want to mention straight away is that for a given project in HCP you can only have a single HCP Boundary cluster. If you try to create a second one it will fail. If you do need several clusters you can create additional projects in HCP and create one Boundary cluster in each.

You must provide an initial username and password for the first administrator account. You can’t bootstrap a new HCP Boundary cluster with OIDC or LDAP auth methods. This might seem a bit unfortunate, but understandable. Another unfortunate thing is that later on you will notice that the Terraform provider for Boundary is currently not compatible with any other auth method than the password auth method. Anyway, I’ve set the username and password as variables to avoid putting my secrets directly into the code:

variable "boundary_admin_username" {
  description = "Boundary admin username"
  type        = string
}

variable "boundary_admin_password" {
  description = "Boundary admin user password"
  type        = string
  sensitive   = true
}

I’ve also added a maintenance_window_config block to my HCP Boundary cluster where I specify that I want cluster updates to happen on Mondays early in the morning. During my experimentation there have been a few cluster upgrades, and my experience is that it works well without any issues.

Ideally we would be able to configure our Terraform provider for Boundary using the attributes available from this hcp_boundary_cluster resource. Unfortunately I don’t think it is possible to both create an HCP Boundary cluster and create resources within it through the Boundary provider in one and the same configuration. I kept running into an issue where the Boundary provider would look for my cluster at 127.0.0.1 for no good reason.

Let me know if you find a way to both deploy a cluster and use it in the same configuration, for the rest of this post I will assume you split the two things into separate Terraform configurations. One could argue that it makes more sense to keep the Boundary cluster in a separate configuration since the lifecycle of the cluster is significantly different from the resources within the cluster.

Create a HashiCorp Cloud Platform Vault cluster

I want to use Vault to provide SSH certificates that I can use to access EC2 machines running in AWS. So I will need an HCP Vault cluster to work with as well. I add this to the same configuration:

resource "hcp_hvn" "this" {
  hvn_id         = "hvn-aws-${var.aws_region}"
  cidr_block     = var.hvn_cidr_block
  cloud_provider = "aws"
  region         = var.aws_region
}

resource "hcp_vault_cluster" "this" {
  cluster_id      = "vault-aws-${var.aws_region}"
  hvn_id          = hcp_hvn.this.hvn_id
  tier            = "dev"
  public_endpoint = true
}

resource "hcp_vault_cluster_admin_token" "this" {
  cluster_id = hcp_vault_cluster.this.cluster_id
}

You must place the HCP Vault cluster into a HVN, so I create a hcp_hvn resource as well. A HVN is an abstraction on top of an AWS VPC or an Azure VNET. I will not be establishing a peering connection between my AWS VPC and the HCP HVN, so theoretically I could have used azure as the cloud provider for my HVN, but for consistency I use aws.

Finally I also created a Vault admin token using the hcp_vault_cluster_admin_token resource. This token will be used by Terraform to create resources in Vault. Note that this token is valid for six hours. If you want to use the token after these six hours you will have to re-run this configuration in order to generate a new token.

Create an AWS Virtual Private Cloud

I create a very simple Virtual Private Cloud (VPC) in AWS where I will later on create an EC2 instance. The VPC consists of a single public subnet and an internet gateway. The details of the VPC is available in the GitHub repository.

Configure Boundary

In the accompanying GitHub repository you will notice that the resources described above are part of its own Terraform configuration in the 01-hcp-aws directory. Everything described in the rest of this blog post is part of the other Terraform configuration in the 02-boundary-aws directory. This configuration contains many resources, I will focus on some of these in the subsections below.

Scopes

You can organize resources in Boundary into different scopes.

Scopes are like directories in a filesystem. The top-most scope is called the global scope. The scope ID of the global scope is always global, this is handy to remember. The first sub-level below the global scope are organizations. Sub-scopes of organizations are called projects. Then there are no further levels of scopes. How you want to organize your resources is up to you, but it makes sense to group them according to similarity, for instance to keep your AWS resources in a single project or keep resources from the same cloud region in the same project.

For the purpose of this post I am adding an organization scope called cloud and a project scope called aws-resources.

Before we do that we need to add the Boundary provider to our configuration:

terraform {
  required_providers {
    boundary = {
      source  = "hashicorp/boundary"
      version = "1.1.14"
    }
  }
}

provider "boundary" {
  addr                   = var.boundary_cluster_url
  auth_method_login_name = var.boundary_admin_username
  auth_method_password   = var.boundary_admin_password
}

To configure the Boundary provider I am using the HCP Boundary cluster URL as a variable, together with my username and password that I used to setup Boundary. You could of course create a dedicated user for Terraform if you wish, but understand that this has to be done in some other fashion than Terraform (scripts, SDK, or manually). Actually, you could use a different Terraform configuration to create another user to then use with this Terraform configuration. But I will skip that step!

Next I create my two scopes:

resource "boundary_scope" "org" {
  name                     = "cloud"
  description              = "Organization for blog post demo"
  scope_id                 = "global"
  auto_create_admin_role   = true
  auto_create_default_role = true
}

resource "boundary_scope" "project" {
  name                     = "aws-resources"
  description              = "Project for target AWS resources"
  scope_id                 = boundary_scope.org.id
  auto_create_admin_role   = true
  auto_create_default_role = true
}

Notice how I don’t specify if the scope should be an organization or a project. This is simply understood from the scope where I place my new scope. Inside of the global scope there are organizations, and inside of organizations there are projects. You can’t create a project directly under the global scope or organizations inside of projects.

For my organization I say that its scope_id is global. This means that the new scope should be placed inside of the global scope. I could have used a data source to get a nice reference to the global scope, but it is unnecessary to do so because the global scope ID is always global (remember?) There is also no need to actually create the global scope in the first place, it is automatically created for you with your new HCP Boundary cluster. I can’t say for sure whether this is true for a self-managed Boundary cluster. In that case you might need to actually create the global scope yourself. I’ll try it out and let you know!

For both of my scopes I have specified auto_create_admin_role and auto_create_default_role and set them to true. If you set them to false you will notice that by default you don’t have many permissions in the new scopes. So I recommend to set these values to true for now, and we can come back to how to address the situation when they are set to false later.

Auth methods

Auth methods in Boundary are similar to auth methods in Vault. They are used to authenticate to Boundary. When you create a new HCP Boundary cluster the password auth method is enabled at the global scope automatically. This allows you to sign in with the admin account that you set up.

You can add additional auth methods at different scopes in your Boundary cluster. Currently there are three auth methods available: password, OIDC, and LDAP. Configuring an OIDC or LDAP auth method is a bit too complicated for an introductory post on Boundary, so I will leave it for a future post. In this post I will use the password auth method available at the global scope.

I will need to reference this auth method at a few places, so it makes sense to use a data source for it:

data "boundary_auth_method" "password" {
  scope_id = "global"
  name     = "password"
}

The name of the default password auth method is simply password, so it is easy to set up the data source.

Users, accounts, and groups

There are two types of principals in Boundary: users and groups. In fact there is something known as managed groups as well but we will ignore that for now.

Users are the entities that will be accessing the targets through Boundary. Users can be assigned to groups to simplify permission management.

Users can have one or more accounts. An account is the credentials connected to a given auth method. A user can have an account in an OIDC auth method, and an account in a password auth method, but ultimately the accounts belong to the same user.

I will create three users, each with its own account in the global password auth method. The first two users are my on-call engineers. For simplicity I only show one of the users, the other one will be fairly similar except for the name (the other one is named john):

resource "boundary_account_password" "jane" {
  name           = "jane"
  login_name     = "jane"
  description    = "Jane account for the password auth method"
  password       = random_password.jane.result
  auth_method_id = data.boundary_auth_method.password.id
}

resource "boundary_user" "jane" {
  name        = "Jane Doe"
  description = "Jane the on-call engineer"
  scope_id    = "global"
  account_ids = [
    boundary_account_password.jane.id,
  ]
}

I first create the account in the password auth method using the boundary_account_password resource type, next I create the user with the boundary_user resource type and reference the account in the account_ids argument.

My two on-call engineers are added to a group:

resource "boundary_group" "oncall" {
  name        = "on-call-engineers"
  description = "On-call engineers"
  scope_id    = "global"
  member_ids = [
    boundary_user.john.id,
    boundary_user.jane.id,
  ]
}

Later on in this post I will be setting up a Lambda function to administer permissions for this on-call group in Boundary. The Lambda function requires its own user:

resource "boundary_account_password" "lambda" {
  name           = "lambda"
  login_name     = "lambda"
  description    = "Account for AWS Lambda in the password auth method"
  password       = random_password.lambda.result
  auth_method_id = data.boundary_auth_method.password.id
}

resource "boundary_user" "lambda" {
  name        = "AWS Lambda"
  description = "User for AWS Lambda"
  scope_id    = "global"
  account_ids = [
    boundary_account_password.lambda.id,
  ]
}

Roles

Permissions in Boundary are known as grants. You combine grants into roles. Roles can be assigned to principals; users and groups.

I will create three custom roles. The first is a reader role which will allow the on-call engineers to list any type of resource in Boundary but not perform any specific actions on them:

resource "boundary_role" "reader" {
  name        = "reader"
  description = "Custom reader role for on-call engineers"
  scope_id    = boundary_scope.org.id
  grant_scope_ids = [
    boundary_scope.org.id,
    boundary_scope.project.id,
  ]
  grant_strings = [
    "ids=*;type=*;actions=list,no-op",
  ]
  principal_ids = [
    boundary_group.oncall.id,
  ]
}

There are a few important pieces of a role. First of all you create the role in a scope, but you must also define in what scopes the role is in effect through the grant_scope_ids argument. You specify who this role is assigned to in the principal_ids argument. Finally, you specify the grants this role consists of in the grant_strings argument. A grant string follows a specific syntax where you can specify resource IDs, resource types, actions that can be performed, and more. In this case the grant string says that for any resource type (type=*) and for any resource ID (ids=*) the actions of list and no-op are allowed. This combination of actions is a special way to allow listing of any resource but not allowing any other action (no-op). Without the no-op action the resources would still not necessarily be listed.

Next I create a special role named alert that will be assigned to the on-call engineers when an alert is triggered:

resource "boundary_role" "alert" {
  name        = "alert"
  description = "Custom role for on-call engineers during alerts"
  scope_id    = "global"
  grant_scope_ids = [
    boundary_scope.org.id,
    boundary_scope.project.id,
  ]
  grant_strings = [
    "ids=*;type=*;actions=read,list",
    "ids=*;type=target;actions=authorize-session"
  ]
}

The main difference here is that I have not assigned any principals to this role. This will be handled as needed by an AWS Lambda function.

This role has two grant strings. The first allows reading and listing any resource (ids=*;type=*;actions=read,list) and the second one allows authorize-session on target type of resources. This means the on-call engineers will be able to connect to the resource through Boundary.

The third custom role is for our Lambda function user:

resource "boundary_role" "lambda" {
  name        = "lambda"
  description = "Custom role for AWS Lambda to administer the on-call role assignments"
  scope_id    = "global"
  grant_strings = [
    "ids=*;type=*;actions=list,no-op",
    "ids=${boundary_role.alert.id};type=role;actions=read,list,add-principals,remove-principals",
  ]
  grant_scope_ids = [
    "global",
    boundary_scope.org.id,
    boundary_scope.project.id,
  ]
  principal_ids = [
    boundary_user.lambda.id,
  ]
}

The interesting part here is the second grant string:

"ids=${boundary_role.alert.id};type=role;actions=read,list,add-principals,remove-principals"

The first part ids=${boundary_role.alert.id} restricts this permissions to the specific resource ID of the alert role defined above, the second part type=role further restricts the permissions specifically to role resources, and the last part actions=read,list,add-principals,remove-principals further restricts what actions can be performed on roles.

I have added read because Lambda must be able to read the role to see which principals are currently assigned to it. I added list too, but in fact I think I could remove it. For now I will let it be! Finally I have added add-principals and remove-principals because this is the main part of what the Lambda function will do, it will add a group to this role, and then remove the group from the role.

Configuring a target

I will configure an EC2 instance in AWS as the target resource for this demo. To successfully set this target up there are a few things that need to be configured. This section and the following subsections will go through most of these parts one by one.

Configuring Vault

We want to use Vault for credential injection, specifically the injection of SSH certificates in the sessions that we establish through Boundary. To work with Vault through Terraform we add the Vault provider and configure it:

terraform {
  required_providers {
    vault = {
      source  = "hashicorp/vault"
      version = "4.2.0"
    }
  }
}

provider "vault" {
  address   = var.vault_cluster_url
  namespace = "admin"
  token     = var.vault_admin_token
}

The Vault cluster URL and token we obtain from the Terraform configuration we applied earlier.

To allow Boundary to work with Vault it requires a token of its own. This token must be configured in a specific way, and we must give it access to do what it needs to do. I create two separate policies for the Vault token for Boundary:

resource "vault_policy" "boundary" {
  name   = "boundary-controller"
  policy = file("./policy/boundary-controller-policy.hcl")
}

resource "vault_policy" "ssh" {
  name   = "ssh"
  policy = file("./policy/ssh-policy.hcl")
}

The specifics of what these policies do can be seen in the GitHub repository, in essence it allows Boundary to manage its own token and allows it to use the SSH secrets engine that we will set up. Together with these policies we can create the Vault token for Boundary:

resource "vault_token" "boundary" {
  display_name = "boundary"
  policies = [
    vault_policy.boundary.name,
    vault_policy.ssh.name,
  ]
  no_default_policy = true
  no_parent         = true
  renewable         = true
  ttl               = "24h"
  period            = "1h"
}

Importantly it myst be an orphan token (no_parent = true), it must be periodic and renewable. If the token had a parent token it could stop working whenever that parent token was revoked, and Boundary would stop working as well.

To configure the SSH secrets engine we set up the following resources:

resource "vault_mount" "ssh" {
  path = "ssh-client-signer"
  type = "ssh"
}

resource "vault_ssh_secret_backend_role" "boundary_client" {
  name                    = "boundary-client"
  backend                 = vault_mount.ssh.path
  key_type                = "ca"
  default_user            = "ubuntu"
  allowed_users           = "*"
  allowed_extensions      = "*"
  allow_host_certificates = true
  allow_user_certificates = true

  default_extensions = {
    permit-pty = ""
  }
}

resource "vault_ssh_secret_backend_ca" "boundary" {
  backend              = vault_mount.ssh.path
  generate_signing_key = true
}

The details of this configuration is beyond the scope of this post, but one important details is to notice that default_user is set to ubuntu in the vault_ssh_secret_backend_role resource. This default user must match with whatever EC2 AMI you are using for the target resource in AWS. If you were to use an Amazon Linux AMI you would need to change this default user to ec2-user.

Configuring Boundary credential store and credential library

With Vault set up we can configure the integration with Boundary. Boundary has two concepts for credentials: credential store and credential library. A credential store comes in two types: Vault or static. The difference is where the credentials are provided from. The Vault credential store type is self-explanatory, but the static type means the credentials are provided from Boundary itself. In this case we will use the Vault type of credential store:

resource "boundary_credential_store_vault" "ec2" {
  name      = "boudary-vault-credential-store-ec2"
  scope_id  = boundary_scope.project.id
  address   = var.vault_cluster_url
  token     = vault_token.boundary.client_token
  namespace = "admin"
}

We provide the address for the Vault cluster together with a token and what namespace we should use. Next we will configure a credential library. A credential library provides a set of credentials with given permissions. Credentials from a specific credential library always have the same permissions. If you must provide credentials with different permissions you must use several credential libraries. The credential library in this case will provide SSH certificates. There is a convenience resource for this in the Boundary provider for Terraform:

resource "boundary_credential_library_vault_ssh_certificate" "ec2" {
  name                = "ssh-certificates"
  credential_store_id = boundary_credential_store_vault.ec2.id
  path                = "ssh-client-signer/sign/boundary-client"
  username            = "ubuntu"
  key_type            = "ecdsa"
  key_bits            = 521
  extensions = {
    permit-pty = ""
  }
}

We see that we can override some of the settings defined in the Vault secrets engine from here, for instance we could provide a different username than the default username defined in the secrets engine. Note that we place this credential library in the credential store we defined above.

Configure Boundary host catalogs, host sets, and hosts

A host catalog is a collection of host sets and hosts. There are two types of host catalogs: static and dynamic. In a static host catalog we add specific hosts that we know by their addresses. In a dynamic host catalog (available for AWS and Azure) we specify a set of tags we look for, and any instance (currently only supports EC2 instances in AWS) with these tags are added to the host catalog. In this case we add a static host catalog:

resource "boundary_host_catalog_static" "ec2" {
  name        = "AWS Static Host Catalog"
  description = "EC2 instances"
  scope_id    = boundary_scope.project.id
}

A host is an address to a resource we want to access. Hosts can be grouped into host sets, where a given host set contains hosts that are identical. A typical host set would be all the instances in an autoscaling group. I add a host and add the host to a host set:

resource "boundary_host_static" "ec2" {
  name            = "aws-ec2-static-host"
  address         = aws_instance.ec2.public_ip
  host_catalog_id = boundary_host_catalog_static.ec2.id
}

resource "boundary_host_set_static" "ec2" {
  name            = "aws-ec2-static-host-set"
  host_catalog_id = boundary_host_catalog_static.ec2.id
  host_ids = [
    boundary_host_static.ec2.id,
  ]
}

In my host I reference the public IP of an AWS EC2 instance (see below). Note how I do not specify any ports that I want to connect to, or any credentials required to connect to the address.

Configuring a target

A target is the combination of a host, with details of what port we want to connect to and any credentials required to connect to the host. I define a target for my EC2 instance:

resource "boundary_target" "ec2" {
  type                     = "ssh"
  name                     = "aws-ec2"
  scope_id                 = boundary_scope.project.id
  session_connection_limit = -1
  default_port             = 22
  host_source_ids = [
    boundary_host_set_static.ec2.id,
  ]
  injected_application_credential_source_ids = [
    boundary_credential_library_vault_ssh_certificate.ec2.id
  ]
}

There are two types of targets: SSH and generic TCP. The features differ for the two types of targets. One major feature difference is that the SSH target type supports credential injection from Vault. Credential injection is the process where Vault injects credentials to the session without the user ever having the possibility to see them. For generic TCP targets there is support for credential brokering, where Vault (or Boundary) provides the credentials for the user. The user can see the credentials in this case.

For my boundary_target resource I specify that I want to connect to port 22, I specify what hosts are part of the target in the host_source_ids argument and I specify that I want credentials injected from my credential library in the injected_application_credential_source_ids argument.

Configuring the EC2 instance to complete our target

So far we have configured everything required in Vault and Boundary for our target. We still have not talked about the EC2 instance itself.

To be able to connect to the EC2 instance using the SSH certificates issued by Vault the EC2 instance must trust the Vault CA. We can achieve this by obtaining the public key from Vault and add it to our EC2 instance. In a production scenario this is something you would bake into the AMI image using Packer, but for this demo it is sufficient to add it during setup with Terraform. I will use the http data source to ask Vault for this public key:

data "http" "public_key" {
  method = "GET"
  url    = "${var.vault_cluster_url}/v1/ssh-client-signer/public_key"

  retry {
    attempts     = 10
    min_delay_ms = 1000
    max_delay_ms = 5000
  }

  request_headers = {
    "X-Vault-Namespace" = "admin"
  }

  depends_on = [
    vault_mount.ssh,
    vault_ssh_secret_backend_ca.boundary,
    vault_ssh_secret_backend_role.boundary_client,
  ]
}

The response from this request is then used in a cloudinit_config data source:

data "cloudinit_config" "ec2" {
  gzip          = false
  base64_encode = true

  part {
    content_type = "text/x-shellscript"
    content      = <<-EOF
      #!/bin/bash
      echo "${data.http.public_key.response_body}" >> /etc/ssh/trusted-user-ca-keys.pem
      echo 'TrustedUserCAKeys /etc/ssh/trusted-user-ca-keys.pem' | sudo tee -a /etc/ssh/sshd_config
      sudo systemctl restart sshd.service
    EOF
  }

  # ...
}

In essence what is going on is that I store the public key in /etc/ssh/trusted-user-ca-keys.pem, next I add the following line to /etc/ssh/sshd_config:

'TrustedUserCAKeys /etc/ssh/trusted-user-ca-keys.pem'

And finally restart the sshd service.

When you go through the Terraform configuration in the accompanying GitHub repository you will see that I do something else in the cloudinit_config data source as well:

```hcl
data "cloudinit_config" "ec2" {
  gzip          = false
  base64_encode = true

  # ...

  part {
    content_type = "text/x-shellscript"
    content      = <<-EOF
      #!/bin/bash
      sudo apt update
      sudo apt install -y stress rand

      # random sleep between 0 and 300 seconds
      sleep $(rand -M 300)

      # stress all cpus for three minutes
      stress --cpu $(nproc) --timeout 180
    EOF
  }
}

I install stress and rand, then I use rand to sleep somewhere between 0 and 300 seconds, and finally I start a stress test using stress and let it run for three minutes. The reason for this is that this will trigger and alarm that in turn will trigger a Lambda function that provides access in Boundary for our on-call engineers. This is simply to simulate an issue with the EC2 instance.

The EC2 instance itself is simple to configure:

resource "aws_instance" "ec2" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = "t3.micro"
  vpc_security_group_ids = [aws_security_group.ec2.id]
  subnet_id              = var.aws_subnet_id

  user_data_base64            = data.cloudinit_config.ec2.rendered
  associate_public_ip_address = true
  monitoring                  = true

  tags = {
    Name = "Boundary Target"
  }
}

Two important arguments are user_data_base64 where I provide the rendered script from my cloudinit_config data source, and then the monitoring = true argument which specifies I want detailed monitoring for this EC2 instance. This last part is required because otherwise the alarm will not be triggered as fast as I would like.

Configuring the Lambda function

When the EC2 instance defined in the previous section is under heavy load I want an alarm to go off and trigger a Lambda function. This Lambda function will update the alarm Boundary role and add the on-call group to it. The alarm that triggers the Lambda function looks like this:

resource "aws_cloudwatch_metric_alarm" "trigger" {
  alarm_name      = "ec2-cpu-alarm"
  actions_enabled = true

  namespace   = "AWS/EC2"
  metric_name = "CPUUtilization"
  dimensions = {
    "InstanceId" = aws_instance.ec2.id
  }

  statistic           = "Average"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  threshold           = 50
  evaluation_periods  = 1
  datapoints_to_alarm = 1
  period              = 60
  treat_missing_data  = "notBreaching"

  # trigger the Boundary lambda function for all state changes
  ok_actions                = [aws_lambda_function.boundary.arn]
  alarm_actions             = [aws_lambda_function.boundary.arn]
  insufficient_data_actions = [aws_lambda_function.boundary.arn]
}

The alarm is aggressive, it is triggered in the CPU utilization is above 50% for one minute. This is so that it is easy to trigger for demo purposes. For all the state changes of the alarm the Lambda function is triggered.

There is a Boundary SDK for Go available, so it makes sense to write the Lambda function in Go. I will not show the complete source code in this blog post, instead refer to the GitHub repository for all the details.

Back in December 2023 AWS added support for triggering Lambda functions directly from a CloudWatch alarm. No longer do we need to have a middle-man in the form of AWS SNS. Naturally I want to use this trigger for my Lambda function. I did notice that there was no support for this event type in the official AWS SDKs for Lambda, but it was easy to add my own code for it. The entry point for the function is as always the main() function:

func main() {
	lambda.Start(HandleAlert)
}

My own logic starts in the HandleAlert function:

func HandleAlert(ctx context.Context, event CloudWatchAlarmEvent) error {
    // ignore this event type for now
	if event.AlarmData.State.Value == "INSUFFICIENT_DATA" {
		return nil
	}

    // create an authenticated boundary client
	client, err := CreateBoundaryClient()
	if err != nil {
		return errors.New("could not create Boundary client")
	}

    // handle the case where the alarm went from OK to ALARM
	if event.AlarmData.State.Value == "ALARM" {
		err = AssignOnCallRole(client)
		if err != nil {
			return errors.New("could not assign role to user")
		}
	}

    // handle the case where the alarm went from ALARM to OK
	if event.AlarmData.State.Value == "OK" {
		err = RevokeOnCallRole(client)
		if err != nil {
			return errors.New("could not revoke role from user")
		}
	}

	return nil
}

This function contains some simple checks to see what the current state of the alarm is, and take appropriate actions based on it. An important step is to get an authenticated client, and this is handled in the CreateBoundaryClient() function:

func CreateBoundaryClient() (*api.Client, error) {
	client, err := api.NewClient(&api.Config{Addr: os.Getenv("BOUNDARY_ADDR")})
	if err != nil {
		return nil, err
	}

	credentials := map[string]interface{}{
		"login_name": os.Getenv("BOUNDARY_USERNAME"),
		"password":   os.Getenv("BOUNDARY_PASSWORD"),
	}

	authMethodClient := authmethods.NewClient(client)
	authenticationResult, err := authMethodClient.Authenticate(context.Background(), os.Getenv("BOUNDARY_AUTH_METHOD_ID"), "login", credentials)
	if err != nil {
		return nil, err
	}

	client.SetToken(fmt.Sprint(authenticationResult.Attributes["token"]))

	return client, nil
}

There are a few actions that can be performed unauthenticated in Boundary. This is required because otherwise you would not be able to reset your password, for instance. Another thing that you have to do is to first create an unauthenticated client, authenticate to Boundary, and then add the Boundary token to the unauthenticated client to make an authenticated client. This is exactly what this function does. I have stored the important information for Boundary in environment variables that my Lambda function can read (i.e. BOUNDARY_ADDR, BOUNDARY_USERNAME, BOUNDARY_PASSWORD, and BOUNDARY_AUTH_METHOD_ID).

In this post we’ll only look at the case where the alarm is triggered to the ALARM state, this is handled in the AssignOnCallRole function:

func AssignOnCallRole(c *api.Client) error {
    // get a role client
	rolesClient := roles.NewClient(c)
	role, err := rolesClient.Read(context.Background(), os.Getenv("BOUNDARY_ON_CALL_ROLE_ID"))
	if err != nil {
		return err
	}

    // see if the principal is already assigned to the role
	for _, principalId := range role.Item.PrincipalIds {
		if principalId == os.Getenv("BOUNDARY_ON_CALL_GROUP_ID") {
			return nil
		}
	}

    // add the principal to the role
	_, err = rolesClient.AddPrincipals(context.Background(), os.Getenv("BOUNDARY_ON_CALL_ROLE_ID"), role.Item.Version, []string{os.Getenv("BOUNDARY_ON_CALL_GROUP_ID")})
	if err != nil {
		log.Println(err)
		return err
	}

	return nil
}

First we read the details of the role and go through the assigned principals. If the principal is already assigned to this role then there is no need to perform any more work. If not, then we add the principal to the role using the AddPrincipals(...) method.

Something that tripped me up a bit when writing this code was that the role has a version, and you need to provide the version number when you make updates to the role. This is another reason why we must read the role before we make any changes to it. Also, the API is not idempotent. This means if you try to add a principal to the role that is already added, then you will get an error.

I can build my function code using:

$ GOOS=linux GOARCH=arm64 go build -tags lambda.norpc -o bootstrap main.go

In fact, in the Terraform configuration I have added a null_resource that performs this build:

resource "null_resource" "build" {
  provisioner "local-exec" {
    command = "cd src && GOOS=linux GOARCH=arm64 go build -tags lambda.norpc -o bootstrap main.go"
  }
}

Note that this is for simplicity of the demo, in a production setting you would have a dedicated pipeline to build this Lambda function. The result of the build is zipped:

data "archive_file" "boundary" {
  type        = "zip"
  source_file = "src/bootstrap"
  output_path = "lambda.zip"

  depends_on = [
    null_resource.build
  ]
}

Finally the zipped file is used when creating the Lambda function:

resource "aws_lambda_function" "boundary" {
  function_name    = "on-call-role-administrator"
  role             = aws_iam_role.boundary.arn
  handler          = "bootstrap"
  runtime          = "provided.al2023"
  filename         = data.archive_file.boundary.output_path
  source_code_hash = data.archive_file.boundary.output_base64sha256
  architectures    = ["arm64"]
  environment {
    variables = {
      BOUNDARY_ADDR             = var.boundary_cluster_url
      BOUNDARY_USERNAME         = boundary_account_password.lambda.login_name
      BOUNDARY_PASSWORD         = boundary_account_password.lambda.password
      BOUNDARY_AUTH_METHOD_ID   = data.boundary_auth_method.password.id
      BOUNDARY_ON_CALL_ROLE_ID  = boundary_role.alert.id
      BOUNDARY_ON_CALL_GROUP_ID = boundary_group.oncall.id
    }
  }
}

All the important Boundary information is passed to the function as environment variables.

The final piece of the Lambda function is the CloudWatch alarm trigger:

resource "aws_lambda_permission" "cloudwatch_trigger" {
  statement_id  = "AllowExecutionFromCloudWatch"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.boundary.function_name
  principal     = "lambda.alarms.cloudwatch.amazonaws.com"
}

Boundary workers

So far we have been working with AWS and the Boundary control plane. I mentioned that the second major piece of Boundary was workers. Where are our workers?

When you create an HCP Boundary cluster it comes with two managed workers. You can use them for accessing your targets as long as the targets are publicly available. If you have private resources you must use private workers as well. However, to keep things simple in this post I have publicly available targets so I can use the managed workers without any further configuration! It feels a little bit like I am cheating, and I agree. In a future post I will talk more about self-managed workers.

Demo time

Once you have set up both Terraform configurations the demo starts.

Our on-call engineer Jane opens up the Boundary client and signs in:

auth

Initially there is no ongoing incident, so Jane is met by this view:

no-target

Over in AWS it seems like something is happening with our EC2 instance shortly after it was created:

alarm

An alarm has been triggered! The on-call engineers are notified. When Jane refreshes her view in the Boundary client she now sees this:

target

The target is available. Jane clicks on Connect to access the target to begin investigating the issue. She is able to sign in without the need to provide any credentials:

ubuntu-view

Summary

This post illustrated the steps necessary to set up HCP Boundary together with HCP Vault for a use-case where we have on-call engineers who are provided just-in-time access to resources in AWS.

Many of the concepts in the Boundary domain model were covered: users, groups, accounts, host catalogs, host sets, hosts, targets, credential stores, credential libraries, auth methods, scopes, and roles. A few important concepts we simply glossed over for now. I hope this post was useful in defining how many of these important concepts fit together, and how easy they can be configured using Terraform.

In future posts I will focus on other use-cases where there is a need for self-managed workers.

As a reminder, the source code for everything I showed in this post is available in my GitHub repository https://github.com/mattias-fjellstrom/hashicorp-boundary-demo/.