Disclaimer: I wrote the title before writing the content, in the end it did not turn out to be a guide but rather closer to a rant with some example pitfalls along the way. You be the judge!
I’m currently reading The Pragmatic Programmer by David Thomas and Andrew Hunt. Actually, I am re-reading it. I was recently going through all my books to see what I could sell or give away1, and I stumbled upon this gem.
The Pragmatic Programmer covers many topics related to software development and has a number of tips along the way. In the section named How to program by coincidence we are given the advice:
Don’t Program by Coincidence
What does program by coincidence mean? Three main categories of coincidences are listed:
- Accidents of implementation
- Accidents of context
- Implicit assumptions
Accidents of implementation covers when we write code that seems to work, but it is because we have coded it to work given the current status quo. Imagine a library that has a off-by-one error in one of the functions you are using, and you compensate for that by adding or subtracting one. Now your code works, but only for as long as the library keeps not-working.
Accidents of context covers situation when the code you are writing works in a particular context, most often your local developer laptop. Does it work in another context?
Implicit assumptions are assumptions you make without documenting or checking for. This usually leads to errors when those assumptions are no longer true.
What should you do instead of programming by coincidence? You should program deliberately. You should be aware of what you are doing when you write code. If your code works, understand why it works. If you are using a given tool, framework, or procedure, understand how it works. At no point in time should luck play a part in the code you are writing and how it behaves. Document your assumptions. Test your assumptions. Test your code. The list of what to do is long!
If you write a function to solve a specific problem, and all of a sudden it seems like your function is working but you don’t quite understand why it is working - what do you do? The idea here is that you should not accept this situation, you should strive towards understanding why it is working. Perhaps you will discover that by pure chance the function is working for your specific input, but it would not work in the general case.
I’m not a software developer in the sense that I write application code all day long, but what I do write a lot of is infrastructure-as-code. Is this concept applicable to infrastructure-as-code as well? Of course it is, it is still code.
In this post I would like to discuss implicit actions and abstractions when working with infrastructure-as-code. This is all related to programming deliberately. But let’s start by discussing ClickOps.
Why ClickOps leaves gaps in your understanding#
The term ClickOps is jokingly referring to performing the job of a DevOps engineer by clicking around in a graphical interface. Do you need to vertically scale your database in production? Just select the database size you want and then click apply!
ClickOps is a great way to get familiar with a platform to quickly learn its capabilities and for running proof-of-concepts. However, there is an inherent danger in ClickOps and there are many reasons to avoid it in production2.
The danger I want to focus on here is present in most platforms, to one extent or another. I have experienced it firsthand in both AWS and Azure. I am referring to behind-the-scenes actions performed automatically for you by the platform when you are doing ClickOps. You create a resource with the click of a button, and the platform fulfills your wishes. However, it goes the extra mile and creates the five other resources that are required to actually make your desired resource work as intended.
I see two problems with this behavior. The first is the potential cost in dollars and cents, and the second is lack of understanding what is going on. We are experiencing something that smells like programming by coincidence.
If you have a working FinOps program the cost problem might not have time become an issue. But for the sake of it let’s assume you don’t have any FinOps in place. I dare say that many organizations don’t. In AWS costs could easily become a problem. This is not because AWS is more expensive than Azure (just using AWS and Azure as examples here), but because in AWS you are not forced to use resource groups to group your resources. This means you can create a resource in one of your accounts, in any one of all the available regions, and then quickly lose track of it. If you are lucky you might notice it is there at the end of the month when the bill arrives. In Azure this is easier to spot because the resource(s) will most likely be part of your resource group, so you will notice that they are there and if you delete your resource group all the resources goes away together.
Costs aside, what about your learning? If the platform goes this extra mile for you behind-the-scenes, chances are you do not realize exactly what it takes to get your resource to work. When you want to set up said resource using a code-first approach you will be back to square one. Just creating that one resource does not work, we have fallen into the accidents of context trap.
My point here is that in a sense ClickOps involves luck. You are lucky that the platform helps you out, but this is at the cost of your understanding (and some of your dollars).
With a bit of extra luck it stops there. Since I gave AWS a hard time above I should end this section with giving Azure an equally hard time. ClickOps in Azure has a tendency to simplify making your resources publicly available. Create an Azure App Service using ClickOps and it will almost guaranteed be available for all to see, and attack.
Abstractions in code#
AWS CDK and similar imperative infrastructure-as-code tools are good at providing abstractions. This is in a way to do in code what ClickOps does for you in the GUI. You can rely on others to create an abstraction for whatever you want to create in your cloud provider, and trust that all the pieces will be set up correctly for you.
An example of an abstraction could be a website. You want to create a website, you don’t care so much what exactly goes into setting up this website. Your friendly platform team in your organization has conveniently created the perfect Website
abstraction for you. You can create a Website
resource, and you don’t have to care about configuring the WAF, CDN, database, compute runtime, etc, that goes into this website.
There is a big difference between this and ClickOps, and that is that you can inspect what the code does for you. In the case of AWS CDK you will also end up with CloudFormation stacks that you can inspect. A stack in CloudFormation is similar in a way to resource groups in Azure. The abstractions in this case packages up something into a convenient container, but you can still see inside of the container. With ClickOps you might not even be aware that there was a container to begin with.
Terraform also provides abstractions. First of all there are modules that work as abstractions, for instance the AWS VPC module that creates a Virtual Private Cloud (VPC) configured for you. If you have ever set up a VPC using CloudFormation you know that there are a lot of resources that goes into that, so an abstraction is a good thing. But in Terraform you also have providers that can provide abstractions. For instance, the Azure provider has a resource named azurerm_linux_function_app
to create, well, a Linux Function App. In the next section I will discuss a bit why this is such a great abstraction.
When there are no abstractions#
A prime example of lack of abstraction is when setting up an Azure Function App using Bicep. Function Apps represents the function-as-a-service offering from Azure. The idea is to upload a piece of code (a function) and have Azure execute this function based on triggers that you define. It is similar to AWS Lambda, with some extra bells and whistles. However, creating an instance of a Function App using Bicep leaves you wondering what on earth you are doing.
To start off, you need an instance of the Microsoft.Web/serverfarms
resource. Because, of course, you need to run your functions on some server. This is true even if we are in the serverless realm here. There is always a server, even for AWS Lambda, but I would prefer if that part is abstracted away for me here.
An example of what this resource could look like in Bicep:
resource server 'Microsoft.Web/serverfarms@2021-03-01' = {
name: serverFarmName
location: location
sku: {
name: 'Y1'
tier: 'Dynamic'
}
}
Next we need to create an instance of the Microsoft.Web/sites
resource. Our Function App is apparently a website of some sort. Of course this could be true, maybe our function is triggered by HTTP requests and maybe it returns HTML content. Then it would be a website. But it does not have to be, in fact I would argue that it is extremely seldom you would return HTML from the function. An example of what this resource could look like:
resource functionApp 'Microsoft.Web/sites@2021-03-01' = {
name: functionAppName
location: location
kind: 'functionapp'
properties: {
serverFarmId: server.id
siteConfig: {
appSettings: [...]
}
}
}
Note that I have left out a lot of configuration that you have to get right, in order for this to become a working Function App. This is not a trivial task.
What could Azure do to resolve this issue? I think there is a desperate need for a resource similar to this:
resource functionApp 'Microsoft.Function/apps@2023-12-07' = {
name: functionAppName
location: location
properties: {
...
}
}
The number of properties
I need to configure should be greatly reduced, most of them should have sensible defaults.
The resource provider is Microsoft.Function
, it clearly illustrates what types of resources we are dealing with. The resource type is apps
. Together Microsoft.Function/apps
clearly communicates that we are dealing with a Function App.
I am a big fan of the Bicep language, so it is unfortunate that the underlying resource APIs sometimes provide such a poor developer experience. I hope the Bicep team can provide an abstraction layer to solve issues like this.
Note that this was just one example, there are more.
Implicit actions in infrastructure-as-code#
I want to highlight another issue that is present in Azure Bicep, or to be fair, in the underlying resource APIs.
When you create certain resources in Azure there are implicit resources created along with it. This is ClickOps behavior in infrastructure-as-code. One could argue that this is an abstraction that helps me, but the issue is that there is no way for me to inspect what goes on without reading about it in documentation somewhere. That is why this is ClickOps in infrastructure-as-code, and not an abstraction.
I’ll go through an example. In Azure there is a service named Storage Accounts, this is in fact a collection of a few storage services where the most commonly used is Blob Storage. The other services are File Storage, Queue Storage, and Table Storage. I can create a Storage Account like this:
resource storageAccount 'Microsoft.Storage/storageAccounts@2022-05-01' = {
name: storageAccountName
location: location
sku: {
name: 'Standard_LRS'
}
kind: 'StorageV2'
}
When I deploy this template with Bicep I will get a Storage Account together with the implicitly created Blob Storage, File Storage, Queue Storage, and Table Storage. These are not like other resources, I can’t see them as a resource anywhere. They are part of the Storage Account.
Do I need to care about these implicit resources? Perhaps not. But let’s say I also want to create a blob container in my Storage Account where I can store my blobs. A first naive approach is to try something like this:
resource container 'Microsoft.Storage/containers@2022-05-01' = {
name: containerName
parent: account
}
This does not work. There is no resource type named Microsoft.Storage/containers
. There is a resource type named Microsoft.Storage/storageAccounts/blobServices/containers
, however. Now all of a sudden I do need to know that there is something called blobServices
in the mix.
With this new knowledge your next attempt at creating the container resource might look something like this:
resource container 'Microsoft.Storage/storageAccounts/blobServices/containers@2022-05-01' = {
name: containerName
parent: account
}
This will not work either. To understand why you need to know that the resource type hierarchy defines what resource can be a parent to what other resource. In this case we have the following hierarchy:
storageAccounts
└── blobServices
└── containers
Only blobServices
can be a parent to containers
. But we do not have an instance of blobServices
, what to do? There are two alternatives:
We could explicitly create it:
resource storageAccount 'Microsoft.Storage/storageAccounts@2022-05-01' = {
name: storageAccountName
location: location
sku: {
name: 'Standard_LRS'
}
kind: 'StorageV2'
// adding blobServices as a nested resource
resource blobServices 'blobServices' = {
name: 'default'
}
}
resource container 'Microsoft.Storage/storageAccounts/blobServices/containers@2022-05-01' = {
name: containerName
parent: account::blobServices
}
The other alternative is to just refer to blobServices
name without having an explicit reference to it:
resource container 'Microsoft.Storage/storageAccounts/blobServices/containers@2022-05-01' = {
name: '${account.name}/default/${containerName}'
}
What is common in the two examples is that I used the name default
for the blobServices
resource. This is another quirk of how Azure works under the hood, but let’s ignore that for now. Just be aware that the name must be default
, any other name will fail.
In the second example I must make sure that the number of segments in the name is the same as the number of segments in the resource type. In this case the resource type contains three segments storageAccounts
, blobServices
, and containers
. So the name must contain three segments as well ${account.name}
, default
, and ${containerName}
.
If we ignore the issue with there being implicit resources to begin with, just understanding how to work with them is problematic to say the least. You need to know way too many details to create a blob container using Bicep. You can argue that it is easy to learn these details, but the question remains: why do we need to know these details?
Anyway, back to implicit resources. I don’t like this behavior. There is ClickOps magic happening through infrastructure-as-code, and it does make understanding how the code works much more complicated. We need to struggle a bit to understand what is going on, in order to program deliberately. This ties us back to The Pragmatic Programmer.
To program deliberately requires us to understand the tools and platforms that we use. Otherwise we will fall into implicit assumptions, accidents of context, and accidents of implementation. However, some tools and platforms will make this job easier than others.
Summary#
Programming deliberately is important for any code that you write, not the least infrastructure-as-code. I am a big fan of writing declarative infrastructure-as-code because I prefer the transparency it provides. However, as we have seen, the tools are not always as transparent as one would like. As I mentioned before I am a big fan of Azure Bicep, but the Azure platform has a few quirks that can be hard to learn and understand. This in turn leads you towards the territory of programming by coincidence.
Abstractions are great, as long as you are provided a way to see through the abstraction. Good abstractions help you towards programming deliberately.
Implicit actions, or implicit abstractions, are inherently evil, I would argue. They hide knowledge, which leads to misunderstanding, wrong assumptions, and most likely wrong or overly complicated code.