Terraform Confessions: Mistakes I've Made
TL;DR: This article dives into common but not-so-obvious Terraform pitfalls. We’ll explore why splitting your state too much can backfire and why a monorepo might be your best friend. We’ll uncover the hidden dangers of mutable state keys. Finally, we challenge the conventional wisdom on naming conventions and custom modules, guiding you on when (and when not) to use them.
# Introduction
My first contact with Terraform was back in 2018. After years of mastering Ansible, Chef, and Salt, it looked like just another Infrastructure as Code (IaC) tool. I couldn’t imagine that, Terraform would soon become my go-to tool for a long, long time.
Terraform arrived with a new way of describing our infrastructure: HCL. Some might call it “disturbing” the infrastructure (and let’s be honest, sometimes it feels that way), but it was a breath of fresh air. With all its pros and cons, HCL, in my opinion, knocked YAML from its top spot. Even though HCL isn’t a full-blown programming language, Terraform borrows some of its best practices, like functions, locals (as variables), and variables (as constants). The code can be packaged, formatted, linted, and tested. It’s like we’re real software engineers now!
For many DevOps, SRE, and System Engineer folks who were fluent in IaC, Bash, and Python, Terraform was a revelation. But since these fine people don’t write code for a living, some software development practices aren’t always obvious. On the other hand, developers are often unaware that Terraform is stateful and, more importantly, why it’s stateful.
In this article, I’ll share some common problems I’ve seen in Terraform codebases over and over again and suggest how to mitigate them. I’m intentionally trying to avoid the most popular best practices you can find all over the web and focus on those that are not so obvious.
# Navigating the Terraform Minefield
# The Dangers of a Divided State
# The Problem
I still remember the massive shift from monoliths to microservices. It was the hottest thing in the IT community, a star at every tech conference. Everyone shared their success stories of moving to microservices and how they saved money, time, and resources. It felt like it became a “cool kids” thing, and if you were still working on a monolith, you were supposed to be ashamed of yourself. Even if not everyone understood what microservices really were, everyone followed the trend. Often, this resulted in a “distributed monolith” or a sidecar for the main app.
This brought a new challenge for the infrastructure folks. We could no longer configure one load balancer, a single database, and share one S3 bucket. We had to split our infrastructure into multiple, independently manageable parts. Each microservice needed to be isolated from its neighbors, so shared resources were a no-go. If one microservice gobbled up all the resources of a shared database, the whole system would go down. Since the main metric for infrastructure is stability, all infrastructure engineers learned this rule the hard way.
Everyone started to split their Terraform configurations into multiple pieces. Some teams used Terragrunt, but most just created multiple Git repositories for each infrastructure layer with independent CI pipelines. Network became layer 1, then compute as layer 2, and storage as layer 3. Oh, wait, I forgot to add AWS Organization, so let’s make that layer 0.
So now we have four repositories, but they can’t be completely isolated. The compute layer still needs a VPC ID, and the IAM Role for accessing RDS in the storage repository depends on the EKS cluster’s OIDC endpoint from the compute layer. The worst solution would be to hardcode these artifacts into variables. A slightly less ugly approach is to use data lookups like terraform_remote_state.
The problem is that since Terraform is a console tool and not a long-running daemon, it can’t detect configuration drift automatically. So, if for any reason your VPC resources were updated, you would need to run terraform apply on all the dependent layers. This is not only time-consuming but also error-prone, especially if you haven’t updated a state in a while.
The real “fun” begins when you need to add new core resources to your infrastructure. Now you have to create four pull requests instead of one. Something that could have been done in 10 minutes now takes hours, or even days, waiting for reviews and pipelines.
# The Solution
So what do I suggest to solve this problem? A monorepo. Now, don’t shoot the messenger! Hear me out. I’m not suggesting a single state for all your resources. Instead, we organize the code into multiple directories within a single repository. The layers and their dependencies remain the same. We still have the problem of updating child states, but we can solve that with some clever CI/CD rules. At the very least, adding a new base resource will be much easier. Tracking and reverting changes also becomes a breeze.
In a monorepo, we can enforce consistent standards for code quality, state management, and versioning. Updating the default Terraform version or a provider becomes effortless. Need to know where module X is used? Just run grep. Want to fix formatting across the codebase? Run terraform fmt -recursive from the repository root. I could keep listing the benefits all day.
Yes, the word monorepo has earned a mixed reputation. But it’s worth separating emotions from technical terms. Like any architectural choice, it comes with both pros and cons. There are no universally right or wrong decisions (with perhaps a few humorous exceptions). Every solution can be the best choice in a specific context, and for Terraform projects, I strongly believe a monorepo is often the better fit.
# Preventing State “Split-Brain”: The Case for Immutable Keys
# The Problem
This one is pretty obvious to many seasoned Terraform users, but it’s a trap for newcomers. If you set a dynamic key for your state, the next time you run Terraform, it will try to create all the resources again. Let’s hope it won’t break your infrastructure and just returns an “already exists” error. But it might create a “split-brain” scenario where you have two states for the same infrastructure. If you’ve never tried to merge two states together, trust me, it’s not a fun experience.
Many people build a human-readable hierarchy in the key name. I guess the idea is to easily find the state’s owner if you only have the key name. The backend configuration might look something like this:
|
|
However, not many consider that a team or project could be renamed in the future. Or a project might change ownership from one team to another. Even the organization’s name can change after a merger or acquisition. If this happens, you’ll find yourself in a situation where you either have to migrate your state or live with the old names. Maybe it’s fine if the changes only affect a few projects, but I wouldn’t want to be the one migrating hundreds of Terraform states.
# The Solution
To prevent this, choose an immutable key. It should be somehow connected to the project but never change. For example, GitHub and GitLab have a project ID that is unique for each project. It’s generated on project creation and stays the same even if the project is moved or renamed. And I assume it won’t be reused even after the project is deleted.
|
|
Even randomly generated UUID would better then mutable keys.
|
|
Sure, it may be less obvious who owns the state. However, in my opinion, that cost is far lower than manually importing 20 resources into the state while your pipeline is blocked from deploying a high-priority hotfix.
# The Tyranny of Naming Conventions
# The Problem
In the old days, we had a few bare-metal servers running multiple applications 🦣. If your birth year starts with 19XX, you probably remember this. You were lucky if the staging environment had its own dedicated server. One common risk was SSH-ing into a server to drop a few databases, only to realize you were connected to the wrong environment.
To reduce the chance of incidents, we started adding environment suffixes to files, directories, and resources. Another common practice was to include the file type in the name. For example, we added -db to database files or -api to API server hostnames. By the end of the day, we had names like postgres-dump-prod.sql and nginx_access_staging.log.
Old habits die hard. Today, I still see this approach reflected in Terraform resource naming.
|
|
People add metadata to resource names, forgetting that this is no longer a Linux file. The RDS resource you see in AWS is a software entity accessed through the AWS API. AWS will never return a random resource: to retrieve a resource, you must specify the correct service, region, resource name, and API endpoint.
Terraform adds another layer of safety. Every resource reference already includes its type. There is no way to accidentally delete a database instead of an S3 bucket.
Some teams try to standardize naming by baking conventions directly into modules.
|
|
This works until module users start shortening names. When “engineering-excellence” deploys the “confluent-cloud-adapter” database into the “production-europe” environment, you hit a “name too long” error. Shortening everything to “eng-exc”, “confcl-adapter”, and “prd” defeats the purpose of having a clear naming convention.
# The Solution
Move all metadata into resource tags (sometimes called labels). Keep resource names clean and minimal. Avoid hardcoding naming conventions inside modules whenever possible. Modules should remain flexible and opinion-free.
|
|
If you need naming conventions, put them in a separate module.
|
|
Think of modules as functions: you pass parameters in, get results out, and then use those results to build other resources.
# To Module or Not to Module?
# The Problem
Another controversial one!
“What? Are you saying I shouldn’t create custom modules? But that’s a Terraform best practice!”
Yes… and no.
Custom modules are great for encapsulating and reusing code. They can reduce duplication and improve consistency. But they also introduce complexity and long-term maintenance costs. Creating a module for a resource that is used only once is a classic example of over-engineering.
Another common mistake: people forget that the moment you create a custom module, you own it. You are now responsible for updating, maintaining, testing, versioning, and documenting it. In an ideal world, every new version is clearly communicated to users. Every edge case and feature flag is tested. In reality, many teams end up with a custom module that has no release strategy, no changelog, and no clear ownership.
The irony is that modules are often created to avoid dependency on open-source modules — only to end up with an internal module that is harder to maintain and less reliable.
# The Solution
Before creating a custom module, ask yourself a few honest questions:
- Is there an existing open-source module that already solves this problem?
- Will this code be reused in at least two different places?
- Do we already have a clear release and versioning strategy for Terraform modules?
If the answer to all of these questions is “yes”, then go ahead and create the module. Otherwise, you are probably better off keeping the code directly in your main configuration.
Remember: the goal is to make your life easier, not to follow a “best practice” blindly.
I dare you — double dare you — avoid creating yet another RDS or EKS Terraform module. There are hundreds of engineers around the world contributing to the official terraform-aws-modules modules. They have been tested millions of times, is well documented, and covers most real-world use cases.
I highly doubt your requirements are truly unique. Need customization? Great. Lock the version and wrap the official module with a thin custom layer. Your future self will thank you.
# Conclusion
Navigating the world of Terraform is a journey of continuous learning. The patterns we’ve discussed today—from state management and data sharing to naming and modularity - are not just abstract rules but hard-won lessons from the trenches. By questioning “best practices” and understanding the “why” behind them, you can build infrastructure that is not only functional but also scalable, maintainable, and resilient.