Terraliths Are the Natural Shape of Infrastructure
Search "Terraform Terralith" and you'll find the same advice repeated everywhere: break it up. But the Terralith isn't the anti-pattern. The acceptance of broken tooling is.
The Natural Shape of Infrastructure
Infrastructure has a shape. It's not the shape we draw on architecture diagrams with neat boxes and clean separation. It's a dense web of dependencies where everything connects to everything else.
Consider a typical production environment:
Nearly 4,000 dependency edges. Resources that touch dozens of other resources. This isn't poor design. This is what infrastructure looks like when you build something real.
The Terralith captures this reality accurately. One module, one state file, one honest representation of how things actually connect. When engineers start with Terraform, they instinctively create Terraliths because that's the natural representation of their infrastructure.
Observation
Teams don't accidentally create Terraliths. They create them because infrastructure is inherently interconnected. The monolith is the natural shape. The split is the artificial construct.
The False Promise of State Splitting
The standard advice for Terralith "problems" is state splitting. Network stack here, compute stack there, data layer somewhere else. The promise: smaller blast radius, faster plans, parallel execution.
The reality is different. Consider what happens when you split a Terralith into three stacks:
Before: Terralith
After: Split State
You haven't reduced complexity. You've redistributed it. The resources still exist. The dependencies still exist. But now Terraform can't see them.
Cross-stack dependencies: That RDS instance still needs the VPC from the network stack. Now you're passing outputs through data sources or worse, hardcoding values.
Deployment orchestration: Terraform can't tell you that the network stack needs to deploy before the compute stack. You discover this in production when things fail.
Hidden drift: Stack A changes a security group rule. Stack B depends on that rule but doesn't know it changed. The drift surfaces weeks later during an incident.
We've trained an entire generation of engineers to treat these problems as normal. They're not normal. They're symptoms of accepting broken tooling.
Why Terraliths Actually Fail
Terraliths don't fail because they're too large. They fail because Terraform stores state wrong. Consider the actual failure modes:
Global lock contention: Engineer A modifying one IAM role blocks Engineer B from updating an unrelated S3 bucket. These operations share no dependencies, but the global lock doesn't care.
Full state refresh: Changing one resource triggers refresh of all 2,847 resources. Terraform has the dependency graph. It knows only 12 resources need refreshing. It refreshes everything anyway.
Flat file storage: State is a graph stored as a JSON blob. Every operation deserializes the entire blob, operates on it in memory, and serializes it back. This is O(n) for operations that should be O(1).
The Terralith didn't fail. The storage layer failed. The locking strategy failed. The refresh strategy failed. But instead of fixing these failures, we tell teams to split their infrastructure into unnatural pieces.
Technical Reality
A Terralith with 10,000 resources and proper subgraph operations would outperform 10 split stacks with 1,000 resources each. The problem was never the size. It was always the implementation.
The Cost of Accepting Broken Tools
Every team that splits their Terralith pays the same costs:
Orchestration layers: Terragrunt, Makefiles, shell scripts, CI/CD glue. Entire codebases dedicated to working around the fact that Terraform can't handle infrastructure at its natural scale.
Mental overhead: Engineers must maintain two models: how infrastructure actually connects, and how they've split it to appease the tooling. Every change requires translating between these models.
Operational risk: Dependencies Terraform can't see are dependencies that can break without warning. Every cross-stack reference is a potential production incident.
We've normalized this dysfunction so completely that we teach it as "best practice." We write books about optimal state splitting strategies. We build entire consultancies around helping teams split their Terraliths "correctly."
But there is no correct way to split a Terralith. There's only different ways to distribute the pain.
What Actually Works
The solution isn't to split the Terralith. It's to fix the storage.
State is a graph. Store it as a graph. In a database. With row-level locking. With MVCC. With proper indexes. With subgraph queries. This isn't revolutionary. It's how we store every other graph-structured dataset.
When state is stored correctly:
Subgraph locking: Modifying IAM roles doesn't block S3 operations. The lock scope matches the change scope.
Partial refresh: Changing one resource refreshes only its dependency cone. O(log n) instead of O(n).
True parallelism: Multiple engineers can modify disjoint subgraphs simultaneously. The database handles the coordination.
Same Terralith. Same 2,847 resources. Two orders of magnitude faster. No splits. No orchestration. No lies about isolation.
This is what Stategraph does. Not because we're clever, but because we're doing the obvious thing: storing graph data in a graph-capable database.
The Real Anti-Pattern
The anti-pattern was never the Terralith. The anti-pattern is accepting that our tools can dictate our architecture.
When we split state to work around tooling limitations, we're not doing engineering. We're doing tool appeasement. We're letting implementation details of a state storage system dictate how we organize our infrastructure.
This is backwards. Tools should conform to the natural shape of the problem domain, not the other way around.
Infrastructure wants to be together because infrastructure is connected. The Terralith is correct. It always was. The tooling was wrong. It's time we fixed the tooling instead of our infrastructure.
Every team that splits their Terralith is admitting defeat. They're saying the tool's limitations matter more than their infrastructure's reality. We refuse to accept that.
Stop coordinating. Start shipping.
Resource-level locking. Graph-based state. SQL queries on your infra.
Teams work in parallel. No more lock contention.
// Zero spam. Just progress updates as we build Stategraph.