Terraliths are the natural shape of infrastructure

Terraform Terralith Infrastructure

Josh Pollara • September 22nd, 2025

TL;DR

$ cat terraliths-natural-shape.tldr

• Infrastructure is a graph. Terraliths represent this correctly.

• State splitting: 1 problem → N problems + coordination overhead.

• Anti-pattern = accepting broken tools, not using Terraliths.

Search "Terraform Terralith" and you'll find the same advice repeated everywhere: break it up. But the Terralith isn't the anti-pattern. The acceptance of broken tooling is.

The natural shape of infrastructure

Infrastructure has a shape. It's not the shape we draw on architecture diagrams with neat boxes and clean separation. It's a dense web of dependencies where everything connects to everything else.

Consider a typical production environment:

$ terraform graph | grep -c "depends_on"

3847

$ terraform graph | grep -c "aws_vpc"

186

$ terraform graph | grep -c "aws_iam_role"

412

Nearly 4,000 dependency edges. Resources that touch dozens of other resources. This isn't poor design. This is what infrastructure looks like when you build something real.

The Terralith captures this reality accurately. One module, one state file, one honest representation of how things actually connect. When engineers start with Terraform, they instinctively create Terraliths because that's the natural representation of their infrastructure.

Observation

Teams don't accidentally create Terraliths. They create them because infrastructure is inherently interconnected. The monolith is the natural shape. The split is the artificial construct.

The false promise of state splitting

The standard advice for Terralith "problems" is state splitting. Network stack here, compute stack there, data layer somewhere else. The promise: smaller blast radius, faster plans, parallel execution.

The reality is different. Consider what happens when you split a Terralith into three stacks:

Before: Terralith

2847 resources 1 state file 1 lock Native deps 4min plan Terraform manages

After: Split State

2847 resources 3+ state files 3+ locks Manual deps 3×2min plans You manage

You haven't reduced complexity. You've redistributed it. The resources still exist. The dependencies still exist. But now Terraform can't see them.

Cross-stack dependencies: That RDS instance still needs the VPC from the network stack. Now you're passing outputs through data sources or worse, hardcoding values.

Deployment orchestration: Terraform can't tell you that the network stack needs to deploy before the compute stack. You discover this in production when things fail.

Hidden drift: Stack A changes a security group rule. Stack B depends on that rule but doesn't know it changed. The drift surfaces weeks later during an incident.

We've trained an entire generation of engineers to treat these problems as normal. They're not normal. They're symptoms of accepting broken tooling.

Why terraliths actually fail

Terraliths don't fail because they're too large. They fail because Terraform stores state wrong. Consider the actual failure modes:

Global lock contention: Engineer A modifying one IAM role blocks Engineer B from updating an unrelated S3 bucket. These operations share no dependencies, but the global lock doesn't care.

Full state refresh: Changing one resource triggers refresh of all 2,847 resources. Terraform has the dependency graph. It knows only 12 resources need refreshing. It refreshes everything anyway.

Flat file storage: State is a graph stored as a JSON blob. Every operation deserializes the entire blob, operates on it in memory, and serializes it back. This is O(n) for operations that should be O(1).

$ time terraform plan -target=aws_iam_role.single_role

→ Acquiring state lock... (blocking all 2847 resources)

→ Reading entire state... (42MB JSON)

→ Refreshing... (2847 resources, but only need 1)

→ Plan: 1 to change

real 4m 31s

The Terralith didn't fail. The storage layer failed. The locking strategy failed. The refresh strategy failed. But instead of fixing these failures, we tell teams to split their infrastructure into unnatural pieces.

Technical Reality

A Terralith with 10,000 resources and proper subgraph operations would outperform 10 split stacks with 1,000 resources each. The problem was never the size. It was always the implementation.

The cost of accepting broken tools

Every team that splits their Terralith pays the same costs:

Orchestration layers: Terragrunt, Makefiles, shell scripts, CI/CD glue. Entire codebases dedicated to working around the fact that Terraform can't handle infrastructure at its natural scale.

Mental overhead: Engineers must maintain two models: how infrastructure actually connects, and how they've split it to appease the tooling. Every change requires translating between these models.

Operational risk: Dependencies Terraform can't see are dependencies that can break without warning. Every cross-stack reference is a potential production incident.

We've normalized this dysfunction so completely that we teach it as "best practice." We write books about optimal state splitting strategies. We build entire consultancies around helping teams split their Terraliths "correctly."

But there is no correct way to split a Terralith. There's only different ways to distribute the pain.

What actually works

The solution isn't to split the Terralith. It's to fix the storage.

State is a graph. Store it as a graph. In a database. With row-level locking. With MVCC. With proper indexes. With subgraph queries. This isn't revolutionary. It's how we store every other graph-structured dataset.

When state is stored correctly:

Subgraph locking: Modifying IAM roles doesn't block S3 operations. The lock scope matches the change scope.

Partial refresh: Changing one resource refreshes only its dependency cone. O(log n) instead of O(n).

True parallelism: Multiple engineers can modify disjoint subgraphs simultaneously. The database handles the coordination.

$ time stategraph plan -target=aws_iam_role.single_role

→ Computing affected subgraph... (1 resource, 3 deps)

→ Acquiring subgraph lock... (4 of 2847 resources)

→ Refreshing subgraph... (4 resources only)

→ Plan: 1 to change

real 0m 2.1s

Same Terralith. Same 2,847 resources. Two orders of magnitude faster. No splits. No orchestration. No lies about isolation.

This is what Stategraph does. Not because we're clever, but because we're doing the obvious thing: storing graph data in a graph-capable database.

The real anti-pattern

The anti-pattern was never the Terralith. The anti-pattern is accepting that our tools can dictate our architecture.

When we split state to work around tooling limitations, we're not doing engineering. We're doing tool appeasement. We're letting implementation details of a state storage system dictate how we organize our infrastructure.

This is backwards. Tools should conform to the natural shape of the problem domain, not the other way around.

Infrastructure wants to be together because infrastructure is connected. The Terralith is correct. It always was. The tooling was wrong. It's time we fixed the tooling instead of our infrastructure.

Every team that splits their Terralith is admitting defeat. They're saying the tool's limitations matter more than their infrastructure's reality. We refuse to accept that.

Stop coordinating. Start shipping.

Resource-level locking. Graph-based state. SQL queries on your infra.
Teams work in parallel. No more lock contention.

Get Updates Become a Design Partner

// Zero spam. Just progress updates as we build Stategraph.