The Terralith is correct. State fragmentation is the problem.

Terraform Infrastructure State Management Terralith Scalability

Josh Pollara • November 10th, 2025

TL;DR

$ cat terralith.tldr

• State splitting creates coordination problems Terraform can't see

• File-based storage is the problem, not monolithic architecture

• Lock subgraphs, not entire state files

Masterpoint's article on Terraliths recommends splitting monolithic state into multiple files. But this treats the symptom, not the disease. When your infrastructure is slow, splitting state doesn't make it faster. It distributes the slowness across more coordination points. When teams block each other, multiple state files create coordination problems Terraform can't see. The Terralith isn't the anti-pattern. Accepting that your storage layer dictates your architecture is.

The pain is real

Masterpoint isn't wrong about the problems. They cite clients experiencing plan times exceeding 30 minutes, API timeouts, and rate-limit failures. These are real production issues causing real pain. Lock contention blocking concurrent teams is broken. Blast radius on monolithic state is a genuine risk. When you're bleeding, you reach for a bandage.

But here's what's offensive. These are storage problems masquerading as architecture problems. When terraform plan takes thirty minutes, the root cause isn't that you have 2,847 resources in one state file. The root cause is that Terraform reads the entire state file, refreshes all 2,847 resources, and holds a global lock while doing it. Your infrastructure has a natural shape. The tool can't handle it. And somehow we've accepted that you should change your infrastructure to appease the tool.

This is backwards. State splitting doesn't eliminate these bottlenecks. It redistributes them across artificial boundaries while adding coordination overhead Terraform can't see or manage. You're fragmenting your infrastructure to work around broken storage primitives.

Fragmentation redistributes pain

Masterpoint recommends breaking monolithic structures into service-bounded root modules, using Terraform Workspaces for environment separation, and adopting wrapper tools like Atmos, Terramate, or Terragrunt. These are practical workarounds for teams in production who need relief today. But this is accepting defeat. We're reorganizing infrastructure to match tool limitations instead of fixing the tool to match infrastructure reality.

State splitting creates new problems.

Cross-stack dependencies become manual

When Stack A creates a VPC and Stack B needs the VPC ID, Terraform can't track that dependency. You use data sources, hardcoded values, or parameter passing through wrapper tools. The dependency exists in your infrastructure, but it's invisible to Terraform. When Stack A changes, Stack B doesn't know. Drift accumulates silently.

Orchestration complexity explodes

You now need to know which stack deploys first. Your CI pipeline encodes dependency ordering that should live in the infrastructure graph. When you add a new dependency, you update both Terraform configuration and deployment orchestration. The single source of truth is now two sources that must stay synchronized.

Lock contention persists

Splitting state doesn't eliminate lock contention. It constrains it to smaller boundaries. If three engineers work on the same stack, they still block each other. You've just decided that blocking within service boundaries is acceptable. The fundamental problem (global locks preventing concurrent work on unrelated resources) remains unsolved.

Diagram showing how state splitting creates coordination problems: manual dependencies, orchestration complexity, and hidden drift

You haven't reduced complexity. You've redistributed it. The coordination overhead that was implicit in Terraform's dependency graph is now explicit in your deployment tooling, your data source references, and your team's mental model of what depends on what.

Infrastructure has a natural shape

Infrastructure is a graph. VPCs contain subnets. Subnets contain instances. Instances reference security groups. Security groups reference other security groups. Load balancers reference instances. DNS references load balancers. Everything connects to everything else.

Teams don't accidentally create Terraliths. They create them because infrastructure is inherently interconnected. The Terralith reflects infrastructure's natural shape. Splitting it requires drawing arbitrary boundaries through a dense web of dependencies.

Where do you draw the line? By AWS service? By application tier? By team ownership? Every choice creates cross-boundary dependencies that Terraform can no longer see. Every boundary requires manual coordination.

Tools should conform to the problem domain, not the other way around. When your tool can't handle the natural structure of your data, the solution isn't to restructure your data. The solution is to fix the tool.

The real problem is storage

Terraform stores state as a JSON file with a global lock. Plans read the entire state file even if you're modifying one resource. Terraform loads all 2,847 resources because file-based storage offers no alternative. Refreshes query everything because the state file doesn't know what you're about to change. The lock is global because the file is the unit of atomicity. Dependencies are opaque because you can't query the graph without parsing 40MB of JSON.

These aren't limitations of monolithic architecture. These are limitations of using a file as a database. The performance problems and lock contention that Masterpoint's clients experience aren't caused by having 2,847 resources in one logical unit. They're caused by storing those resources in a format that can't support selective reads, partial refreshes, or granular locking.

Comparison showing file-based storage with global lock versus graph-native storage with resource-level locks

What actually works

State is a graph. Represent it as a graph. When you store state as nodes and edges in a relational database instead of a JSON blob in a file, the problems Masterpoint identifies disappear without fragmentation.

Resource-level locking replaces global locks. When you modify twelve resources, you lock those twelve and their dependents. Other engineers working on unrelated resources proceed in parallel. Lock granularity matches the dependency structure.

Subgraph isolation replaces full-state operations. Plans compute the affected subgraph based on configuration changes, then read and refresh only those resources. A change to application configuration doesn't refresh networking infrastructure.

MVCC semantics allow concurrent reads. Plans operate on consistent snapshots of the subgraph, while applies acquire write locks on modified resources. Multiple engineers can plan simultaneously without blocking.

# Traditional: Global lock on 2,847 resources

$ time terraform plan

Acquiring state lock...

Refreshing state... [2847/2847]

Plan: 1 to add, 0 to change, 0 to destroy.

real 30m 0s

# Stategraph: Subgraph lock on 12 resources

$ time stategraph plan

Computing affected subgraph... 12 resources

Acquiring subgraph lock...

Refreshing subgraph... [12/12]

Plan: 1 to add, 0 to change, 0 to destroy.

real 0m 2.1s

This isn't theoretical. Stategraph implements this model with PostgreSQL. State lives as versioned rows with explicit foreign keys representing dependencies. Each resource has an optimistic lock version. Applies acquire row-level locks on the affected subgraph. Concurrent operations on disjoint subgraphs succeed. The database enforces consistency guarantees that file systems can't provide.

Same 2,847 resources. Same infrastructure. Different storage. Plans that took thirty minutes take two seconds. Teams that blocked each other work in parallel. No fragmentation. No coordination overhead. No manual dependency tracking.

Stop accepting broken primitives

The Terraform ecosystem has normalized dysfunction. We treat thirty-minute plans as inevitable. We accept that concurrent work requires splitting state. We build wrapper tools like Terragrunt to work around limitations that shouldn't exist. We teach state fragmentation as a best practice. Masterpoint's article is well-intentioned, and their recommendations work within the constraints of today's tooling. That's the problem. We shouldn't have these constraints.

Infrastructure wants to be modular at the code level and monolithic at the state level. Your Terraform modules should have clear boundaries. Your state storage should reflect the actual dependency graph without artificial partitioning. The fact that we're even having this conversation, that we're debating how to split state files to make tools faster, is an indictment of the tooling, not the architecture.

State fragmentation isn't a solution. It's capitulation. It's saying "the tool's limitations matter more than our infrastructure's natural structure, so we'll contort our infrastructure to appease the tool." This is exactly backwards.

The Terralith was always correct. It's time we built storage that matches it.

The Terralith is correct. State fragmentation is the problem.

The pain is real

Fragmentation redistributes pain

Infrastructure has a natural shape

The real problem is storage

What actually works

Stop accepting broken primitives

Related Posts

Terraliths are the natural shape of infrastructure

Don't force unlock Terraform state. Do this instead.

Terraform state locking explained (and why it hurts at scale)