Why We're Building Stategraph: Terraform State as a Distributed Systems Problem
The Terraform ecosystem has spent a decade working around a fundamental architectural mismatch: we're using filesystem semantics to solve a distributed systems problem. The result is predictable and painful.
When we started building infrastructure automation at scale, we discovered that Terraform's state management exhibits all the classic symptoms of impedance mismatch between data representation and access patterns. Teams implement increasingly elaborate workarounds: state file splitting, wrapper orchestration, external locking mechanisms. These aren't solutions; they're evidence that we're solving the wrong problem.
Stategraph addresses this by treating state for what it actually is: a directed acyclic graph of resources with partial update semantics, not a monolithic document.
The Pathology of File-Based State
Terraform state, at its core, is a coordination problem. Multiple actors (engineers, CI systems, drift detection) need to read and modify overlapping subsets of infrastructure state concurrently. This is a well-studied problem in distributed systems, with established solutions around fine-grained locking, multi-version concurrency control, and transaction isolation.
Instead, Terraform implements the simplest possible solution: a global mutex on a JSON file.
Observation
The probability of lock contention in a shared state file increases super-linearly with both team size and resource count. At 100 resources and 5 engineers, you're coordinating 500 potential interaction points through a single mutex.
Consider the actual data access patterns in a typical Terraform operation:
Current Model
Read: 100%
Lock: 100%
Modify: 0.5%
Actual Requirement
Read: 3%
Lock: 3%
Modify: 3%
This mismatch between granularity of operation and granularity of locking is the root cause of every Terraform scaling problem. It violates the fundamental principle of isolation in concurrent systems: non-overlapping operations should not block each other.
The standard response, splitting state files, doesn't solve the problem. It redistributes it. Now you have N coordination problems instead of one, plus the additional complexity of managing cross-state dependencies. You've traded false contention for distributed transaction coordination, which is arguably worse.
State as a Graph: The Natural Representation
Infrastructure state is inherently a directed graph. Resources have dependencies, which form edges. Changes propagate along these edges. Terraform already knows this: the internal representation is a graph, and the planner performs graph traversal. But at the storage layer, we flatten this rich structure into a blob.
This is akin to storing a B-tree in a CSV file. You can do it, but you're destroying the very properties that make the data structure useful.
When state is properly normalized into a graph database, several properties emerge naturally:
Subgraph isolation: Operations on disjoint subgraphs are inherently parallelizable. If Team A is modifying RDS instances and Team B is updating CloudFront distributions, there's no shared state to coordinate.
Precise locking: We can implement row-level locking on resources and edge-level locking on dependencies. Lock acquisition follows the dependency graph, preventing deadlocks through consistent ordering.
Incremental refresh: Given a change set, we can compute the minimal refresh set by traversing the dependency graph. Most changes affect a small cone of resources, not the entire state space.
Concurrency Control Through Proper Abstractions
The distributed systems community solved these problems decades ago. Multi-version concurrency control (MVCC) allows readers to proceed without blocking writers. Write-ahead logging provides durability without sacrificing performance. Transaction isolation levels let operators choose their consistency guarantees.
Stategraph implements these patterns at the Terraform state layer:
Traditional: Global Lock
$ terraform apply Acquiring global lock… waiting
Stategraph: Subgraph Isolation
$ stategraph apply Locking subgraph (3 resources)… ready
Each operation acquires locks only on its subgraph. The lock manager uses the dependency graph to ensure consistent ordering, preventing deadlocks. Readers use MVCC to access consistent snapshots without blocking writers.
Implementation Detail
Lock acquisition follows a strict partial order derived from the resource dependency graph. Resources are locked in topological order, with ties broken by resource ID. This guarantees deadlock freedom without requiring global coordination.
The result is dramatic improvement in concurrent throughput:
Transaction A
Transaction B
Transaction C
Three teams, three transactions, zero contention. This isn't possible with file-based state, regardless of how you split it.
The Refresh Problem
Terraform refresh is O(n) in the number of resources, regardless of change scope. Change one security group rule and you still walk the entire state. That's an algorithmic bottleneck, not just an implementation detail.
File-Based State
Changing 1 resource
Refreshing all 30
Graph State
Changing 1 resource
Refreshing only 3
With a graph representation, refresh work can be scoped to the affected subgraph instead of the entire state. Most changes touch only a small fraction of resources, not everything.
Why We Built This
At Terrateam, we've watched hundreds of teams struggle with the same fundamental problems. They start with a single state file, hit scaling limits, split their state, discover coordination complexity, build orchestration layers, and eventually resign themselves to living with the pain.
This is a solvable problem. The computer science is well-understood. The implementation is straightforward once you acknowledge that state management is a distributed systems problem, not a file storage problem.
Stategraph isn't revolutionary. It's the application of established distributed systems principles to a problem that's been mischaracterized since its inception. We're not inventing new algorithms; we're applying the right ones.
Design Principle
The storage layer should match the access patterns. Terraform state exhibits graph traversal patterns, partial update patterns, and concurrent access patterns. The storage layer should be a graph database with ACID transactions and fine-grained locking. Anything else is impedance mismatch.
The infrastructure industry has accepted file-based state as an immutable constraint for too long. It's not. It's a choice, and it's the wrong one for systems at scale.
Technical Implementation
Stategraph is implemented as a PostgreSQL schema with a backend that speaks the Terraform/OpenTofu remote backend protocol. We chose PostgreSQL for its robust MVCC, proven scalability, and operational familiarity. The schema normalizes state into three primary relations:
resources: one row per resource, with type, provider, and attribute columns.
dependencies: edge table representing the resource dependency graph.
transactions: append-only log of all state mutations with full attribution.
The backend extends Terraform's protocol with graph-aware operations. Lock acquisition and state queries operate directly on the database representation of the graph, enabling precision and concurrency that file-based backends can't provide.
This isn't a wrapper or an orchestrator. It's a replacement for the storage layer that preserves Terraform's execution model while fixing its coordination problems.
Adoption Path
Stategraph reads existing tfstate files and constructs the graph representation automatically. No changes to Terraform configurations are required. The backend protocol is unchanged. From Terraform's perspective, Stategraph is just another backend, like S3 or GCS.
But from an operational perspective, everything changes. Lock contention disappears. Refresh times drop by orders of magnitude. Teams stop blocking each other. State becomes queryable, auditable, and comprehensible.
We're not asking teams to rewrite their infrastructure. We're asking them to store it properly.
The question isn't whether Terraform state should be a graph. It already is. The question is whether we'll continue pretending it's a file.
Technical Preview
Stategraph is in active development. We're working with design partners to validate the approach at scale.
Get Updates