← Back to Blog RSS

Terragrunt was a band-aid. Stategraph fixes the wound.

Terraform Terragrunt State Management Infrastructure
TL;DR
$ cat stategraph-vs-terragrunt.tldr
• Terragrunt works around Terraform's state file limitations
• It solved real problems but introduced orchestration complexity
• Stategraph fixes the primitive with graph-based state and row-level locking
• External orchestration becomes unnecessary when dependencies are native

Terragrunt is dependency management duct tape for an underlying primitive that never scaled.

Terragrunt wasn't invented because people love wrapper tools. It exists because Terraform's state model forces teams to bend their infrastructure around a single, global, serialized state file.

That created everything Terragrunt works around. Giant repos split into "micro-stacks." Folder conventions to simulate graph boundaries. Home-grown orchestration rules. Wrappers to enforce order and dependency. Bespoke locking hacks. Glue everywhere to keep teams from stepping on each other.

This isn't a critique of Terragrunt. Terragrunt revealed something important. Terraform's core abstraction doesn't scale. Once you see that clearly, you can fix it properly.

The problem Terragrunt was created to solve

Terraform uses a single state file per root module. One file. One lock. Everything in that root shares the same blast radius, the same lock contention, the same refresh cycle.

This works for small deployments. As infrastructure grows, the problems compound.

# What happens when your state file grows
Lock contention - Only one person can run at a time
Slow plans - Every plan refreshes all resources
Blast radius - One bad change can affect everything
No parallelism - Cross-module changes run serially
Team conflicts - Everyone working in the same state

Gruntwork saw this in 2016. Their customers needed to manage infrastructure across multiple teams and environments. Terraform's "one state = one root" model was the bottleneck. So they created Terragrunt.

The solution was elegant. Split infrastructure into many small states, each with its own backend, and orchestrate them together. One directory per component. One state file per directory. Dependency declarations to wire them together.

live/
├── prod/
│ ├── vpc/terragrunt.hcl # State 1
│ ├── mysql/terragrunt.hcl # State 2
│ └── app/terragrunt.hcl # State 3
└── staging/
└── ... # Same structure

This was a major improvement. Instead of one massive state file with a global lock, many small states with independent locks. Teams could work in parallel on different components. Blast radius was contained. The folder structure became the graph.

Terragrunt's value

Terragrunt is good at what it was designed for. It gives teams structure when Terraform refuses to. It enforces conventions and creates local graph boundaries through directory layout.

Define backend and provider config once, inherit everywhere. Declare cross-module dependencies, get outputs automatically. Run many Terraform processes in parallel, respecting dependency order. Retry transient failures. Run hooks before and after commands.

These aren't trivial features. Teams have built entire platform workflows around Terragrunt. It works.

The core limitation

Terragrunt works around a fundamental limitation. It's a wrapper playing traffic cop around a state file that has no concept of partial, parallel, or isolated execution. It has to emulate what Terraform never exposed.

Consider what Terragrunt actually does when you run terragrunt run-all apply. It traverses the directory tree to find all terragrunt.hcl files. Parses dependency blocks to build a DAG. Executes Terraform processes in topological order. Streams output from multiple concurrent processes. Injects outputs from parent states into child inputs.

This is sophisticated orchestration. But it's all external to Terraform. The underlying engine doesn't know about any of it. Terraform sees each directory as a completely independent root module. The graph exists only in Terragrunt's head.

Stategraph is a different category

Stategraph is not "Terragrunt but better." It's not solving the same problem in the same layer.

Stategraph replaces the primitive that forced Terragrunt to exist.

Instead of a single flat state file, you get state modeled as a graph. Row-level locking. Subgraph execution. Parallel plan and apply. APIs for querying, diffing, and visualizing the real dependency graph.

All the complexity Terragrunt managed externally becomes intrinsic to the system.

# Terragrunt approach
[Dir 1] → [State 1] → [Lock 1]
[Dir 2] → [State 2] → [Lock 2]
[Dir 3] → [State 3] → [Lock 3]
[Terragrunt orchestrates processes externally]
# Stategraph approach
[Single root module]
[Graph database with row-level locks]
[Parallel execution of independent subgraphs]

You don't need folders to create graph boundaries. You don't need wrappers to serialize execution. You don't need conventions as a stand-in for a missing abstraction. The system understands dependency, concurrency, and isolation.

The paradigm shift

Terragrunt taught us that splitting into many states was the answer. Stategraph asks a different question. What if the engine could handle one state that scales? Then you wouldn't need to split at all.

Terragrunt saw what Terraform missed

Terragrunt revealed something Terraform never acknowledged. Teams need safety, scalability, parallelism, and clear boundaries.

Terragrunt guessed its way into that reality. If Terraform won't give us these things natively, we'll build them on top.

And it worked. For years. Thousands of organizations ran on that pattern.

But it was always a workaround. The underlying primitive, a flat JSON file with a global lock, remained unchanged. Every feature Terragrunt added was compensating for that limitation.

# The diagnosis
Symptom: Terraform doesn't scale
Band-aid: Terragrunt (split into many states, orchestrate externally)
Disease: The state file model itself
Cure: Stategraph (fix the primitive)

Stategraph implements what Terragrunt emulated. Natively. In the backend.

Concrete examples

Let's make this concrete. Here's what the same infrastructure looks like in both models.

A simple dependency chain

You have three components: VPC, database, and application. The app depends on the database. The database depends on the VPC.

Terragrunt

prod/
├── vpc/
│ └── terragrunt.hcl
├── mysql/
│ └── terragrunt.hcl
# dependency "vpc"
└── app/
└── terragrunt.hcl
# dependency "vpc"
# dependency "mysql"
$ terragrunt run-all apply
# 3 folders, 3 state files
# 3 terraform processes
# manual orchestration

Stategraph

infra/
├── main.tf
# module "vpc" {...}
# module "mysql" {...}
# module "app" {...}
└── backend.tf
# stategraph backend
$ stategraph apply
# 1 root, 1 state graph
# parallel subgraph execution
# automatic dependency ordering

Cross-stack references

Your application needs the database endpoint and the VPC ID.

Terragrunt

dependency "vpc" {
config_path = "../vpc"
}
dependency "mysql" {
config_path = "../mysql"
}
inputs = {
vpc_id = dependency.vpc.outputs.vpc_id
db_endpoint = dependency.mysql.outputs.endpoint
}
# Terragrunt reads other states
# Injects values as inputs
# Serializes applies

Stategraph

module "app" {
source = "./modules/app"
vpc_id = module.vpc.vpc_id
db_endpoint = module.mysql.endpoint
}
# Native Terraform references
# Dependency edges are explicit
# Planner computes minimal workset
# Independent changes run in parallel

Large enterprise deployment

You have 100 modules across multiple environments and teams.

Terragrunt

  • Hundreds of directories
  • Scripts around Terragrunt to manage CI
  • Parallelism throttled to avoid lock contention
  • 15x slowdown in some versions due to O(n²) config evaluation
  • Memory usage that balloons with module count

Stategraph

  • One graph
  • Row-level locking
  • Parallel apply of safe subgraphs
  • CI decoupled from repository layout
  • Query state with SQL

What this means in practice

Performance at scale

Terragrunt orchestrates 100 modules by running 100 separate Terraform processes. Each one initializes providers, reads state, refreshes resources. With Stategraph, it's one process, one state query, one graph traversal. The overhead of managing many processes disappears.

With Stategraph, parallelism is fine-grained and internal. Independent resources apply concurrently. The system knows the graph and can execute safe subgraphs in parallel without spawning separate processes.

CI/CD simplification

Terragrunt CI pipelines manage hundreds of individual plan/apply cycles. Determine which modules changed. Handle errors module-by-module. Aggregate outputs for review. With Stategraph, it's one plan, one apply. The pipeline logic simplifies dramatically.

This doesn't mean everything becomes easy. A unified plan for 100 modules produces a lot of output. But the complexity shifts from orchestration to review. That's a much more tractable problem.

Stop building wrappers

Terragrunt will continue to exist. Teams will still use it. It will still add value for plenty of workflows. Makefiles still exist even after Bazel and Nix.

But the shape of infrastructure tooling is changing.

We're moving past "wrap Terraform and hope for the best" into "fix the underlying state model so orchestration becomes a solved problem."

# The evolution of Terraform at scale
2015: One state file per root module
2016: Terragrunt splits into many states
2018: CI/CD pipelines to orchestrate Terragrunt
2020: Wrappers around wrappers
2025: Fix the primitive itself

Stategraph is not a wrapper. It's a new primitive.

Once you fix the primitive, entire classes of external tooling disappear.

Terragrunt covered Terraform's state problem. Stategraph fixes it.