← Back to Blog RSS

Building a Terraform control plane: From faster state to event-driven reconciliation

Product Engineering Vision
TL;DR
$ cat terraform-control-plane.tldr
• Terraform's state is the bottleneck. Fix it, and you unlock everything else.
• Three phases: faster state, control plane, event-driven reconciliation
• Kubernetes proved this progression works. Infrastructure is catching up.

Terraform's state system is the bottleneck holding back everything we want to build. Fix state, and you unlock continuous reconciliation. Fix reconciliation, and you can react to events instead of polling. Kubernetes proved this progression works a decade ago—infrastructure tooling is finally catching up.

Phase 1: The foundation is state

Terraform's state file is the bottleneck. It's a single, flat JSON file with a global lock. Every plan refreshes every resource. Every apply runs serially. One person at a time can make changes. This isn't a minor inconvenience—it's the architectural constraint that makes everything else impossible.

You can't build continuous reconciliation on top of a flat file that locks globally. You can't parallelize operations when everything shares the same lock. You can't query relationships when state is opaque JSON. The state system has to change first.

First comes inventory management with queryable state and resource relationships. Then faster plan and apply with intelligent parallelization across the graph. This is the foundation—graph-based state that enables everything that comes after.

Phase 2: The control plane

Once you have queryable state, you can build a control plane. Continuous reconciliation becomes possible because you can query the graph to understand what needs to reconcile. Drift detection works because you can compare desired state to actual state efficiently. Auto-remediation works because you have the primitives to orchestrate changes.

This is where Kubernetes patterns really apply. Kubernetes controllers continuously reconcile cluster state with desired configuration, retry until they succeed, watch for drift, and handle that pattern at massive scale. Infrastructure needs the same thing—state that reconciles automatically, operations that retry instead of fail, drift that gets fixed instead of reported.

Comparison showing how Kubernetes patterns map to Stategraph infrastructure control plane Comparison showing how Kubernetes patterns map to Stategraph infrastructure control plane

After the foundation is in place, we ship the Terraform control plane with auto-reconciliation, drift detection that actually fixes things instead of just reporting them in Slack channels nobody monitors, and YAML interfaces for control plane configuration. This is Crossplane-style reconciliation, but for Terraform—staying in the ecosystem while getting operational resilience.

Stay in the ecosystem

The goal is to let teams stay within mature ecosystems like Terraform and OpenTofu, tools that work and have massive provider ecosystems, while modernizing the underlying machinery so they get auto-reconciliation, proper drift handling, and queryable state without throwing away existing patterns that already work.

Phase 3: Event-driven reconciliation

Here's what comes after the control plane. EventBridge integration for AWS, Event Grid for Azure, Pub/Sub for GCP, all feeding into Stategraph so it discovers changes happening around Terraform through real events instead of polling. A system that listens, dispatches, senses, and reconciles based on what actually changed instead of running the entire graph every time because one security group tag got modified.

This is the Kubernetes model fully realized for infrastructure. Declarative configuration goes into the state graph, controllers watch for changes through event streams, reconciliation loops drive infrastructure toward desired state without human intervention, events propagate through the system in real time so controllers can react immediately instead of waiting for the next poll cycle.

The entire system becomes queryable. You can ask "which resources depend on this security group" or "what changed in production in the last hour" without parsing state files or scraping logs. This is what Kubernetes gives you for containers—we're building it for infrastructure.

Event-driven reconciliation architecture showing cloud events feeding into Stategraph control plane Event-driven reconciliation architecture showing cloud events feeding into Stategraph control plane

Declarative meets operational

Imagine you could write something with the robustness of a custom Kubernetes controller by declaring a Terraform module instead, where the reconciliation behavior, the retry logic, the drift detection, and the event handling all came from the platform. Infrastructure controllers as declarative as Terraform modules but as operationally robust as hand-written Go controllers watching etcd.

This ties into clickops too (the thing everyone does but nobody admits). Developers make manual changes in cloud consoles all the time. Those changes fire events into EventBridge, Event Grid, or Pub/Sub. Stategraph sees the event and reconciles state by either auto-fixing the drift or opening a PR with the detected change. The control loop stays intact instead of silently diverging until someone runs a plan three weeks later and wonders why production doesn't match the code.

Why existing solutions miss the mark

System Initiative pushed in the opposite direction, trying to reinvent the entire model of infrastructure automation around a continuously-evaluated, real-time state engine rather than incremental plans, but the industry never fully converged on that approach.

Crossplane and AWS Controllers for Kubernetes (ACK) got the control plane pattern right—continuous reconciliation is exactly what infrastructure needs. But they're missing two pieces: queryable state at the foundation, and event-driven architecture at the top.

Without queryable state, you can't efficiently determine what needs to reconcile. Crossplane treats Terraform/AWS/Azure resources as opaque—it can't understand dependencies, can't parallelize intelligently, can't answer questions about relationships. This makes reconciliation slow and expensive.

Without event-driven architecture, you're stuck polling. Unlike Kubernetes controllers that watch the API server for events, Crossplane and ACK have to constantly poll cloud APIs for changes. In large environments this leads to rate limiting and exceeded API quotas as you try to manage thousands of resources across multiple accounts.

The three-phase model

You need all three phases. Queryable state makes reconciliation efficient. The control plane makes it continuous. Event-driven architecture makes it reactive. Skip one, and you're either slow, unreliable, or both.

Building what the industry needs

This isn't a weekend project or a feature you bolt onto existing tooling. It's a fundamental rethinking of how infrastructure gets deployed and managed. This requires deep expertise in distributed systems and infrastructure tooling, careful API design that doesn't leak implementation details into user-facing interfaces, obsessive attention to operational details like what happens when the database connection fails mid-transaction, and the willingness to solve hard problems that don't have obvious solutions.

Two horizons of complexity

The first horizon covers inventory and faster plan/apply and is concrete, achievable, and valuable on its own. The second horizon focuses on the full control plane with auto-reconciliation and event-driven reactive infrastructure and is ambitious, technically complex, and exactly what the industry needs even if most people don't realize it yet.

Building this pattern around Terraform is possible because the primitives exist and the use cases are clear and the benefits are proven, but nobody has shipped it at scale because it requires rethinking the entire storage layer and execution model.

Stategraph makes it possible through pieces that work together. The queryable state graph gives you the foundation because you can't reconcile what you can't query. The control plane gives you the execution engine with retry logic and dependency management and parallel execution. The event integrations through EventBridge and Event Grid and Pub/Sub give you reactivity so you respond to changes instead of polling for them. The reconciliation loops give you operational resilience where transient failures don't require manual intervention.

Put it together and you get infrastructure that actually behaves like a distributed system instead of a pile of bash scripts tied together with CI/CD duct tape.

Follow along as we build this

If you're interested in this future, we're looking for design partners who want to help shape what gets built. We're building Stategraph in the open, sharing progress, technical decisions, and the engineering challenges along the way.