Running Open Policy Agent (OPA) in production: what enterprises underestimate

Open Policy Agent (OPA) has become one of those pieces of infrastructure that ends up everywhere before anyone notices. It starts as an admission controller on one Kubernetes cluster, then it's enforcing authorization in a couple of services, then it's gating deployments in CI. By the time a platform team steps back to look, OPA is sitting on the critical path of more decisions than almost anything else they run.

That ubiquity is a testament to how good the project is. It's also exactly why running OPA in production is a different discipline from adopting it. This article is about the operational reality teams consistently underestimate, and how to de-risk it without slowing down.

Why OPA ends up critical

OPA decouples policy from application code. Instead of authorization logic scattered across a dozen services in five languages, you express it once in Rego and ask OPA a question: is this allowed? That single idea is why it spreads:

Kubernetes admission control, via Gatekeeper or the raw OPA admission webhook, deciding which workloads are even allowed to run.
Microservice authorization as a sidecar or library, deciding what each request can do.
CI/CD and supply chain gating: Terraform plans, container images, and deployment manifests checked against policy.
Data filtering and API gateways, shaping what each caller is permitted to see.

The common thread is that when OPA says "no," something stops. And when OPA is wrong, unavailable, or slow, the blast radius is the union of everything it guards. That is a strong reason to adopt it, and an even stronger reason to run it deliberately.

What teams underestimate

1. Rego is a language, and policy is code

Rego is powerful, but it is genuinely a language with its own evaluation model, and most teams pick it up incrementally rather than deliberately. The result is policy that works but is hard to read, hard to test, and hard to reason about under change.

The failure mode isn't usually a syntax error. It's a policy that's subtly more permissive than the author believed: a misplaced default, an or-shaped rule that short-circuits, a partial set that's missing a case. These don't fail loudly. They quietly allow something that should have been denied, and you find out during an incident or an audit, not in code review.

Policy deserves the same engineering rigour as the application code it guards. That means tests with opa test, coverage you actually look at, review by someone who reads Rego fluently, and a style the next engineer can follow.

2. The Rego v0 to v1 transition

OPA's move toward Rego v1 (the if and contains keywords, stricter parsing, the deprecation of older idioms) is the kind of migration that's trivial for ten lines of policy and genuinely involved for a mature policy library spread across many repositories and bundles.

Teams that wrote a lot of Rego early often have a meaningful body of v0-style policy. The upgrade is doable, but it's not a find-and-replace. It interacts with how your bundles are built and distributed, and it has to be validated against real decision inputs, not just "it parses now."

3. CVEs land on the whole dependency tree

OPA itself has a good security track record, but "running OPA" in practice means an opa binary or a Go library, often Gatekeeper on top, the Go toolchain it was built with, base container images, and the bundle-distribution machinery around it. Vulnerability exposure is the whole tree, not just the headline project.

For something on the authorization path, "we'll upgrade when we get to it" is a weaker position than most teams are comfortable defending to an auditor. You want a defined process for triaging advisories across that tree and shipping a patched build quickly, without waiting for an internal version bump to fight its way up the backlog.

4. Performance and operability are real work

OPA is fast, but production behaviour depends on choices most teams make implicitly: how policy and data are bundled and refreshed, how much data you push into OPA versus query externally, partial evaluation, decision-log volume and where those logs go, and the latency budget OPA adds to every guarded request.

None of this is exotic, but it is the difference between OPA being invisible and OPA being the thing that adds tail latency to every call or quietly falls behind on bundle updates.

5. Bus factor on the policy itself

Rego expertise tends to concentrate in one or two engineers who "own policy." When they move on, the organisation inherits an authorization layer on its critical path that no current employee fully understands. It's the same knowledge-concentration risk that makes internal forks dangerous, except here it's the rules deciding who can do what.

How to de-risk it

You don't need to slow down adoption to run OPA responsibly. You need a few things to be deliberate rather than incidental:

Treat policy as a tested, reviewed codebase. opa test in CI, coverage you actually look at, and review by someone fluent in Rego.
Have a CVE response process for the whole tree (OPA, Gatekeeper, the Go toolchain, and base images), with a defined path to a patched build under a known SLA.
Plan the Rego v1 migration as a project with validation against real decision inputs, not an opportunistic refactor.
Make bundles, data, and decision logs an explicit operational design, with a latency budget for guarded requests.
Reduce bus factor by documenting policy intent and having more than one person who can confidently change it.

Where commercial support fits

For many teams, OPA sits in an awkward spot: too critical to run casually, too specialised to staff a dedicated policy team around. That gap is what commercial support is meant to fill.

This is the kind of work DepKeep does for the open source on enterprises' critical path. Applied to OPA, that means:

Security patching under SLA across OPA, Gatekeeper, and the surrounding dependency tree, so an advisory doesn't sit in the backlog while it's on your authorization path.
Long-term support for the version you're on, so you can modernise on your timeline rather than upstream's.
Rego and policy expertise on call: review, testing strategy, performance work, and a second set of eyes that reads Rego fluently.
Migration support for transitions like Rego v0 to v1, validated against your real decision inputs.

We don't claim to speak for the OPA project. It's a healthy, well-run community, and the right first stop for many questions is the OPA docs and Slack. What we add is the responsive, accountable support, patching, and expertise that the volunteer model isn't designed to provide, for teams that need it.

OPA earns its place on the critical path because it's good. Keeping it there safely, patched and performant and understood by more than one person, is ordinary operational discipline. The teams that get burned are usually the ones who adopted OPA deliberately and then ran it incidentally. If OPA is making authorization decisions in your stack, it's worth closing that gap before an incident or an audit does it for you.