Feature flagging e2e mocks • Hannes Diercks

an evaluation on how to switch mocks in end to end tests

This is an operations essay framed through E2E testing.

The concrete problem is how do you switch external-service mocks in end-to-end tests without undermining trust in what your system is actually doing?

“Trust” here means:

telemetry reflects the execution mode that actually ran
humans and tests traverse the same critical paths
failures are explainable without reconstructing hidden switches
operational behavior matches what engineers believe is deployed

When I say “system shape”, I mean: deployment topology, dependency graph, auth and routing paths, and runtime switches that can change behavior.

Constraints

The application has server-side logic (e.g. Next.js).
At least one external service cannot be exercised in tests in a way that is both:
safe (no irreversible side effects, no unacceptable cost), and
deterministic (predictable outcomes and timing).
E2E tests must be repeatable.
Preview environments are used by both humans and automated tests.

That last constraint is the pressure that makes “clean” solutions operationally expensive.

What “true black-box E2E” means here

By “true black-box E2E” I mean exercising the deployed system end-to-end through its public interfaces against real dependencies, without substituting the dependency behavior.

If a dependency cannot be called safely or deterministically, then true black-box E2E is not achievable for that dependency path.

At that point, substitution is not optional.

Divergence is unavoidable — placement is the decision

Once substitution is required, every approach introduces divergence somewhere:

different builds
different deployments or instances
different dependency endpoints
or different runtime behavior inside the same instance

The decision is where divergence lives, who controls it, and how visible it is in operations.

Switching mocks is a control-plane problem

At the code level, switching to a mock always boils down to:

if (mockEnabled) {
  return mockClient
}
return realClient

The question is how mockEnabled is determined and who can influence it:

CI / test runner
deployment configuration
engineers operating the system
end users (accidentally or intentionally) if controls are weak

Two control planes

1. Instance-scoped (environment-based)

process.env.EXTERNAL_SERVICE_MODE === 'mock'

Characteristics:

evaluated at startup
fixed for the lifetime of the instance
behavior is uniform within that instance
typically implies separate previews or deployments for humans vs tests
in practice often increases operational surface area (more envs, URLs, and CI coordination), even though you can design around some of that

This is classic configuration: one instance, one mode.

2. Request-scoped (feature-flag based)

`request.headers['x-external-service-mode'] === 'mock'

Characteristics:

evaluated per request
the same instance can serve real and mocked traffic
requires guardrails
usually reduces the number of deployments and URLs you need to reason about

This is essentially feature flagging for a "test" audience is “tests”.

Feature flags have a reputation for long-term complexity. That reputation is earned. So let's consider what constrains that complexity, and what costs remain even when you constrain it.

Why request-scoped mocking is commonly discouraged

Observability ambiguity
Real and mocked executions share logs, traces, and metrics unless mode is explicitly tagged and queryable.
Shared-state and caching hazards
Any cache above the boundary can be polluted if “mode” is not part of the key or caches are not isolated.
Accidental activation
Humans can trigger mock behavior if controls are weak, or if the preview URL is shared widely.
On-call debugging cost
Intra-instance behavioral variance is often the dominant pain: two requests hitting the same URL can run materially different dependency behavior. This raises the bar for incident response unless the mode is loud in telemetry and tooling.

Every mocking strategy has costs

All mocking strategies have failure modes. They differ in degree, visibility, and blast radius.

Separate deployments fragment telemetry and paging context.
Test-only builds hide the runtime envelope you ship.
Endpoint swaps still introduce divergence between test and human experiences.
Humans debugging one instance while tests exercise another creates credibility gaps.

Request-scoped mocking does add one distinct property that some other approaches avoid:

intra-instance variance (behavior can differ request-to-request)

The trade is that it may reduce variance across instances (fewer “this preview vs that preview” situations) at the cost of increasing variance within an instance.

When request-scoped mocking becomes viable

Request-scoped mocking is defensible when:

Production is hard-blocked.
The switch affects only external dependency boundaries, not business logic.
Mock mode is explicit in logs, traces, and metrics.
Caches are mode-aware or isolated.
Activation is guarded (signed headers, restricted issuers, expiring tokens).
The organization treats this as a scoped mechanism, not a general-purpose switchboard.

That last bullet is a social constraint, not a technical one. You can encode some of it in code review and tooling, but it remains governance.

Why it can be a net win in preview environments

Given the constraint that humans and tests share previews, request-scoped mocking offers concrete advantages:

Tests traverse realistic execution paths
Same URLs, same middleware, same auth, same routing, same telemetry pipeline.
Test traffic shows up in real observability
This validates logging, tracing, alerting, and invariants under real deployment conditions and real request shapes. It does not claim representative load.
Operational simplicity
Fewer parallel deployments and fewer “which preview is this?” debugging loops.
Security signals surface early
If a human can accidentally trigger mock mode, that can expose missing controls around preview access and auth assumptions. Whether you treat that as a win depends on your risk tolerance.

About long-term complexity

The long-term risk is not request-scoped switching per se. The risk is unbounded expansion:

more flags
wider scope beyond dependency boundaries
implicit behavior rather than queryable mode
unclear ownership of switches

Request-scoped divergence can be either clearer or more confusing:

clearer if mode is loud and queryable in telemetry
more confusing if it’s interleaved without strong tagging and tooling

Environment-based approaches can hide divergence by splitting it across instances. Request-scoped approaches can hide divergence by interleaving it. Both fail without discipline.

Conclusion

Request-based mocking in previews is feature flagging for tests.

It is a deliberate trade:

less instance purity
fewer deployments and fewer URLs
higher burden on observability and incident response
governance becomes a first-class constraint

If production is excluded, the boundary is strict, and mode is explicit, the approach is defensible.

The decision is where divergence lives, who controls it, and how visible it is in operations.#