Feature flagging e2e mocks

an evaluation on how to switch mocks in end to end tests

This is an operations essay framed through E2E testing.

The concrete problem is how do you switch external-service mocks in end-to-end tests without undermining trust in what your system is actually doing?

“Trust” here means:

When I say “system shape”, I mean: deployment topology, dependency graph, auth and routing paths, and runtime switches that can change behavior.

Constraints

That last constraint is the pressure that makes “clean” solutions operationally expensive.

What “true black-box E2E” means here

By “true black-box E2E” I mean exercising the deployed system end-to-end through its public interfaces against real dependencies, without substituting the dependency behavior.

If a dependency cannot be called safely or deterministically, then true black-box E2E is not achievable for that dependency path.

At that point, substitution is not optional.

Divergence is unavoidable — placement is the decision

Once substitution is required, every approach introduces divergence somewhere:

The decision is where divergence lives, who controls it, and how visible it is in operations.

Switching mocks is a control-plane problem

At the code level, switching to a mock always boils down to:

if (mockEnabled) {
  return mockClient
}
return realClient

The question is how mockEnabled is determined and who can influence it:

Two control planes

1. Instance-scoped (environment-based)

process.env.EXTERNAL_SERVICE_MODE === 'mock'

Characteristics:

This is classic configuration: one instance, one mode.

2. Request-scoped (feature-flag based)

`request.headers['x-external-service-mode'] === 'mock'

Characteristics:

This is essentially feature flagging for a "test" audience is “tests”.

Feature flags have a reputation for long-term complexity. That reputation is earned. So let's consider what constrains that complexity, and what costs remain even when you constrain it.

Why request-scoped mocking is commonly discouraged

  1. Observability ambiguity
    Real and mocked executions share logs, traces, and metrics unless mode is explicitly tagged and queryable.
  2. Shared-state and caching hazards
    Any cache above the boundary can be polluted if “mode” is not part of the key or caches are not isolated.
  3. Accidental activation
    Humans can trigger mock behavior if controls are weak, or if the preview URL is shared widely.
  4. On-call debugging cost
    Intra-instance behavioral variance is often the dominant pain: two requests hitting the same URL can run materially different dependency behavior. This raises the bar for incident response unless the mode is loud in telemetry and tooling.

Every mocking strategy has costs

All mocking strategies have failure modes. They differ in degree, visibility, and blast radius.

Request-scoped mocking does add one distinct property that some other approaches avoid:

The trade is that it may reduce variance across instances (fewer “this preview vs that preview” situations) at the cost of increasing variance within an instance.

When request-scoped mocking becomes viable

Request-scoped mocking is defensible when:

That last bullet is a social constraint, not a technical one. You can encode some of it in code review and tooling, but it remains governance.

Why it can be a net win in preview environments

Given the constraint that humans and tests share previews, request-scoped mocking offers concrete advantages:

About long-term complexity

The long-term risk is not request-scoped switching per se. The risk is unbounded expansion:

Request-scoped divergence can be either clearer or more confusing:

Environment-based approaches can hide divergence by splitting it across instances. Request-scoped approaches can hide divergence by interleaving it. Both fail without discipline.

Conclusion

Request-based mocking in previews is feature flagging for tests.

It is a deliberate trade:

If production is excluded, the boundary is strict, and mode is explicit, the approach is defensible.

The decision is where divergence lives, who controls it, and how visible it is in operations.#

Feature flagging e2e mocks • Hannes Diercks