an evaluation on how to switch mocks in end to end tests
This is an operations essay framed through E2E testing.
The concrete problem is how do you switch external-service mocks in end-to-end tests without undermining trust in what your system is actually doing?
“Trust” here means:
- telemetry reflects the execution mode that actually ran
- humans and tests traverse the same critical paths
- failures are explainable without reconstructing hidden switches
- operational behavior matches what engineers believe is deployed
When I say “system shape”, I mean: deployment topology, dependency graph, auth and routing paths, and runtime switches that can change behavior.
Constraints
- The application has server-side logic (e.g. Next.js).
- At least one external service cannot be exercised in tests in a way that is both:
- safe (no irreversible side effects, no unacceptable cost), and
- deterministic (predictable outcomes and timing).
- E2E tests must be repeatable.
- Preview environments are used by both humans and automated tests.
That last constraint is the pressure that makes “clean” solutions operationally expensive.
What “true black-box E2E” means here
By “true black-box E2E” I mean exercising the deployed system end-to-end through its public interfaces against real dependencies, without substituting the dependency behavior.
If a dependency cannot be called safely or deterministically, then true black-box E2E is not achievable for that dependency path.
At that point, substitution is not optional.
Divergence is unavoidable — placement is the decision
Once substitution is required, every approach introduces divergence somewhere:
- different builds
- different deployments or instances
- different dependency endpoints
- or different runtime behavior inside the same instance
The decision is where divergence lives, who controls it, and how visible it is in operations.
Switching mocks is a control-plane problem
At the code level, switching to a mock always boils down to:
if (mockEnabled) {
return mockClient
}
return realClient
The question is how mockEnabled is determined and who can influence it:
- CI / test runner
- deployment configuration
- engineers operating the system
- end users (accidentally or intentionally) if controls are weak
Two control planes
1. Instance-scoped (environment-based)
process.env.EXTERNAL_SERVICE_MODE === 'mock'
Characteristics:
- evaluated at startup
- fixed for the lifetime of the instance
- behavior is uniform within that instance
- typically implies separate previews or deployments for humans vs tests
- in practice often increases operational surface area (more envs, URLs, and CI coordination), even though you can design around some of that
This is classic configuration: one instance, one mode.
2. Request-scoped (feature-flag based)
`request.headers['x-external-service-mode'] === 'mock'
Characteristics:
- evaluated per request
- the same instance can serve real and mocked traffic
- requires guardrails
- usually reduces the number of deployments and URLs you need to reason about
This is essentially feature flagging for a "test" audience is “tests”.
Feature flags have a reputation for long-term complexity. That reputation is earned. So let's consider what constrains that complexity, and what costs remain even when you constrain it.
Why request-scoped mocking is commonly discouraged
- Observability ambiguity
Real and mocked executions share logs, traces, and metrics unless mode is explicitly tagged and queryable. - Shared-state and caching hazards
Any cache above the boundary can be polluted if “mode” is not part of the key or caches are not isolated. - Accidental activation
Humans can trigger mock behavior if controls are weak, or if the preview URL is shared widely. - On-call debugging cost
Intra-instance behavioral variance is often the dominant pain: two requests hitting the same URL can run materially different dependency behavior. This raises the bar for incident response unless the mode is loud in telemetry and tooling.
Every mocking strategy has costs
All mocking strategies have failure modes. They differ in degree, visibility, and blast radius.
- Separate deployments fragment telemetry and paging context.
- Test-only builds hide the runtime envelope you ship.
- Endpoint swaps still introduce divergence between test and human experiences.
- Humans debugging one instance while tests exercise another creates credibility gaps.
Request-scoped mocking does add one distinct property that some other approaches avoid:
- intra-instance variance (behavior can differ request-to-request)
The trade is that it may reduce variance across instances (fewer “this preview vs that preview” situations) at the cost of increasing variance within an instance.
When request-scoped mocking becomes viable
Request-scoped mocking is defensible when:
- Production is hard-blocked.
- The switch affects only external dependency boundaries, not business logic.
- Mock mode is explicit in logs, traces, and metrics.
- Caches are mode-aware or isolated.
- Activation is guarded (signed headers, restricted issuers, expiring tokens).
- The organization treats this as a scoped mechanism, not a general-purpose switchboard.
That last bullet is a social constraint, not a technical one. You can encode some of it in code review and tooling, but it remains governance.
Why it can be a net win in preview environments
Given the constraint that humans and tests share previews, request-scoped mocking offers concrete advantages:
- Tests traverse realistic execution paths
Same URLs, same middleware, same auth, same routing, same telemetry pipeline. - Test traffic shows up in real observability
This validates logging, tracing, alerting, and invariants under real deployment conditions and real request shapes. It does not claim representative load. - Operational simplicity
Fewer parallel deployments and fewer “which preview is this?” debugging loops. - Security signals surface early
If a human can accidentally trigger mock mode, that can expose missing controls around preview access and auth assumptions. Whether you treat that as a win depends on your risk tolerance.
About long-term complexity
The long-term risk is not request-scoped switching per se. The risk is unbounded expansion:
- more flags
- wider scope beyond dependency boundaries
- implicit behavior rather than queryable mode
- unclear ownership of switches
Request-scoped divergence can be either clearer or more confusing:
- clearer if mode is loud and queryable in telemetry
- more confusing if it’s interleaved without strong tagging and tooling
Environment-based approaches can hide divergence by splitting it across instances. Request-scoped approaches can hide divergence by interleaving it. Both fail without discipline.
Conclusion
Request-based mocking in previews is feature flagging for tests.
It is a deliberate trade:
- less instance purity
- fewer deployments and fewer URLs
- higher burden on observability and incident response
- governance becomes a first-class constraint
If production is excluded, the boundary is strict, and mode is explicit, the approach is defensible.
The decision is where divergence lives, who controls it, and how visible it is in operations.#