The Right Null Hypothesis For Indirect Observation

Evaluation often begins with an implicit generosity. A system answers correctly, and the result is treated as evidence that the system understood the target. In many cases that may be true. The problem is that a single correct response is also compatible with a weaker explanation: the system may have responded to surface cues in the representation rather than to the underlying phenomenon the representation was meant to reveal.

When the target cannot be observed directly, that weaker explanation should remain alive until the evaluation rules it out. A prompt, label, document, or benchmark item is a channel through which the target is observed. If the channel is never varied, the evaluator has not yet tested whether behavior survives a change in representation.

A stricter default

The cautious default is to assume that observed behavior may be representation-sensitive. This is not pessimism. It is measurement discipline. It prevents an evaluation from overclaiming when the evidence only shows behavior under one form.

The alternative claim is stronger: behavior remains stable across meaning-preserving representations of the same latent target. That claim requires evidence. It requires holding the relevant meaning fixed, changing the representation in valid ways, and observing whether the system preserves the expected behavior. Without that structure, single-representation correctness does not establish invariance.

This default is useful because it makes the burden of proof explicit. It asks evaluators to show that the model is not merely following the representation channel. If the evidence is not available, the right conclusion is not failure by default. It is a narrower claim: the behavior has been observed under a specific representation, and stronger claims require additional variation.

Why non-observation matters

Indirect observation also creates cases where the right move is to suspend the inference. If no usable representation is produced, or if a variation does not preserve the relevant meaning, the evaluation has not obtained invariance evidence. Treating that absence as a model failure or a model success can both be misleading.

The same logic appears outside AI. If a sensor gives no reading, that may be evidence about the sensor channel rather than evidence that the underlying condition is absent. If a respondent gives no answer, that may reflect response availability rather than the absence of a belief. In AI evaluation, if the representation channel is defective, underspecified, or invalid, the result should not be forced into a clean success-or-failure story.

This is why valid variation matters. The variation must preserve what is being measured. Random noise, arbitrary perturbation, or meaning-changing edits do not test invariance. They test something else.

What this means for AI assurance

For deployed AI systems, the practical implication is straightforward. A model that performs well on one form of a task should not automatically be credited with understanding, policy comprehension, intent tracking, or conceptual mastery. It has demonstrated behavior under that form. To support stronger claims, the evaluation needs to show that the behavior remains correct when the same case is expressed through other valid forms.

This standard is especially relevant for enterprise assistants, copilots, and workflow automation, where users naturally vary wording and context. If the system is stable only when the prompt resembles the test item, the organization should know that before deployment. If it remains stable across controlled variation, the evidence is more credible.

The habit to build is simple: ask what would change your confidence that the behavior tracks the target rather than the representation. Under the Latent Invariance Principle, one answer is valid variation. Preserve the meaning, change the form, and see whether the behavior holds.

At Invarra, that stricter default shapes how we read evaluation evidence. We start from the assumption that representation sensitivity is possible, then ask what evidence would justify a stronger claim.