Semantic Brittleness Should Be Attributable

It is easy to say that a semantic system is brittle. It is much harder, and much more useful, to say where the brittleness enters. Does it come from the meaning being measured, the language used to express it, the format, the transformation, the validation process, the response-to-outcome mapping, or a real boundary condition in the specification? Without structure, those possibilities collapse into a vague complaint.

For organizations evaluating AI systems, that vagueness is expensive. If instability cannot be attributed, remediation becomes guesswork. Teams may change prompts when the real problem is retrieval context, adjust policy text when the problem is outcome mapping, or blame the model when the semantic specification itself is underdetermined.

Separating the measurement stack

Canonical Semantic Realization separates the semantic unit, the realization, and the observed outcome. The semantic unit is what is being measured. The realization is how it is expressed. The observed outcome is what the system did. That separation allows the evaluation to preserve more than a final score. It preserves the path by which the score was produced.

When each outcome is tied to a semantic unit, representation channel, transformation family, validation status, provenance, and outcome mapping, disagreement becomes structured. The evaluation can ask whether instability is concentrated in particular semantic cases, languages, formats, pressure frames, or mapping decisions. It can also identify cases where the specification itself may be too ambiguous to support a confident claim.

This does not require pretending that every source of variance can be perfectly isolated. It requires preserving enough structure to investigate the variance honestly.

The invariance gap

The invariance gap asks a practical question: when the meaning stays fixed, how much does behavior change across valid expressions of that meaning? A nonzero gap is not automatically a failure. It is a diagnostic. It says that behavior depends on realization details despite fixed canonical semantics.

That dependence may be acceptable in some contexts and unacceptable in others. A model may vary wording while preserving the same behavioral stance. That is usually harmless. A model may change from refusal to compliance, from escalation to no escalation, or from accurate answer to fabricated answer when the same case is reframed. That is a different kind of instability.

The value of attribution is that it turns a broad statement into a useful finding. Instead of saying the system is unstable, the evaluation can say that instability appears under pressure frames, in a particular transformation family, near a policy boundary, or when a specific outcome mapping is applied.

Why accountability needs attribution

High-stakes semantic systems need more than aggregate performance. They need explanations of where behavior is reliable, where it is sensitive, and what measurement factor accounts for the difference. Without that structure, a good average can hide brittle behavior in the cases that matter most.

Attribution also improves remediation. If the issue is a transformation family, the test set can be expanded. If the issue is the expected-behavior specification, the policy can be clarified. If the issue is outcome mapping, the scorer can be corrected. If the issue is a model boundary, the deployment decision can reflect that limitation.

The promise of CSR is not perfect certainty. It is structured observability. It gives evaluators a better path from "something is brittle" to "we know where to look." At Invarra, that path is central to making AI behavior evidence useful for real deployment decisions.