AI Systems Log Outputs. Courts Will Ask for Decisions.

When an automated decision is challenged, a receipt of the output is not the same as a defensible trail that can reconstruct why the decision was made.

June 28, 2026 · Quantum Nexus Ventures FZCO

AI governance
audit trail
EU AI Act
accountability
RegTech

Every organisation deploying AI in a consequential domain — lending, legal, insurance, hiring — is building an audit log. Most of those logs will be useless when a decision is challenged. Not because the logs are incomplete. Because they are logging the wrong thing.

Traditional audit infrastructure is designed around events: an action was taken, by a specific principal, at a specific time, producing a specific change in system state. The invariant is reconstructibility. If you replay the event log, you can reconstitute what happened. That is what makes an audit trail legally meaningful. Not the fact that it exists, but that it can prove something.

AI inference is not an event in this sense. An AI decision is a function. It takes inputs and produces an output. The output is not the decision — it is the result of the decision function applied to a specific set of inputs under specific conditions. Log only the output, and you have recorded the answer without recording the question, the model that answered it, the context it was given, or the parameters under which it operated. You have a receipt, not a trail.

The four inputs that determine the decision

A decision produced by a language model is determined by, at minimum, four independent variables.

The user input: the document, query, or prompt submitted at inference time. Most organisations log this. It is the easiest.

The system prompt: the instructions that frame the model's behaviour, constrain its outputs, and define its role in the pipeline. System prompts change. They are updated silently, without versioning, as teams iterate on model behaviour. If you cannot produce the exact system prompt in use at the time of a specific inference, you cannot reconstruct what the model was instructed to do.

The model version: not the model name. The version. "GPT-4" is not a model version. If a provider updates a model behind a stable API endpoint, which happens, your log entry of "GPT-4" is consistent with two different models producing two different outputs for the same input. The decision is not reconstructable.

The retrieval context: if the system uses retrieval-augmented generation, the retrieved documents are part of the decision function. A retrieval index changes over time. Documents are added, deprecated, re-chunked, re-embedded. The retrieval result for the same query today is not the same as the retrieval result six months ago. Unless the retrieved context is logged at the time of inference, the decision cannot be reproduced.

Any one of these, unlogged or unversioned, makes the decision irrecoverable.

Why output logs fail at the moment they are tested

The failure mode is invisible in normal operation. An output log works fine for monitoring, analytics, and debugging. It fails precisely when you need it most: when a decision is disputed.

Consider the scenario. A consequential automated decision is challenged twelve months after it was made. The model has since been updated. The retrieval index has been refreshed. The system prompt has been revised twice. The challenger asks, legitimately: what did your system know when it made this decision, and why did it decide this way?

The output log answers: it decided this. It does not answer: on what basis.

This is the difference between logging and accountability. A log tells you what happened. An accountability record tells you why, in a form that can be verified, reproduced, and defended. The regulatory frameworks that matter are beginning to enforce this distinction whether organisations are ready for it or not.

What the regulatory frameworks actually require

EU AI Act Article 12 requires high-risk AI systems to "technically allow for the automatic recording of events (logs) over the lifetime of the system," with logging capabilities scoped to ensure a level of traceability "appropriate to the intended purpose of the system" (Regulation (EU) 2024/1689, Article 12(1)–(2)). Under GDPR Article 22, individuals have the right not to be subject to a decision based solely on automated processing — including profiling — that produces legal or similarly significant effects, and where such processing is permitted, Article 22(3) entitles them to safeguards including human intervention, the right to express their point of view, and the right to contest the decision; a "right to explanation" as such is not in Article 22's text but derives from the non-binding Recital 71 and from the transparency duties in Articles 13–15, which require "meaningful information about the logic involved." The PRA's model risk management framework — Supervisory Statement SS1/23, "Model risk management principles for banks" — expects firms to maintain documentation detailed enough for an independent party to understand and reproduce a model's results, supporting auditability and effective model governance.Sources: EU Artificial Intelligence Act, Article 12 (Record-keeping) — Regulation (EU) 2024/1689 · GDPR Article 22 (full text), gdpr-info.eu · Bank of England (PRA) — SS1/23, "Model risk management principles for banks"

None of these regulations specify what logging means technically. A court or regulator interpreting them will apply a purposive standard: does the log allow the decision to be reconstructed and explained? An output log does not meet this standard. An output-plus-context-plus-model-version log does.

The practical consequence is that organisations building AI on top of general-purpose APIs with no version commitment and no retrieval logging are accumulating liability that is invisible until a decision is challenged, at which point the inputs needed for reconstruction no longer exist.

What a decision-reconstructable log actually looks like

The minimum viable accountability record for an AI decision contains: A hash of the user input at the time of inference. The exact system prompt in use, versioned and hashed. A model identifier that commits to specific weights, a frozen deployment ID or a provider-level snapshot reference, not a name. The retrieved context if applicable: document identifiers, content hashes, and retrieval timestamp. The inference parameters: temperature, sampling configuration, anything that affects output stochasticity. The output, hashed and timestamped. The downstream decision if the output feeds a rule-based or human-reviewed step.

This is not a large volume of data. The retrieval context is the largest component for RAG systems, but it is bounded by the context window. The rest is metadata. The cost of storing it is negligible relative to the cost of not being able to produce it.

The test

If you cannot reproduce an AI decision given a log entry, run the same inputs through the same model version with the same system prompt and retrieval context and arrive at the same output, then your log is not an audit trail. It is an output history.

Output histories are useful. They are not defensible.

The organisations that will be in the strongest position when automated decisions are challenged are not necessarily the ones with the best models. They are the ones that treated logging as a decision architecture problem from the start, before any decision was disputed.

The time to build that architecture is before the dispute, not after.

This is an opinion / thought-leadership piece. It is not legal or financial advice.

More insights

← Back to Insights