Decision Logging: How AI Agents Create Audit Trails That Hold Up

Decision Logging: How AI Agents Create Audit Trails That Hold Up

There's a scenario that comes up early in almost every conversation we have with enterprise IT teams evaluating AI agents for approval workflows. It goes something like this: an AI agent approves a purchase order that turns out to be fraudulent, or incorrectly flags a legitimate vendor payment, and someone — an internal auditor, a finance director, a regulator — asks: why did the system make that decision?

If you can answer that question clearly, you have a recoverable situation. You can show the policy rules the agent was applying, the data it evaluated, and the logic path that led to the outcome. You can identify where the policy was ambiguous or where the input data was misleading. You can fix it. If you cannot answer that question — if the agent made its decision in a way that isn't logged with sufficient fidelity — you have a governance problem that may be worse than the original decision itself.

Decision logging is the mechanism that determines which of those two situations you're in. This post describes how we think about it and what we actually build.

What a Decision Log Needs to Contain

A naive implementation of decision logging is just recording the outcome: "PO #48821, approved, 2025-08-14 10:42." That's a transaction log, not a decision log. It tells you what happened, not why. It's useless for audit purposes because an auditor's question is always about the reasoning, not just the result.

A useful decision log records the complete evaluation context at the moment of the decision:

  • The input data the agent received — the full structured payload, not a summary
  • The specific rules that were evaluated, in evaluation order
  • The outcome of each individual rule (matched / not matched / not applicable)
  • The decision path — which rules were determinative
  • The final decision and the timestamp
  • The rule set version that was active at the time
  • The identity of the agent instance that made the decision

That last point — rule set version — is more important than it might seem. Enterprise policy documents change. A procurement approval threshold that was ¥500,000 in January might be ¥300,000 by June after a finance policy update. If an auditor is reviewing a decision made in March, they need to know which version of the approval policy was in effect at that time. Without versioned rule sets and version references in the decision log, historical audit queries are unreliable.

Immutability and Tamper Evidence

Logging data into a database that the same system can write to is not an audit trail — it's an editable record. An audit trail has to be constructed so that modifying a past entry is either impossible or clearly detectable.

We use append-only log storage for decision records. Once a decision entry is written, no process in the dodoAI runtime has write access to that entry. The agent runtime can write new records; it cannot modify existing ones. This is enforced at the database level through row-level permissions, not just at the application layer.

For environments where tamper evidence is a hard requirement, we offer a hash-chain approach: each log entry includes a cryptographic hash of the previous entry's content plus the current entry's content. This means any modification to a historical record breaks the chain and is detectable by recalculating hashes forward from any known-good checkpoint. The approach is similar to how append-only audit logs work in financial systems that need to satisfy external auditor requirements.

We're not claiming this is equivalent to a blockchain or a formal cryptographic audit system — it isn't. It's a practical tamper-evidence mechanism that satisfies the documentation requirements we've encountered in Japanese enterprise environments, where the concern is usually internal modification by IT staff rather than sophisticated external attacks.

Structured Reasoning vs. Log Verbosity

There's a tension in decision logging between completeness and usability. A decision log that records every intermediate computation is complete but unreadable. A finance manager who needs to understand why a PO was flagged doesn't want to read a JSON document with 400 fields — they want a clear explanation of which policy rule was triggered and why.

We separate the technical decision record from the human-readable explanation. The technical record goes into the append-only log in full fidelity — every field, every rule evaluation, the complete input payload. Separately, the agent generates a structured explanation in plain language: "This purchase order was flagged because the vendor tax registration number does not match the vendor master record. The discrepancy was found during vendor identity verification step 3 of the procurement policy (version 2.1, effective 2025-04-01)."

The plain language explanation is generated from the technical record — it's a rendering, not a separate artifact. This matters because the explanation must always be derivable from the logged data. If the explanation and the technical record can diverge, the explanation becomes unreliable as an audit document.

Time and Timezone Handling

This is a detail that seems minor but causes real problems. Enterprise systems, especially in Japan, often operate across multiple time references: server time in JST, ERP time in JST, some older systems that store UTC without labeling it, and occasionally systems installed by overseas IT departments that record in UTC+9 without the offset marker.

We store all decision log timestamps in UTC with explicit timezone notation (ISO 8601 with offset: 2025-08-14T01:42:00+00:00). The display layer converts to JST for local users. When we ingest timestamps from ERP events — which might arrive in various formats — we normalize to UTC at ingestion time and log both the original timestamp and the normalized UTC timestamp so there's no ambiguity about what the source system reported versus what we recorded.

This has saved us from audit confusion on at least one occasion where a decision appeared to have happened "before" the request it was responding to, purely because the request timestamp from the ERP was in JST without offset notation and our log was in UTC.

What Auditors Actually Ask For

We've had decision logs reviewed in the context of internal audit exercises at a few of the organizations we work with. The questions that come up repeatedly are:

What policy was in effect when this decision was made? This is why versioning matters. The answer needs to be: "Policy version 2.1, which was active from 2025-04-01 to 2025-09-30." If the log just says "current policy," that answer is useless after the policy changes.

Who or what had authority to make this decision? Enterprise governance frameworks require that authority be traceable. For agent decisions, the answer is: "Agent instance dodo-procurement-01, acting under configuration approved by [named IT administrator] on [date], executing policy authorized by [named finance policy owner] in policy document [reference]." The chain of authorization has to be explicit.

What data did the system use? If an agent approved a vendor payment and the vendor turns out to be fraudulent, auditors will ask what vendor verification data the agent saw. The answer has to come from the log, not from reconstructed memory of what the system probably saw.

Was any human involved? For high-value decisions above a threshold, many organizations require human-in-the-loop confirmation. The log needs to clearly show whether a human reviewed and confirmed, or whether the decision was fully autonomous. If a decision was supposed to involve human confirmation but didn't — because of a configuration error, say — that gap should be visible in the log.

When the Log Isn't Enough

We want to be clear about one limitation: a decision log, no matter how complete, doesn't validate that the underlying policy was well-designed. If the policy says "approve all purchase orders under ¥1,000,000" and someone submits a series of ¥999,000 orders that collectively amount to a bypass of a higher approval threshold, the agent will have complied with the policy correctly — and the log will show that compliance accurately. The log is evidence of what the agent did. It doesn't guarantee the policy was sensible.

That's a policy design problem, not a logging problem. The appropriate safeguard is anomaly detection layered above the decision log — flagging statistical patterns that suggest policy circumvention — rather than trying to make the logging system smarter. These are different concerns and should be addressed separately.

Decision logging gives you a complete, auditable record of what your AI agents decided and why. It's a necessary foundation for any AI system operating in a regulated or high-accountability environment. But it's a foundation, not a complete governance solution. The organizations we work with who get the most value from the logging infrastructure are the ones who treat it as one layer in a broader accountability stack — not as the stack itself.

Interested in sovereign AI for your enterprise?

We deploy inside your perimeter. Your data never leaves. Start with a discovery call to map your use case and environment.