What is an AI risk assessment template?

An AI risk assessment template is a structured framework for evaluating the risk profile of an AI system before deployment. It scores the system across risk dimensions, produces a composite risk score, and recommends controls or oversight levels proportionate to that score. An effective template is evidence-based and produces specific, actionable outputs, not a generic risk category.

What dimensions should an AI risk assessment cover?

An AI risk assessment should cover data sensitivity, decision reversibility, audience vulnerability, refusal coverage, HITL maturity, model transparency, regulatory exposure, third-party dependency, drift and degradation risk, scale and reach, adversarial exposure, and incident history. Each dimension should be scored independently on a defined scale before the composite is calculated.

How is an AI composite risk score calculated?

A composite AI risk score is the mean of scores across all assessed dimensions, each rated on the same scale (typically 1–5, with 5 being highest risk). The composite determines the HITL placement tier: Tier 1 for low composite scores, Tier 3 for high. Dimensions with no evidence are excluded from the mean; they are documented as gaps and trigger mandatory controls regardless of the composite.

What is the difference between an AI risk assessment and an AI audit?

An AI risk assessment is a prospective evaluation: it estimates the risk of deploying a system and determines what controls are required. An AI audit is retrospective: it verifies that deployed controls are working as documented. Both are necessary; they serve different governance functions. The risk assessment is done before deployment; the audit is done after, on a defined cadence.

What HITL placement does a high AI risk score imply?

A composite risk score above 3.5 implies Tier 3 HITL placement: sequential human approval for every consequential output. In regulated contexts, a score above 4.0 implies dual sign-off: two qualified reviewers must approve independently. These thresholds are not universal; each organisation should calibrate them against its risk appetite.

Scoring AI Risk Without a Spreadsheet

Most AI risk assessment templates are spreadsheets. They are filled in once, at deployment time, by the team that built the system. The scores are optimistic. The composite is acceptable. The spreadsheet is filed. Six months later, the system has changed, the data distribution has shifted, and the risk profile looks nothing like the original assessment. The spreadsheet says the system is safe. The governance team has no way to know whether that is still true.

Key takeaways

Twelve dimensions: A complete AI risk assessment covers twelve dimensions: data sensitivity, decision reversibility, audience vulnerability, refusal coverage, HITL maturity, model transparency, regulatory exposure, third-party dependency, drift risk, scale and reach, adversarial exposure, and incident history.
Composite to HITL tier: The composite risk score maps directly to a HITL placement tier — low composite scores require automated monitoring only, high composite scores require sequential human approval for every consequential output.
Scores must be evidence-based: Risk scores are not estimates. Each score level requires specific evidence: a logged incident is evidence for the incident history dimension; the absence of a model registry is evidence for drift risk.
Re-score on change: A risk assessment is valid for the system as it was assessed. Any material change to training data, use case, or deployment environment requires a reassessment of the affected dimensions.
The spreadsheet is not the problem: The problem is using a spreadsheet as a one-time exercise. The scoring model can live in any tool; what matters is that scores are updated as the system changes and that the composite drives governance decisions.

Why Static Risk Assessments Fail

A risk assessment that is done once, at deployment, captures the risk profile of the system on a single day. AI systems are not static. The data they process changes, the use cases they are applied to expand, the audiences they affect grow, and the regulatory environment around them evolves. A static assessment becomes misleading almost immediately after it is completed.

The deeper problem is that static assessments are often completed by the team with the strongest incentive to produce an acceptable score. When the deployment team scores the system they built, the assessment is not independent. The dimensions most likely to be underscored are the ones that would require additional controls: HITL maturity, refusal coverage, and adversarial exposure are consistently underscored in practice.

The Twelve Risk Dimensions

Figure. The twelve dimensions provide independent signals. A system can score low on data sensitivity but high on adversarial exposure; both inform the composite.

Data sensitivity scores how sensitive the data the system processes is. A system processing public data scores 1. A system processing health data, biometric data, or legally privileged information scores 5. This is the most legible dimension; it is also the most commonly gamed, because teams define data sensitivity relative to their own tolerance rather than the data subject's exposure.

Decision reversibility scores how easily the decisions triggered by this system's outputs can be undone. A system that produces internal summaries scores 1 — the reader can discard the summary. A system that triggers financial transactions or legal filings scores 5. This dimension interacts strongly with audience vulnerability: an irreversible decision affecting a vulnerable individual is the highest-risk combination on the worksheet.

Refusal coverage scores how well-defined the system's refusal conditions are. A system with no documented refusal conditions scores 1, regardless of how well the system performs in practice. A system with documented, tested, and automatically enforced refusal conditions scores 4 or 5. This dimension is a direct measure of governance maturity, not model capability.

HITL maturity scores the robustness of the human oversight mechanism. A score of 1 means no human review of outputs. A score of 5 means review is logged, reviewer qualifications are verified, SLAs are tracked, and the review cannot be bypassed under time pressure. This dimension is the most operationally significant; it is also the most commonly overstated.

Adversarial exposure scores how exposed the system is to adversarial or malicious inputs. A fully internal system that processes only internal data scores 1. A system directly accessible to members of the public with a financial or reputational motivation to manipulate it scores 5. Prompt injection is the primary adversarial risk for language models; most organisations have not assessed it systematically.

Computing the Composite and Mapping to HITL Tier

The composite risk score is the mean of the twelve dimension scores. Dimensions that have not been assessed are excluded from the mean but are flagged as mandatory controls: an unassessed dimension is a risk, not a neutral score.

A composite score of 1.5 or below maps to Tier 1 oversight: automated monitoring, no individual output gate. A score between 1.5 and 2.5 maps to Tier 2 sample review: a random sample of outputs is reviewed weekly. A score between 2.5 and 3.5 maps to Tier 2 gate review: all outputs above a risk threshold are gated before action. A score above 3.5 maps to Tier 3: sequential human approval for every consequential output. In regulated contexts, a score above 4.5 implies dual sign-off.

Direct answer

The composite score determines the minimum oversight level. Individual dimension scores can raise it. A system scoring 2.0 on the composite but 5 on decision reversibility should be treated as Tier 3 for the specific outputs that trigger irreversible decisions, even if those outputs are a minority.

This is the critical insight that spreadsheet-based assessments miss. Risk is not uniform across a system's outputs. A composite score applies to the system's typical output profile. High-scoring individual dimensions flag the exceptional outputs that require elevated oversight even when the composite is moderate.

When to Reassess

A risk assessment is valid for the system as it was assessed. Reassessment is required when any of the following changes: the training data, the deployment environment, the audience (from internal to external, or from general public to vulnerable population), the use case (from advisory to decision-triggering), the regulatory exposure, or the incident history (a new incident resets the incident history dimension to a higher score).

In practice, organisations that run quarterly reassessments catch drift before it becomes incident-level. The reassessment cadence is the operational expression of the NIST AI RMF Manage function's requirement that risk management actions be sustained over time, not applied once at deployment.

NIST AI RMF 1.0, MANAGE 4.1: "Residual risks not addressed during the MANAGE function are documented." A risk assessment that is not updated does not document residual risk; it documents historical risk. Only a maintained assessment satisfies this requirement.

For teams running the assessment directly, the AI Risk Worksheet covers all twelve dimensions with per-level descriptors and produces a composite score and HITL placement recommendation automatically. It supports multiple systems in the same session and exports to CSV for integration with enterprise risk registers.

Scoring AI Risk Without a Spreadsheet

Why Static Risk Assessments Fail

The Twelve Risk Dimensions

Computing the Composite and Mapping to HITL Tier

When to Reassess

Questions that surface often.

Related essays.