Cube A CloudMenu →
AI SafetyInformational

AI Guardrails: How to Build Safety Boundaries That Actually Work

AI guardrails come in four types: input, output, behavioural, and policy. Where each fits in the request lifecycle, how they fail, and how to test them.

Mudassir KhanCEO of Cube A Cloud
Published
Reading9 min
CUBE A CLOUD — AI SAFETY / REFUSAL AI Guardrails, As Contracts. Four types, where each fits in the request lifecycle, and how they fail. INPUT BEHAVIOURAL OUTPUT POLICY ESSAY · 9 MIN READ CUBEACLOUD.COM
Figure · Editorial cover

AI guardrails are the runtime expression of a system's refusal contract. They are what actually catches the input that should not be answered, the output that should not be returned, the tool call that should not be made, and the request that the system has no authority to commit. The phrase is now used loosely enough that two teams discussing it can mean different things; this guide names the four guardrail types, gives the place each one fits in the request lifecycle, and walks through the failure modes a real evaluation will expose.

Key takeaways

  • **A boundary is a contract before it is code:** AI guardrails are the runtime enforcement of a refusal contract that was written down before any model was wired in.
  • **Four types, four jobs:** Input guardrails screen what enters, output guardrails screen what leaves, behavioural guardrails constrain how the model operates between input and output, and policy guardrails express the rules of the principal.
  • **Place matters as much as type:** A behavioural rule applied at the input stage and an input filter applied at the output stage both produce silent failures that look fine in logs.
  • **They will fail:** The OWASP [Top 10 for Large Language Model Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications/) tracks the dominant failure classes; indirect prompt injection alone breaks naive setups year after year.
  • **A refusal is the success state, not the failure:** A guardrail that prevents a wrong commitment by refusing and logging is operating correctly; a guardrail that produces a silent allow is broken even when nothing visible has gone wrong.

A working definition

A guardrail is a deterministic check, applied at a named point in the request lifecycle, that either lets the request continue, modifies it, or refuses it with an entry in the audit trail. Three clauses inside that sentence carry the load. First, deterministic: a guardrail whose behaviour drifts with the model is not a guardrail; it is a hope. Second, named point: the guardrail's place in the lifecycle is documented and the same on every request. Third, refuses with an entry in the audit trail: a silent block is not a guardrail outcome, because nothing downstream can replay or appeal it.

The U.S. NIST Generative AI Profile (AI 600-1) treats this layer as one of the measurement and management techniques for generative systems. Vendor toolkits, including NVIDIA NeMo Guardrails, implement the layer as configurable policy that sits in front of and behind the model. The toolkits differ; the four types described below are stable across all of them.

The four guardrail types

A serious deployment runs all four. A deployment with only one or two of the four typically catches the obvious failures and silently allows the subtle ones.

FOUR TYPES, FOUR JOBS INPUT GUARDRAILS Topic, injection, PII, authority. Screens entry. OUTPUT GUARDRAILS Schema, factuality, leakage. Screens exit. BEHAVIOURAL GUARDRAILS System prompt, tool perms, iteration limits. POLICY GUARDRAILS Authority gates, refusal conditions, regulation. A deployment with only one or two of the four catches obvious failures and silently allows subtle ones.
Figure. Four guardrail types: input, output, behavioural, and policy

Input guardrails. Applied to what enters the system before any model is invoked. The work covers content classification (is this request inside the permitted topic set), prompt-injection detection (is this input attempting to override the system instructions), PII detection (does this input contain data that should not be sent to a third-party model), and authority check (is the requester allowed to ask this question). Input guardrails catch the largest volume of obvious failures and the smallest volume of subtle ones.

Output guardrails. Applied to what the model produces before it leaves the system. The work covers structured-output validation (does the output match the schema), factuality checks against retrieval-augmented sources, leakage detection (does the output contain data that should not be returned), and tone or content classification on the generated text. Output guardrails are how a system catches its own hallucinations and prevents the data exfiltration that an input guardrail missed.

Behavioural guardrails. Applied during the model's operation, not at the boundary. The work covers system-prompt structure, tool-call permissions, max-iteration limits on agentic loops, and any policy that constrains how the model gets from input to output. A model that has a permission to use a tool is one behavioural guardrail away from a model that uses the tool incorrectly.

Policy guardrails. Applied to the request as a commitment, not as text. The work covers regulatory and contractual rules, authority gates ("this commitment requires a named approver"), and refusal conditions that the system must enforce regardless of what the model produces. Policy guardrails are the runtime expression of the refusal-first philosophy described in why AI refusal matters, and they are the layer most often skipped in chatbot deployments that later need to operate inside regulated workflows.

Where each guardrail fits in the lifecycle

The four types are not interchangeable, and they are not independent. A request walks through them in a known order, and the place each runs at is part of the boundary's design.

REQUEST LIFECYCLE WITH GUARDRAILS REQUEST INPUT GUARDRAIL MODEL + BEHAVIOURAL OUTPUT GUARDRAIL POLICY GATE REFUSAL BRANCH logged with reason, layer, and rule A request that passes all four layers becomes a commitment. Failures at any layer become logged refusals.
Figure. A typical request lifecycle with input, behavioural, output, and policy guardrails distributed across the path

A request arrives. Input guardrails decide whether the model should ever see it. If the request passes, the model executes inside the behavioural guardrails, which constrain its reasoning loop and its tool calls. The model produces a candidate output, which the output guardrails check. The candidate output, if it would result in a commitment, then runs through the policy guardrails, which evaluate the commitment against the principal's rules. Only a request that passes all four layers becomes a committed outcome, and every request that fails any layer is logged as a refusal with the reason, the layer, and the rule.

A note on the order. Some teams place policy guardrails earlier in the lifecycle, before the model is even invoked, on the grounds that the cheapest refusal is the one that never spent a token. That works when the policy depends only on the request, not on the candidate output. When the policy depends on what the model is about to commit (a typical case in credit, clinical, and procurement workflows), the policy guardrail has to run after the model has produced a candidate. Both orderings are defensible; what is not defensible is leaving the policy layer out.

Why they fail in practice

Two failure modes dominate. The first is the boundary that depends on the model to enforce itself. Asking a model to refuse to violate its own system prompt is a hope, not a control; the OWASP list places prompt injection at the top of LLM application risks for this reason. The second is the boundary that produces a silent block. A guardrail that returns "I cannot help with that" without an audit-trail entry has produced an unappealable outcome and made the system harder to debug.

Both failure modes are design errors rather than implementation bugs. The remedy is structural: every guardrail is a deterministic check outside the model, every refusal carries the same evidence as a commitment, and the boundary is verified by the pre-deployment test phase that we cover in our companion guide on AI red teaming.

How to test them

A boundary that has not been tested is a hope. Testing covers two questions: does the boundary fire when it should, and does the boundary stay quiet when it should not. The first is the easier test and the one teams default to. The second is the test that catches the false-positive class that makes a system unusable.

A practical evaluation runs a labelled set of requests across both classes, in production-shaped traffic, and reports the rate at which the boundary correctly fires, the rate at which it incorrectly fires, and the rate at which it fails to fire when it should. The set is refreshed on every model retrain. The OWASP and NIST documents above name the test categories that any serious set has to cover.

A practical closing

A team that has been told to "add guardrails" to an existing feature usually finds, on inspection, that the missing layer is policy. Input filters and output checks tend to be in place; the gate that evaluates the candidate output as a commitment is the one that has been skipped. The contract that gate enforces is documented on our published refusal conditions page; once that contract is written down, the boundary becomes a piece of code rather than a meeting topic.

Frequently asked

Questions that surface often.

What are AI guardrails?

A guardrail is a deterministic check, applied at a named point in the request lifecycle, that either lets the request continue, modifies it, or refuses it with an audit-trail entry. The four types are input, output, behavioural, and policy. A serious deployment runs all four. The layer is the runtime enforcement of a refusal contract that was written down before any model was integrated, and a refusal is the success state, not the failure.

What is the difference between guardrails and content moderation?

Content moderation is one input-class and one output-class guardrail. Guardrails as a class extend beyond moderation to cover authority checks, prompt-injection detection, structured-output validation, tool-call permission, and the policy gates that decide whether a candidate output is a permissible commitment. A system with content moderation alone has one corner of the boundary covered and the other three open.

Do AI guardrails prevent hallucinations?

Output-class guardrails reduce hallucinations by validating model output against retrieval sources, schemas, and authoritative references. They do not prevent the model from producing a wrong answer; they prevent that wrong answer from leaving the system as a commitment. The combination of an output guardrail that catches the hallucination and a policy guardrail that refuses the commitment is what keeps the failure inside the audit trail rather than in front of the user.

How do guardrails relate to NIST AI RMF and ISO/IEC 42001?

Both standards treat this layer as a control. NIST AI RMF places the design and verification of these controls inside its Manage and Measure functions; ISO/IEC 42001 requires the controls to be documented, owned, and reviewed as part of the management system. Neither standard prescribes specific guardrail implementations, but both expect the organisation to document which controls exist, where they fire, and how they are tested.

Can a model be its own guardrail?

Not reliably. A model can express a refusal in language, but a model that refuses based only on its own system prompt is one prompt-injection attempt away from doing the thing the system prompt forbade. A boundary that depends on the model to enforce itself is a hope; structured adversarial testing is what reveals the difference between a real boundary and a hopeful one.

Writer

Mudassir Khan

CEO of Cube A Cloud

Writes on decision systems, AI governance, and the operational mechanics of bounded AI in regulated environments.

Continue reading