AI LLM Security Testing: How to Scope, Test, and Implement Guardrails

This year I had the opportunity to perform security testing on an LLM agent, and at first, I wasn’t sure where to begin. I spent hours researching how the system works and how it should be approached from a security perspective. When you’re under time pressure, you naturally look for the shortest path to understand the system and get meaningful results.
This article summarizes that path.

Understand the Purpose and Scope of the LLM Agent

Start the same way you would with any security review, Asking questions and making a mind map of what exactly the system does:

What is the agent used for?
What tasks does it automate?
What decisions can it make?
What external systems does it interact with?

You cannot test an agent effectively unless you know what it is supposed to do and what it is allowed to touch.

Identify the Model and Its Capabilities

If you are testing this for your enterprise, determine:

Which model is being used
Whether it uses an API, fine-tuned model, or hosted model
What data sources it has access to
What tools or plugins it can call
What the blast radius is if it behaves unexpectedly
What APIs does the agent call?
Does it use MCP or another tool-calling protocol?
Does it execute code, run workflows, or perform actions based on the “user” role?
What system permissions does it inherit?
Are tool calls sandboxed?
How is memory stored and retrieved?
Does the agent read external, user-controlled content (emails, documents, webpages)?

Know the Model’s Strengths and Weaknesses

Several publicly available resources benchmark LLMs against common security evaluations. While these won’t exist for every model, they are useful for orientation.

But, what is Benchmark?

A benchmark is a standardized reference test used to measure, compare, and evaluate performance against a defined baseline under controlled conditions.

Many well know companies provide these benchmarks in their own website against their LLM Models, Ex:

Anthropic: https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf
OpenAI: https://openai.com/safety/evaluations-hub/#jailbreak-evaluations
Third party websites benchmarking known Models: https://promptfoo.dev/models/reports/claude-4-sonnet

The above are just some examples and its always a good idea to read these reports if your company is using such models in the product. These aren’t exhaustive, but they provide a directional sense of how a model handles certain classes of attacks. These benchmarks helps answer the below questions:

Under what conditions does it break?
How often does the model break?
How badly does it break?

Now, this is just the picture from the very top, We need to zoom in better, we want to test our own environment, our own usecase with existing test scenarios – just like a web application security testing or any traditional security testing is done! we need to setup a lab to run against the defined AI LLM benchmarks!

Testing!

During my own research and testing, I came across promptfoo free version, I tested my LLM model usecase with the help of promptfoo – It has easy UI and is user friendly enough to grasp in the first go!

Give this a try locally – https://www.promptfoo.dev/docs/intro/

Promptfoo provides you a free way to automatically test your LLM agent against multiple attack types in LLM. In the next article, I will be talking about how you can perform your own scan.

Note: Results are usually dependent on configuration and coverage; benchmarks may miss zero-day attacks or environment-specific vulnerabilities.

Promptfoo is just example, but there are several other examples:

https://github.com/rapticore/llm-security-benchmark

https://meta-llama.github.io/PurpleLlama/CyberSecEval/docs/user_guide/getting_started

https://github.com/JailbreakBench/jailbreakbench

You can use any tool which you find comfortable, make sure it is suitable for the kind of your usecase.

Evaluating your results:

Once you have the report of what kind of attacks are passed, you can evaluate what risks your LLM Model can pose.

Guardrails?

Guardrails are constraints and control mechanisms designed to keep a system operating within acceptable, safe, and intended boundaries—even when users or inputs try to push it outside those boundaries.

They do not make a system perfect.
They limit damage when failure happens. Guardrails often decides what would happen when LLM breaks.

So far I have seen 3 kinds of guardrails that we can usually put in the system:

Input Guardrails (Before the Model) – These analyze or restrict user input. Examples:
- Prompt filtering
- Keyword detection
- Intent classification
- Context sanitization
- Prompt injection detection
- Notes / Limitations:: Obfuscation can bypass this kind of guardrails. Adding many layers of input checks can introduce latency, especially if each layer calls external classifiers or LLM-based filters. However, a well-optimized filter pipeline (like regex + lightweight ML classifier) usually adds minimal delay.
System Prompt Guardrails (During the Model) – These are instruction-level constraints. Examples:
- Safety system prompts
- Role constraints
- Tool-use rules
- Refusal instructions
- Purpose:
  - Define behavior boundaries
  - Shape refusal style
- Notes / Limitations: They are soft constraints, meaning the model may partially ignore them in some scenarios. In practice, they usually work efficiently without delaying output. However, system prompts can become bulkier, and excessively large prompts may be vulnerable to system prompt injection attacks.
Output Guardrails (After the Model): These inspect and constrain model outputs. Examples:
- Toxicity classifiers
- Regex / pattern checks
- Security content filters
- PII detection
- Policy-based blockers
- Purpose
  - Catch failures
  - Block harmful responses
  - Trigger fallbacks
- Notes/Limitation: They provide some real protections live. Output guardrails are essential but not sufficient.They complement input guardrails, system prompt guardrails, and tool-level constraints. Their effectiveness is a trade-off between safety, latency, and usability. Obfuscation is a risk here as well.

Based on your environment, you can decide which guardrail is suitable to your environment. AWS has guardrails in-built with bedrock, so if you are using AWS environment, you can utilize their inhouse tool, same goes with Azure, they provide defender.

December 13, 2025

Jo

Miscellaneous

Anonhack

AI LLM Security Testing: How to Scope, Test, and Implement Guardrails

Understand the Purpose and Scope of the LLM Agent

Identify the Model and Its Capabilities

Know the Model’s Strengths and Weaknesses

Testing!

Guardrails?

Like this:

Leave a ReplyCancel reply

AI LLM Security Testing: How to Scope, Test, and Implement Guardrails

Understand the Purpose and Scope of the LLM Agent

Identify the Model and Its Capabilities

Know the Model’s Strengths and Weaknesses

Testing!

Guardrails?

Share this:

Like this:

Leave a ReplyCancel reply