Agentic workflows for red teaming: doing more with less

I’ve been using agentic AI workflows in my security testing for a while now, and at this point I can say it’s the single biggest force multiplier I’ve encountered in offensive security. I’m not talking about asking ChatGPT to write you a Python script. I’m talking about orchestrated, multi-agent workflows with defined roles, custom skills, guardrails, and memory. Doing in a week what used to take a small team several weeks.

What I mean by agentic workflows

An agentic workflow is when you have one or more AI agents operating semi-autonomously toward a goal, with the ability to use tools, make decisions, and chain tasks together. Think of it less like a chatbot and more like a junior pentester who can follow instructions, use your tools, take notes, and flag things for your review. Except this junior pentester doesn’t sleep, doesn’t forget what it found three hours ago, and can run multiple threads of investigation simultaneously.

The key components of my setup:

Role definition: each agent has a specific role. One does reconnaissance. One runs targeted tests. One verifies findings and eliminates false positives. One writes up PoCs. You wouldn’t have one person on a team do everything, and the same principle applies here.
Custom skills and plugins: agents have access to specific toolsets, scripts, and testing methodologies packaged as skills. These are reusable and can be shared across engagements or injected as needed. If I have a skill for testing JWT implementations, I plug it in. If I need something for API enumeration, same thing.
MCP servers: custom Model Context Protocol servers let me give agents structured access to specific tools and data sources without exposing everything. The agent gets the interface it needs and nothing more.
Memory: this is where it gets interesting. Agent memory isn’t just for continuity in a conversation. I use it to enforce guardrails, store notes about the target environment, track what’s been tested, and persist context across sessions. If I tell the agent “do not write to any production database” in memory, that instruction carries forward through every task it performs.

Guardrails and operational security

If you’re going to hand an AI agent a set of security testing tools and point it at infrastructure, you need to be deliberate about what it can and can’t do. This isn’t optional, it’s the difference between a useful workflow and an incident.

Read-only in sensitive environments. In production or client environments, agents are restricted to read-only operations. Enumeration, scanning, analysis, yes. Exploitation, modification, active testing, only in designated environments or with explicit approval. Memory-based guardrails enforce this, and the agent’s available tools are scoped accordingly.

Data handling. This is a big one. Depending on the engagement, the data you’re working with may be sensitive (credentials, PII, client infrastructure details). I use providers with strict data logging requirements, or more accurately, providers that don’t log your data. Amazon Bedrock is one option where your prompts and responses aren’t stored or used for training. For more sensitive work, locally hosted models eliminate the data residency question entirely. Know your provider’s data policy before you feed it anything from a client engagement.

Scoping agent capabilities. An agent should only have access to the tools it needs for its specific role. The recon agent doesn’t need exploitation tools. The verification agent doesn’t need the ability to modify anything. This is just principle of least privilege applied to AI tooling, and it’s easy to get lazy about it.

The real value: triage and verification

Here’s where agents earn their keep. In any engagement of meaningful scope, you end up with a pile of findings. Some are real, some are false positives, some look real but have no actual impact, and some look minor but are actually critical when you dig in. Sorting through all of that is where a significant chunk of engagement time goes, and it’s where agentic workflows save the most time.

I set up dedicated agents for verification. The workflow looks something like:

Discovery agent runs enumeration and scanning, produces a raw list of potential findings.
Verification agent takes each finding and attempts to confirm it. Re-running the test, trying variations, checking for edge cases. Its job is to eliminate false positives.
PoC agent takes confirmed findings and builds proof-of-concept demonstrations. This does two things: it provides evidence for the report, and it forces a deeper level of validation. If you can’t build a working PoC, maybe it’s not as exploitable as the scanner said.
Impact analysis: once a finding is confirmed and has a PoC, you push further. The initial discovery is rarely the full story. An SSRF that looks low-severity might be a path to cloud metadata and credential theft. An info disclosure might chain with something else. This is where human judgment still matters the most, but agents can do the legwork of testing the adjacent attack surface.

The result is that instead of spending days manually validating scanner output, I’m reviewing pre-validated findings with PoCs already attached. The agents handle the volume, I handle the judgment calls.

Enumeration at scale

The other major advantage is speed of enumeration. Traditional pen testing or bug bounty recon involves running a bunch of tools, waiting for output, parsing results, running more tools based on those results, and so on. It’s a lot of hurry up and wait, and a lot of context switching.

With an agentic workflow, the agent handles the tool chaining. It runs subdomain enumeration, takes the results, resolves them, checks for live hosts, fingerprints services, checks for known vulnerabilities, and flags interesting targets, all without me babysitting each step. I can review the results when they’re ready, or I can have multiple agents working different parts of the target simultaneously.

This isn’t about replacing the methodology. The methodology is the same. It’s about removing the manual overhead of executing it. The agent follows the same steps I would, it just doesn’t need coffee breaks between them.

Rapid tooling and the one-man red team

This is the part that changed my day-to-day the most. I can build custom tools now at a pace that would have been impossible before.

Need a custom C2 beacon for a specific engagement? I can describe the requirements (comms protocol, evasion characteristics, payload behavior) and have a working implant in an afternoon. Not a template from a framework, a purpose-built tool for the specific environment and EDR I’m going up against. Then I can run it against the target EDR, analyze the detection response, make variants, and iterate until I have something that works. That cycle of build, test, analyze, modify used to take days of manual development. Now it’s measured in hours.

The same applies to custom scripts and utilities. Anything from a specialized credential harvester for a particular application stack to a data exfiltration tool that mimics legitimate traffic patterns for the target’s environment. The agent understands the context, what I’m trying to achieve, what constraints I’m working under, what the target looks like, and produces code that fits. I’m still reviewing everything, still making judgment calls on approach, but the implementation time collapses.

I can ingest existing skills and combine them into new workflows. Take a reconnaissance skill, chain it with an API testing skill, pipe the output into a reporting skill. Or build something entirely new, a custom MCP server that wraps a tool I wrote, exposed as a skill any agent can use across engagements.

The net effect is that one person with the right agentic setup can cover ground that used to require a small team. Recon, initial access, privilege escalation, lateral movement, persistence, reporting, each phase has dedicated agents with the right tools. I’m not claiming it replaces a team, but for solo operators or small shops, it closes the gap significantly. Bug bounty hunters, independent consultants, small red team shops, this is where the leverage is enormous. You’re not limited by your typing speed anymore, you’re limited by your ability to direct and make decisions. Which is how it should be.

Purple teaming and threat emulation

Where agentic workflows really shine for purple teaming is the speed at which you can go from threat intel to emulation. You get a report on a new TTP, say a novel initial access technique or a specific lateral movement pattern, and instead of spending days building out the tooling to emulate it, you feed the threat intel to an agent and have working IoCs or emulation scripts in hours. Your blue team gets tested against the real thing while it’s still relevant, not three weeks later when everyone’s moved on.

There’s already work being done on fully autonomous vulnerability exploitation pipelines. The concept is straightforward: give the system a CVE, and it walks through a multi-stage agent flow. Collecting details and diffs, researching exploit primitives, building exploit code, testing it against a live environment, then validating success. No external PoCs needed, fully autonomous from CVE to working exploit. The implication for both red and blue teams should be obvious. If an autonomous system can go from CVE to exploit in under an hour, your patch window just got a lot smaller, and your ability to rapidly emulate emerging threats needs to keep pace.

The dev team model applied to red teams

Steve Yegge’s Gastown project is worth paying attention to even if you’re not a developer. The concept is applying a development team model to AI agents. An orchestrator with full context (the “mayor”), individual worker agents handling specific tasks, a lightweight issue tracker as shared external memory, and a merge system to handle parallel work. Essentially Kubernetes for AI coding agents.

The reason this matters for red teaming is that the same model maps directly to how assessments actually work. In Gastown terms, the “mayor” becomes your Lead Assessor, the orchestrator with full context on the engagement scope, rules of engagement, target environment, and objectives. Then you have your specialist agents:

Recon and Enumeration: subdomain discovery, service fingerprinting, tech stack identification, OSINT collection. Runs the playbook, feeds results to the lead.
Scanner Triage: and this is a big one. Anyone who’s run a Nessus or Burp scan on a large environment knows you get back thousands of findings, most of which are noise. This agent’s job is to dig through that output, correlate findings, filter based on the engagement context, what matters for this specific target, what the risk profile looks like, what the assessment goals are. This needs human guidance through skills and memory that encode the priorities for the specific operation, but the sorting and initial analysis doesn’t need to be manual.
Exploit Developer: takes confirmed vulnerabilities and builds working exploits or PoCs. Can iterate on evasion techniques, test against target defenses, and produce the artifacts you need.
Finding Validator: takes raw findings from scanners or the recon agent and confirms or kills them. Re-tests, checks for false positives, verifies exploitability. Saves enormous amounts of time.
Reporting: takes validated findings with PoCs and drafts report sections. Not the final report, you still review, edit, and add context, but the initial writeup of each finding with evidence, impact, and remediation. That’s hours of engagement time recovered.

And it extends beyond those core roles. You can have agents dedicated to C2 beacon development, building, testing, and iterating on implants for a specific engagement. Source code analysis agents that audit application code or dependencies for vulnerabilities. OSINT collection agents that run continuously in the background. The roster grows based on the engagement.

A lot of this orchestration is already built into tools like Claude Code. Skills, memory, sub-agents, tool use, custom MCP servers. It’s marketed as a coding tool, but if you’re treating your red team operation as a software engineering problem (which you should be), the capabilities translate directly. Writing exploit code, building C2 infrastructure, analyzing target environments, generating reports, it’s all “code” in the broad sense.

Multi-model approaches

One thing I’ve found valuable is not being locked into a single model. Different models have different strengths, and for security work specifically, using multiple models or multiple iterations of the same model gives you better coverage.

A larger model like Claude or GPT-4 class is good for the planning, analysis, and complex reasoning tasks. A smaller, faster model is better for the high-volume repetitive work, running through a list of endpoints, checking configurations, parsing scan output. Some models are better at code generation, others at analysis. You can also use models from different providers as a form of diversity since they may take different approaches to the same problem, and in security testing, variation is your friend. You want the unexpected approach, the edge case nobody thought of.

Even the non-deterministic nature of LLMs works in your favor here. Run the same prompt against the same model five times and you might get five slightly different approaches to exploiting a vulnerability. In traditional tooling, you’d have to manually think of those variations. With agents, you get them for free just by running the same task multiple times or across multiple models. It’s like having multiple team members with different backgrounds look at the same problem. The variation is a feature, not a bug.

The security model question is real though. Different providers have different data retention policies, different fine-tuning approaches, and different safety guardrails. For offensive security work, you need to understand what each provider does with your prompts and outputs. Some are strict about not retaining data (Bedrock, locally hosted models). Others may use your interactions for training. Match the model to the sensitivity of the work.

Providers are catching on

It’s worth noting that the major AI providers are starting to recognize that security professionals are a legitimate use case, not just people trying to jailbreak their models. OpenAI launched Trusted Access for Cyber (TAC) in early 2026, an identity-verified program that gives approved cybersecurity professionals access to models with reduced safety guardrails for dual-use security work. This is significant because it acknowledges what anyone doing this work already knows: the same techniques used to attack systems are the same techniques used to defend them, and hamstringing your model’s ability to discuss offensive techniques just pushes security professionals to less safe alternatives.

I expect other providers to follow with similar programs. The guardrail problem is real. I’ve had models refuse to help with perfectly legitimate pen testing tasks because the safety filters can’t distinguish between “I’m testing our client’s web application with authorization” and “I want to hack a website.” Having a formal program that verifies you’re a professional doing authorized work and adjusts the model’s behavior accordingly is the right approach. It’s better than the alternative, which is security teams either not using AI tools at all or finding creative ways around the guardrails, neither of which is a good outcome.

Keeping skills sharp and pushing the envelope

One thing I want to emphasize: agents don’t stay useful on their own. You have to keep updating their skills with the latest tradecraft. New techniques, new bypasses, new attack surfaces. If your skills are stale, your agents are running last year’s playbook. The human in the loop isn’t just there for judgment calls, they’re there to push the agent to try things it wouldn’t try on its own. Suggest unconventional approaches. Point it at something nobody’s looked at yet. The agent executes fast, but the operator sets the direction, and that direction needs to come from someone who’s staying current.

Talk to anyone in the bug bounty space right now and they’ll tell you, agentic workflows are already the norm among the top performers. The people consistently finding high-severity bugs are the ones who’ve figured out how to leverage agents effectively. Some hunters I know have found 0-days by pointing agents at the dependency libraries a target relies on (identified through recon or inside knowledge of the tech stack) and having the agent systematically audit the library source for vulnerabilities. Not just finding the bug, but getting a CVE assigned and using it as an attack vector during the engagement or for a bounty payout. That’s a workflow that would take a researcher weeks of manual code review. With an agent, it’s days.

The flip side: where bug bounty is heading

I’d be dishonest if I didn’t acknowledge the elephant in the room. A lot of people in the bug bounty community expect the space to contract significantly. As agents get better at finding vulnerabilities, organizations are increasingly using them internally, running automated security testing as part of their CI/CD pipeline, catching bugs before code ships, making the QA process include the kind of testing that used to be outsourced to bug bounty programs. Why pay bounties for findings when an agent can catch them before release?

This also means there’s a growing problem with vulnerability reporting slop. Low-quality, agent-generated reports flooding bug bounty platforms with findings that are either false positives, duplicates, or so surface-level they’re not actionable. The platforms and companies are going to have to figure out how to deal with that, and it’s going to make life harder for legitimate researchers in the short term.

But here’s the thing, the bugs that agents find internally are the easy bugs. The configuration issues, the known vulnerability patterns, the stuff that a good scanner would catch. The bugs that require understanding business logic, chaining multiple low-severity findings into a critical path, or thinking creatively about abuse cases, those still require human expertise. Agents make the easy stuff cheaper to find, which means the value of the hard stuff goes up. If you’re a bug bounty hunter whose methodology is “run a scanner and report what it finds,” yeah, that’s going away. If you’re the kind of researcher who finds novel attack chains, you’re going to be fine. You’ll just be doing it faster with better tooling.

What this doesn’t replace

I want to be clear about what agents are bad at, because the hype cycle around AI in security is loud enough without me contributing to it.

Agents are not good at the creative, lateral-thinking aspects of offensive security. The “what if I try this weird thing” intuition that comes from years of experience, the instinct to poke at something that doesn’t look right even when every tool says it’s clean. Understanding the business context of a finding, why this particular data exposure matters to this particular client. Communicating findings to non-technical stakeholders in a way that drives action. Reading a room during a debrief and knowing when to push and when to let the client process.

More importantly, the human experience is what makes any of this work in the first place. The agent’s skills come from somewhere, they come from you. Your years of learning techniques, reading research, attending conferences, breaking things in labs, failing on engagements and figuring out why. That knowledge base is what you encode into skills, guardrails, and operational context. The agent is executing your methodology, not its own. If you stop learning, stop researching new techniques, stop keeping up with the evolving landscape, your agents stagnate with you.

Agents also can’t build relationships. They can’t sit in a threat briefing and read between the lines of what a client is actually worried about versus what they’re saying. They can’t mentor junior team members. They can’t make the case to leadership for why a finding matters. They can’t navigate the politics of a remediation plan. The soft skills of offensive security (and yes, there are soft skills) are entirely human.

They’re a force multiplier, not a replacement. They handle the volume and the grunt work so you can focus on the parts that actually require expertise. If you’re already good at this, agents make you faster. If you’re not, agents will just help you produce bad results at scale.

Getting started

If you’re interested in setting this up, you don’t need anything exotic. Claude Code with custom skills and memory, or a similar agentic framework, is enough to get started. Define your roles, build your skills as reusable modules, set your guardrails in memory, and start small. Maybe one agent doing recon on a bug bounty target. Once you see the output quality and time savings, you’ll want to expand from there.

The main thing I’d emphasize: treat your agent setup with the same rigor you’d apply to any tool in your offensive arsenal. Understand what it’s doing, where the data is going, and what the blast radius is if something goes wrong. Agentic doesn’t mean autonomous. You’re still the operator.

And log everything. Everything the agent does during an engagement needs to be recorded, every request, every tool invocation, every finding, every action taken. Build this into your skills, your plugins, your MCP servers from the start, not as an afterthought. If you’ve been doing this long enough, you know that sometimes a client’s website goes down during an engagement and you get suspected if not outright blamed even if you weren’t doing active testing at the time. When that call comes in, you need to be able to pull up logs and prove exactly what your agents were doing at that timestamp. And if it was you, if an agent did something unexpected that caused an issue, you need to figure out what happened immediately, why it happened, and help fix it. That’s the worst case scenario, and having complete logs is the difference between a professional response and a liability. You are responsible for everything your agents do. Full stop.