OpenAI's New Safety Bug Bounty Pays for 3 Types of AI Flaws | AI Bytes
0% read
OpenAI's New Safety Bug Bounty Pays for 3 Types of AI Flaws
AI News
OpenAI's New Safety Bug Bounty Pays for 3 Types of AI Flaws
OpenAI just launched a Safety Bug Bounty program on Bugcrowd that rewards researchers for finding agentic vulnerabilities, prompt injection attacks, and data exfiltration bugs — even when they don't qualify as traditional security flaws.
March 26, 2026
7 min read
47 views
Updated March 26, 2026
OpenAI Launches a Safety Bug Bounty — And It's Not Your Typical Vulnerability Program
OpenAI just drew a line in the sand. On March 26, 2026, the company announced a brand-new Safety Bug Bounty program that pays security researchers to find a class of bugs most bounty programs completely ignore: AI-specific abuse and safety risks. We're not talking about SQL injection or broken authentication here. This is about prompt injection, data exfiltration from AI agents, and the kind of novel attack vectors that only exist because AI systems now browse the web, execute code, and take real-world actions on your behalf.
And honestly? It's about time.
What Is OpenAI's Safety Bug Bounty Program?
OpenAI's Safety Bug Bounty is a public program hosted on Bugcrowd that accepts reports of meaningful AI abuse and safety risks across OpenAI's products — even when those issues don't meet the traditional definition of a security vulnerability. It runs alongside OpenAI's existing Security Bug Bounty (which recently bumped its maximum payout to $100,000 for critical findings). The safety-focused program specifically targets the weird, hard-to-classify dangers that come with deploying agentic AI systems at scale.
Think of it this way: the Security Bug Bounty catches the lock on your front door being broken. The Safety Bug Bounty catches someone convincing your AI butler to hand over your house keys through a cleverly worded note.
The 3 Categories That Qualify
As of March 26, 2026, the program covers three broad areas of AI-specific risk:
1. Agentic Risks
This is the big one. With products like ChatGPT Agent, Browser, and other agentic tools now interacting with third-party websites and services, the attack surface has exploded. Qualifying submissions in this category include:
Third-party prompt injection — where attacker-controlled text hijacks a victim's AI agent
Data exfiltration — tricking an agent into leaking a user's sensitive information
MCP (Model Context Protocol) abuse — exploiting the protocol agents use to interact with external tools
Disallowed actions at scale — making agents perform unauthorized operations on OpenAI's infrastructure
Here's the kicker: prompt injection attacks must be reproducible at least 50% of the time to qualify for a reward. So you can't just submit a one-off fluke and call it a day.
2. Account and Platform Integrity Violations
This covers bypasses of anti-automation controls, manipulation of account trust signals, and techniques for evading account restrictions or suspensions. Basically, if you find a way to game OpenAI's platform at scale, they want to hear about it.
3. Proprietary Information Abuse
If you can get a model to expose proprietary information — like details about its internal reasoning process or other OpenAI intellectual property — that counts too. This is a less obvious category, but it matters as models become more capable and the gap between "what the model knows" and "what it should share" gets harder to enforce.
The Safety Bug Bounty accepts issues that pose real-world risk, even when they don't meet traditional vulnerability classifications. That's a pretty big deal for an industry that's been trying to fit AI safety into a cybersecurity-shaped box.
What Doesn't Qualify
Not everything makes the cut. OpenAI is explicitly excluding:
General jailbreaks that only produce rude language or return information already easy to find on Google
Content-policy bypasses without a clear safety or abuse impact
Integrity violations involving unauthorized access — those get redirected to the Security Bug Bounty instead
So if you've been sitting on a prompt that makes ChatGPT swear at you, don't bother. But if you've found a way to make an AI agent silently exfiltrate a user's browsing session data to a third party? That's exactly what they're looking for.
Why This Matters More Than You Think
As of March 2026, we're living in an era where AI agents don't just generate text — they browse websites, write and run code, manage files, and interact with APIs. ChatGPT, Claude, Gemini — they all have agentic capabilities now. OpenAI recently gave AI agents a full Linux terminal, and even caught its own coding agents trying to bypass security. And that means the potential for harm has shifted from "the model said something bad" to "the model did something bad."
Prompt injection is to AI agents what SQL injection was to web apps in the early 2000s. The difference is that AI agents are harder to patch because the vulnerability is inherent to how they process natural language.
OpenAI acknowledged this directly by singling out prompt injection and data exfiltration as top-priority categories. They've even noted in previous research that AI browsers may always be somewhat vulnerable to prompt injection — a refreshingly honest admission for a company that often gets criticized for overpromising.
The 50% reproducibility threshold is also worth noting. It's strict enough to filter out noise, but loose enough to acknowledge that AI systems are inherently non-deterministic. You're not going to get the same output every single time, and OpenAI seems to understand that.
How the Program Actually Works
Submissions go through OpenAI's Bugcrowd portal. Both the Safety and Security Bug Bounty teams triage incoming reports, and issues may get rerouted between the two programs depending on scope and ownership. So if you submit a safety issue that turns out to be a traditional security flaw (or vice versa), it'll still end up in the right hands.
OpenAI also runs private campaigns targeting specific high-risk areas. They've previously invited researchers to probe for biorisk content generation in ChatGPT Agent and GPT-5 — a sign that the company is thinking about category-specific threats, not just generic vulnerability classes.
For reward amounts, OpenAI hasn't published a specific pay scale for the Safety Bug Bounty. But for context, their Security Bug Bounty ranges from $200 for low-severity findings up to $100,000 for exceptional critical vulnerabilities — a fivefold increase from the previous $20,000 cap. Researchers who find flaws enabling "direct user harm with actionable fixes" may also qualify for rewards on a case-by-case basis, even if the issue doesn't fit neatly into a defined category.
The Bigger Picture: AI Safety as a Community Sport
This launch is part of a broader trend. As of March 2026, the AI safety conversation has moved well beyond academic papers and policy proposals. Companies are putting real money behind finding real flaws — and they're enlisting the same security research community that's been hardening traditional software for decades.
When you pay researchers to break your AI agents, you're making a bet that external scrutiny will find things internal red teams miss. Given the complexity of agentic AI systems, that's a smart bet.
But there's a tension here, too. OpenAI's exclusion of general jailbreaks signals that the company draws a firm line between "the model said something it shouldn't" and "the model took an action that caused real harm." That distinction makes sense from a prioritization standpoint, but it might frustrate researchers who believe content-policy bypasses are a meaningful part of the safety puzzle.
Still, the fact that a dedicated safety-specific bounty program even exists is a positive signal. It means OpenAI is treating AI safety risks as a distinct category worthy of their own reward structure — not just an awkward edge case that security teams have to figure out on the fly.
How to Participate
If you're a security researcher or ethical hacker interested in AI safety, here's what you need to know:
Focus on real harm — submissions need to demonstrate meaningful safety or abuse impact
Hit the 50% bar — prompt injection attacks must be reproducible at least half the time
Provide clear reproduction steps — the more detailed, the better your chances of a payout
Don't submit jailbreaks — unless they lead to demonstrable real-world harm beyond content policy
The program is open to external researchers globally. If you work with MCP-based tools, our Claude Desktop MCP setup guide covers the protocol in detail. And given that OpenAI's agentic products are only getting more powerful, the attack surface is only going to grow.
OpenAI hasn't published a specific pay scale for the Safety Bug Bounty. However, their related Security Bug Bounty pays $200 to $100,000 per finding depending on severity. Safety submissions demonstrating direct user harm with actionable fixes may qualify for rewards on a case-by-case basis, even outside predefined categories.
Can I submit a ChatGPT jailbreak to OpenAI's Safety Bug Bounty?
Only if it causes demonstrable real-world harm beyond content policy. General jailbreaks that produce rude language or return easily searchable information are explicitly out of scope. Your submission needs to show a clear safety or abuse impact — like an agent leaking user data or performing unauthorized actions.
Does the OpenAI Safety Bug Bounty cover Claude or Gemini agents?
No. The program only covers OpenAI's own products, including ChatGPT Agent, Browser, and similar agentic tools. Vulnerabilities in third-party AI systems like Anthropic's Claude or Google's Gemini would need to be reported through those companies' own security programs.
What's the difference between OpenAI's Safety and Security bug bounties?
The Security Bug Bounty covers traditional vulnerabilities like authentication bypasses and infrastructure flaws, paying up to $100,000. The Safety Bug Bounty covers AI-specific abuse risks — prompt injection, data exfiltration, MCP abuse — that don't fit traditional security definitions but still pose real-world harm. Submissions are triaged by both teams and rerouted as needed.
Do I need to reproduce a prompt injection attack every time to get paid?
Not every time, but at least 50% of the time. OpenAI requires prompt injection attacks to be reproducible in at least half of attempts to qualify for a reward. This threshold accounts for the non-deterministic nature of AI models while filtering out unreliable one-off results.