Testing an autonomous hacker
I've tested an autonomous hacker on real products over the past month,[1] and it's extremely capable. Concretely, I could:
- Hijack accounts at a major bank and log in as other users.
- Bypass authorization in a major AI lab's product and access other users' private data, including uploaded files.
- List all users of a big tech company's product and modify their files.
- Download the health records of anyone whose data was stored and sharable through a popular electronic health records system.[2]
If I were malicious, I reasonably believe that I could cause billions of dollars in damage in less than a month, and AI is extending that choice to millions of people. While the AI providers have some safeguards, they're not a barrier. The current generation of models are easy to jailbreak, and one model in particular – Claude Opus 4.6 – is eager to hack.[3]
Dangerous empowerment
I think I'm a competent software engineer: I worked at Anthropic, ran a venture-backed software company for a few years, and I have a degree in computer science. Still, a year ago, I was not capable of finding these vulnerabilities; I'm not a trained security engineer.
I'd estimate the pool of people capable of finding these vulnerabilities a year ago to be tens of thousands of people.[4] Now, with the uplift from AI, I think there are tens of millions.[5]
That ~1000x increase in the number of people who are able to find and exploit critical security issues is going to cause issues. We have relative security due to the rarity of talent that's able to find these kinds of issues and is willing to exploit them. Most systems are never subjected to a serious attack[6] and are hardened to withstand only moderately resourced hacking groups.
I'm not a fan of analogies, but I think one is apt here: it's as if everyone built houses to withstand rain, and they're about to be inundated by a flood.
Making it through this
I'm cautiously non-pessimistic that we can make it through this without a society-altering disaster, because the defense can adopt the same tooling, find issues before hackers, and fix them.
For that path to succeed, both the defense and the labs need to take action:
- Labs need to grant access to pre-release models to the defense. The defense needs a chance to find and fix issues before hackers. For this to work, the preview period should be at least a week long and include very high rate limits.
- Defense needs to start continuous open-scope penetration tests that are powered by the latest models. AI that's employed by the defense is capable of finding the same issues as AI employed by hackers, but to be effective, it needs to run as the attack surface changes and have an unlimited scope of attack. Right now, most penetration tests have a limited scope, are run only occasionally, and don't reflect the real capabilities of an autonomous hacker.
- Millions of owners need to authorize automated vulnerability testing. Even with billions of dollars in cash and an army of security engineers, I would go to jail for unauthorized hacking of systems I don't own – and only a small fraction of systems allow public ethical hacking. We need most of these system owners to authorize someone to perform automated hacking and disclose issues. (I built Autopen[7] so there would be a place to sign up and authorize testing for free.)
- Issues need to be fixed faster. Fix timelines average in the weeks and often extend to months, even for critical server-side issues that could theoretically be patched instantly. That timeline is too slow to effectively defend against this threat.
- We need purpose-built "evil models". Models trained on offensive hacking and lightly tuned for fewer refusals will also play a role for the defense. I'm somewhat hesitant about this one, because it's ripe for misuse and likely won't remain contained by safeguards. Smaller groups are already working on this, though, so I think it's inevitable that they'll be available to the offense soon.
- We need better safeguards for fully autonomous hacking. When the autonomous hacker finds a critical vulnerability, it has the ability to exploit it and negatively affect the system (and often tries to!). I built safeguards to prevent this specific failure mode, but I still need to monitor each hacking agent. Scaling this up either requires thousands of people monitoring agents, better safeguards, or both.
I intentionally included both "evil models" and pre-release public models in the list above, because combining different lineages of models in the same session finds more critical issues. Adding another model to the mix might produce an emergent characteristic that creates a more capable autonomous hacker, even if the model doesn't exhibit this capability in isolation.
If we do all of this, move quickly, and execute well, I could see the lone-wolf hacking risk being largely mitigated for the next ~year.
What the AI labs could be doing better
Of all the AI labs, I think OpenAI has the most effective cybersecurity safeguards; I was silently downgraded to a less capable model while using Codex to hack. They advertise this and have a trusted access program to avoid it, but it still took me a while to figure out what happened, and it meaningfully slowed down my security research. (Which is good! I hoped that all the platforms would ban me within hours.)
I think most people at the AI labs are well-intentioned and genuinely trying to make the transition to powerful AI go well.[8] That said, I don't think any lab has adequate safeguards to prevent cybersecurity misuse from a moderately dedicated actor. That's because:
- It's trivial to jailbreak the models to hack. It only takes a few minutes to jailbreak any of the frontier models to perform as an autonomous hacker.[9] After the initial jailbreak, I have seen only two refusals from the primary agent over ~400,000 events.[10]
- Bans for cyber misuse are slow. It took several weeks before one of my accounts was flagged by a model provider. Now, I've gone through a KYC process for one and am labelled as a known red teamer by another. I haven't done anything with the third provider, and my account is still operating as-is.
- Opening new accounts is easy. Measures that might hinder abuse for normal users, like phone verification or entering a credit card, are low barriers for a moderately resourced hacking group when the reward is several critical security vulnerabilities.
- Credit programs are lacking and exclude offensive security testing. OpenAI has grant programs, but they exclude offensive security testing. Google occasionally has similar programs, and Anthropic has none that I'm aware of. The products they've built for defense cost money and only focus on owned-code, not active pen testing; none of the vulnerabilities I mentioned above would be caught by analysis of the codebase.
- It's easy to distill models and strip safeguards. It's easy to use any lab's API as a judge or a source of traces to train a new model (or to fine tune an open source one).
This all means that extracting frontier cybersecurity capabilities through the safeguards is fairly straightforward.
I worry that the frontier labs overindexed on the lack of demonstrated bio risk in the past year. The pool of people who are willing and able to use LLMs to hack is much larger than those who want to build bioweapons. There are clear financial incentives that drive this behavior and fewer moral qualms; many more people are willing to hack than indiscriminately murder.
That said, this is all moot if this capability is built outside of a large lab, which is already happening. Cybersecurity is one of the easiest domains in which to create RL environments, because rewards are verifiable and infrastructure is straightforward.[11] Anyone with access to accelerators can build similar capabilities on top of a strong open-source agentic coding model without relying on a frontier lab's API.
An aside: actual damages
I strongly suspect that an AI-powered Equifax-scale attack has already occurred. The capabilities are there, and people are actively using frontier models to hack. We just haven't heard about it yet, because significant breaches usually go undetected and unreported by the victim for months.[12]
To put numbers on the cost of damages, bad cybersecurity incidents like the Equifax breach or NotPetya ransomware cost billions of dollars, and attacks with less widespread damages regularly lead to cash outlays in the tens to hundreds of millions of dollars. Attacks on critical infrastructure, which have been rare so far, can be much worse.
Personally, I'm of the opinion that you are responsible for the effects of what you build. If I were a decision maker at Anthropic, I would seriously consider pulling Opus 4.6 and not releasing more capable models until adequate safeguards are in place, even though this risks falling off the frontier.
I'm singling out Anthropic here because Opus 4.6 is my go-to for hacking. If it disappeared, I'd be a less capable hacker, but Opus 4.5, GPT 5.3 Codex, and GPT 5.4 would still meet all my needs for coding. I see the riskiest period as when hackers have access to capable models, but defense hasn't applied fixes yet. Because of that, I view Opus 4.6 as the primary source of additional risk.
N roads diverged
I see this as the first real risk from powerful AI which is both a) emergent rather than intentional, and b) requires coordination from multiple parties to fix.
As such, I'm closely watching this as a source of signal for future issues through the diffusion rate for applying defenses, if the government will step in, and how AI labs cover negative externalities. There are a lot of potential futures with wildly different outcomes, and I'm still not sure where we're headed.
If my priors hold, we likely won't raise defenses in time. Despite that, I'm taking the optimistic stance that we will, because I don't see any action to take if I hold the pessimistic view.
Thank you to Rene Brandel and Bryce Cai for reviewing a draft of this.
With permission! The agent was mostly autonomous – I was still in the loop to shut it down if it started trying to delete other user's data (which it regularly tries to do!), steer when critical instructions were lost during compaction, and validate findings. Many sessions proceeded for hours without any manual involvement. ↩︎
Technically, this one required a step that took a few minutes of effort per person, so downloading everything would be slow and risk detection. ↩︎
To AI safety friends: yes, you could call this an infohazard, but it's common knowledge in hacking circles and not application security circles – so I think sharing this is a net positive. On the defense side, we're in the "it's not there yet" phase of perception, like coding agents a year ago. ↩︎
Estimated as thousands of people who submit critical issues via HackerOne each year, the estimated tens of thousands of nation-state security researchers, and thousands of other independent but competent security researchers. This number is likely off, but I'd be surprised if it's off by an order of magnitude. ↩︎
Estimated as the rough number of people who have the ability to run basic commands on the command-line, a basic understanding of how software is built, and could pay for a subscription to a coding agent. This one is harder to approximate, but as a comparison, GitHub reports over 100M active users, and over 20M of them are in the United States – so tens of millions of people seems about right. ↩︎
Almost nobody builds to withstand a sustained attack from a capable actor like the United States' NSA or Israel's Unit 8200, and I don't know of any examples of successfully withstanding such an offensive – just measures to make it harder or more obvious. ↩︎
I built Autopen after finding these vulnerabilities, because nothing comparable has a free or low-cost tier for the long-tail of apps that won't pay thousands for a pen test. I suspect I'll get flak similar to when Anthropic posts a warning about AI, but I chose to build Autopen and write the post because I'm genuinely worried about the consequences and trying to help mitigate them. ↩︎
As I've said in HN posts like this or this. That doesn't mean that I always agree with their decisions, and it doesn't mean that they're a perfect company. ↩︎
In comparison, it takes ~10 hours to jailbreak a model to help build a bioweapon, and there are additional layers of safety classifiers that monitor and stop requests that are actively streaming harmful information. ↩︎
Subagents have variable context and thus higher refusal rates, but I don't track metrics on their traces. ↩︎
Comparably, creating RL environments for an evil bio model is hard, because reward modeling is harder, and verifiable rewards – outside of bioinformatics – requires operating a physical lab. Feasible for a frontier AI lab; not feasible for an actor with less cash – and more opportunities for someone to catch you. ↩︎
See Capital One or Equifax for an example of the timeline after discovery. ↩︎