The Blog Post from the researcher is a more interesting read.
Important points here about benchmarking:
o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.
o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.
I’m not sure if a signal to noise ratio of 1:100 is uh… Great…
If the researcher had spent as much time auditing the code as he did having to evaluate the merit of 100s of incorrect LLM reports then he would have found the second vulnerability himself, no doubt.
Problem is motivation. As someone with ADHD I definitely understand that having an interesting project makes tedious stuff much more likely to get done. LOL
I agree not brilliant, but It’s early days. If one is looking to mechanise a process like finding bugs, you have to start somewhere. Determine how to measure success, set performance baselines and all that.
You’re right, probably better put as: if he’d spent his time writing instead of working on that contraption, he’d have produced more books in the first month.
I’m not sure if a signal to noise ratio of 1:100 is uh… Great…
It found it correctly in 8 of 100 runs and reported a find that was false in 28 runs. The remaining 64 runs can be discarded, so a person would only need to review 36 reports. For the LLM, 100 runs would take minutes at most, so the time requirement for that is minimal and the cost would be trivial compared to the cost of 100 humans learning a codebase and writing a report.
So, a security research puts in the code base and in a few minutes they have 36 bug reports that they need to test. If they know that 2 in 9 of them are real zero-day exploits then discovering new zero-days becomes a lot faster.
If a security researcher had the option of reading an entire code base or reviewing 40 bug reports, 10 of which would contain a new bug then they would choose the bug reports every time.
That isn’t to say that people should be submitting LLM generated bug reports to developers on github. But as a tool for a security researcher to use it could significantly speed up their workflow in some situations.
It found it 8/100 times when the researcher gave it only the code paths he already knew contained the exploit. Essentially the garden path.
The test with the actual full suite of commands passed in the context only found it 1/100 times and we didn’t get any info on the number of false positives they had to wade through to find it.
This is also assuming you can automatically and reliably filter out false negatives.
He even says the ratio is too high in the blog post:
That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution.
That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution. Still, that ratio is only going to get better.
Conclusion
LLMs exist at a point in the capability space of program analysis techniques that is far closer to humans than anything else we have seen. Considering the attributes of creativity, flexibility, and generality, LLMs are far more similar to a human code auditor than they are to symbolic execution, abstract interpretation or fuzzing. Since GPT-4 there has been hints of the potential for LLMs in vulnerability research, but the results on real problems have never quite lived up to the hope or the hype. That has changed with o3, and we have a model that can do well enough at code reasoning, Q&A, programming and problem solving that it can genuinely enhance human performance at vulnerability research.
o3 is not infallible. Far from it. There’s still a substantial chance it will generate nonsensical results and frustrate you. **What is different, is that for the first time the chance of getting correct results is sufficiently high that it is worth your time and and your effort to try to use it on real problems. **
The point is that LLM code review can find novel exploits. The author gets results using a base model with a simple workflow so there is a lot of room for improving the accuracy and outcomes in such a system.
A human may do it better on an individual level but it takes a lot more time, money and effort to make and train a human than it does to build an H100. This is why security audits are long, manual and expensive process which requires human experts. Because of this, exploits can exist in the wild for long periods of time because we simply don’t have enough people to security audit every commit.
This kind of tool could make security auditing a checkbox in your CI system.
There’s a lot of assumptions about the reliability of the LLMs to get better over time laced into that…
But so far they have gotten steadily better, so I suppose there’s enough fuel for optimists to extrapolate that out into a positive outlook.
I’m very pessimistic about these technologies and I feel like we’re at the top of the sigma curve for “improvements,” so I don’t see LLM tools getting substantially better than this at analyzing code.
If that’s the case I don’t feel like having hundreds and hundreds of false security reports creates the mental arena that allows for researchers to actually spot the non-false report among all the slop.
We only know if we’re at the top of the curve if we keep pushing the frontier of what is possible. Seeing exciting paths is what motivates people to try to get the improvements and efficiencies.
I do agree that the AI companies are pushing a ridiculous message, as if LLMs are going to replace people next quarter. I too am very pessimistic on that outcome, I don’t think we’re going to see LLMs replacing human workers anytime soon. Nor do I think GitHub should make this a feature tomorrow.
But, machine learning is a developing field and so we don’t know what efficiencies are possible. We do know that you can create intelligence out of human brains so it seems likely that whatever advancements we make in learning would be at least in the direction of the efficiency of human intelligence.
If that’s the case I don’t feel like having hundreds and hundreds of false security reports creates the mental arena that allows for researchers to actually spot the non-false report among all the slop.
It could very well be that you can devise a system which can verify hundreds of false security reports easier than a human can audit the same codebase. The author didn’t explore how he did this but he seems to have felt that it was worth his time.:
What is different, is that for the first time the chance of getting correct results is sufficiently high that it is worth your time and and your effort to try to use it on real problems.
The Blog Post from the researcher is a more interesting read.
Important points here about benchmarking:
I’m not sure if a signal to noise ratio of 1:100 is uh… Great…
If the researcher had spent as much time auditing the code as he did having to evaluate the merit of 100s of incorrect LLM reports then he would have found the second vulnerability himself, no doubt.
this confirms what i just said in reply to a different comment: most cases of ai “success” are actually curated by real people from a sea of bullshit
Problem is motivation. As someone with ADHD I definitely understand that having an interesting project makes tedious stuff much more likely to get done. LOL
And if Gutenberg had just written faster, he would’ve produced more books in the first week?
I’m not sure if the Gutenberg Press had only produced one readable copy for every 100 printed it would have been the literary revolution that it was.
I agree not brilliant, but It’s early days. If one is looking to mechanise a process like finding bugs, you have to start somewhere. Determine how to measure success, set performance baselines and all that.
I get your point, but your comparison is a little… off. Wasn’t Gutenberg “printing”, not “writing”?
You’re right, probably better put as: if he’d spent his time writing instead of working on that contraption, he’d have produced more books in the first month.
The models seem to be getting worse at this one task?
It found it correctly in 8 of 100 runs and reported a find that was false in 28 runs. The remaining 64 runs can be discarded, so a person would only need to review 36 reports. For the LLM, 100 runs would take minutes at most, so the time requirement for that is minimal and the cost would be trivial compared to the cost of 100 humans learning a codebase and writing a report.
So, a security research puts in the code base and in a few minutes they have 36 bug reports that they need to test. If they know that 2 in 9 of them are real zero-day exploits then discovering new zero-days becomes a lot faster.
If a security researcher had the option of reading an entire code base or reviewing 40 bug reports, 10 of which would contain a new bug then they would choose the bug reports every time.
That isn’t to say that people should be submitting LLM generated bug reports to developers on github. But as a tool for a security researcher to use it could significantly speed up their workflow in some situations.
It found it 8/100 times when the researcher gave it only the code paths he already knew contained the exploit. Essentially the garden path.
The test with the actual full suite of commands passed in the context only found it 1/100 times and we didn’t get any info on the number of false positives they had to wade through to find it.
This is also assuming you can automatically and reliably filter out false negatives.
He even says the ratio is too high in the blog post:
From the blog post: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-2025-37899-a-remote-zeroday-vulnerability-in-the-linux-kernels-smb-implementation/
The point is that LLM code review can find novel exploits. The author gets results using a base model with a simple workflow so there is a lot of room for improving the accuracy and outcomes in such a system.
A human may do it better on an individual level but it takes a lot more time, money and effort to make and train a human than it does to build an H100. This is why security audits are long, manual and expensive process which requires human experts. Because of this, exploits can exist in the wild for long periods of time because we simply don’t have enough people to security audit every commit.
This kind of tool could make security auditing a checkbox in your CI system.
There’s a lot of assumptions about the reliability of the LLMs to get better over time laced into that…
But so far they have gotten steadily better, so I suppose there’s enough fuel for optimists to extrapolate that out into a positive outlook.
I’m very pessimistic about these technologies and I feel like we’re at the top of the sigma curve for “improvements,” so I don’t see LLM tools getting substantially better than this at analyzing code.
If that’s the case I don’t feel like having hundreds and hundreds of false security reports creates the mental arena that allows for researchers to actually spot the non-false report among all the slop.
We only know if we’re at the top of the curve if we keep pushing the frontier of what is possible. Seeing exciting paths is what motivates people to try to get the improvements and efficiencies.
I do agree that the AI companies are pushing a ridiculous message, as if LLMs are going to replace people next quarter. I too am very pessimistic on that outcome, I don’t think we’re going to see LLMs replacing human workers anytime soon. Nor do I think GitHub should make this a feature tomorrow.
But, machine learning is a developing field and so we don’t know what efficiencies are possible. We do know that you can create intelligence out of human brains so it seems likely that whatever advancements we make in learning would be at least in the direction of the efficiency of human intelligence.
It could very well be that you can devise a system which can verify hundreds of false security reports easier than a human can audit the same codebase. The author didn’t explore how he did this but he seems to have felt that it was worth his time.:
It’s only good for clickbait titles.
It brings clicks and it’s spreading the falsehood that “AI” is good at something/getting better for the majority of people who stop at the title.