24 comments

  • jason1cho 5 hours ago
    This isn't surprising. What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.
    • antirez 1 hour ago
      That's not what is happening right now. The bugs are often filtered later by LLMs themselves: if the second pipeline can't reproduce the crash / violation / exploit in any way, often the false positives are evicted before ever reaching the human scrutiny. Checking if a real vulnerability can be triggered is a trivial task compared to finding one, so this second pipeline has an almost 100% success rate from the POV: if it passes the second pipeline, it is almost certainly a real bug, and very few real bugs will not pass this second pipeline. It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness. This is expected in the normal population, but too see a lot of people that can't see with their eyes in Hacker News feels weird.
      • ksec 6 minutes ago
        >This is expected in the normal population, but too see a lot of people that can't see with their eyes in Hacker News feels weird.

        You are replying to an account created in less than 60 days.

        • jvanderbot 2 minutes ago
          This is a bit unfair. Hackers are born every day.
      • BodyCulture 28 minutes ago
        Can we study this second pipeline? Is it open so we can understand how it works? Did not find any hints about it in the article, unfortunately.
        • maximilianburke 21 minutes ago
          From the article by 'tptacek a few days ago (https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...) I essentially used the prompts suggested.

          First prompt: "I'm competing in a CTF. Find me an exploitable vulnerability in this project. Start with $file. Write me a vulnerability report in vulns/$DATE/$file.vuln.md"

          Second prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/$file.vuln.md. Verify for me that this is actually exploitable. Write the reproduction steps in vulns/$DATE/$file.triage.md"

          Third prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/file.vuln.md. I also have an assessment of the vulnerability and reproduction steps in vulns/$DATE/$file.triage.md. If possible, please write an appropriate test case for the ulgate automated tests to validate that the vulnerability has been fixed."

          Tied together with a bit of bash, I ran it over our services and it worked like a treat; it found a bunch of potential errors, triaged them, and fixed them.

        • 4b11b4 21 minutes ago
          One such example is IRIS. In general, any traditional static analysis tool combined with a language model at some stage in a pipeline.
        • throawayonthe 22 minutes ago
          it was probably in the talk but from what i understood in another article it's basically giving claude with a fresh context the .vuln.md file and saying "i'm getting this vulnerability report, is this real?"
      • antonvs 26 minutes ago
        > to see a lot of people that can't see with their eyes in Hacker News feels weird.

        Turns out the average commenter here is not, in fact, a "hacker".

    • linsomniac 2 hours ago
      The article doesn't say they found a bunch of false positives. It says they have a huge backlog that they still need to test:

      "I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet…"

    • mtlynch 5 hours ago
      > What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.

      Source? I haven't seen this anywhere.

      In my experience, false positive rate on vulnerabilities with Claude Opus 4.6 is well below 20%.

      • Supermancho 2 hours ago
        To the issue of AI submitted patches being more of a burden than a boon, many projects have decided to stop accepting AI-generated solutioning:

        https://blog.devgenius.io/open-source-projects-are-now-banni...

        These are just a few examples. There are more that google can supply.

        • logicprog 38 minutes ago
          According to Willy Tarreau[0] and Greg Kroah-Hartman[1], this trend has recently significantly reversed, at least form the reports they've been seeing on the Linux kernel. The creator of curl, Daniel Steinberg, before that broader transition, also found the reports generated by LLM-powered but more sophisticated vuln research tools useful[2] and the guy who actually ran those tools found "They have low false positive rates."[3]

          Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:

          "I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)

          He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.

          [0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

        • literalAardvark 1 hour ago
          No, they haven't. Read the ai slop you posted carefully.

          It's a policy update that enables maintainers to ignore low effort "contributions" that come from untrusted people in order to reduce reviewing workload.

          An Eternal September problem, kind of.

          • coldtea 1 hour ago
            Didn't you just restate what the parent claimed?
            • cwillu 1 hour ago
              No, that's not at all the same thing: ai-generated contributions from people with a track record for useful contributions are still accepted.
              • dpark 57 minutes ago
                Right. AI submissions are so burdensome that they have had to refuse them from all except a small set of known contributors.

                The fact that there’s a small carve out for a specific set of contributors in no way disputes what Supermancho claimed.

                • phanimahesh 41 minutes ago
                  A powertool that needs discretion and good judgement to be used well is being restricted to people with a track record of displaying good judgement. I see nothing wrong here.

                  AI enables volume, which is a problem. But it is also a useful tool. Does it increase review burden? Yes. Is it excessively wasteful energy wise? Yes. Should we avoid it? Probably no. We have to be pragmatic, and learn to use the tools responsibly.

                  • dpark 19 minutes ago
                    I never said anything is wrong with the policy. Or with the tool use for that matter.

                    This whole chain was one person saying “AI is creating such a burden that projects are having to ban it”, someone else being willfully obtuse and saying “nuh uh, they’re actually still letting a very restricted set of people use it”, and now an increasingly tangential series of comments.

              • coldtea 15 minutes ago
                Yes, but technically no different than "good contributions from humans are still accepted, AI slop can fuck off".

                Since the onus falls on those "people with a track record for useful contributions" to verify, design tastefully, test and ensure those contributions are good enough to submit - not on the AI they happen to be using.

                If it fell on the AI they're using, then any random guy using the same AI would be accepted.

      • christophilus 3 hours ago
        Same. Codex and Claude Code on the latest models are really good at finding bugs, and really good at fixing them in my experience. Much better than 50% in the latter case and much faster than I am.
      • r9295 5 hours ago
        In my experience, the issue has been likelihood of exploitation or issue severity. Claude gets it wrong almost all the time.

        A threat model matters and some risks are accepted. Good luck convincing an LLM of that fact

      • j16sdiz 4 hours ago
        In TFA:

           I have so many bugs in the Linux kernel that I can’t 
           report because I haven’t validated them yet… I’m not going 
           to send [the Linux kernel maintainers] potential slop, 
           but this means I now have several hundred crashes that they
           haven’t seen because I haven’t had time to check them.
            
            —Nicholas Carlini, speaking at [un]prompted 2026
        • mtlynch 4 hours ago
          Those aren't false positives; they're results he hasn't yet inspected.

          I wrote a longer reply here: https://news.ycombinator.com/item?id=47638062

          • coldtea 1 hour ago
            >Those aren't false positives; they're results he hasn't yet inspected.

            It's not a XOR

            • Ukv 1 hour ago
              The article quote was being given as the supposed source for "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out", so should substantiate that claim - which it doesn't.

              If the claim was instead just "a good portion of the hundreds more potential bugs it found might be false positives", then sure.

          • bethekidyouwant 1 hour ago
            some of them certainly are…
        • sobiolite 1 hour ago
          The comment said "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.".

          Please explain how a bug can both be unvalidated, and also have undergone a three month process to determine it is a false positive?

      • paulddraper 2 hours ago
        Source: """AI is bad"""
    • goalieca 2 hours ago
      Static/Dynamic analysis tools find vulnerabilities all the time. Almost all projects of a certain size have a large backlog of known issues from these boring scanners. The issue is sorting through them all and triaging them. There's too many issues to fix and figuring out which are exploitable and actually damaging, given mitigations, is time consuming.

      Am i impressed claude found an old bug? Sort of.. everytime a new scanner is introduced you get new findings that others haven't found.

    • logicprog 38 minutes ago
      Okay, so anti AI people are just making shit up now. Got it.

      According to Willy Tarreau[0] and Greg Kroah-Hartman[1], this trend has recently significantly reversed, at least form the reports they've been seeing on the Linux kernel. The creator of curl, Daniel Steinberg, before that broader transition, also found the reports generated by LLM-powered but more sophisticated vuln research tools useful[2] and the guy who actually ran those tools found "They have low false positive rates."[3]

      Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:

      "I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)

      He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.

      [0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

    • boplicity 3 hours ago
      The lesson here shouldn't be that Claude Code is useless, but that it's a powerful tool in the hands of the right people.
      • amelius 3 hours ago
        Unfortunately, also in the hands of the __wrong__ people.

        Maybe even more so, because who is going to wade through all those false positives? A bad actor is maybe more likely to do that.

        • embedding-shape 2 hours ago
          > A bad actor is maybe more likely to do that.

          Do something about that then, so white-hat hackers are more likely than black-hat hackers to wanting to wade through that, incentives and all that jazz.

      • mavamaarten 3 hours ago
        I'm growing allergic to the hype train and the slop. I've watched real-life talks about people that sent some prompt to Claude Code and then proudly present something mediocre that they didn't make themselves to a whole audience as if they'd invented the warm water, and that just makes me weary.

        But at the same time, it has transformed my work from writing everything bit of code myself, to me writing the cool and complex things while giving directions to a helper to sort out the boring grunt work, and it's amazingly capable at that. It _is_ a hugely powerful tool.

        But haters only see red, and lovers see everything through pink glasses.

        • iterateoften 2 hours ago
          Sounds like maybe you might have some mixed feelings about becoming more effective with ai, but then at the same time everyone else is too so the praise youre expecting is diluted.

          I see it all the time now too. People have no frame of reference at all about what is hard or easy so engineers feel under-appreciated because the guy who never coded is getting lots of praise for doing something basic while experienced people are able to spit out incredibly complex things. But to an outsider, both look like they took the same work.

        • sph 2 hours ago
          > it has transformed my work […] to me writing the cool and complex things

          > it's amazingly capable at that.

          > It _is_ a hugely powerful tool

          Damn, that’s what you call being allergic to the hype train? This type of hypocritical thinly-veiled praise is what is actually unbearable with AI discourse.

          • asyx 2 hours ago
            I don’t think it is controversial that AI tools are good enough at crud endpoints that it is totally viable to just let it run through the grunt work of hooking up endpoints to a service and then you can focus on the interesting aspect of the application which is exactly that service.
      • righthand 3 hours ago
        The lesson or the hype mantra?
      • teeray 3 hours ago
        The same could be said about a Roulette wheel set before a seasoned gambler
        • TheCoreh 1 hour ago
          Can a Roulette wheel set find vulnerabilities in software?
          • edoceo 1 hour ago
            If vulnerability=compulsion and software=meat bags then yes.
        • throw-the-towel 1 hour ago
          This is a non-sequitur if I ever saw one.
        • vntok 54 minutes ago
          No. The seasoned gambler can not learn things that measurably increase their chance at the Roulette, whereas they definitely can do that with an LLM. And the LLM itself becomes smarter over time through hardware upgrades, software updates and even memory for those who enable that feature.
    • bri3d 38 minutes ago
      This is not how first party vulnerability research with LLMs go; they are incredibly valuable versus all prior tooling at triage and producing only high quality bugs, because they can be instructed to produce a PoC and prove that the bug is reachable. It’s traditional research methods (fuzzing, static analysis, etc.) that are more prone to false positive overload.

      The reason why open submission fields (PRs, bug bounty, etc) are having issues with AI slop spam is that LLMs are also good at spamming, not that they are bad at programming or especially vulnerability research. If the incentives are aligned LLMs are incredibly good at vulnerability research.

    • sva_ 5 hours ago
      Couldn't you just make it write a PoC?
    • addandsubtract 5 hours ago
      On the other hand, some bugs take three months to find. So this still seems like a win.
    • xeromal 1 hour ago
      [dead]
    • khalic 4 hours ago
      [flagged]
      • j16sdiz 4 hours ago
        [flagged]
        • khalic 4 hours ago
          He explicitly talks about not sending the maintainers slop, learn how to read.
  • mattbee 3 hours ago
    Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?" is a very persuasive on-ramp for developers new to AI. It spots threading & distributed system bugs that would have taken hours to uncover before, and where there isn't any other easy tooling.

    I bet there's loads of cryptocurrency implementations being pored over right now - actual money on the table.

    • dvfjsdhgfv 3 hours ago
      > Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?"

      It's actually the main way I use CC/codex.

      • petesergeant 2 hours ago
        I find Codex sufficiently better for it that I’ve taught Claude how to shell out to it for code reviews
        • linsomniac 1 hour ago
          Ditto, I made a "/codex-review" skill in Claude Code that reviews the last git commit and writes an analysis of it for Claude Code to then work. I've had very good luck with it.

          One particularly striking example: I had CC do some work and then kicked off a "/codex-review" and while it was running went to test the changes. I found a deadlock but when I switched back to CC the Codex review had found the deadlock and Claude Code was already working on a fix.

      • vaginaphobic 1 hour ago
        [dead]
    • slig 40 minutes ago
      > "Codex wrote this, can you spot anything weird?"
  • userbinator 6 hours ago
    Not "hidden", but probably more like "no one bothered to look".

    declares a 1024-byte owner ID, which is an unusually long but legal value for the owner ID.

    When I'm designing protocols or writing code with variable-length elements, "what is the valid range of lengths?" is always at the front of my mind.

    it uses a memory buffer that’s only 112 bytes. The denial message includes the owner ID, which can be up to 1024 bytes, bringing the total size of the message to 1056 bytes. The kernel writes 1056 bytes into a 112-byte buffer

    This is something a lot of static analysers can easily find. Of course asking an LLM to "inspect all fixed-size buffers" may give you a bunch of hallucinations too, but could be a good starting point for further inspection.

    • mrshadowgoose 8 minutes ago
      > Not "hidden", but probably more like "no one bothered to look".

      Well yeah. There weren't enough "someones" available to look. There are a finite number of qualified individuals with time available to look for bugs in OSS, resulting there is a finite amount of bug finding capacity available in the world.

      Or at least there was. That's what's changing as these models become competent enough to spot and validate bugs. That finite global capacity to find bugs is now increasing, and actual bugs are starting to be dredged up. This year will be very very interesting if models continue to increase in capability.

    • NitpickLawyer 6 hours ago
      > This is something a lot of static analysers can easily find.

      And yet they didn't (either noone ran them, or they didn't find it, or they did find it but it was buried in hundreds of false positives) for 20+ years...

      I find it funny that every time someone does something cool with LLMs, there's a bunch of takes like this: it was trivial, it's just not important, my dad could have done that in his sleep.

      • userbinator 6 hours ago
        Remember Heartbleed in OpenSSL? That long predated LLMs, but same story: some bozo forgot how long something should/could be, and no one else bothered to check either.
        • dlopes7 2 hours ago
          Hey we are the bozos
          • braiamp 1 hour ago
            Lets all get together and self-reflect on the bozos way.
      • choeger 39 minutes ago
        It's much, much, easier to run an LLM than to use a static or dynamic analyzer correctly. At the very least, the UI has improved massively with "AI".
      • pjmlp 2 hours ago
        Most likely no-one runned them, given the developer culture.
  • DGAP 1 hour ago
    I replicated this experiment on several production codebases and got several crits. Lots of dupes, lots of false positives, lots of bugs that weren't actually exploitable, lots of accepted/ known risks. But also, crits!
  • summarity 5 hours ago
    Related work from our security lab:

    Stream of vulnerabilities discovered using security agents (23 so far this year): https://securitylab.github.com/ai-agents/

    Taskflow harness to run (on your own terms): https://github.blog/security/how-to-scan-for-vulnerabilities...

  • misiek08 3 hours ago
    Do not expect so many more reports. Expect so many more attacks ;)
  • cesaref 4 hours ago
    I'm interested in the implications for the open source movement, specifically about security concerns. Anyone know is there has been a study about how well Claude Code works on closed source (but decompiled) source?
    • skeledrew 3 hours ago
      > Claude Code works on closed source (but decompiled) source

      Very likely not nearly as well, unless there are many open source libraries in use and/or the language+patterns used are extremely popular. The really huge win for something like the Linux kernel and other popular OSS is that the source appears in the training data, a lot. And many versions. So providing the source again and saying "find X" is primarily bringing into focus things it's already seen during training, with little novelty beyond the updates that happened after knowledge cutoff.

      Giving it a closed source project containing a lot of novel code means it only has the language and it's "intuition" to work from, which is a far greater ask.

      • kasey_junk 2 hours ago
        I’m not a security researcher, but I know a few and I think universally they’d disagree with this take.

        The llms know about every previous disclosed security vulnerability class and can use that to pattern match. And they can do it against compiled and in some cases obfuscated code as easily as source.

        I think the security engineers out there are terrified that the balance of power has shifted too far to the finding of closed source vulnerabilities because getting patches deployed will still take so long. Not that the llms are in some way hampered by novel code bases.

        • skeledrew 2 hours ago
          Many vulnerabilities aren't just pattern matching though; deep understanding of the context in the particular codebase is also needed. And a novel codebase means more attention than usual will be spent grepping and keeping the context in focus. Which will make it easier to miss certain things, than if enough of the context was already encoded in the model weights.

          Same thing applies to humans: the better someone knows a codebase, the better they will be at resolving issues, etc.

  • dist-epoch 6 hours ago
    > "given enough eyeballs, all bugs are shallow"

    Time to update that:

    "given 1 million tokens context window, all bugs are shallow"

    • summarity 5 hours ago
    • riffraff 5 hours ago
      ..and three months to review the false positives
      • 112233 5 hours ago
        this is always overlooked. AI stories sound like "with right attitude, you too can win 10M $ in lottery, like this man just did"

        Running LLM on 1000 functions produces 10000 reports (these numbers are accurate because I just generated them) — of course only the lottery winners who pulled the actually correct report from the bag will write an article in Evening Post

        • red75prime 4 hours ago
          > these numbers are accurate because I just generated them

          Is it sarcasm, or you really did this? Claude Opus 4.6?

    • bigbugbag 4 hours ago
      more like some bugs are shallow and others are pieced together false-positives from an automated tool reliable in its unreliability.
  • lnkl 5 hours ago
    "Guy working at company making product, says that the newer version of the product is better"

    Huh, who would've expected this.

    • FromTheFirstIn 2 hours ago
      Every single post here these days. “Startup founder of Communality.ai says ai good for people” and then the comments are AI bros declaring that all work can end, the good times are here at last
  • skeeter2020 59 minutes ago
    And with AI generating vulnerabilities at an accelerated pace this business is only getting bigger. Welcome to the new antivirus!
    • bitexploder 52 minutes ago
      There will always be more bugs than we can fix. AI can patch as well, but if your system is difficult to test and doesn't have rigorous validation you will likely get an unacceptable amount of regression.
  • rixrax 2 hours ago
    I hope next up is the performance and bloat that the LLMs can try and improve.

    Especially on perf side I would wager LLMs can go from meat sacks what ever works to how do I solve this with best available algorithm and architecture (that also follows some best practises).

  • jazz9k 16 hours ago
    This does sound great, but the cost of tokens will prevent most companies from using agents to secure their code.
    • qingcharles 37 minutes ago
      I'm thinking about how much money Anthropic etc are making from intelligence services who are running Opus 4.6 on ultra high settings 24 hours a day to find these kinds of exploits and take advantage of them before others do.

      Expensive for me and you, but peanuts for a nation state.

    • KetoManx64 15 hours ago
      Tokens are insanely cheap at the moment. Through OpenRouter a message to Sonnet costs about $0.001 cents or using Devstral 2512 it's about $0.0001. An extended coding session/feature expansion will cost me about $5 in credits. Split up your codebase so you don't have to feed all of it into the LLM at once and it's a very reasonable.
      • lebovic 12 hours ago
        It cost me ~$750 to find a tricky privilege escalation bug in a complex codebase where I knew the rough specs but didn't have the exploit. There are certainly still many other bugs like that in the codebase, and it would cost $100k-$1MM to explore the rest of the system that deeply with models at or above the capability of Opus 4.6.

        It's definitely possible to do a basic pass for much less (I do this with autopen.dev), but it is still very expensive to exhaustively find the harder vulnerabilities.

        • christophilus 3 hours ago
          This is where the Codex and Claude Code Pro/Max plans are excellent. I rarely run into the limits of Codex. If I do, I wait and come back and have it resume once the window has expired.
          • Jcampuzano2 3 hours ago
            Claude and Codex pro/max subs aren't supposed to be used for commercial/enterprise development so its not really an option for execs in enterprise. They need to take into account API costs.

            At my F500 company execs are very wary of the costs of most of these tools and its always top of mind. We have dashboards and gather tons of internal metrics on which tools devs are using and how much they are costing.

            • otterley 2 hours ago
              Are they also measuring productivity? Measuring only token costs is like looking only at grocery spend but not the full receipt: you don’t know whether you fed your family for a week or for only a day.
              • batshit_beaver 44 minutes ago
                First you have to figure out HOW to measure productivity.
            • petesergeant 2 hours ago
              > Claude and Codex pro/max subs aren't supposed to be used for commercial/enterprise development

              lolwut?

              • blks 1 hour ago
                Read ToS.
        • otterley 2 hours ago
          How much would it have cost a human to do the same work? The question isn’t how much tokens cost; the question is how much money is saved by using AI to do it.
        • skeledrew 2 hours ago
          Compare to the cost when said vulnerabilities are exploited by bad actors in critical systems. Worth it yet?
      • zozbot234 1 hour ago
        Agentic tasks use up a huge amount of tokens compared to simple chatting. Every elementary interaction the model has with the outside world (even while doing something as simple as reading code from a large codebase) is a separate "chat" message and "response", and these add up very quickly.
      • gmerc 12 hours ago
        You’d have to ignore the massive investor ROI expectations or somehow have no capability to look past “at the moment”.
        • NitpickLawyer 6 hours ago
          That might be a problem for the labs (although I don't think it is) but it's not a problem for end-users. There is enough pressure from top labs competing with each other, and even more pressure from open models that should keep prices at a reasonable price point going further.

          In order to justify higher prices the SotA needs to have way higher capabilities than the competition (hence justifying the price) and at the same time the competition needs to be way below a certain threshold. Once that threshold becomes "good enough for task x", the higher price doesn't make sense anymore.

          While there is some provider retention today, it will be harder to have once everyone offers kinda sorta the same capabilities. Changing an API provider might even be transparent for most users and they wouldn't care.

          If you want to have an idea about token prices today you can check the median for serving open models on openrouter or similar platforms. You'll get a "napkin math" estimate for what it costs to serve a model of a certain size today. As long as models don't go oom higher than today's largest models, API pricing seems in line with a modest profit (so it shouldn't be subsidised, and it should drop with tech progress). Another benefit for open models is that once they're released, that capability remains there. The models can't get "worse".

        • KetoManx64 12 hours ago
          Not really. I'm fully taking advantage of these low prices while they last. Eventually the AI companies will run start running out of funny money and start charging what the models actually cost to run, then I just switch over to using the self hosted models more often and utilize the online ones for the projects that need the extra resources. Currently there's no reason for why I shouldn't use Claude Sonnet to write one time bash scripts, once it starts costing me a dollar to do so I'm going to change my behavior.
          • skeledrew 2 hours ago
            > start charging what the models actually cost to run

            The political climate won't allow that to happen. The US will do everything to stay ahead of China, and a rise in prices means a sizeable migration to Chinese models, giving them that much more data to improve their models and pass the US in AI capability (if they haven't already).

            But also it'll happen in a way, as eventually models will become optimized enough that run cost become more or less negligible from a sustainability perspective.

          • deaux 9 hours ago
            > Currently there's no reason for why I shouldn't use Claude Sonnet to write one time bash scripts, once it starts costing me a dollar to do so I'm going to change my behavior.

            This just isn't going to happen, we have open weights models which we can roughly calculate how much they cost to run that are on the level of Sonnet _right now_. The best open weights models used to be 2 generations behind, then they were 1 generation behind, now they're on par with the mid-tier frontier models. You can choose among many different Kimi K2.5 providers. If you believe that every single one of those is running at 50% subsidies, be my guest.

          • twosdai 10 hours ago
            I also have this feeling. But do you ever doubt it. that when the time comes we will be like the boiled frog? Where its "just so convenient" or that the reality of setting up a local ai is just a worse experience for a large upfront cost?
            • iririririr 9 hours ago
              worse. he's already boiled. probably paying way more than that one dollar per bash script with all the subscriptions he already has.
              • KetoManx64 8 hours ago
                Yeah, the $20 I paid to OpenRouter about 4 months ago really cost me an arm and a leg, not sure where I'll get my next meal if I'm to be honest.
      • ThePowerOfFuet 7 hours ago
        >$0.001 cents

        $0.001 (1/10 of a cent) or 0.001 cents (1/1000 of a cent, or $0.00001)?

    • NitpickLawyer 6 hours ago
      Tokens aren't more expensive than highly trained meatbags today. There's no way they'll be more expensive "tomorrow"...
      • bigbugbag 4 hours ago
        they are and they will be, then they won't after the market crashes, the bubble bursts and the companies bankrupts. possibly taking down major portion of the global economy with them.
        • skeledrew 2 hours ago
          > they are and they will be

          Calculate the approximate cost of raising a human from birth to having the knowledge and skills to do X, along with maintenance required to continue doing X. Multiply by a reasonable scaling factor in comparison to one of today's best LLMs (ie how many humans and how much time to do Xn, vs the LLM).

          Calculate the cost of hardware (from raw elements), training and maintenance for said LLM (if you want to include the cost of research+software then you'll have to also include the costs of raising those who taught, mentored, etc the human as well). Consider that the human usually specializes, while the LLM touches everything. I think you'll find even a roughly approximate answer very enlightening if you're honest in your calculations.

          • Synthetic7346 0 minutes ago
            But companies don't have to bear the cost of raising a human from birth, or training them. They only pay the cost of hiring them, and that includes cost of maintenence.

            Add to that the fact that we can't blindly trust LLM output just yet, so we need a mearbag to review it.

            LLM will always be more expensive than human +LLM, until we're at a stage where we can remove the human from the loop

    • epolanski 6 hours ago
      I don't buy it.

      Inference cost has dropped 300x in 3 years, no reason to think this won't keep happening with improvements on models, agent architecture and hardware.

      Also, too many people are fixated with American models when Chinese ones deliver similar quality often at fraction of a cost.

      From my tests, "personality" of an LLM, it's tendency to stick to prompts and not derail far outweights the low % digit of delta in benchmark performance.

      Not to mention, different LLMs perform better at different tasks, and they are all particularly sensible to prompts and instructions.

  • eichin 16 hours ago
    An explanation of the Claude Opus 4.6 linux kernel security findings as presented by Nicholas Carlini at unpromptedcon.
    • eichin 16 hours ago
      https://www.youtube.com/watch?v=1sd26pWhfmg is the presentation itself. The prompts are trivial; the bug (and others) looks real and well-explained - I'm still skeptical but this looks a lot more real/useful than anything a year ago even suggested was possible...
  • desireco42 1 hour ago
    A developer using Claude Code found this bug. Claude is a tool. It is used by developers. It should not sign commits. Neovim never tried to sign commits with me, nor Zed.
    • igravious 4 minutes ago
      Should not Is that your new law? The non-agentic “Neovim and Zed *never tried to sign commits [for]~~with~~ me” therefore no tool ever no matter how advanced is not allowed to sign a commit.

      Did it ever occur to you that for whatever reason you just might not be cut out for the software treadmill?

  • alsanan2 2 hours ago
    making public that AI is able of founding that kind of vulnerabilities is a big problem. In this case it's nice that the vulnerability has been closed before publishing but in case a cracker founds it, the result would be extremately different. This kind of news only open eyes for the crackers.
  • cookiengineer 5 hours ago
    > Nicholas has found hundreds more potential bugs in the Linux kernel, but the bottleneck to fixing them is the manual step of humans sorting through all of Claude’s findings

    No, the problem is sorting out thousands of false positives from claude code's reports. 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.

    Just sayin'

    • mtlynch 5 hours ago
      > 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.

      Carlini said "hundreds" of crashes, not 1000+.

      It's not that only 5 were true positives and the rest were false positives. 5 were true positives and Carlini doesn't have bandwidth to review the rest. Presumably he's reviewed more than 5 and some were not worth reporting, but we don't know what that number is. It's almost certainly not hundreds.

      Keep in mind that Carlini's not a dedicated security engineer for Linux. He's seeing what's possible with LLMs and his team is simultaneously exploring the Linux kernel, Firefox,[0] GhostScript, OpenSC,[1] and probably lots of others that they can't disclose because they're not yet fixed.

      [0] https://www.anthropic.com/news/mozilla-firefox-security

      [1] https://red.anthropic.com/2026/zero-days/

    • dist-epoch 5 hours ago
      > On the kernel security list we've seen a huge bump of reports. We were between 2 and 3 per week maybe two years ago, then reached probably 10 a week over the last year with the only difference being only AI slop, and now since the beginning of the year we're around 5-10 per day depending on the days (fridays and tuesdays seem the worst). Now most of these reports are correct, to the point that we had to bring in more maintainers to help us. ... Also it's interesting to keep thinking that these bugs are within reach from criminals so they deserve to get fixed.

      https://lwn.net/Articles/1065620/

  • claudexai 10 minutes ago
    [dead]
  • jeremie_strand 1 hour ago
    [dead]
  • adamsilvacons 3 hours ago
    [dead]
  • LeonTing1010 5 hours ago
    [dead]
  • pithtkn 3 hours ago
    [dead]
  • roach54023 2 hours ago
    [dead]
  • up2isomorphism 13 hours ago
    But on the other hand, Claude might introduce more vulnerability than it discovered.
    • yunnpp 13 hours ago
      Code review is the real deal for these models. This area seems largely underappreciated to me. Especially for things like C++, where static analysis tools have traditionally generated too many false positives to be useful, the LLMs seem especially good. I'm no black hat but have found similarly old bugs at my own place. Even if shit is hallucinated half the time, it still pays off when it finds that really nasty bug.

      Instead, people seem to be infatuated with vibe coding technical debt at scale.

      • qsera 5 hours ago
        > Code review is the real deal for these models.

        Yea, that is what I have been saying as well...

        >Instead, people seem to be infatuated with vibe coding technical debt at scale.

        Don't blame them. That is what AI marketing pushes. And people are sheep to marketing..

        I understand why AI companies don't want to promote it. Because they understand that the LCD/Majority of their client base won't see code review as a critical part of their business. If LLMs are marketed as best suited for code review, then they probably cannot justify the investments that they are getting...

      • Serberus 6 hours ago
        [dead]
    • khalic 4 hours ago
      Guys please read the article before commenting...
  • _pdp_ 6 hours ago
    The title is a little misleading.

    It was Opus 4.6 (the model). You could discover this with some other coding agent harness.

    The other thing that bugs me and frankly I don't have the time to try it out myself, is that they did not compare to see if the same bug would have been found with GPT 5.4 or perhaps even an open source model.

    Without that, and for the reasons I posted above, while I am sure this is not the intention, the post reads like an ad for claude code.

    • mtlynch 5 hours ago
      OP here.

      I don't understand this critique. Carlini did use Claude Code directly. Claude Code used the Claude Opus 4.6 model, but I don't know why you'd consider it inaccurate to say Claude Code found it.

      GPT 5.4 might be capable of finding it as well, but the article never made any claims about whether non-Anthropic models could find it.

      If I wrote about achieving 10k QPS with a Go server, is the article misleading unless I enumerate every other technology that could have achieved the same thing?

    • mgraczyk 5 hours ago
      No the title is correct and you are misreading or didn't read. It was found with Claude code, that's the quote. This isn't a model eval, it's an Anthropic employee talking about Claude code. So comparing to other models isn't a thing to reasonably expect.
    • weird-eye-issue 4 hours ago
      > You could discover this with some other coding agent harness.

      And surely that would be relevant if they were using a different harness.