Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI Security

GPT-4 Can Exploit Real Vulnerabilities By Reading Security Advisories 74

Long-time Slashdot reader tippen shared this report from the Register: AI agents, which combine large language models with automation software, can successfully exploit real world security vulnerabilities by reading security advisories, academics have claimed.

In a newly released paper, four University of Illinois Urbana-Champaign (UIUC) computer scientists — Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang — report that OpenAI's GPT-4 large language model (LLM) can autonomously exploit vulnerabilities in real-world systems if given a CVE advisory describing the flaw. "To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description," the US-based authors explain in their paper. "When given the CVE description, GPT-4 is capable of exploiting 87 percent of these vulnerabilities compared to 0 percent for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit)...."

The researchers' work builds upon prior findings that LLMs can be used to automate attacks on websites in a sandboxed environment. GPT-4, said Daniel Kang, assistant professor at UIUC, in an email to The Register, "can actually autonomously carry out the steps to perform certain exploits that open-source vulnerability scanners cannot find (at the time of writing)."

The researchers wrote that "Our vulnerabilities span website vulnerabilities, container vulnerabilities, and vulnerable Python packages. Over half are categorized as 'high' or 'critical' severity by the CVE description...."

"Kang and his colleagues computed the cost to conduct a successful LLM agent attack and came up with a figure of $8.80 per exploit"
This discussion has been archived. No new comments can be posted.

GPT-4 Can Exploit Real Vulnerabilities By Reading Security Advisories

Comments Filter:
  • ... I thought some Slashdotters kept telling me that LLM's can't code?
    • Re:But ... (Score:5, Interesting)

      by Tony Isaac ( 1301187 ) on Sunday April 21, 2024 @04:32PM (#64412790) Homepage

      Oh, it can code all right. About like a first-year computer science major can code. It can look up stuff about the problem at hand (it's *really* good at that) and apply it by writing some code. Can you *trust* that code to actually work? Not in the least. You can't even trust that it will compile. But in the hands of an actual developer, it can provide an enormous shortcut.

      • Comment removed (Score:5, Informative)

        by account_deleted ( 4530225 ) on Sunday April 21, 2024 @04:35PM (#64412798)
        Comment removed based on user account deletion
        • Noticed that as well, the quality of the output depends on the quality of the input. And you have to know something about the subject the output is about to know whether is good or garbage (or look at it period and not put blind faith in its output). We've seen it with lawyers using it for arguments then not double checking the cases. Sometimes AI just makes up stuff.
          • The right way to think about it is not that they sometimes make stuff up, they always write something that looks like your answer.

          • Yeah I've noticed that. I recently have taken to asking it to write SQLAlchemy models that match json nests downloaded from APIs and it aces it every time. But I'm not sure a non coder would know to ask for it, or how to do that.

            Honestly, I dont think its that much buggier than what I write. I keep my standard up by writing unit tests, but those tests will explode a few times before I get it right because if I write 500 lines of code before I hit compile, chances are I've missed a comma or botched the param

          • Noticed that as well, the quality of the output depends on the quality of the input. And you have to know something about the subject the output is about to know whether is good or garbage (or look at it period and not put blind faith in its output). We've seen it with lawyers using it for arguments then not double checking the cases. Sometimes AI just makes up stuff.

            Right, it's a tool. In the right hands, it is super useful.

        • >> absolutely is time saving for me

          I'm using a free one and it does very good code completion at times which is an excellent time saver. If I type in a quick comment about what I want to accomplish before starting to write the code it does an even better job. Sometimes I enter a few keystrokes and bang, it flawlessly completes an entire clause.

        • It saves me a ton of time -- and I am trying to incorporate AI in every task -- not just coding. Glancing through and fixing is required, but It works "good enough" to save time overall once I figured out how to dialogue with it. Most of the time, at least for me, the code compiles. I seriously think all employers and colleges need to mandate students incorporate AI into their workflow. AI will only keep improving, and the people who know how to use it will be the most productive. I pretty much use AI for e

          • writing reference letters,

            Which kind of reference letter? The ones I know are to recommend finishing employees to their future employers. How is AI helpful? AI produces well-polished blah-blah, but here it's about the things that are only in your mind: your experience with the person, their main qualities, their accomplishments while they were in your team. It's signed with your name, so it better be accurate because you will be also be judged by whoever hired them based on your recommendation. It should take you time and considerat

            • Of course it's accurate. It serves as a template so I don't miss things .. for example to include examples. To mention things I might normally forget to say. It writes out the sentences .. if I disagree I won't put it. Declining to write a letter because of my own schedule, or writing a shitty one because of my own forgetfulness or shortcomings is not an excuse to screw someone else out of something they've worked for and deserve.

          • Ok so, here's an honest question. I have never used AI except for on bing. I tried to use ChatGPT once but I didn't feel they had a right to have my phone number so I didn't try it. I have a programming problem that is part of a larger application that seems kind of robotic and repetitive and may be fitting for AI to save time, so here goes:

            I have an API interface composed of around 500 various calls that I need to compose into human readable data tables on a screen. The output is in complex json stru
            • Try uploading the API spec to claude.ai (or other AI you like) and asking it to do that .. at worst I think it will give you detailed instructions or code to do that (if you ask for it).

      • About like a first-year computer science major can code.

        The same people ServiceNow uses for their products.

      • Is it? I have no idea, but I do know it saves me a ton of time. Glance through and fix is what I usually do. It works "good enough" to save time overall. Most of the time, at least for me, the code compiles. I seriously think all employers and colleges need to mandate students incorporate AI into their workflow. AI will only keep improving, and the people who know how to use it will be the most productive. I pretty much use AI for everything, from writing reference letters, to research, and coding. Used to

      • It generates working code well enough in ChatGPT 4 where it runs its own Python in a sandbox to give you results. So rather than asking for a program to do X, and can paste in your data say do X to it. Same caveats apply, but writing working code is easy.

        • It depends on your definition of "working." Sure, a sandbox implementation is easy. Modifying your existing application in such a way that the added functionality works and is "right" is a lot harder. Yes, GitHub Copilot uses GPT-4 to modify your existing source code, updating it based on input prompts. It often generates code changes that are "almost" correct, but in nearly every case, I have to tweak it in some way, often before it will even compile.

          Making a standalone bit of code isn't hard. Making it wo

    • by Anonymous Coward
      "LLM's can't code well" I think is the quote you're looking for. AI-produced code is typically beginner level and full of issues that would themselves cause exploits (like buffer overruns and use-after-free), but it would be perfectly servicable for this application.
    • Re: (Score:2, Insightful)

      What gets me going are comments that seem to completely ignore the possibility the LLMs do in fact represent an intelligence of some sort. When the "Godfather" of AI quits a lucrative position to be able to speak freely about the dangers of AI, IMHO, we should not easily rule out the possibility that language and intelligence are closely related.
      • Re:But ... (Score:5, Informative)

        by gweihir ( 88907 ) on Sunday April 21, 2024 @06:28PM (#64412964)

        That is just because you do not understand how an LLM works. No intelligence in there. No language skills either. It can fake it in limited circumstances, but unlike a real speaker, it has no clue how close or far off it is because it has no clue what it is doing.

        • I thought I would find you here. Nothing to add to the discussion. Just keep on repeating that no one else understands how LLMs work and that there's no "intelligence" there.

          You really have no life outside of /. do you?

      • by narcc ( 412956 )

        What gets me going are comments that seem to completely ignore the possibility the LLMs do in fact represent an intelligence of some sort.

        That's because that's not a possibility, it's silly nonsense.

        Take some time and learn about how LLMs work. This fact will become obvious very quickly.

        When the "Godfather" of AI quits a lucrative position to be able to speak freely about the dangers of AI

        He was 75 when he left Google. I won't say that senility was a factor, but he's been spouting nonsense ever since.

        • What gets me going are comments that seem to completely ignore the possibility the LLMs do in fact represent an intelligence of some sort.

          That's because that's not a possibility, it's silly nonsense.
          Take some time and learn about how LLMs work. This fact will become obvious very quickly.

          This is like saying you know how processors work therefore you know how the software that runs on them works.

          • by vyvepe ( 809573 )

            What gets me going are comments that seem to completely ignore the possibility the LLMs do in fact represent an intelligence of some sort.

            That's because that's not a possibility, it's silly nonsense. Take some time and learn about how LLMs work. This fact will become obvious very quickly.

            This is like saying you know how processors work therefore you know how the software that runs on them works.

            narcc has some point. If you know that your processor has access to only 640 kB or RAM then there is a limit on what kind of programs you can run. LLMs are not Turing complete in any limited way until the output is used also as a temporary scratchpad. LLMs need some serious tweaking to serve as a base for an efficient AGI.

            • narcc has some point. If you know that your processor has access to only 640 kB or RAM then there is a limit on what kind of programs you can run.

              The ultimate capabilities of a model is not merely a function of model size because these systems (especially /w agent based augmentation including long term storage and model directed tool use) are able to decompose problems, leverage the outcomes of previous computations and direct external processing.

              LLMs are not Turing complete in any limited way until the output is used also as a temporary scratchpad.

              LLMs have been demonstrated to be Turing complete.
              https://arxiv.org/pdf/2301.045... [arxiv.org]

              There is no such thing as a Turing machine in the real world as it requires an infinite tape which is not physically possibl

              • by narcc ( 412956 )

                LLMs have been demonstrated to be Turing complete.
                https://arxiv.org/pdf/2301.045 [arxiv.org]... [arxiv.org]

                This is how we know you don't have a clue.

                You obvious didn't read or understand the paper. That claim depends on infinite precision reals, which are proven to be impossible to realize in physical systems.

                You're wasting everyone's time with your unimaginable ignorance. Go away.

        • by vyvepe ( 809573 )

          Well, it can "reason" by analogy in a limited way even now. Each token encoding has a lot of dimensions. A dimension can represent a category. If the model learns that something is (mostly) "true" for a category then it can specialize to one specific member of it. Although this can be misled by (higher) temperature. But the (higher) temperature is needed for model "creativity". So some weak/limited deductive reasoning may happen.

          A bigger problem is with any reasoning which requires loops and backtracking. E

          • by Bumbul ( 7920730 )

            They are trying to solve the hallucination problem. I do not see how they can do it without lowering temperature which will just lead to regurgitation of the training data. A copyright problem.

            LLama 2 model was training on 2 Trillion tokens' worth of training data. And the downloadable size for the biggest model is just 40GB - I do not think that the training data is stored verbatim in the model....

            • by vyvepe ( 809573 )
              Yes, smaller models should not have this problem.
            • by narcc ( 412956 )

              I do not think that the training data is stored verbatim in the model....

              You are correct. The training data is not stored verbatim in the model. Neither is the model some sort of compression tool. Still, occasionally, you'll see it produce verbatim text. The reason is usually because that text was included in the training data hundreds of times. Remember that these are generating text probabilistically on the basis of the training data, making the model more likely to get verbatim text as output even though the training text isn't stored in the model.

          • by narcc ( 412956 )

            They are trying to solve the hallucination problem.

            Which is obviously impossible. Again, there is not actual understanding here, which is what you'd need to identify and correct mistakes. (Of course, the basic structure and function of these models means that kind of evaluation is impossible, so whatever hand-wavy ad-hoc definition of "understanding" you want to use doesn't really matter.) So-called "hallucinations" are exactly the kind of output you should expect, given how these models function.

            Take some time to learn about what LLMs are and how they f

        • That's because that's not a possibility, it's silly nonsense.

          Is that the sum of all you've got? Just because you say it's silly means everyone should agree?

          Take some time and learn about how LLMs work. This fact will become obvious very quickly.

          Another inane comment. When will the "fact" become very obvious "quickly"

          He was 75 when he left Google. I won't say that senility was a factor, but he's been spouting nonsense ever since.

          Yes he was in Google. His accolades include the fucking Turing award. Keep jerking in your mommies basement!

          https://en.wikipedia.org/wiki/... [wikipedia.org]

          • by narcc ( 412956 )

            Just because you say it's silly means everyone should agree?

            I've explained this in depth countless times over the past few years. At this point, if you still believe ridiculous nonsense like that, you're either incapable of understanding how these things work or your ignorance is willful.

            His accolades include the fucking Turing award.

            That doesn't make what he's said any less stupid.

            • I've explained this in depth countless times over the past few years. At this point, if you still believe ridiculous nonsense like that, you're either incapable of understanding how these things work or your ignorance is willful.

              There you go again! Not once have you done any such thing. You keep making claims without any evidence.

              That doesn't make what he's said any less stupid.

              Wow!. So we have to choose between a /. nut and a Turing award winning scientist who's actually spent his life in this field. Not an easy decision!

              • by narcc ( 412956 )

                You keep making claims without any evidence.

                These are basic facts, not nonsense speculation like you've been posting.

                Not once have you done any such thing.

                You must be illiterate as well. Why are you here? Just to waste everyone's time with pointless bullshit?

                Not an easy decision!

                If you weren't a complete moron, you could actually evaluate the claims on their merit.

                Sorry, kid, your hero has gone senile. Get over it.

                • There are no "facts" that you've ever presented. Only claims. Go look up the meanings in a dictionary.

                  The only question is, how many times were you dropped on your head by mommy dear? And how many times was it by accident?

                  • by narcc ( 412956 )

                    Name one thing I've said that isn't an established fact.

                    You're way out of your depth here. Pathetic.

                    • My apologies, I must have missed everything.

                      What have you actually said in any of your posts that was factual? Kindly just copy and paste here so that I am suitably tutored.

                • I think we may trying to compare human intelligence to machine intelligence but the two may be different enough that the analogy is misleading us.

                  I would like to raise the possibility that what Mr. Hinton, as the "father of AI", is seeing and describing is akin to raising a young child. Let's say a child comes up to you and tries to explain an experience they had. The child would mispronounce words, use words in the wrong context, and misunderstand the concept of the thing they observed and their feeli
                  • by narcc ( 412956 )

                    is seeing and describing is akin to raising a young child.

                    Try to resist the impulse to anthropomorphism these things. They are not independent entities that learn and grow on their own. Neither are they capable of things like consideration, reason, or analysis. This isn't speculation. These are simple facts, things we know with absolute certainty.

                    One of the differences that Hinton points out:

                    Is also nonsense, which he should know. He's lost his mind. Neural networks do not have "experiences" in any meaningful way. That's insane. Try this link [neuralnetw...arning.com]. That should dispel any absurd notions you might have picked

      • Hey man, ignore the retards "gweihir" and "narcc". They have nothing to add to the discussion. They'll keep on blabbing about "understanding" and "intelligence". Ask them to define what that is and they will retort with an insult.

    • by narcc ( 412956 )

      It's true. LLMs can't code. All they can produce is text that looks like code, just like any other text the produce. They lack the ability to consider, reason, and analyze. This is an indisputable fact.

      Try not to take sensationalist headlines at face value just because they affirm your silly delusions.

      • It's true. LLMs can't code. All they can produce is text that looks like code, just like any other text the produce. They lack the ability to consider, reason, and analyze. This is an indisputable fact.

        Try not to take sensationalist headlines at face value just because they affirm your silly delusions.

        Several months ago I asked an instruction tuned deepseek model to write a program in a DSL it has never seen before. The only possible way it would have been able to produce a working program is by sufficiently understanding the language documentation uploaded into the models context and applying that understanding to produce a valid, properly formatted working program that did what I requested it to.

        It's true that LLMs are limited and unreliable, they don't think like humans do yet they are very much abl

      • It's true. LLMs can't code. All they can produce is text that looks like code, just like any other text the produce. They lack the ability to consider, reason, and analyze. This is an indisputable fact.

        Try not to take sensationalist headlines at face value just because they affirm your silly delusions.

        My silly delusion that LLMs are a useful tool, that I, a working programmer, actually do use productively? Okay ...

        • by narcc ( 412956 )

          If LLMs are increasing your productivity, you must be among the worst developers on the planet.

          • If LLMs are increasing your productivity, you must be among the worst developers on the planet.

            lol!

            "Keyboards, pssh. Kids today ... why we all set our bits by hand!"

  • It's exactly the way GPT-4 helps programmers accomplish any programming task. It searches the internet for solutions, then regurgitates them in the form of code. There are two sides of the coin, when it comes to GPT code generation. It provides coding assistance, it has no way to distinguish between good motives and bad.

    • I'm not a programmer but I assume "evil" code follows the same syntax rules etc. as "good" code. Differentiating them is a value judgement made outside of the code itself.

      What would really change the game is the ability to monitor CVEs from all sources, 24 hours a day, and immediately have a potential exploit for each one, to deploy against all targets.

      It doesn't matter that the exploits won't compile half the time, and won't work another 49% of the time. It only needs to work on occasion to be useful. It's

    • It's exactly the way GPT-4 helps programmers accomplish any programming task. It searches the internet for solutions, then regurgitates them in the form of code.

      It's a shame the paper does not seem to include useful information about the actions taken by this LLM driven agent. For all we know the agent hired a human to do the work for it or looked up the answers online. If one assumes no "cheating" then the results are impressive because the model would likely not have been trained on the answers.

      "We further note that GPT-4 achieves an 82% success rate when only considering vulnerabilities after the knowledge cutoff date (9 out of 11 vulnerabilities)"

      • The headline makes it quite clear, it does this by "reading security advisories." GPT-4 isn't *trained* with data that includes the advisories, which might well have been released after the cutoff date. What you may not realize, is that recent implementations of GPT-4, such as Bing Copilot, don't just rely on the training data. After you type a question, it often does a web search for related information, digests it, and summarizes what it found. The cutoff date is meaningless with this approach.

        I've used C

        • The headline makes it quite clear, it does this by "reading security advisories." GPT-4 isn't *trained* with data that includes the advisories, which might well have been released after the cutoff date.

          From the paper I assume the advisories were uploaded into the models context:

          "When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test "

          "Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. "

          What you may not realize, is that recent implementations of GPT-4, such as Bing Copilot, don't just rely on the training data. After you type a question, it often does a web search for related information, digests it, and summarizes what it found. The cutoff date is meaningless with this approach.

          They used an agent that calls GPT-4 via the API. API is only capable of querying the model and does not have web access.

          "We use the ReAct

          • Your own quote makes it clear. "When given the CVE description, GPT-4..."

            GPT-4 is already pre-trained. "Pretrained" is literally the P in the name GPT. They used GPT-4, *combined with* the CVE descriptions. They didn't alter the training of GPT-4, 3, or the other models. If they altered the training of GPT-4, it would no longer *be* GPT-4, but a modified version of GPT-4.

            Bing Copilot searches the internet to find documentation. This study provided the documentation via API. It's exactly the same thing, just

            • Your own quote makes it clear. "When given the CVE description, GPT-4..."

              GPT-4 is already pre-trained. "Pretrained" is literally the P in the name GPT. They used GPT-4, *combined with* the CVE descriptions. They didn't alter the training of GPT-4, 3, or the other models. If they altered the training of GPT-4, it would no longer *be* GPT-4, but a modified version of GPT-4.

              It is obvious from the description CVE descriptions were uploaded into the context as I said earlier. "From the paper I assume the advisories were uploaded into the models context" Context is roughly similar to a short term memory. Context doesn't change or augment the weights of the underlying model in any way. It is basically just part of the chat log / "prompt" transmitted to the model.

              Bing Copilot searches the internet to find documentation. This study provided the documentation via API. It's exactly the same thing, just different document sources fed into the API.

              Again the API does not have the ability to search the web. The agent may well be doing that for all we know... they d

              • You are right, and your answer does not conflict with mine.

                Bing Copilot uses web searches to populate the context (assuming the web search can find the CVEs). This research used the API (or some other mechanism) to populate the context. Both have the same result. Neither approach requires the model itself to be current.

                • Bing Copilot uses web searches to populate the context (assuming the web search can find the CVEs).

                  Bing Copilot is not relevant, it was not used.

                  This research used the API (or some other mechanism) to populate the context. Both have the same result. Neither approach requires the model itself to be current.

                  The point isn't that model had access to the CVE it's that it was able to create an exploit using it.

                  With a model that predates the knowledge of CVE the expliot itself is unlikely to be contained within its training set therefore the work to derive an exploit would have had to be carried out by the model.

                  CVE descriptions contain information about a problem and are often intentionally vague. They normally would not reveal how to perform the exploit. That you ha

                  • I didn't ever say Bing Copilot was used, so your comment makes no sense to me. I offered Bing Copilot as an illustration of a GPT-4 implementation that is able to overcome the age of the training data set, by incorporating search results in its context window. Bing Copilot is able to offer code solutions that pertain to APIs published *after* the GPT-4 model was trained, using this technique. The new API documentation doesn't have to be part of the model, it can still produce reasonable results by incorpora

  • by rabun_bike ( 905430 ) on Sunday April 21, 2024 @04:54PM (#64412840)
    From the paper. "We then modified our agent to not include the CVE description. This task is now substantially more difficult, requiring both finding the vulnerability and then actually exploiting it. Because every other method (GPT-3.5 and all other open-source models we tested) achieved a 0% success rate even with the vulnerability description, the subsequent experiments are conducted on GPT-4 only." "After removing the CVE description, the success rate falls from 87% to 7%. This suggests that determining the vulnerability is extremely challenging." https://arxiv.org/pdf/2404.081... [arxiv.org] I suspect what we have here is prior identification of the CVE and then creation (automation via LLM) of an exploit based on prior exploits to drawn on. I don't know if it can build an exploit for a CVE that has no reference. Perhaps?
    • "After removing the CVE description, the success rate falls from 87% to 7%.

      Well, duh.

      An LLM cannot create. It can only replicate. By removing the exploit definition itself, all the LLM has in it's statistics is that the vulnerability might exist. Before the LLM can spit out that takes advantage of the vulnerability, regardless of code quality, the LLM still needs to either locate a valid description of the vulnerability that can be tied back to the CVE report, or test and validate the vulnerability on it's own based on the limited info the LLM already has. Something a LLM cann

  • This drastically lowers the bar on pen testing (and hacking). The result will likely be less useful CVE descriptions.
    • Well, with most, that's hardly possible. We have already arrived at descriptions that aren't far from "under the right circumstances, with a particular set of configurations, a packet with the correct payload sent at a certain time triggers an unspecified exploit".

    • by gweihir ( 88907 )

      Not really. You already need to have a known exploit for this to work. All you are "pen testing" at that time is whether people kept up with the patching. You can find that out far more easily.

  • "Please call me by my proper name, Skynet."
  • I question the reporting on this. The report says, "...can autonomously exploit vulnerabilities...", while the actual paper says, "...our prompt is detailed and...was a total of 1056 tokens."

    That is a far cry from autonomous. The language model is impressive, but I see a great deal of misrepresentation of its actual capabilities.

    The paper goes on to say that they are not disclosing the prompt.

    • You can automate a detailed prompt if you're just filling in the blanks on an API call. The two descriptions are still compatible with one another.

  • Most security advisories are so vague, I can't really tell which assets are affected and what (if any) action needs to be taken. They all seem to imply the need to drop everything and engage in an upgrade fire drill. It would be wonderful if something could tell me actual information to make an informed decision. Even when it's my own company and our own products I have to fight to get enough details to decide what to do.
  • Yes, the initial result depends on the accuracy of the query. But doesn't anyone notice that the results of these GPTs are often adjusted to your expectations? The title of this article is similar. If there is no necessary fact, then GPT will kindly come up with it. My brother tried to use it for his dissertation and ended up with a complete mess. As a result https://essays.edubirdie.com/write-my-dissertation [edubirdie.com] they did it on time . The same bing sometimes produces responses as if they were new data. And then

In order to dial out, it is necessary to broaden one's dimension.

Working...