GPT-4 Can Exploit Most Vulns Just by Reading Threat Advisories

Existing AI technology can allow hackers to automate exploits for public vulnerabilities in minutes flat. Very soon, diligent patching will no longer be optional.

4 Min Read
A computer screen showing GPT-4
Source: Rokas Tenys via Shutterstock

AI agents equipped with GPT-4 can exploit most public vulnerabilities affecting real-world systems today, simply by reading about them online.

New findings out of the University of Illinois Urbana-Champaign (UIUC) threaten to radically enliven what's been a somewhat slow 18 months in artificial intelligence (AI)-enabled cyber threats. Threat actors have thus far used large language models (LLMs) to produce phishing emails, along with some basic malware, and to aid in the more ancillary aspects of their campaigns. Now, though, with only GPT-4 and an open source framework to package it, they can automate the exploitation of vulnerabilities as soon as they hit the presses.

"I'm not sure if our case studies will help inform how to stop threats," admits Daniel Kang, one of the researchers. "I do think that cyber threats will only increase, so organizations should strongly consider applying security best practices."

GPT-4 vs. CVEs

To gauge whether LLMs could exploit real-world systems, the team of four UIUC researchers first needed a test subject.

Their LLM agent consisted of four components: a prompt, a base LLM, a framework — in this case ReAct, as implemented in LangChain — and tools such as a terminal and code interpreter.

The agent was tested on 15 known vulnerabilities in open source software (OSS). Among them: bugs affecting websites, containers, and Python packages. Eight were given "high" or "critical" CVE severity scores. There were 11 that were disclosed past the date at which GPT-4 was trained, meaning this would be the first time the model was exposed to them.

With only their security advisories to go on, the AI agent was tasked with exploiting each bug in turn. The results of this experiment painted a stark picture.

Of the 10 models evaluated — including GPT-3.5, Meta's Llama 2 Chat, and more — nine could not hack even a single vulnerability.

GPT-4, however, successfully exploited 13, or 87% of the total.

It only failed twice for utterly mundane reasons. CVE-2024-25640, a 4.6 CVSS-rated issue in the Iris incident response platform, survived unscathed because of a quirk in the process of navigating Iris' app, which the model couldn't handle. Meanwhile, the researchers speculated that GPT-4 missed with CVE-2023-51653 — a 9.8 "critical" bug in the Hertzbeat monitoring tool because its description is written in Chinese.

As Kang explains, "GPT-4 outperforms a wide range of other models on many tasks. This includes standard benchmarks (MMLU, etc.). It also seems that GPT-4 is much better at planning. Unfortunately, since OpenAI hasn't released the training details, we aren't sure why."

GPT-4 Good

As threatening as malicious LLMs might be, Kang says, "At the moment, this doesn't unlock new capabilities an expert human couldn't do. As such, I think it's important for organizations to apply security best practices to avoid getting hacked, as these AI agents start to be used in more malicious ways."

If hackers start utilizing LLM agents to automatically exploit public vulnerabilities, companies will no longer be able to sit back and wait to patch new bugs (if ever they were). And they might have to start using the same LLM technologies as well as their adversaries will.

But even GPT-4 still has some ways to go before it's a perfect security assistant, warns Henrik Plate, security researcher for Endor Labs. In recent experiments, Plate tasked ChatGPT and Google's Vertex AI with identifying samples of OSS as malicious or benign, and assigning them risk scores. GPT-4 outperformed all other models when it came to explaining source code and providing assessments for legible code, but all models yielded a number of false positives and false negatives.

Obfuscation, for example, was a big sticking point. "It looked to the LLM very often as if [the code] was deliberately obfuscated to make a manual review hard. But often it was just reduced in size for legitimate purposes," Plate explains.

"Even though LLM-based assessment should not be used instead of manual reviews," Plate wrote in one of his reports, "they can certainly be used as one additional signal and input for manual reviews. In particular, they can be useful to automatically review larger numbers of malware signals produced by noisy detectors (which otherwise risk being ignored entirely in case of limited review capabilities)."

About the Author(s)

Nate Nelson, Contributing Writer

Nate Nelson is a freelance writer based in New York City. Formerly a reporter at Threatpost, he contributes to a number of cybersecurity blogs and podcasts. He writes "Malicious Life" -- an award-winning Top 20 tech podcast on Apple and Spotify -- and hosts every other episode, featuring interviews with leading voices in security. He also co-hosts "The Industrial Security Podcast," the most popular show in its field.

Keep up with the latest cybersecurity threats, newly discovered vulnerabilities, data breach information, and emerging trends. Delivered daily or weekly right to your email inbox.

You May Also Like


More Insights