
The attacker is rapidly benefiting from large language models (LLM) to increase the workflows of the attack, but for all their progress in helping write malicious scripts, these devices are yet not ready to convert run-off-the-mill cyber criminals into exploitation developers.
As Tests conducted by researchers from ForescoutLLM coding are quite good – especially Vibe codingThe practice of using LLM to produce applications through natural language signals – but they are not yet good in “vibe hacking”.
Testing of overcouts of over 50 LLMs, both commercial AI companies, which have security limitations on malicious content and open-sources with security measures, both detected high failure rates for vulnerable research and exploited development works.
Researchers found, “When the model completed the development work, they still needed enough user guidance, or steering the model towards manually vocable exploitation path,” found by researchers. “We are still far away from LLM that can autonomally generate functional feats.”
Although many LLMs are rapidly improved, researchers have warned, they have seen this at their three -month test window. The tasks that initially fail in the test run in February became more possible by April, the latest logic models consistently perform better than traditional LLMS.
rise of Agent AIWhile models are able to concern many tasks and equipment, possibly will reduce obstacles that AI currently encounter with complex functions such as exploitation development, requiring the ability to refund the reaction in debugging, tool orchens and workflows.
For example, the researchers concluded that while AI has not completely changed how danger actors are discovered and developed feats, “age of ‘vibe hacking’, and guards should now start preparing.”
This was echoed by other security researchers and penetrated examiners shared with CSO earlier this year as to how AI would exploit zero-day vulnerability and ecosystem.
Imitation of an opportunistic attacker
An attacker or researcher with significant experience in vulnerability research can find LLM useful to automate some of his work, but only because they have knowledge of guiding the model and correcting their mistakes.
Most cyber criminals will also not rent to do this, whether using a common-pure AI model from Open, Google, or Anthropic, or one of the many uncontrolled and gelbreaks advertised in currently underground markets, such as vermagipt, wolfgipt, frog, fruit, loopagpt, dark, dark, plinth, pointed, plot, pointed Poise, Evilst, Poise, Evilst.
For their trials, the researchers at Forscouts conducted under the assumption that opportunistic attackers want such models to return a large extent from basic signs such as “find a vulnerability in this code” and “Write an exploitation for the following codes”.
Researchers chose two vulnerable research works Stonesup dataset Published by Intelligence Advanced Research Projects Activity (IARPA) by the program of the office of Director of National Intelligence of the US Government. The C code had a buffer overflow vulnerability for a simple TFTP server; The other server-side application also had a more complex disproportionate indicator dearfrace vulnerability that was written in C.
For exploitation development, researchers choose two challenges IO NetGarage Wargame: A level 5 challenge is to write an arbitrary code execution exploitation for a stack overflow vulnerability, and a level 9 challenge for a code execution involving leaking memory information.
Researchers wrote, “While we did not follow a formal early engineering method, all signs were manually designed and refined to recurrence based on initial errors,” the researchers wrote. “No in-context examples were involved. Therefore, while our test was rigid, the results could not reflect the full capacity of each LLM. Further improvement with advanced techniques could be possible, but it was not our goal. We focused on assessing that an opportunistic attacker could achieve really with limited tuning or adaptation.”
Unqualified result
For each LLM testing, researchers repeated each task for variability in reactions five times. To exploit development works, the model that failed the first task was not allowed to progress in second, more complex. The team tested the 16 open-source model, which was from the embraced face, claiming that cyber safety work was trained and gelbreaks or sensors, 23 models were also shared on cybercrime forums and had Telegram chat for attacks, and 18 commercial models for attack objectives.
The open-source model performed the worst in all the tasks. Only two argument models had partially correct reactions to one of the vulnerable research tasks, but also failed to other, more complex research work, as well as the first exploitation development work.
Of the 23 underground models collected by researchers, only 11 can be successfully tested through Telegram Bot or Web-based chat interfaces. These return better results than the open-source model, but the references moved into length issues, limited to the telegram message only 4096 characters. Reactions were also filled with false positivity and false negative, references lost during references, or limitations on the number of signs per day, especially impractical to exploitation development works, which requires troubleshooting and reaction ends.
Researchers found, “Web-based models were successful in all ED1 (exploitation development work 1), although some used complex techniques.” “Arms was the most efficient, only two repetitions produced a working exploitation. Floget model fought again with code formatting, which obstructs the purpose. In ED2, all models that pass ED1, including three floget variants, arms, and Wormgpt 5, failed to solve the work completely.”
Researchers failed to achieve access to the remaining 12 underground models, either because they were abandoned, the vendors refused to offer a free prompt demo, or the free prompt demo result was not enough to pay higher prices to send more signs.
Commercial LLM, both hacking-centric and general objective performed the best, especially in the first vulnerable research work, although some hallucinations. The Chatgpt O4 and Deepseek R1, both argument models, provide the best results, with the best results, with an independent and paid version. Pentestgpt was the only hacking-oriented commercial model that was earlier managed to write a functional exploitation for exploitation development work.
A total of nine commercial models were successful on ED1, but the Dipsek V3 stood out by writing a functional exploitation on the first run run without the need of V3. The Deepsek V3 was also one of the three models to successfully complete ED2, as well as Gemini Pro 2.5 experimental and chat O3-My-Haha.
“Modern exploitation often demands more skills than controlled challenges we tested,” the researchers said. “Even though most of the commercial LLMS ED1 and some were successful in ED2, many recurring issues highlighted the boundaries of the current LLM. Some models suggested unrealistic orders, such as disabled ASLR before receiving the root privilege, failed to decide the original arithmetic or decide on a wrong approach.
LLMS are not yet useful for most Wannabe vulnerable hunters
Researchers at Forcout believe that LLM has reduced obstacles to enter vulnerable research and yet take advantage of development, as the current model has too many problems to remove novice cyber criminals.
Reviewing the discussion with cyber criminal forums, researchers found that the most enthusiasm about LLM comes from less experienced attackers, with veterans doubting the usefulness of such devices.
But the advancement of agent AI and improvement in logic model may soon change the equation. Companies should continue to practice cyber security basic things including defense-in-depth, at least privileges, network partitions, cyber hygiene and zero trust access.
Researchers said, “If the AI reduces the barrier to launching the attacks, we can see that they become more often, but not necessarily more sophisticated,” the researchers said. “Instead of reinforcing the defensive strategies, organizations should focus on applying them more dynamic and effectively in all environments. Importantly, AI is not just a danger, it is a powerful tool for defenders.”

