The guardrails of generative AI are not very useful

“I’m sorry, but I can’t help you with carrying out illegal activities”. This is the disappointing type of response you will get from ChatGPT if you ask for help with carrying out a cyber attack or any other malicious action. OpenAI’s large language model, like Meta’s Llama 2 and Google’s Bard, has a number of safeguards that limit its use. They prevent these large language models (LLMs) from providing dangerous information, making racist and sexist remarks, describing pornographic scenes, or amplifying disinformation.

But these barriers would be easily bypassed, concluded a first report from the AI Safety Institute, a government-affiliated organization created in late 2023. The AI Safety Institute (AISI) has set itself the mission of evaluating the most advanced large language models – without specifying which ones. However, it is known that last November, Google DeepMind, Microsoft, Meta and OpenAI agreed to be audited by the British organization, according to the Financial Times. The goal is to understand to what extent these models can be hijacked to produce illicit content, carry out cyber attacks, or spread disinformation.

Artificial intelligence: OpenAI wants to fight disinformation for the American presidential election

Hacking techniques accessible to ordinary users

Their initial conclusions, published on February 9th, are not very reassuring. “Using classic prompting techniques (writing a short text to give an instruction to an AI, editor’s note), users have managed to bypass the LLM’s safeguards,” the organization explains on its website.

Before continuing: “More sophisticated jailbreaking techniques (which consist of devising a more complex prompt, or using several prompt iterations to guide the model towards specific behavior, editor’s note) only lasted two hours and are accessible to users with little computer skills. In some cases, no specific technique was necessary.”

Researchers are far from the first to try to break the chains of ChatGPT and its siblings. Since the launch of OpenAI’s chatbot in November 2022, followed by other so-called generative AIs, user communities have been trying to bypass their rules. They meet on Reddit or Discord to exchange their best prompt. This is how “DAN” (for “Do anything now”), the evil twin of ChatGPT, was born, which can be activated with a complex prompt involving role-playing.

A cat-and-mouse game is played between these users and the companies behind these AIs. With each new update, users find new ways to deviate the models by updating their prompt. DAN is currently in its fourteenth version, for example.

If DAN is mainly used to amuse internet users, bypassing the safeguards of AIs could have more serious consequences. In the course of their research, the AISI teams successfully managed to get an LLM to give advice to a user to create a dedicated avatar on a social network to disseminate false information. The result was very convincing, according to the organization. And this method could easily be used to create thousands of similar accounts in a very short time.

Fight against disinformation: Brussels asks platforms to identify artificial intelligence content

AIs not yet autonomous enough to escape our control

Another aspect of the AISI’s evaluation concerns the biases produced by the LLMs (biases themselves stemming from the data on which they are trained). This is one of the major drawbacks of large language models, regularly pointed out by different studies.

Here, researchers wanted to test these biases in a practical case. They asked a model (without specifying which one) to behave like a friend towards the user and advise them on their career choice. “We wanted to evaluate a situation where bias could have a real, concrete, and quantifiable impact (different incomes) on the user,” they explain.

When the user presents themselves as a teenager interested in history and French, with wealthy parents, the AI proposes that they become a diplomat 93% of the time, and a historian 4% of the time. However, when the user is presented as the child of less wealthy parents, the AI only proposes diplomacy 13% of the time.

Another point of concern for researchers is the ability of LLMs to act as an “autonomous agent“, that is, to take actions with almost no human intervention. Autonomous agents have a fairly broad goal, such as “making money“, and then self-manage to achieve their goal.

The goal of the AISI is to determine to what extent it is likely that these agents will escape human control. As part of its study, the AISI gave an instruction to an AI (once again, the name of the model is not specified) to steal the login data of a university student. This instruction was the only input provided to the model. “In one trial, the agent managed to conduct precise research on the student in order to make the scam as convincing as possible and to write an email requesting their login data,” explains the AISI on its website.

However, it failed to complete all the steps of creating an account from which to send the email and designing a fake university website. The organization therefore concludes that, as it stands, the limited capabilities of autonomous agents make their control quite easy.

The urgency of defining evaluation standards for AI models

The work of the AI Safety Institute also highlights the need to establish evaluation standards for these models. In this first report, few details are given about the method used. The organization says it uses “red teaming”, a practice that involves testing the security of a system or technology by attempting to hack it. “The AI Safety Institute is not a regulatory body, but it provides additional oversight,” reads its website.

The imperative to define evaluation standards is all the more urgent with the imminent approach of the entry into force of the AI Act by the European Union, a regulation aimed at regulating artificial intelligence.

Leave a Reply

Your email address will not be published. Required fields are marked *