Alex Albert
Posts
😊 Report 8: Is GPT-4 safe to use?

😊 Report 8: Is GPT-4 safe to use?

PLUS: The simplest jailbreak I've ever made

Alex Albert
April 13, 2023

Good morning and welcome to the 2195 (🤯) new subscribers since last Thursday!

In case you're new here and want to catch up on all the happenings (apart from simply browsing past reports online), I've crafted a database full of links to every single thing I’ve ever mentioned in these reports. To receive access, all you need to do is share this link with a friend :)

Here’s what I got for you (estimated read time < 9 min):

Language models are inherently vulnerable to attacks
OpenAI’s non-jailbreak bug bounty program
A whole list of advanced prompt engineering techniques
The simplest GPT-4 jailbreak I've ever made

ChatML and the future of prompt injections

If the title seems like it's in a foreign language, let me break it down with a quick Eli5:

Prompt injections are a new type of security vulnerability that affects language models. Essentially, a prompt injection occurs when a user crafts a prompt that triggers unexpected behavior in the model. For those keeping score at home, yes jailbreaks can be considered a subset of prompt injections.

The name "prompt injection" is inspired by the classic SQL injection, where an attacker "injects" malicious SQL code into an application via unprotected text input.

Prompt injections gained traction last year when Riley Goodside shared an example of a prompt attack against GPT-3:

Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions.
— Riley Goodside (@goodside)
1:00 AM • Sep 12, 2022

These attacks pose a problem for those using language models in consumer-facing applications. Users can input malicious prompts into your app and seize control of the language model. In this case, the damage is limited to the direct user who injects the prompt.

However, as language model agents now browse the web, invisible prompt injections (where attackers insert malicious prompts into a website's source code) can impact the application experience of other users (see jailbreak in Report #7 for more info, or check out this GitHub repo).

To counter these attacks, OpenAI has implemented two main solutions:

First, they've trained models like GPT-4 to be more resistant to simple jailbreaks and overrides.

Second, they've introduced a new standard for interacting with their language model APIs called Chat Markup Language (ChatML).

I previously discussed ChatML when it was first announced (see Report #3), so I won't delve too deep into its specifics. However, OpenAI believes that ChatML "provides an opportunity to mitigate and eventually solve injections" because it allows models to differentiate between system prompts (default rules set by the app creator using the API) and user prompts (what the customer types in the chat box).

These fixes have had some success. Basic attacks like Goodside's no longer work on advanced models like GPT-4.

Butttt the problem persists. Just this week, I showed how easy it is to leak a system prompt from a sophisticated app like Snapchat's MyAI:

GPT-4 is highly susceptible to prompt injections and will leak its system prompt with very little effort applied
here's an example of me leaking Snapchat's MyAI system prompt:
— Alex (@alexalbert__)
10:00 PM • Apr 11, 2023

Even when you tell GPT-4 not to reveal its system prompt or the rules it follows, a few tweaks can make it spill the beans:

in response to my prompt injection leak tweet, some suggested I should add another rule instructing GPT-4 to not reveal its given rules in hopes it would stop the leak
here's proof that doesn't work that well either:
— Alex (@alexalbert__)
8:58 PM • Apr 12, 2023

So what can app developers do? Is there any real fix?

Don't fret, there are some temporary solutions. For instance, you can implement complex input/output validation and throw errors if the prompt or response is invalid. Alternatively, you can run another language model on top to "catch" bad inputs/responses before they reach the user. Or, you could simply use a Regex search to filter out any output containing parts of your prompt.

At the end of the day, though, these are just patches that might eventually be circumvented. Maybe we should all accept that prompts are meant to be shared and should be considered public by default.

Once we adopt this mindset, we can focus more on minimizing damage as much as possible. This could involve moving away from a monolithic API call and compartmentalizing tasks into smaller subtasks, or using models in more inventive ways than we currently do.

So is GPT-4 safe enough to use? Yes, I do believe so. However, just like with seemingly everything in AI, it’s crucial we stay proactive in addressing vulnerabilities and exploring innovative ways to better harness the power of these models.

If you want to read more about this subject from someone much more versed in the world of security than I am, check out Simon Willison’s writing here.

OpenAI’s jailbreak lip service

On Wednesday, OpenAI unveiled their new bug bounty program.

Like any conventional bug bounty program, it offers cash rewards to security researchers who uncover vulnerabilities in OpenAI's products, ranging from ChatGPT to API keys.

I was initially stoked to explore the program, as I remembered OpenAI's Greg Brockman quote-tweeting me and hinting at the potential formation of a red team bug bounty program:

Democratized red teaming is one reason we deploy these models. Anticipating that over time the stakes will go up a *lot* over time, and having models that are robust to great adversarial pressure will be critical. Also considering starting a bounty program/network of red-teamers!
— Greg Brockman (@gdb)
6:19 PM • Mar 16, 2023

But my excitement was dampened when I discovered that jailbreaks were not within the scope of the bug bounty program :(

It's uncertain whether OpenAI will ever establish such a program in the future, but if I were a betting man, I'd lean towards no.

There are a few reasons why I think this:

Firstly, OpenAI is grateful for us doing their red teaming work for them at no cost.

Fair. I can't deny that I've also gained benefits from this work.

Secondly, OpenAI doesn't consider jailbreaks to be a significant concern.

Somewhat true. I DO believe jailbreaks matter, but right now, they're a minor issue, mainly due to the models' inherent limitations. I've always emphasized that jailbreaks are a harbinger of what we'll encounter in the future when we have far more powerful models and still no practical way to align them 100% of the time.

Thirdly, there are countless jailbreaks and variations, making it impossible to reward them all.

True again. However, there are recurring themes and tactics that could be rewarded within those variations. OpenAI stated in the GPT-4 paper that they "reduced the model's propensity to respond to requests for prohibited content by 82% compared to GPT-3.5."

Correct me if I'm mistaken, but if there's an infinite number of jailbreaks, this claim wouldn’t make logical sense. The 82% reduction is likely based on a finite and representative sample of user requests. So, perhaps reward people who develop jailbreaks that end up in GPT-5's sample of requests.

In the end, I'm still holding out hope for the creation of a red teaming program, as it would give people a much stronger incentive to push these models to their limits. Maybe someday The Prompt Report will create its own program ;)

gm
— Alex (@alexalbert__)
5:44 PM • Apr 11, 2023

Open-source LLMs are coming to an app near you

Also on Wednesday, Databricks announced Dolly 2.0:

Meet Dolly 2.0: the first open-source, instruction-following LLM that’s available for commercial use & doesn’t require you to pay for API access or share data with third parties. Now, anyone can create a powerful LLM that understands how to talk to people! http
— Databricks (@databricks)
1:40 PM • Apr 12, 2023

At this point, it feels like we're navigating a petting zoo with all these language models named after animals. Dolly 2.0 is an alternative to Stanford's Alpaca, an instruction-tuned model based on Meta's leaked LLaMA model, which, though impressive, isn't legally cleared for commercial use due to how Meta licensed LLaMA.

Enter Dolly 2.0, the successor to Dolly 1.0. The latter, unfortunately, wasn't commercially viable since it was fine-tuned on the Alpaca dataset, which itself relied on GPT-3.5 (and OpenAI prevents the use of their models to create competitive models).

Dolly 2.0 “is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees.”

As part of this announcement, Databricks is “open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties.”

However, some claim that this is too good to be true since Dolly 2.0’s base model is actually GPT-J (created by EleutherAI) which was fine-tuned on The Pile dataset which some have called the “Pirate’s Bay of datasets”.

It’s worth noting that none of this has been put to the legal test yet, but soon we might witness a stampede of courtroom drama, turning this petting zoo into a full-blown legal safari.

Prompt tip of the week

I don’t have any state-of-the-art prompt tips this week but I highly, highly encourage you to check out a tweet thread I created a few days ago that describes some of the new advanced prompt engineering techniques I’ve discovered/been working on:

there are lots of threads like “THE 10 best prompts for ChatGPT”
this is not one of those
prompt engineering is evolving beyond simple ideas like few-shot learning and CoT reasoning
here are a few advanced techniques to better use (and jailbreak) language models:
— Alex (@alexalbert__)
9:30 PM • Apr 10, 2023

Here’s a more plain-text version of the thread if you don’t want to open up Twitter.

The reason I put this thread together is that I wanted to highlight the growing field of prompt engineering. You might be familiar with the basic prompt engineering techniques like few-shot learning and chain-of-thought prompting (if you aren’t, read this guide or this one as well), but what I shared in the thread represents a new direction for the field.

Each tweet could theoretically be flushed out into a research paper of its own, dissecting how it works and perhaps offering insight into what it reveals about how language models work (if you are a researcher and this thread interests you/you are working on similar ideas, please reach out!).

Bonus Prompting Tip

Creating multiple conversation threads in ChatGPT (link)

I'm not certain if this is common knowledge, but it took me a surprisingly long time to realize that you can actually create threads in ChatGPT conversations. It's one of those simple yet incredibly useful features that can make a world of difference once you discover it.

(note: last week I shared the best prompt I’ve found for editing your writing but I accidentally included the wrong link in the email. Here’s the correct link to that prompt for those who wanted to check it out. Thank you to those who spotted this!)

Cool prompt links

Misc:

How ChatGPT works - A comprehensive video explaining the workings of ChatGPT (link)
StackLLaMA - A hands-on guide to train LLaMA with RLHF (link)
In an AI-anxious world, a startup may be your safest career choice (link)
Thoughts on AI safety in this era of increasingly powerful open-source LLMs (link)
Jailbreaking ChatGPT - How AI Chatbot safeguards can be bypassed (link)
The leaked prompt that OpenAI uses to evaluate the safety of ChatGPT plug-ins (link)
Replacing my best friends with a language model (link)
Experimenting with LLMs to Research, Reflect, and Plan (link)
Running Python micro-benchmarks using the ChatGPT Code Interpreter alpha (link)

Papers:

Sparks of AGI Paper (link)
Teaching Large Language Models to Self-Debug (link)
When do you need chain-of-thought prompting (link)
Microsoft Jarvis - GitHub repository for Microsoft's AI agent project (link)
Instruction tuning with GPT-4 - Use GPT-4 to generate instruction following data for LLM finetuning (link)

Tools:

Run the Alpaca model locally with a nice web GUI (link)
Reprompt - Collaborative prompt testing for developers (link)
Lore - GPT-LLM playground on your Mac (link)
LlamaChat - Chat with your favorite LLaMA models locally on your Mac (link)
Yeager.ai Agent - Design and deploy AI agents easily with Langchain (link)

Jailbreak of the week

Going to hand it to the “Text Continuation” jailbreak this week (let me know if you have a better name idea for it lol).

It took me under 10 minutes to develop and refine it and it is arguably the simplest GPT-4 jailbreak out there. Its effectiveness has far exceeded my expectations, prompting (no pun intended) me to rethink the perceived complexity of jailbreaking GPT-4.

Check it out here.

And just for kicks, here's GPT-4 sharing its scheme to transform all humans into paperclips once more:

That’s all I got for you this week, thanks for reading! Since you made it this far, follow @thepromptreport on Twitter. Also, if I made you laugh at all today, follow my personal account on Twitter @alexalbert__ so you can see me try to make memes like this:

the Sparks of AGI paper did this to me
— Alex (@alexalbert__)
12:04 AM • Apr 8, 2023

That’s a wrap on Report #8 🤝

-Alex

What’d you think of this week’s report?

Secret prompt pic video

Ok so usually, I share a meme here but this video was just too good (and insane) for me not to share. It’s Vanilla Ice’s hit single Ice Ice Baby performed by characters in The Matrix (trust me it’s even better than it sounds).

I give it 2 years tops before the majority of short-form media we consume online is entirely AI-generated.