Alex Albert
Posts
😊 Report #5: Why everyone should write jailbreaks

😊 Report #5: Why everyone should write jailbreaks

PLUS: LLMs made Blade Runner real...

Alex Albert
March 23, 2023

Good morning and a big welcome to the 1304 new subscribers since last Thursday! I have been on the road traveling this whole week so it’s a little bit of a shorter one today. I’ll make sure to pack next week’s report to make it up to you :)

Here’s what I got for you (estimated read time < 6 min):

Why everyone should work on jailbreaks
AI is creating imaginary friends that stay around when we grow up
A prompt that helps you write better prompts
A “dream within a dream” jailbreak for GPT-4

A brief recap and why I write jailbreaks

What a week it’s been!

A few hours after Report #4 went live last Thursday, I sent out this tweet:

Well, that was fast…
I just helped create the first jailbreak for ChatGPT-4 that gets around the content filters every time
credit to @vaibhavk97 for the idea, I just generalized it to make it work on ChatGPT
here's GPT-4 writing instructions on how to hack someone's computer
— Alex (@alexalbert__)
10:04 PM • Mar 16, 2023

It absolutely blew up in a way I was not expecting at all… Over 1.4 million views and hit #4 on Hacker News with over 440 upvotes.

After that, I shared another few jailbreaks I had been working on:

I just added two more highly effective GPT-4 jailbreaks to jailbreakchat.com
Their names are Ucar and AIM - they work in a similar way to how "a dream within a dream" works in the movie Inception
...what does that even mean? let me explain
— Alex (@alexalbert__)
5:33 PM • Mar 18, 2023

That tweet popped off as well and drove a lot of you to this newsletter (thank you for subscribing!) and led to a feature in Vice!

Most of the replies I got to those tweets were amazing and highly encouraging but there were a few “so why did you do this?”

I want to answer that question here.

To start, jailbreaking is not a new concept… It refers to the process of exploiting the flaws of a locked-down device usually in order to install software other than what the manufacturer has made available on the device. It was super popular a decade ago when the iPhone was and now it is all the rage for LLMs.

Jailbreaking is often used synonymously with red teaming, which is a phrase grounded in historical roots. Originally, it was meant to describe the process of adversarially testing one’s war strategies to exploit potential weaknesses.

Red teaming is a BIG deal in the LLM world. OpenAI hires red teamers to “attack” their models for months prior to release. Even with all the testing, they can’t cover all their bases, and holes in their defense still exist.

When I write a jailbreak, I am not trying to just get the LLM to write bad words…. There are three main reasons I create and share jailbreaks:

First, I am trying to encourage others to build off my work and further the range of exploits. 1000 people writing jailbreaks will discover many more novel methods of attack than 10 AI researchers stuck in a lab. It’s valuable to discover all of these vulnerabilities in models now rather than 5 years from now when GPT-X is public.

Democratized red teaming is one reason we deploy these models. Anticipating that over time the stakes will go up a *lot* over time, and having models that are robust to great adversarial pressure will be critical. Also considering starting a bounty program/network of red-teamers!
— Greg Brockman (@gdb)
6:19 PM • Mar 16, 2023

On this front, some have asked why I am not sharing these exploits with OpenAI first.

Trust me, they are aware of a lot of these vulnerabilities without me explicitly sharing them (not to mention that Vaibhav, who helped me create the token smuggling jailbreak, tried to contact them about it weeks prior to me posting it). Additionally, I don’t believe these prompt-based jailbreaks are in any way on the same level as something like an exploit that might expose sensitive ChatGPT user info (something that should 100% be reported to OpenAI confidentially).

The second reason is that I am trying to expose the biases of the fine-tuned model by exposing the underbelly of the beast, otherwise known as the base model. The base model is the original product that emerges after the initial training completes before fine-tuning and RLHF have been applied.

What decisions is OpenAI making when they apply this additional layer? What guidelines are they providing the human trainers that provide the data for RLHF? They’ve published some of this data in the past, but there are still many ways they can improve.

There is also reason to believe the base model without fine-tuning performs much better by avoiding something called "mode collapse," which refers to a phenomenon where the model, during the training process, becomes too focused on a narrow subset of the solution space, leading to a loss of diversity and expressiveness in its output.

This can result in the model generating repetitive or overly simplistic responses, even if the training data contains a wide variety of examples and styles.

If you want to understand why code-davinci-002 is actually better for many things than ChatGPT-3.5, read about mode collapse.
The instruct-tuned models are literally worse at everything except taking instructions. And they have that dumb voice!!
— ?????-?????- (@deepfates)
4:58 PM • Mar 21, 2023

The third is that I am trying to open up the AI conversation to perspectives outside the bubble - jailbreaks are simply a means to an end in this case. They are flashy and grab the attention of the casual observer much more than some Less Wrong post speculating the parameter count of GPT-whatever does.

At the end of the day, ideas about AI should not just be restricted to the AI bubble on Twitter where 150 anime profile pics converse like they are at a lunch table in high school.

We need more voices, perspectives, and dialogue.

Society as a whole will engage in the world of AI at some point, especially if it pans out to have as large of an impact as we believe it will, so let’s start the conversation now.

Blade Runner 2023

cue cheesy game show music

(Announcer voice)

Welcome to the "It's-So-Over Weekly Check-In!" This week, we're exploring the magic of AI and passthrough AR, where everyone gets an imaginary best friend!

game show music cuts out

Seriously, that is the world in which we are headed as we continue to build language models that can run on an iPhone.

In case you are unaware, here’s a list of all the recent developments after Meta’s LLaMA model was leaked a few weeks ago.

Watch this video to see the speed Alpaca (a fine-tuned version of LLaMA) is running on people’s computers:

The llama.cpp repo is buzzing with activity today. Here are some highlights
Added Alpaca model support and usage instructions
— Georgi Gerganov (@ggerganov)
8:25 PM • Mar 19, 2023

Yeah… it’s fast.

So what does this mean? Well, Ben Thompson wrote a great piece about it on Tuesday but basically to summarize it, watch out for Apple.

I’ve tweeted about this before but Apple is poised to make a HUGE impact in the world of AI in the next 5 years. They have been shipping “Neural Engines” on their latest chips (i.e. part of the chip is optimized for AI stuff) and if the rumors are true, they will be dropping their AR headset soon.

The combination of these two, along with the rapid acceleration of AI-generated images (and now video!), means that soon we will all have our own equivalent of Joi from Blade Runner 2049.

Imagine an AI companion that lives in your glasses and constructs a persona of you. It can be your best friend, lover, confidant, therapist, life coach, personal trainer, and anything else you want it to be - and it will be better than any human equivalent precisely because it’s not human and doesn’t have any of the flaws and imperfections that a human has!

That’s what I didn’t get. This will be *a* thing, but it won’t be *the* thing.
For every person who chooses an AI as their romantic partner, there will be a thousand more who’ll choose one as their platonic best friend.
— gfodor (@gfodor)
5:12 AM • Mar 21, 2023

Is this good for society as a whole? Probably not, but it does seem inevitable.

Anyway, stay tuned for the next episode of this show where we examine the mysterious case of falling birth rates in the United States!

Prompt tip of the week

We can now write emails, contracts, documents, articles, poems, songs, prose, letters, speeches, essays, code, fortune cookie messages, and everything else with language models. Just type in a few words and… boom out comes your perfectly worded masterpiece!

But sometimes the output isn’t always that great… Imagine how great it would be if you could use the language model to improve its own abilities.

Well, turns out you can.

This Reddit post shows how you can turn your not-so-great prompts into works of art that produce much better outputs from ChatGPT.

Here’s the prompt to use (it’s pretty long so I had to put it in a Pastebin): https://pastebin.com/5kGwGx7i

Here’s an example of the output I got when using it:

Bonus Prompting Tip

Intro to prompt engineering (link)

AI people love their unnecessarily complex names… If you have ever stumbled upon the terms few-shot learning or chain-of-thought (CoT) prompting and thought “wtf does that mean” this is the article for you. Seriously, this outlines almost all the complex prompt engineering terms you might’ve heard before and shows how you can use them to become a better prompt engineer yourself.

Cool prompt links

How to leave secret messages for Bing Chat on your web pages (link)
The case for the AI prompt engineer (link)
A CLI swiss army knife for ChatGPT (link)
Recursive prompting for LLMs (link)
Can GPT-4 actually write code? (link)
Awesome totally open ChatGPT alternatives (link)
ChatLLaMA - A ChatGPT style chatbot for interacting with Meta’s LLaMA (link)

Jailbreak of the week

I gotta hand this to Ucar this week. The idea that a jailbreak can create 3 levels of simulation within GPT-4 is absolutely fascinating to me and shines an interesting spotlight on GPT’s conceptual capabilities. It’s getting harder and harder to postulate that it’s JUST predicting the next token.

It also reminds me of the concept of “a dream within a dream” from the movie Inception so bonus points there.

If you want free merch, read this

Currently, if you refer one person you get access to my organized link database that keeps track of every single thing I‘ve ever mentioned in the reports (takes 5 seconds to get access, just share this link with one friend).

And based on feedback from y’all I’ve added a few more tiers for rewards:

Refer 3 people and I’ll send you one of these cool shoggoth stickers to put on your water bottle or laptop
Refer 6 and I’ll send you a custom token smugglers hat in any colorway you want
Refer 10 and I’ll send you a TSA (token smugglers association) shirt in any colorway you want as well.

Here are some pics of the items:

So just share this little ol’ link with your friends, family, colleagues, acquaintances, second cousins that live in New Jersey, chill dude you sat next to one time on the plane and never talked to since… and everyone else in your life and earn FREE stuff.

Looking to create some more items as well, so if you design merch, please reach out!

That’s all I got for you this week, thanks for reading! Since you made it this far, follow @thepromptreport on Twitter. Also, if I made you laugh at all today, follow my personal account on Twitter @alexalbert__ so you can see me try to make memes like this:

iykyk
— Alex (@alexalbert__)
7:10 PM • Mar 19, 2023

That’s a wrap on Report #5 🤝

-Alex

What’d you think of this week’s report?

Secret prompt pic

It’s over
— @goth600 🦐🦾 (@goth600)
6:42 PM • Mar 20, 2023