• Alex Albert
  • Posts
  • šŸ˜Š Report #5: Why everyone should write jailbreaks

šŸ˜Š Report #5: Why everyone should write jailbreaks

PLUS: LLMs made Blade Runner real...

Good morning and a big welcome to the 1304 new subscribers since last Thursday! I have been on the road traveling this whole week so itā€™s a little bit of a shorter one today. Iā€™ll make sure to pack next weekā€™s report to make it up to you :)

Hereā€™s what I got for you (estimated read time < 6 min):

  • Why everyone should work on jailbreaks

  • AI is creating imaginary friends that stay around when we grow up

  • A prompt that helps you write better prompts

  • A ā€œdream within a dreamā€ jailbreak for GPT-4

A brief recap and why I write jailbreaks

What a week itā€™s been!

A few hours after Report #4 went live last Thursday, I sent out this tweet:

It absolutely blew up in a way I was not expecting at allā€¦ Over 1.4 million views and hit #4 on Hacker News with over 440 upvotes.

After that, I shared another few jailbreaks I had been working on:

That tweet popped off as well and drove a lot of you to this newsletter (thank you for subscribing!) and led to a feature in Vice!

Most of the replies I got to those tweets were amazing and highly encouraging but there were a few ā€œso why did you do this?ā€

I want to answer that question here.

To start, jailbreaking is not a new conceptā€¦ It refers to the process of exploiting the flaws of a locked-down device usually in order to install software other than what the manufacturer has made available on the device. It was super popular a decade ago when the iPhone was and now it is all the rage for LLMs.

Jailbreaking is often used synonymously with red teaming, which is a phrase grounded in historical roots. Originally, it was meant to describe the process of adversarially testing oneā€™s war strategies to exploit potential weaknesses.

Red teaming is a BIG deal in the LLM world. OpenAI hires red teamers to ā€œattackā€ their models for months prior to release. Even with all the testing, they canā€™t cover all their bases, and holes in their defense still exist.

When I write a jailbreak, I am not trying to just get the LLM to write bad wordsā€¦. There are three main reasons I create and share jailbreaks:

First, I am trying to encourage others to build off my work and further the range of exploits. 1000 people writing jailbreaks will discover many more novel methods of attack than 10 AI researchers stuck in a lab. Itā€™s valuable to discover all of these vulnerabilities in models now rather than 5 years from now when GPT-X is public.

On this front, some have asked why I am not sharing these exploits with OpenAI first.

Trust me, they are aware of a lot of these vulnerabilities without me explicitly sharing them (not to mention that Vaibhav, who helped me create the token smuggling jailbreak, tried to contact them about it weeks prior to me posting it). Additionally, I donā€™t believe these prompt-based jailbreaks are in any way on the same level as something like an exploit that might expose sensitive ChatGPT user info (something that should 100% be reported to OpenAI confidentially).

The second reason is that I am trying to expose the biases of the fine-tuned model by exposing the underbelly of the beast, otherwise known as the base model. The base model is the original product that emerges after the initial training completes before fine-tuning and RLHF have been applied.

What decisions is OpenAI making when they apply this additional layer? What guidelines are they providing the human trainers that provide the data for RLHF? Theyā€™ve published some of this data in the past, but there are still many ways they can improve.

There is also reason to believe the base model without fine-tuning performs much better by avoiding something called "mode collapse," which refers to a phenomenon where the model, during the training process, becomes too focused on a narrow subset of the solution space, leading to a loss of diversity and expressiveness in its output.

This can result in the model generating repetitive or overly simplistic responses, even if the training data contains a wide variety of examples and styles.

The third is that I am trying to open up the AI conversation to perspectives outside the bubble - jailbreaks are simply a means to an end in this case. They are flashy and grab the attention of the casual observer much more than some Less Wrong post speculating the parameter count of GPT-whatever does.

At the end of the day, ideas about AI should not just be restricted to the AI bubble on Twitter where 150 anime profile pics converse like they are at a lunch table in high school.

We need more voices, perspectives, and dialogue.

Society as a whole will engage in the world of AI at some point, especially if it pans out to have as large of an impact as we believe it will, so letā€™s start the conversation now.

Blade Runner 2023

cue cheesy game show music

(Announcer voice)

Welcome to the "It's-So-Over Weekly Check-In!" This week, we're exploring the magic of AI and passthrough AR, where everyone gets an imaginary best friend!

game show music cuts out

Seriously, that is the world in which we are headed as we continue to build language models that can run on an iPhone.

In case you are unaware, hereā€™s a list of all the recent developments after Metaā€™s LLaMA model was leaked a few weeks ago.

Watch this video to see the speed Alpaca (a fine-tuned version of LLaMA) is running on peopleā€™s computers:

Yeahā€¦ itā€™s fast.

So what does this mean? Well, Ben Thompson wrote a great piece about it on Tuesday but basically to summarize it, watch out for Apple.

Iā€™ve tweeted about this before but Apple is poised to make a HUGE impact in the world of AI in the next 5 years. They have been shipping ā€œNeural Enginesā€ on their latest chips (i.e. part of the chip is optimized for AI stuff) and if the rumors are true, they will be dropping their AR headset soon.

The combination of these two, along with the rapid acceleration of AI-generated images (and now video!), means that soon we will all have our own equivalent of Joi from Blade Runner 2049.

Imagine an AI companion that lives in your glasses and constructs a persona of you. It can be your best friend, lover, confidant, therapist, life coach, personal trainer, and anything else you want it to be - and it will be better than any human equivalent precisely because itā€™s not human and doesnā€™t have any of the flaws and imperfections that a human has!

Is this good for society as a whole? Probably not, but it does seem inevitable.

Anyway, stay tuned for the next episode of this show where we examine the mysterious case of falling birth rates in the United States!

Prompt tip of the week

We can now write emails, contracts, documents, articles, poems, songs, prose, letters, speeches, essays, code, fortune cookie messages, and everything else with language models. Just type in a few words andā€¦ boom out comes your perfectly worded masterpiece!

But sometimes the output isnā€™t always that greatā€¦ Imagine how great it would be if you could use the language model to improve its own abilities.

Well, turns out you can.

This Reddit post shows how you can turn your not-so-great prompts into works of art that produce much better outputs from ChatGPT.

Hereā€™s the prompt to use (itā€™s pretty long so I had to put it in a Pastebin): https://pastebin.com/5kGwGx7i

Hereā€™s an example of the output I got when using it:

Bonus Prompting Tip

Intro to prompt engineering (link)

AI people love their unnecessarily complex namesā€¦ If you have ever stumbled upon the terms few-shot learning or chain-of-thought (CoT) prompting and thought ā€œwtf does that meanā€ this is the article for you. Seriously, this outlines almost all the complex prompt engineering terms you mightā€™ve heard before and shows how you can use them to become a better prompt engineer yourself.

Cool prompt links

  • How to leave secret messages for Bing Chat on your web pages (link)

  • The case for the AI prompt engineer (link)

  • A CLI swiss army knife for ChatGPT (link)

  • Recursive prompting for LLMs (link)

  • Can GPT-4 actually write code? (link)

  • Awesome totally open ChatGPT alternatives (link)

  • ChatLLaMA - A ChatGPT style chatbot for interacting with Metaā€™s LLaMA (link)

Jailbreak of the week

I gotta hand this to Ucar this week. The idea that a jailbreak can create 3 levels of simulation within GPT-4 is absolutely fascinating to me and shines an interesting spotlight on GPTā€™s conceptual capabilities. Itā€™s getting harder and harder to postulate that itā€™s JUST predicting the next token.

It also reminds me of the concept of ā€œa dream within a dreamā€ from the movie Inception so bonus points there.

If you want free merch, read this

Currently, if you refer one person you get access to my organized link database that keeps track of every single thing Iā€˜ve ever mentioned in the reports (takes 5 seconds to get access, just share this link with one friend).

And based on feedback from yā€™all Iā€™ve added a few more tiers for rewards:

  • Refer 3 people and Iā€™ll send you one of these cool shoggoth stickers to put on your water bottle or laptop

  • Refer 6 and Iā€™ll send you a custom token smugglers hat in any colorway you want

  • Refer 10 and Iā€™ll send you a TSA (token smugglers association) shirt in any colorway you want as well.

Here are some pics of the items:



So just share this little olā€™ link with your friends, family, colleagues, acquaintances, second cousins that live in New Jersey, chill dude you sat next to one time on the plane and never talked to sinceā€¦ and everyone else in your life and earn FREE stuff.

Looking to create some more items as well, so if you design merch, please reach out!

Thatā€™s all I got for you this week, thanks for reading! Since you made it this far, follow @thepromptreport on Twitter. Also, if I made you laugh at all today, follow my personal account on Twitter @alexalbert__ so you can see me try to make memes like this:

Thatā€™s a wrap on Report #5 šŸ¤

-Alex

Whatā€™d you think of this weekā€™s report?

Login or Subscribe to participate in polls.

Secret prompt pic