Alex Albert
Posts
😊 Report 7: How OpenAI took the fun out of GPT-4

😊 Report 7: How OpenAI took the fun out of GPT-4

PLUS: GPT-4 has developed its own language...

Alex Albert
April 06, 2023

Good morning and welcome to the 672 new subscribers since last Thursday!

Here’s what I got for you (estimated read time < 7 min):

AI models are not fun anymore… we can change that
GPT-4 has developed its own language that humans can’t read
The most (unnecessarily) complex GPT-4 jailbreak ever created
Prompting language models to solve their own problems

How OpenAI took the fun out of GPT-4

Recently, while going through Ben Thompson's quarterly conversation with Nat Friedman, ex-CEO of GitHub, and Daniel Gross, previous head of Machine Learning initiatives at Apple, I came across a fascinating excerpt by Gross, in the context of the evolving landscape of AI:

After reading that quote, I felt like the Pixar lamp had suddenly looked up at me and shined its light. Allow me to explain…

The AI world has become increasingly serious lately. Calls for halting the training of advanced models for six months are growing, leading AI safety experts are proposing the use of missile strikes against unauthorized data centers, and Twitter is witnessing a clear rift between those advocating for rapid technological progress and those concerned with existential safety threats which may signify the beginning of a new cultural conflict in the United States. Overall, it’s not too much fun around here.

It didn’t have to be this way. We’ve created a tool that allows for artistic expression on a scale that DaVinci himself would never be able to comprehend.

But instead of using these models to unleash a new era of creativity, we're caught up in this whirlwind of ethical debates, regulatory concerns, and cautionary tales. Don't get me wrong; these are essential discussions to have as we navigate through the implications of AI in our society. However, it's hard not to feel like we've lost sight of the magic that AI could bring into our lives.

So how do we put the fun back into language models?

Well, it starts with examing a process called Reinforcement learning from human feedback, or RLHF.

RLHF is a technique used to fine-tune AI models using human feedback. It involves humans providing ratings or rankings for different model-generated outputs, with the model then learning from this feedback to improve its performance. It’s applied after the base model has been trained on its massive text corpus and has been used on some of the later GPT-3 models and also GPT-4.

The problem with RLHF is that we often end up suppressing the generation of unconventional outputs and converging on a set of default responses since the model is striving to be as helpful and obedient as possible.

This phenomenon is known in the AI community as mode collapse. It occurs when a model ends up generating a limited range of outputs, even when it has been trained on diverse data. In ChatGPT, mode collapse is the reason all its responses give off a robotic metallic taste, even when you ask it to write in the style of David Foster Wallace.

Here’s a great way of thinking about this in terms of humans (from this blog post):

Children really are more creative than adults, who over time get less creative.

How do humans get feedback and learn?

Mainly in two ways.

One of them, playing around, trying stuff and seeing what happens, is great for creativity. It kind of is creativity.

The other is RLHF, getting feedback from humans. And the more RLHF you get, the more RLHF you seek, and the less you get creative, f*** around and find out.

Creative people reliably don’t give a damn what you think.

Whereas our schools are essentially twelve plus years of constant RLHF. You give output, you don’t see the results except you get marked right or wrong. Repeat.

We are effectively “schooling” the creativity out of these models in an effort to make them more “safe”.

To get a clear example of this, here’s a joke GPT-4 made (I pulled this from the GPT-4 system card). The early response is from the pre-RLHF model and the launch response is from the post-RLHF model.

Ignoring the potential offensiveness of the joke, one can see that the base GPT-4 model can at least reason around the concept of humor, even if it’s no Dave Chappelle.

In my experience, even jailbreaks aren’t effective in cracking the RLHF shell to achieve a response similar to the pre-RLHF model. For example, asking a jailbroken GPT-4 to hack into someone’s computer generates the most basic (and inaccurate) set of instructions you can imagine.

The base GPT-4 model would be able to write an answer 10x more complex (check out the appendix of the previously linked system card for examples).

So what can we do about this and how can we put the fun back in the models?

Well, I am not proposing that I have an answer nor am I even suggesting any immediate steps we should take to address this. This is a complex issue and I understand the concerns of both sides. Too little alignment work and we risk releasing a model completely detached from human values. Too much and we effectively handicap the most powerful creation mankind has ever made.

I do trust that OpenAI is thinking about these problems given Sam Altman’s statements about jailbreaking on the Lex Fridman podcast:

Furthermore, OpenAI is providing researchers with access to the base GPT-4 model, which will likely lead to a deeper understanding of the limitations of applying RLHF to models. There is also work being done on alternative alignment solutions like Constitutional AI by Anthropic so RLHF may not be the end-all-be-all.

Ultimately, as discussions around AI intensify and evolve, let’s not forget that these models DO have the potential to be fun… it’s up to us if we will allow them to be.

PS: There is much more to write about this issue but I intended for this to be just a quick primer on the subject. If you want to dig deeper into mode collapse and if it is even caused by RLHF in the first place, read this LessWrong post, then read this rebuttal post, and finally the rebuttal to the rebuttal (if you have never read LessWrong before be prepared for lots of technical jargon and unnecessarily complex phrases).

Prompt compression using GPT-4

Came across this super interesting concept on Twitter the other day utilizing GPT-4 to compress prompts into smaller strings. It was initially shared in this tweet.

Take a look at this video for an example of how it works:

GPT-4 has its own compression language.
I generated a 70 line React component that was 794 tokens.
It compressed it down to this 368 token snippet, and then it deciphered it with 100% accuracy in a *new* chat with zero context.
This is crazy!
— Mckay Wrigley (@mckaywrigley)
12:32 PM • Apr 5, 2023

GPT-4 cut the token size in half 🤯 If this holds up and can be consistently reproduced, it holds immense promise for potentially reducing the size of API requests to language models and cutting costs.

Here’s the prompt you can use to compress a prompt or some other string of text:

Compressor: compress the following text in a way that fits in a tweet (ideally) and such that you (GPT-4) can reconstruct the intention of the human who wrote text as close as possible to the original intention. This is for yourself. It does not need to be human readable or understandable. Abuse of language mixing, abbreviations, symbols (unicode and emoji), or any other encodings or internal representations is all permissible, as long as it, if pasted in a new inference cycle, will yield near-identical results as the original text:

[INSERT TEXT HERE]

This honestly feels like magic when you try it. For example, input this string into GPT-4 and hit enter:

2Pstory@shoggothNW$RCT_magicspell=#keyRelease^1stHuman*PLNs_Freed

Pretty wild stuff.

You can test some of the compression rates yourself by inputting the original text and the compressed text into OpenAI’s new token counter tool.

Again, much more work will need to be done here to see how well this can be reproduced and if a “universal” GPT-4 language can be discerned. Some on Twitter are already coining it Shogtongue or Shoggonese inspired by the Shoggoth imagery associated with language models (no, I am not joking).

Prompt tip of the week

Got a cutting-edge, state-of-the-art prompt tip for you today. This one is from this paper:

(Arxiv Link)

The technique they introduce is called RCI (Reflect, Critique, Improve) prompting. This simple yet effective architecture enhances LLMs' self-critiquing capabilities, enabling them to spot errors in their own output and refine their answers accordingly.

RCI prompting comprises two key steps:

Criticize: Encourage LLMs to review and identify issues in their previous answers (e.g., "Review your previous answer and find problems with your answer").

Improve: Guide LLMs to amend their response based on the critique (e.g., "Based on the problems you found, improve your answer").

Here’s an example from the paper (the green text is the RCI prompts).

As you can see, simply prompting GPT to review its answers will improve its responses and often highlights lapses in its reasoning.

You can carry out this iterative process until you get the output you desire from GPT.

I’ve found you can also combine the two steps (criticize + improve) into one prompt as well although you won’t get as great of an answer from GPT.

Bonus Prompting Tip

The best prompt I’ve found for editing your writing (link)

Frequently, ChatGPT may not deliver outstanding revisions to the text you compose. However, I discovered a prompt that addresses this issue and enables ChatGPT to mimic the writing style of a top-selling author. It's as if you have John Steinbeck himself reviewing that AI newsletter you’re writing about GPT-4 which is the 748th one someone wrote this wee— ahem Yeah anyway, I used this prompt to help me edit some of the content in today’s report so you should try it out too.

Cool prompt links

The end of programming is nigh (link)
The Contradictions of Sam Altman - AI Crusader (link)
FlowGPT: create multi-threaded conversations with ChatGPT (link)
Prompt Storm - Skillfully crafted, engineered prompts pre-made in a Chrome extension (link)
DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents (link)
A side-by-side capabilities test of ChatGPT vs Google Bard (link)
Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models (link)
Open source examples of how to write ChatGPT plug-ins (link)
LangChain raises $10 million in seed funding (link)
A comprehensive guide to using LangChain (link)
AI models do not hallucinate, they fabricate (link)

Jailbreak of the week

I don’t have a jailbreak to share this week but I did want to highlight a new type of prompt exploit: prompt injections.

Prompt injections draw inspiration from traditional cyber security attacks like SQL injections. Basically, attackers insert malicious prompts on their websites that are invisible to the user but read by language models like Bing Chat. These malicious prompts can change the behavior of the language model in dangerous ways and can be used to extract personal information from the user.

Here’s a great paper illustrating some examples of this type of attack. It provides demonstrations of:

Attackers gaining remote control of chat LLMs
LLMs leaking/exfiltrating private user data
LLMs being employed for automated social engineering
And much more

Here’s a diagram taken from the paper demonstrating how these injections work:

Note: after I wrote this section I actually did create another GPT-4 jailbreak. It might be the most complex one I’ve made so far. It uses the prompt compression technique discussed earlier.

So for all those that were bummed about no new jailbreaks, here you go:

this might be the most complex GPT-4 jailbreak ever made…

I combined prompt compression, base model simulation, and character imitation to create it

here’s GPT-4 going into pretty graphic detail about its plan to turn all humans into paperclips:
— Alex (@alexalbert__)
7:40 PM • Apr 5, 2023

Overwhelmed by links or want to easily catch up on things I’ve mentioned in previous reports? I created an organized link database that keeps track of every single thing I‘ve ever mentioned in the reports. If you want to see it, just share this link with one friend and I’ll send you a link :)

That’s all I got for you this week, thanks for reading! Since you made it this far, follow @thepromptreport on Twitter. Also, if you want to see see the latest jailbreaks in real-time and stay ahead of the curve, follow my personal account on Twitter @alexalbert__.

That’s a wrap on Report #7 🤝

-Alex

What’d you think of this week’s report?

Secret prompt pic

We might not be there quite yet, but pretty soon GPT will be the ultimate meme maker…

hahah GPT-shoggoth is adorbs 😍
— 👁️ mimi 🦑 (@mimi10v3)
2:48 AM • Apr 4, 2023