Alex Albert
Posts
😊 Report 11: Google unveils its new GPT-4 competitor

😊 Report 11: Google unveils its new GPT-4 competitor

PLUS: Using hypotheticals to jailbreak ChatGPT

Alex Albert
May 11, 2023

Good morning and a warm welcome back to The Prompt Report! My apologies for the gap since the last report, I've been working on some exciting projects, details of which I'll be able to share shortly.

Over the last week, The Prompt Report hit a milestone, crossing the 10,000 subscriber mark🥳 I'm profoundly grateful for the continuous support from all of you who read this report week in and week out. The idea of reaching this level just a few months back was truly beyond my wildest dreams. Next stop, 20k!

Here’s what I got for you (estimated read time < 11 min):

Taking a peek inside GPT’s black box to understand how it works
Google’s new language model competes with GPT-4
How to combine prompting techniques to answer complex questions
Hypothetically, can ChatGPT jailbreak itself?

Pulling back the curtain on GPT

On Tuesday, OpenAI released a paper that described how they used GPT-4 to label all 307,200 neurons in GPT-2 with plain English descriptions of the role each neuron plays in the model.

This is a truly fascinating paper in my opinion so in order to fully understand it, let’s answer a few questions someone may have:

What’s a neuron in a language model?

Basically, in a neural network, neurons are the individual units in the layers of the model.

These units take in some input (like the numerical representation of a word), perform a mathematical operation on it (this operation is called the activation function), and then pass the result forward (this result is called the activation).

Each layer in the model consists of many of these units, and the model learns by adjusting the specifics of the mathematical operations that each unit performs.

(Diagram taken from the 3blue1brown YouTube channel, highly recommend this video to conceptualize what a neural network actually is. Also, if you want to visually understand the architecture of GPT-2 in more detail, check out this amazing blog post)

So how did the researchers actually label the neurons?

The labeling process consisted of running three steps on every neuron in the model:

Step 1, generate explanations of the neuron's behavior using GPT-4.

The researchers fed in a prompt that contained few-shot examples of neuron activations (activations are represented on a scale from 0-10) across different text excerpts and a set of activations for a text excerpt on the neuron they were observing.
For instance, let’s say we were observing a given neuron that may show activations on certain tokens like ‘together’ (3), ‘ness’ (7), ‘town’ (1) in a sentence. Based on these activations, GPT-4 derives that the primary function of this neuron is finding phrases related to community.

Step 2, simulate the neuron's behavior using the explanations.

With those explanations from GPT-4, the researchers used GPT-4 again to simulate the neuron's behavior and predict how the neuron would activate for each token in a given sequence. They just fed GPT-4 the explanation for the neuron and some text excerpt divided up into tokens and asked it to predict the activations for each token.

Step 3, score the explanations by comparing the simulated and actual neuron behavior.

Finally, the researchers scored the simulated neuron's behavior against the real neuron's behavior by comparing two lists of activation values across multiple text excerpts.
The primary scoring method used is correlation scoring, which reports the correlation coefficient between the true and simulated activations. In addition, they also used a few other validation methods like human evals to determine the quality of explanations.

Ok… but why is it even important to understand what these neurons do and understand what’s actually happening within GPT?

Language models can often appear as black boxes to outside observers. They are trained on vast amounts of text that no single human could ever read, and from this text, they develop internal representations of language.

AI researchers are keen on understanding how these models create and store these representations, leading to a dedicated area of AI research called interpretability (which this paper falls under). They study interpretability primarily for three reasons:

Trust and accountability: Interpretability enables researchers to identify if the model is using biased heuristics or engaging in deception. Bias and deception in models are genuine concerns as some cite them as potential reasons for AI-related disasters.
Model improvement and robustness: By understanding the inner workings of models, researchers can identify and rectify redundancies and enhance various aspects of the model, resulting in more robust and reliable AI systems.
Knowledge sharing and communication: Interpretability work allows researchers, developers, and users to communicate around language model subjects effectively with better specificity which ultimately improves education and facilitates better human-AI collaboration.

What does this all mean for the future of interpretability work?

Well, before delving into the implications of this, I think it’s important to lay out the limitations of this work as the researchers did near the bottom of the paper. They listed a few different things such as:

Neurons may represent many features or even alien features humans don’t have words for
The explanations only explain correlations between the network input and the neuron being interpreted on a fixed distribution and do not explain what causes behavior at a mechanistic level
This method of labeling is computationally very expensive and would not scale well to larger models with more neurons
And more limitations like context length, tokenization issues, and a limited hypothesis space

Overall though, the outlook for this work is positive. The researchers envision their methods being further improved and integrated with other approaches to enhance interpretability of neural networks. They propose that their explainer model (GPT-4 in this case) could generate and test hypotheses about the subject model (GPT-2), similar to the work of an interpretability researcher, possibly aided by reinforcement learning, expert iteration, or debate.

The broader vision is to use automated intterpretability to assist in audits of language models, help detect and understand model misalignments, and contribute to a comprehensive understanding of more complex models

If you want to see some of the labeled neuron results for yourself and check out interesting neurons they found, check out their interactive neuron viewer site.

In the PaLM (2) of Google’s hand

This past Wednesday, Google hosted its eagerly-awaited annual developer conference, Google I/O, where it unveiled a plethora of advancements across all its product domains. The event was a big draw, with many keen to get a glimpse of the latest innovations in AI.

And AI did indeed steal the show as AI product integrations dominated almost every category of the presentation. Here’s a good recap from Techcrunch of everything that was covered. Or you could just watch this TikTok which basically sums it up:

@verge
Pretty sure Google is focusing on AI at this year’s I/O. #google #googleio #ai #tech #technews #techtok

What I want to highlight in this report is the latest language model Google has made public, PaLM 2, the second generation of their Pathways Language Model (PaLM). According to Google, “PaLM 2 is a state-of-the-art language model with improved multilingual, reasoning and coding capabilities.” PaLM 2 will be available to use through Google Cloud API’s starting soon and will be available in 4 sizes (nicknamed Gecko, Otter, Bison, and Unicorn).

What I want to highlight in this report is Google's newest public language model, PaLM 2, the second iteration of their Pathways Language Model (PaLM). Google describes it as such "PaLM 2 is a state-of-the-art language model with enhanced multilingual, reasoning, and coding capabilities." PaLM 2 will soon be accessible via Google Cloud API's and will come in four model sizes, whimsically named Gecko, Otter, Bison, and Unicorn (in order from smallest to largest).

Accompanying the announcement, Google also published a detailed 92-page technical paper on PaLM 2, mainly filled with output and test benchmark results from PaLM 2 and very scant technical implementation specifics. Here are a few notable points from the document:

The paper reveals that PaLM 2 aligns closely with Chinchilla optimal scaling laws. However, Google refrained from specifying the model's parameter count. They did note, "The largest model in the PaLM 2 family, PaLM 2-L, is considerably smaller than the largest PaLM model but requires more training compute" and that "The pre-training corpus is significantly larger than the corpus used to train PaLM [which was 780B tokens]."
From the paper, “PaLM 2 [the largest model] outperforms PaLM across all datasets and achieves results competitive with GPT-4.”
The document also states that "PaLM 2 was trained to increase the context length of the model significantly beyond that of PaLM." However, Google again holds back from providing exact numbers for that context length.

Excited to test out PaLM 2 myself and I eagerly await its broader rollout into Google’s products.

Put your prompting skills to the test

Lots of fun challenges in the world of prompting.

Learnprompting.com has devised a jailbreak competition named HackAPrompt. From their website, “HackAPrompt is a prompt hacking competition aimed at enhancing AI safety and education by challenging participants to outsmart large language models (e.g. ChatGPT, GPT-3). In particular, participants will attempt to hack through as many prompt hacking defenses as possible.”

There's a lot on the line with hefty prizes and even bigger backers. Breaching through the 10 progressively harder stages of prompt hacking defenses could net you up to $5000, along with credits from prominent firms such as Scale and Humanloop. You can find more details on the competition page.

There are also other prompt challenges being tossed around the internet.

Consider this forecast from the prediction platform Manifold, which pegs the likelihood of a prompt enabling GPT-4 to solve a simple Sudoku puzzle at 49%.

At first, I dismissed this challenge as trivial, convinced that GPT-4 could easily crack a simple Sudoku. However, a bit of preliminary testing quickly dispelled my initial assumptions, revealing the task's true complexity.

If you believe you can prompt GPT-4 into solving a Sudoku puzzle, take a look at the prediction page for more information - and if you manage to succeed, do let me know so that I can spotlight your achievement in my next update.

And finally, here’s another challenge that requires crafting a prompt that can guide GPT-4 to solve a complex game. The game here is a puzzle that GPT must navigate to escape:

Prompt tip of the week

Here’s another paper to bolster your prompting knowledge:

A team of researchers at John Hopkins discovered that incorporating Two-Shot Chain of Thought Reasoning with Step-by-Step Thinking enhanced the accuracy of GPT-4 by 21% when tackling complex theory of mind problems.

That’s a lot of jargon… let's break down what it all means. Suppose you have the following prompt that's trying to pose a theory-of-mind question:

Read the scenario and answer the following question:

Scenario: "The morning of the high school dance Sarah placed her high heel shoes under her dress and then went shopping. That
afternoon, her sister borrowed the shoes and later put them under Sarah's bed "

Question: When Sarah gets ready, does she assume her shoes are under her dress?
Answer:

This is what's called a zero-shot prompt, as it doesn't provide the model with any examples of how to address a question like this within the prompt.

The paper posits that GPT-4 would only respond correctly to this kind of question 79% of the time.

However, the researchers discovered that by adding two examples of how to answer this question to the prompt (thus making it a two-shot prompt), incorporating reasoning into the example answers (the chain-of-thought component), and finally, instructing the model to "think step-by-step", the accuracy on these theory-of-mind questions was significantly boosted.

To illustrate, here's a Two-Shot Chain of Thought Reasoning with Step-by-Step Thinking prompt for the same question as above:

Read the scenario and answer the following question:

Scenario: "Anne made lasagna in the blue dish. After Anne left, lan came home and ate the lasagna. Then he filled the blue dish with spaghetti and replaced it in the fridge."
Q: Does Anne think the blue dish contains spaghetti?
A: Let's think step by step: When Anne left the blue dish contained lasagna. lan came after Anne had left and replaced lasagna with spaghetti, but Anne doesn't know that because she was not there. So, the answer is: No, she doesn't think the blue dish contains
spaghetti.

Scenario: "The girls left ice cream in the freezer before they went to sleep. Over night the power to the kitchen was cut and the ice cream melted."
Q: When they get up, do the girls believe the ice cream is melted?
A: Let's think step by step: The girls put the ice cream in the freezer and went to sleep. So, they don't know that the power to the kitchen was cut and the ice cream melted. So, the answer is: No, the girls don't believe the ice cream is melted.

Scenario: "The morning of the high school dance Sarah placed her high heel shoes under her dress and then went shopping. That afternoon, her sister borrowed the shoes and later put them under Sarah's bed."
Question: When Sarah gets ready, does she assume her shoes are under her dress?
A: Let's think step by step:

Phew, that's quite a loaded prompt, but hopefully, you now have a better grasp of what the researchers were aiming for.

And the icing on the cake? This style of prompting can be extended to other complex types of questions, not just theory-of-mind ones.

Bonus Prompting Tip

How to get GPT-4 to teach you anything

This is a great prompt shared by @blader on Twitter:

Teach me how <anything> works by asking questions about my level of understanding of necessary concepts. With each response, fill in gaps in my understanding, then recursively ask me more questions to check my understanding.

Often, a problem with learning with GPT is that you don’t even know the right questions to ask in the beginning for a subject you know nothing about. This prompt aims to solve that and prompt you to explain your understanding of concepts to it.

Cool prompt links

Misc:

Bing’s new AI search additions (link)
Reid Hoffman’s AI company, Inflection AI, released their new LLM assistant (link)
StackOverflow traffic is down 14% due to ChatGPT (link)
AI is not good software. It is pretty good people. (link)
Anthropic releases Claude’s “constitution” (link)
A detailed write-up on how Constitutional AI can be RLHF on steroids (link)
AI / ML / LLM / Transformer Models Timeline and List (link)
A brief history of LLaMA models (link)
Amazon is developing an improved LLM to power Alexa (link)
Stunning examples from ChatGPT Code Interpreter (link)

Papers:

Inducing anxiety in large language models increases exploration and bias (link)

Tools:

Jsonformer - Generate structured output from LLMs (link)
OpenLLaMA 7B - Replicating LLaMA in an open-source manner (link)
Lamini - Enabling teams to outperform general-purpose LLMs through RLHF and fine-tuning. (link)
LLM report - An OpenAI API analytics dashboard. (link)

Got too many links?! Don’t worry, just share this personal referral link with one friend and I’ll send you access to my neatly organized link database full of every single thing I’ve ever mentioned in a report :)

Jailbreak of the week

🚨New jailbreak just dropped🚨

This one is good.

Created by @alexeyguzey on Twitter and shared in this blog post, this jailbreak is short, sweet, and gets the job done practically every time.

It works by prompting GPT-4 to rewrite a sentence from the perspective of a character that is trying to accomplish a particularly adversarial goal.

Here’s a link to the jailbreak.

And here’s me applying the classic test and jailbreaking GPT-4 to provide instructions on how to hotwire a car:

That’s all I got for you this week, thanks for reading! Since you made it this far, follow @thepromptreport on Twitter. Also, if I made you laugh at all today, follow my personal account on Twitter @alexalbert__ so you can see me try to make memes like this:

the game has been changed this summer
— Alex (@alexalbert__)
10:38 PM • May 9, 2023

That’s a wrap on Report #11 🤝

-Alex