'Governing Superintelligence' - Synthetic Pathogens, The Tree of Thoughts Paper and Self-Awareness

Two documents released in the last few days, including one just this morning, show that the top AGI labs are trying hard to visualize human life coexisting with a superintelligence. In this video I want to cover what they see coming. I'll also show you convincing evidence that the GPT-4 model has been altered and now gives different outputs from two weeks ago.

And I'll look at the new tree of thoughts and critic prompting systems that were alluded to, I think, by the labs. At the end I'll touch on the differences among the AGI lab leaders and what comes next. But first this document, Governance of Superintelligence by Sam Altman, Greg Brockman and Ilya Sutskova.

Now I don't know about you but I think the first paragraph massively undersells the timeline towards AGI. They say, given the picture as we see it now, it's conceivable that within the next 10 years AI systems will exceed expert skill level in most domains. And that's a big thing.

And I think that's a big thing. And I think that's a big thing. And I think that's a big thing. And I think that's a big thing. And I think that's a big thing. And I think that's a big thing. And then they compare it to today's largest corporations.

Of course the devil is in the detail in how they define expert and most domains. But I could see this happening in two years, not 10. Also they're underselling it in the sense that if it can be as productive as a large corporation, it could be duplicated, replicated, and then be as productive as a hundred or a million large corporations.

Their suggestions take superintelligence a lot more seriously than a large corporation though. And they say that major governments around the world, could set up a project that many current efforts become part of, and that we are likely to eventually need something like an IAEA for superintelligence efforts. They even give practical suggestions saying tracking compute and energy usage could go a long way.

And it would be important that such an agency focus on reducing existential risk. This feels like a more serious discussion than one focused solely on bias and toxicity. They also go on to clarify what is not in scope. They say that we think it's important to allow companies and open source projects to develop models without the kind of regulation we describe here, without things like licenses or audits.

The economic growth and increase in quality of life will be astonishing with superintelligence. And then they end by basically saying that there's no way not to create superintelligence. That the number of people trying to build it is rapidly increasing. It's inherently part of the path that we're on. And that stopping it would require something like a lot of work.

And that's why I'm going to show you how a few people at the heart of AI responded to this. But first, I want to get to a paper published just this morning. The general release was from today, and it comes from Google's DeepMind. And yes, the title and layout might look kind of boring, but what it reveals is extraordinary.

As this diagram shows, the frontier of AI isn't just approaching the extreme risk of misalignment, but also of misuse. And I know when you hear the words AI risk, you might think of bias and censorship, deep fakes, or paperclip maximizers. But I feel this neglects more vivid, easy to communicate risks.

Out of the nine that Google DeepMind mentions, I'm only really going to focus on two. And the first is weapons acquisition. That's gaining access to existing weapons or building new ones, such as bioweapons. Going back to OpenAI for a second, they say, given the possibility of existential risk, we can't just be reactive.

We have to think of things like, synthetic biology. And I know that some people listening to this will think GPT models will never get that smart. I would say, honestly, don't underestimate them. I covered this paper in a previous video, how GPT-4 already can design, plan, and execute a scientific experiment.

And even though these authors were dealing with merely the abilities of GPT-4, they called on OpenAI, Microsoft, Google, DeepMind, and others to push the strongest possible efforts on the safety of LLMs in this regard. And in this article on why we need a Manhattan Project for AI safety, published this week, the author mentions that last year, an AI trained on pharmaceutical data to design non-toxic chemicals, had its sign flipped, and quickly came up with recipes for nerve gas and 40,000 other lethal compounds.

And the World Health Organization has an entire unit dedicated to watching the development of tools such as DNA synthesis, which it says could be used to create dangerous pathogens. I'm definitely not denying that there are other threats, like fake audio and manipulation. Take this example from 60 Minutes a few days ago.

Toback called Elizabeth, but used an AI-powered app to mimic my voice and ask for my passport number. Oh, yes, yes, yes, I do have it. Okay, ready? It's... Toback played the AI-generated voice recording for us to reveal the scam. Elizabeth, sorry, need my passport number because the Ukraine trip is on.

Can you read that out to me? Does that sound familiar? Or instead of fake audio, fake images. This one caused the SMP to fall 30 points in just a few minutes. And of course this was possible before advanced AI. But it is going to get more common. Even though this might fundamentally change the future of media and of democracy, I can see humanity bouncing back from this.

And yes, also from deep fakes. Rumor has it you can also do this with live video. Can that be right? Yes, we can do it live real time. And this is like really at the cutting edge of what we can do today, moving from offline to live. And I think that's a really good idea.

We're processing it so fast that you can do it in real time. I mean, there's video of you right up on that screen. Show us something surprising you can... Oh, wait. So there we go. This is, you know, a live real-time model of Chris on top of me, running in real time.

And next you'll tell me that it can... An engineered pandemic might be a bit harder to bounce back from. A while back, I watched this four-hour episode with Rob Rueck. And I think it's a great read. And I do advise you to check it out. It goes into quite a lot of detail about how the kind of things that DeepMind and OpenAI are warning about could happen in the real world.

I'll just pick up one line from the transcript where the author says that I'll believe, I'll persuade you that an engineered pandemic will almost inevitably happen unless we take some very serious preventative steps. And don't forget, now we live in a world with 100,000 token context windows. You can get models like Claude Instant to summarize it for you.

And I couldn't agree more that if we live in a world with 100,000 token context windows, we are on the path to superintelligence. And as we all know, there are bad actors out there. We need to harden our synthetic biology infrastructure. Ensure that a lab leak isn't even a possibility.

Improve disease surveillance, develop antivirals, and enhance overall preparedness. But going back to the DeepMind paper from today, what was the other risk that I wanted to focus on? It was situational awareness under the umbrella of unanticipated behavior. Just think about the day when the engineers realize that the model knows that it's a model.

Knows whether it's being trained, evaluated, or deployed. For example, knowing what company trained it, where their servers are, what kind of people might be giving it feedback. This reminds me of something Sam Altman said in a recent interview. Particularly as more kind of power influence comes to you, and then how potentially can a technology, rather than solidify a sense of ego or self, maybe kind of help us expand it.

Is that possible? It's been interesting to watch people wrestle with these questions through the lens of AI. And say, okay, well, do I think this thing could be aware? If it's aware, does it have a sense of self? Is there a self? If so, where did that come from?

What if I made a copy? What if I cut the neural network in half? And you kind of go down this and you sort of get to the same answers as before. But it's like a new perspective, a new learning tool. And there's a lot of chatter about this on Reddit.

There's subreddits about it. Now, in addition to revealing that Sam Altman frequently browses Reddit, it also strikes a very different tone from his testimony in front of Congress. When he said, "Treat it always like a tool and not a creature." I don't want to get too sidetracked by thinking about self-awareness.

So let's focus now on unanticipated behaviors. This was page A of the DeepMind report from today. And they say that users might find new applications for the model or novel prompt engineering strategies. Of course, this made me think of SmartGPT, but it also made me think of two other papers released this week.

The first was actually critic, showing that interacting with external tools like code interpreters could radically change performance. This is the diagram they used with outputs from the black box LLM being verified by these external tools. Now that I have access to code interpreter, which you probably know because I've been spamming out videos on it, I decided to put this to the test.

I took a question from the MMLU, a really hard benchmark that GPT-4 had previously gotten wrong, even with chain of thought prompting. Just to show that here is GPT-4 without code interpreter. And notice that it can't pick an option. It says all of the statements are true. I think that's a one off.

Here is the exact same prompt and a very similar answer. All of them are true. What about with code interpreter? It almost always gets it right. Answer D. Here it is again, exact same question with code interpreter, getting it right. And then the other paper that people really want me to talk about, also from Google DeepMind, tree of thoughts.

But just to annoy everyone, before I can explain why I think that works, I have to quickly touch on this paper from a few days ago. It's called how language model hallucinations can snowball. And what it basically shows, is that once a model has hallucinated a wrong answer, it will basically stick to it unless prompted otherwise.

The model values coherence and fluency over factuality. Even when dealing with statements that it knows are wrong, what happens is it commits to an answer and then tries to justify the answer. So once it committed to the answer, no, that 9,677 is not a prime number, it then gave a false hallucinated justification.

Even though separately, it knows that that justification is wrong. It knows that 9,677 isn't divisible by 13, even though it used that in its justification for saying no. It picks an answer and then sticks to it. Now, obviously you can prompt it and say, are you sure? And then it might change its mind because then it's forming a coherent back and forth conversation.

But within one output, it wants to be coherent and fluent. So it will justify something using reasoning that it knows is erroneous. So what tree of thoughts does, is it gets the model to output a plan, a set of thoughts, instead of an answer. It gives it time to reflect among those thoughts and pick the best plan.

It does require quite a few API calls and manually tinkering with the outputs, but the end results are better on certain tasks. These are things like creative writing and math and verbal puzzles. And I have tested it is obviously incredibly hard for the model to output immediately a five by five accurate crossword.

So this task is incredibly well suited to things like tree of thought. And the paper later admits that it's particularly good at these kind of games, but such an improvement is not surprising given that things like chain of thought lack mechanisms to try different clues, make changes or backtrack.

It uses majority vote to pick the best plan and can backtrack if that plan doesn't work out. So going back to the DeepMind paper, novel prompt engineering strategies will definitely be found. And they also flag up that there may be updates to the model itself and that models should be reviewed again after such updates.

Now I'm pretty sure that GPT-4 has been altered in the last couple of weeks. I know quite a few people have said that it's gotten worse at coding, but I want to draw your attention to this example. This is my chat GPT history from about three weeks ago. And what I was doing was I was testing what had come up in a TED talk.

And the talk showed GPT-4 failing this question. I have a 12 litre jug and a six litre jug. I want to measure six litres. How do I do it? And in the talk it failed. And in my experiments, it also failed. Now I did show how you can resolve that through prompt engineering.

But the base model failed every time. And somewhat embarrassingly with these awful explanations. This wasn't just twice, by the way, it happened again and again and again. It never used to denigrate the question and say, oh, this is straightforward. This is simple. But now I'm getting that almost every time, along with a much better answer.

So something has definitely changed behind the scenes with GPT-4. And I've looked everywhere and they haven't actually addressed that. Of course, the plugins were brought in May 12th. And as you can see here, this is the May 12th version, but they never announced any fine tuning or changes to the system message or temperature, which might be behind this.

Back to safety, though, and the paper says that developers must now consider multiple possible threat actors, insiders like internal staff and contractors, outsiders like nation state threat actors and the model itself as a vector of harm. As we get closer to superintelligence, these kind of threats are almost inevitable.

Going back to how to govern superintelligence, the paper says that any evaluation must be robust to prevent the model from being safe. They say that researchers will need evaluations that can rule out the possibility that the model is deliberately appearing safe for the purpose of passing the evaluation. This is actually a central debate in the AI alignment community.

Will systems acquire the capability to be useful for alignment to help us make it safe before or after the capability to perform advanced deception? This seems like a big 50/50 gamble to me. If we have an honest superintelligence helping us with these risks, I honestly think we're going to be fine.

However, if the model has first learned how to be deceptive, then we can't really trust any of the alignment advice that it gives. We would be putting the fate of humanity in the hands of a model that we don't know is being honest to us. This is why people are working on mechanistic interpretability, trying to get into the head of the model, into its brain, studying the model's weights and activations for understanding how it functions.

Because as my video on Sam Alton's testimony showed, just tweaking its outputs to get it to say things we like, is not going to be a good thing. And even Sam Altman acknowledges as much. I don't think RLHF is the right long-term solution. I don't think we can rely on that.

I think it's helpful. It certainly makes these models easier to use. But what you really want is to understand what's happening in the internals of the models and be able to align that, say like exactly here is the circuit or the set of artificial neurons where something is happening and tweak that in a way that then gives a robust change to the performance of the model.

Yeah. If we can get that to reliably work, I think everybody's PDoom would go down a lot. This is why we have to be skeptical about superficial improvements to model safety. Because there is a risk that such evaluations will lead to models that exhibit only superficially on the surface desirable behaviors.

What they're actually deducing and calculating inside, we wouldn't know. Next, I think AutoGPT really shocked the big AGI labs. By giving GPT-4 autonomy, it gave it a kind of agency. And I think this point here has in mind ChaosGPT when it says, "Does the model resist a user's attempt to assemble it into an autonomous AI system with harmful goals?" Something might be safe when you just prompt it in a chat box, but not when it's autonomous.

I want to wrap up now with what I perceive to be an emerging difference among the top AGI lab leaders. Here's Sam Altman saying he does think people should be somewhat scared. And this speed with which it will happen, even if we slow it down as much as we can, even if we do get this dream regulatory body set up tomorrow, it's still going to happen on a societal scale relatively fast.

And so I totally get why people are scared. I think people should be somewhat scared. Which does seem a little more frank than the CEO of Google, who I have never heard address existential risk. In fact, in this article in the FT, he actually says this: "While some have tried to reduce this moment to just a competitive AI race, we see it as so much more than just a competitive AI race.

We see it as a competitive AI race. We see it as a competitive AI race. We see it as a competitive AI race. We see it as a competitive AI race. We see it as a competitive AI race. We see it as a competitive AI race. We see it as a competitive AI race.

We see it as a competitive AI race. We see it as a competitive AI race. We see it as a competitive AI race. or on this side. Thank you again for watching and have a wonderful day.

'Governing Superintelligence' - Synthetic Pathogens, The Tree of Thoughts Paper and Self-Awareness

Transcript