There were nine impactful AI developments in the last few days that I wanted to tell you guys about. From the frankly startling HeyGen video translation, to the epic new prompt optimising paper, and from Apple's IAX GPT to Open Interpreter, Next GPT, yet more Google Gemini news, and even Roblox AI.
But I must start with HeyGen. You probably already heard from AI Explained that HeyGen can generate lifelike videos, and is available as a plugin to ChatGPT. But how about video language dubbing? Well, today I got access to their new Avatar 2.0 feature, and I decided to test it out with Sam Altman.
Not with the real Sam Altman, of course, but with his testimony to the Senate. And I want Spanish language speakers to tell me how accurate they think this is. My worst fears are that we are causing significant damage to the field, technology, industry, and the world. I think that could happen in various ways.
That's why we started the company. It's a big part of why I'm here today, and we've been able to spend time with you. If this technology fails, it can be disastrous. We want to be clear about our position on this. I have been researching three or four tools, including this one, to translate my videos into dozens of languages.
And I can't wait to put that into place. But time waits for no man, so let me move on to Open Interpreter. I've been using the version released five days ago. What is it? Well, an open source code interpreter. Here is a brief preview. Open Interpreter. Open Interpreter. Of course, I've been trying to figure out how to use Open Interpreter.
I've been trying to figure out how to use Open Interpreter. I've been trying it out intensively, and while it's not perfect, it has proven useful. I asked it this: Download this YouTube video in 1440p, e.g. using PyTube, and clip out 2318, 2338, save to desktop, naming file Altman vertical line.
That's a clip I wanted to use in this very video. Now, okay, it wouldn't have taken me that long to do it manually, but this process was a few seconds. You agreed to run the code a few times, and here was the end result. You can try to guess why I picked out this clip from a recent Sam Altman interview.
And we weren't set up for that. Is the usage of Chatipati decelerating? No. I think it maybe took like a little bit of a flat line during the summer, which happens for lots of products, but it is... Doink up. *sniff* Obviously, we shouldn't automatically believe the CEO of the company about how many people are using his product.
But I do think it points to a counter-narrative to the argument that Chatipati usage is continuing to slow down into the autumn. You don't have to use GPT-4 either, and more generally, I think this points to a change in the way we will use computers in the near future.
And now let's talk about this paper from Google DeepMind. I'll have more news about their Gemini. I'll talk about the Gemini model in a second, but first, I found this paper fascinating. I will hopefully be doing a deeper dive with one of the authors, but for now, the big picture is this.
Language models can come up with optimized prompts for language models. These aren't small optimizations either, and nor do they work with only one model. The paper says that with a variety of large language models, we demonstrate that the best prompts optimized by their method outperform human design prompts by up to 8% or more.
The paper says that with a variety of large language models, we demonstrate that the best prompts optimized by their method outperform human design prompts by up to 8% or more. The paper says that with a variety of large language models, we demonstrate that the best prompts optimized by their method outperform human design prompts by up to 8% or more.
The paper says that with a variety of large language models, we demonstrate that the best prompts optimized by their method outperform human design prompts by up to 8% or more. The paper says that with a variety of large language models, we demonstrate that the best prompts optimized by their method outperform human design prompts by up to 8% or more.
The paper says that with a variety of large language models, we demonstrate that the best prompts optimized by their method outperform human design prompts by up to 8% or more. bench hard tasks. Those are long-standing tasks known for their difficulty for large language models. To massively oversimplify, models like Palm 2 and GPT-4 can be given a meta prompt.
For example, generate a new instruction that achieves a higher accuracy on a particular task. The language models are then shown how previous prompts worked out. In this example, for a particular task, let's figure it out scored 61, while let's solve the problem scored 63 out of 100. This was the mathematics problem down here.
And then they're asked, generate an instruction that is different from all the instructions above and has a higher score than all the instructions above. The instruction should be concise, effective, and generally applicable to all problems. And apparently, GPT-4 was particularly good at looking at the trajectory of optimizations, the patterns and trends about what produced better prompts on a particular task.
For example, you might start with let's solve the problem, which scored 60%. And then the language model would propose iterations like let's think carefully about the problem and solve it together. That got 63.2 and you can see the accuracy gradually going up. Apparently, at least for math problems, Palm 2 preferred concise prompts, while GPT models liked ones that were long and detailed.
And nor was it just about the semantics or meanings of the prompts. The same meanings phrased differently could get radically different results. For example, with Palm 2, let's think step-by-step, and then let's think about the semantics or meanings of the prompts. And then, let's think about the got 71.8.
Whereas let's solve the problem together has accuracy of 60.5. But then if you put those two together and say let's work together to solve this problem step by step you only get 49.4. Although semantically its meaning is just a combination of those two instructions. For the original smart GPT I used this prompt.
Let's work this out in a step-by-step way to be sure we have the right answer. That's because it performed best for GPT-4. As you can see here it doesn't perform best for palm 2. Although notice it does perform better than just an empty string. What does perform best? Well take a deep breath and work on this problem step by step.
Also note the difference with beginning your answer with this prefix or beginning your question with a prefix. Anyway I am hoping to do a deeper dive on this paper with one of the authors so for now I'll leave it there. Suffice to say that prompt engineering is not a solved science yet.
But what was the Gemini news that I promised you from Google? Well this was published just 14 hours ago in the information. Google has as of yesterday given a small group of companies access to an early version of Gemini. That is their direct competitor with OpenAI's GPT-4. According to a person who has tested it Gemini has an advantage over GPT-4 in at least one respect.
The model leverages reams of Google's proprietary data from its consumer products in addition to public information scraped from the web. So the model should be able to do this. So let's take a look at the data. So the data is should be especially accurate with all of those Google search histories when it comes to understanding users intentions with particular queries.
And apparently compared to GPT-4 it generates fewer incorrect answers known as hallucinations. And again according to them Gemini will feature vastly improved code generating abilities for software developers. Although note it says compared to its existing models. It didn't technically say compared to GPT-4. Note that Palm 2 didn't score particularly high for the model.
So let's take a look at the data. So let's take a look for coding so if that's the baseline bear that in mind. The version they're giving developers isn't their largest version though which apparently will be on par with GPT-4. And in a first for this channel I'm going to make a direct prediction using Metaculous.
I'm going to predict that there will indeed be at least three months of third-party safety evaluations conducted on Gemini before its deployment. I think they finished training the model sometime in summer so it will be more like six months if it's released in December. The heart of this channel is about understanding and navigating the future of AI so I am super proud that Metaculous are my first sponsors.
They have aggregate forecasts on a range of AI related questions. Yes it's free to sign up with the link in the description so show them some love and say you came from AI Explained. Speaking of the future though we learned this week in the Wall Street Journal that Meta plans to develop Lama Theta and Meta's new technology.
Meta is a new technology that will be available in the future. It will be available in the early days of the year and that will be Lama 3 sometime in early 2024. That will apparently be several times more powerful than Lama 2. Even more interesting to me though was this exchange at a recent Meta Gen I social.
We have the compute to train Lama 3 and 4. The plan is for Lama 3 to be as good as GPT-4. Wow if Lama 3 is as good as GPT-4 will you guys still open source it? Yeah we will. Sorry alignment people. You can let me know in the comments what you think about that exchange.
That is all for this video. I hope you enjoyed it and I will see you in the next one. Bye for now. Bye. That is of course in complete contravention of what Senators Blumenthal and Hawley have put out. This week they released the bipartisan framework for US AI Act.
In it they actually mentioned deepfakes which I kind of showed you earlier with HeyGen. But they also focused on AI audits and establishing an oversight body that should have the authority to conduct audits of companies seeking licenses. But I suspect the people signing up to work in that auditing office will have to commit to not working for any of the AI companies for the rest of their lives.
That's going to take a particularly motivated individual particularly on public sector pay levels. Why do I say that? Well here's Mustafa Suleiman. He recently said this on 80,000 hours. Well I'm really stuck. I think it's really hard. There is another direction which involves academic groups getting more access and either actually doing red teaming or doing audits of scale or audits of model capabilities.
Right. They're the three proposals that I've heard made and I've been very supportive of and have certainly explored with people at Stanford and elsewhere. But I think there's a real problem there which is if you take the average PhD student or postdoctoral researcher that might work on this in a couple of years they may well go to a commercial lab.
Right. And so if we're to give them access then they'll probably take that knowledge and expertise elsewhere potentially to a competitor. I mean it's an open. Labor market after all. And when we heard this week that the IRS in America are going to use AI to catch tax evasion it made me think that it's going to increasingly be a cat and mouse game between governments and auditors using AI on the one side and the companies developing the AI on the other side.
If the IRS has an AI that can detect tax evasion well then a hedge fund can just make an AI to obscure that tax evasion. Seems to me that in all of this whoever has the most compute will win. And remember these won't just be the ones that are going to be the ones that are single modality language models anymore.
To take one crazy example this week, we now have 'Smell to Text'. It's a much more narrow AI trained in a very different way to GPT models but it matches well with expert humans on novel smells. And then there's 'Protein Chat' which I didn't get a chance to talk about earlier in the year.
The so-called 'Protein GPT' enables users to upload proteins, ask questions and engage in interactive conversations to gain insights. And if that's not enough modalities, how about this? This is 'Next GPT', a multimodal LLM released two days ago that can go from any modality to any modality. Obviously there should be an asterisk over 'any', it isn't quite 'any' yet, but we're talking about images, audio, video and then the output being images, audio, text, video.
One obvious question is: do we want one model to be good at everything or do we want narrower AI that's good at individual tasks? And this links to 'Iron Man' which is a very common model that's used in AI training. And this links to 'Iron Man' and 'Iron Man' which is a very common model that's used in AI training.
And this links to 'Iron Man' and 'Iron Man' which is a very common model that's used in AI training. And this links to 'Iron Man' Ajax GPT from Apple. Now, of course, I did watch the iPhone launch, but I find this more interesting. This was an exclusive for the information, and they talk about how Apple's LLM is designed to boost Siri.
And it almost sounds to me like Open Interpreter, where you can automate tasks involving multiple steps. For example, telling Siri to create a GIF using the last five photos you've taken and text it to a friend. And this was the most interesting part of the piece for me. Earlier in the article, they talked about how they're spending millions of dollars a day on iXGPT.
They're still quite far behind because apparently iXGPT beats GPT 3.5, the original ChatGPT, but not GPT 4. The focus is on running LLMs on your device with the goal of improving privacy and performance. So iXGPT might not be the best LLM, but they're pitching it as the best LLM on your phone.
The model they have at the moment is apparently too big. It's 200 billion parameters. Even a MacBook might struggle to run that, but they might have different sizes of iX, some small enough to run on an iPhone. Of course, from a user point of view, that would mean you can use the model offline, unlike, say, the ChatGPT app.
Let's move on now to a lighter development, albeit one that might affect hundreds of millions of people, including my nephew. Apparently, the iXGPT is a little bit more expensive than the iXGPT, but it's still a the online game platform Roblox is bringing in a new AI chatbot. That's going to allow creators to build virtual worlds just by typing prompts.
And that's the crazy thing. All of this is going to become intuitive to the next generation. Children today are just going to expect their apps to be interactive and customizable on demand. And yes, we have covered a lot today, so let me know what you think. I'm going to end with an AI image that has taken the internet by storm, as well as a few more things that I've been able to do to make it more accessible.
And I'll see you in the next video. Thanks as always for watching to the end. Do check out Metaculous in the description. And as ever, have a wonderful day.