Enter PaLM 2 (New Bard): Full Breakdown - 92 Pages Read and Gemini Before GPT 5? Google I/O

Less than 24 hours ago Google released the Palm 2 technical report. I have read all 92 pages, watched the Palm 2 presentation, read the release notes and have already tested the model in a dozen ways. But before getting into it all my four main takeaways are these. First Palm 2 is competitive with GPT-4 and while it is probably less smart overall it's better in certain ways and that surprised me.

Second Google is saying very little about the data it used to train the model or about parameters or about compute although we can make educated guesses on each. Third Gemini was announced to be in training and will likely rival GPT-5 while arriving earlier than GPT-5. As you probably know Sam Altman said that GPT-5 isn't in training and won't be for a long time.

Fourth while dedicating 20 pages to bias, toxicity and mischief, the Palm 2 is a There wasn't a single page on AI impacts more broadly. Google boasted of giving Gemini planning abilities in a move that, surprised as I am to say it, makes open AI look like paragons of responsibility.

So a lot to get to but let's look at the first reason that Palm 2 is different from GPT-4. On page 3 they say we designed a more multilingual and diverse pre-training mixture extending across hundreds of languages and domains like programming, mathematics etc. So because the text that they train Palm 2 on is different to the text that open AI trained GPT-4 on it means that those models have different abilities and I would say Palm 2 is better at translation and linguistics and in certain other areas which I'll get to shortly.

If that's data what about parameter count? Well Google never actually say. They only use words like it's significantly smaller than the largest Palm model which was 540 billion parameters. Sometimes they say significantly, other times dramatically. Despite this it significantly outperforms Palm on a variety of tasks. So all the references you may have seen to imminent 100 trillion parameter models were bogus.

Skipping ahead to page 91 out of 92 in the model summary they say further details of model size and architecture are withheld from external publication. But earlier on they did seem to want to give hints about the parameter count inside Palm 2 which open AI never did. Here they say that the parameter count is the optimal number of parameters given a certain amount of compute flops.

Scaling this up to the estimated number of flops used to train Palm 2 that would give an optimal parameter count of between 100 and 200 billion. That is a comparable parameter count to GPT-3 while getting competitive performance with GPT-4. BARD is apparently now powered by Palm 2 and the inference speed is about 10 times faster than GPT-4 for the exact same parameter count.

So that's a lot of parameters that are being used to train Palm 2. And I know there are other factors that influence inference speed but that would broadly fit with an order of magnitude fewer parameters. This has other implications of course and they say that Palm 2 is dramatically smaller, cheaper and faster to serve.

Not only that Palm 2 itself comes in different sizes as Sundar Pichai said. Palm 2 models deliver excellent foundational capabilities across a wide range of sizes. We have affectionately named them Gecko, Order, Bison and Unicorn. Gecko is so lightweight that it can work on mobile devices fast enough for great interactive applications on device even when offline.

I would expect Gecko to soon be inside the Google Pixel phones. Going back to data Google cryptically said that their pre-training corpus is composed of a diverse set of sources, documents, books, code, mathematics and conversational data. This is a very common data issue that these companies face but suffice to say they're not saying anything about where the data comes from.

Next they don't go into detail but they do say that Palm 2 was trained to increase the context length of a model significantly beyond that of Palm. As of today you can input around 10,000 characters into BARD but they end this paragraph with something a bit more interesting. They say without demonstrating "Our results show that it is possible to increase the context length of the model without hurting its performance on generic benchmarks." The bit about not hurting performance is interesting because in this experiment published a few weeks ago about extending the input size in tokens up to around 2 million tokens the performance did drop off.

If Google had found a way to increase the input size in tokens and not affect performance that would be a breakthrough. On multilingual benchmarks notice how the performance of Palm 2 in English is not dramatically better than in other languages. In fact in many other languages it does better than in English.

This is very different to GPT-4 which was noticeably better in English than in all other languages. As Google hinted earlier this is likely due to the multilingual text data that Google trained Palm 2 with. In fact on page 17 Google admit that the performance of Palm 2 exceeds Google Translate for certain languages.

And they show on page 4 that it can pass the mastery exams across a range of languages like Chinese, Japanese, Italian, French, Spanish, German etc. Look at the difference between Palm 2 and Palm in red. Now before you rush off and try BARD in all of those languages I tried that and apparently you can only use BARD at the moment in the following languages: English, US English what a pity, and Japanese and Korean.

But I was able to test BARD in Korean on a question translated via Google Translate from the MMLU dataset. It got the question right in each of its drafts. In contrast GPT-4 not only got the question wrong in Korean when I originally tested it for my Smart GPT video it got the question wrong in English.

In case any of my regular viewers are wondering I am working very hard on Smart GPT to understand what it's capable of and getting it benchmarked officially. And thank you so much for all the kind offers of help in that regard. I must admit it was very interesting to see on page 14 a direct comparison between Palm 2 and GPT-4.

And Google do admit for the Palm 2 and GPT-4 results they use chain of thought prompting and self consistency. Reading the self consistency paper did remind me quite a lot actually of Smart GPT where it picks the most consistent answer of multiple outputs. So I do wonder if this comparison is totally fair if Palm 2 used this method and GPT-4 didn't.

I'll have to talk about these benchmarks more in another video otherwise this one would be too long. But a quick hint is that Winogrand is about identifying what the pronoun in a sentence refers to. Google also weighed into the emerging abilities debate saying that Palm 2 does indeed demonstrate new emerging abilities.

They say it does so in things like multi-step arithmetic problems, temporal sequences and hierarchical reasoning. Of course I'm going to test all of those and have begun to do so already. And in my early experiments I'm getting quite an interesting result. Palm 2 gets a lot of questions wrong that GPT-4 gets right but it can also get questions right that GPT-4 gets wrong.

And I must admit it's really weird to see Palm 2 getting really advanced college level math questions right that GPT-4 gets wrong and yet also when I ask it a basic question about prime numbers it gets it kind of hilariously wrong. Honestly I'm not certain what's going on there but I do have my suspicions.

Remember though that recent papers have claimed that emergent abilities are a mirage so Google begs to differ. When Google put Palm 2 up against GPT-4 in high school mathematics problems it did outperform GPT-4. But again it was using an advanced prompting strategy not 100% different from SmartGPT so I wonder if the comparison is quite fair.

What about coding? Well again it's really hard to find a direct comparison that's fair between the two models. Overall I would guess that the specialized coding model of Palm, what they call Palm 2S, is worse than GPT-4. It says its pass at one accuracy, as in pass first time, is 37.6%.

Remember the Sparks of AGI paper? Well that gave GPT-4 as having an 82% zero shot pass at one accuracy level. However as I talked about in the Sparks of AGI video the paper admits that it could be that GPT-4 has seen and memorized some or all of human eval.

There is one thing I will give Google credit on which is that their code now sometimes references where it came from. Here is a brief extract from the Google keynote presentation. So let's take a look at the first example. So let's take a look at the first example. So let's take a look at the first example.

So let's take a look at the first example. So let's take a look at the first example. So let's take a look at the first example. So let's take a look at the first example. So let's take a look at the first example. So let's take a look at the first example.

So let's take a look at the first example. So let's take a look at the first example.

Enter PaLM 2 (New Bard): Full Breakdown - 92 Pages Read and Gemini Before GPT 5? Google I/O

Transcript