back to indexEnter PaLM 2 (New Bard): Full Breakdown - 92 Pages Read and Gemini Before GPT 5? Google I/O
00:00:00.000 |
Less than 24 hours ago Google released the Palm 2 technical report. I have read all 92 pages, 00:00:06.740 |
watched the Palm 2 presentation, read the release notes and have already tested the model in a 00:00:12.340 |
dozen ways. But before getting into it all my four main takeaways are these. First Palm 2 is 00:00:18.300 |
competitive with GPT-4 and while it is probably less smart overall it's better in certain ways 00:00:25.200 |
and that surprised me. Second Google is saying very little about the data it used to train the 00:00:30.500 |
model or about parameters or about compute although we can make educated guesses on each. 00:00:36.680 |
Third Gemini was announced to be in training and will likely rival GPT-5 while arriving earlier 00:00:43.160 |
than GPT-5. As you probably know Sam Altman said that GPT-5 isn't in training and won't be for a 00:00:49.680 |
long time. Fourth while dedicating 20 pages to bias, toxicity and mischief, the Palm 2 is a 00:00:55.180 |
There wasn't a single page on AI impacts more broadly. Google boasted of giving Gemini 00:01:01.480 |
planning abilities in a move that, surprised as I am to say it, makes open AI look like 00:01:07.420 |
paragons of responsibility. So a lot to get to but let's look at the first reason that Palm 2 00:01:13.740 |
is different from GPT-4. On page 3 they say we designed a more multilingual and diverse 00:01:19.640 |
pre-training mixture extending across hundreds of languages and domains like programming, 00:01:25.160 |
So because the text that they train Palm 2 on is different to the text that open AI trained GPT-4 on 00:01:31.940 |
it means that those models have different abilities and I would say Palm 2 is better at translation 00:01:37.900 |
and linguistics and in certain other areas which I'll get to shortly. If that's data what about 00:01:43.540 |
parameter count? Well Google never actually say. They only use words like it's significantly 00:01:48.760 |
smaller than the largest Palm model which was 540 billion parameters. Sometimes they say 00:01:55.140 |
significantly, other times dramatically. Despite this it significantly outperforms Palm on a variety 00:02:02.140 |
of tasks. So all the references you may have seen to imminent 100 trillion parameter models were bogus. 00:02:08.360 |
Skipping ahead to page 91 out of 92 in the model summary they say further details of model size and 00:02:15.040 |
architecture are withheld from external publication. But earlier on they did seem to want to give hints 00:02:20.800 |
about the parameter count inside Palm 2 which open AI never did. Here they say that the parameter count 00:02:25.120 |
is the optimal number of parameters given a certain amount of compute flops. Scaling this up to the 00:02:31.780 |
estimated number of flops used to train Palm 2 that would give an optimal parameter count of between 00:02:37.580 |
100 and 200 billion. That is a comparable parameter count to GPT-3 while getting competitive performance 00:02:45.120 |
with GPT-4. BARD is apparently now powered by Palm 2 and the inference speed is about 10 times faster 00:02:52.840 |
than GPT-4 for the exact same parameter count. So that's a lot of parameters that are being used to train Palm 2. 00:02:55.100 |
And I know there are other factors that influence inference speed but that would 00:02:59.160 |
broadly fit with an order of magnitude fewer parameters. This has other implications of course 00:03:04.860 |
and they say that Palm 2 is dramatically smaller, cheaper and faster to serve. Not only that Palm 2 00:03:11.580 |
itself comes in different sizes as Sundar Pichai said. Palm 2 models deliver excellent foundational 00:03:17.980 |
capabilities across a wide range of sizes. We have affectionately named them Gecko, Order, 00:03:25.080 |
Bison and Unicorn. Gecko is so lightweight that it can work on mobile devices fast enough for great 00:03:33.660 |
interactive applications on device even when offline. I would expect Gecko to soon be inside 00:03:40.280 |
the Google Pixel phones. Going back to data Google cryptically said that their pre-training corpus 00:03:46.000 |
is composed of a diverse set of sources, documents, books, code, mathematics and conversational data. 00:03:55.060 |
This is a very common data issue that these companies face but suffice to say they're not saying anything about where the data comes from. 00:04:00.960 |
Next they don't go into detail but they do say that Palm 2 was trained to increase the context 00:04:06.240 |
length of a model significantly beyond that of Palm. As of today you can input around 10,000 00:04:11.240 |
characters into BARD but they end this paragraph with something a bit more interesting. They say 00:04:16.000 |
without demonstrating "Our results show that it is possible to increase the context length of the 00:04:21.040 |
model without hurting its performance on generic benchmarks." 00:04:25.040 |
The bit about not hurting performance is interesting because in this experiment published a few weeks ago 00:04:30.020 |
about extending the input size in tokens up to around 2 million tokens the performance did drop 00:04:35.820 |
off. If Google had found a way to increase the input size in tokens and not affect performance 00:04:41.820 |
that would be a breakthrough. On multilingual benchmarks notice how the performance of Palm 2 00:04:47.700 |
in English is not dramatically better than in other languages. In fact in many other languages 00:04:55.020 |
This is very different to GPT-4 which was noticeably better in English than in all other languages. 00:05:01.340 |
As Google hinted earlier this is likely due to the multilingual text data that Google trained Palm 2 with. 00:05:08.140 |
In fact on page 17 Google admit that the performance of Palm 2 exceeds Google Translate for certain languages. 00:05:15.420 |
And they show on page 4 that it can pass the mastery exams across a range of languages like Chinese, Japanese, Italian, French, Spanish, German etc. 00:05:25.000 |
Look at the difference between Palm 2 and Palm in red. 00:05:28.940 |
Now before you rush off and try BARD in all of those languages I tried that and apparently 00:05:34.020 |
you can only use BARD at the moment in the following languages: English, US English what a pity, 00:05:41.620 |
But I was able to test BARD in Korean on a question translated via Google Translate from 00:05:48.260 |
the MMLU dataset. It got the question right in each of its drafts. In contrast GPT-4 not only 00:05:54.980 |
got the question wrong in Korean when I originally tested it for my Smart GPT video it got the 00:06:02.480 |
In case any of my regular viewers are wondering I am working very hard on Smart GPT to understand 00:06:08.460 |
what it's capable of and getting it benchmarked officially. 00:06:11.800 |
And thank you so much for all the kind offers of help in that regard. 00:06:15.480 |
I must admit it was very interesting to see on page 14 a direct comparison between Palm 00:06:21.360 |
2 and GPT-4. And Google do admit for the Palm 2 and GPT-4 00:06:24.960 |
results they use chain of thought prompting and self consistency. Reading the self consistency 00:06:30.880 |
paper did remind me quite a lot actually of Smart GPT where it picks the most consistent 00:06:36.440 |
answer of multiple outputs. So I do wonder if this comparison is totally fair if Palm 00:06:44.900 |
I'll have to talk about these benchmarks more in another video otherwise this one would 00:06:48.840 |
be too long. But a quick hint is that Winogrand is about identifying what the pronoun in a 00:06:56.060 |
Google also weighed into the emerging abilities debate saying that Palm 2 does indeed demonstrate 00:07:02.220 |
new emerging abilities. They say it does so in things like multi-step arithmetic problems, 00:07:07.600 |
temporal sequences and hierarchical reasoning. Of course I'm going to test all of those and 00:07:14.320 |
And in my early experiments I'm getting quite an interesting result. Palm 2 gets a lot of 00:07:18.780 |
questions wrong that GPT-4 gets right but it can also get questions right that GPT-4 00:07:24.920 |
gets wrong. And I must admit it's really weird to see Palm 2 getting really advanced 00:07:29.340 |
college level math questions right that GPT-4 gets wrong and yet also when I ask it a basic 00:07:34.840 |
question about prime numbers it gets it kind of hilariously wrong. Honestly I'm not certain 00:07:39.960 |
what's going on there but I do have my suspicions. 00:07:42.920 |
Remember though that recent papers have claimed that emergent abilities are a mirage so Google 00:07:48.560 |
begs to differ. When Google put Palm 2 up against GPT-4 in high school mathematics problems it did 00:07:54.900 |
outperform GPT-4. But again it was using an advanced prompting strategy not 100% different 00:08:01.340 |
from SmartGPT so I wonder if the comparison is quite fair. What about coding? Well again it's 00:08:06.820 |
really hard to find a direct comparison that's fair between the two models. Overall I would 00:08:12.600 |
guess that the specialized coding model of Palm, what they call Palm 2S, is worse than GPT-4. 00:08:19.400 |
It says its pass at one accuracy, as in pass first time, is 37.6%. 00:08:24.880 |
Remember the Sparks of AGI paper? Well that gave GPT-4 as having an 82% zero shot pass at one 00:08:32.860 |
accuracy level. However as I talked about in the Sparks of AGI video the paper admits that it could 00:08:38.540 |
be that GPT-4 has seen and memorized some or all of human eval. There is one thing I will give 00:08:45.620 |
Google credit on which is that their code now sometimes references where it came from. Here is 00:08:50.680 |
a brief extract from the Google keynote presentation.