back to indexWhat's Up With Bard? 9 Examples + 6 Reasons Google Fell Behind [ft. Muse, Med-PaLM 2 and more]
Chapters
0:0
7:26 Researchers Left
7:57 Keep Search Safe What Does Bard Do Again?
8:20 4. Anthropic Investment
00:00:00.000 |
This video was supposed to be about the nine best prompts that you could use with Google's 00:00:04.900 |
newly released BARD model. It's just that there was a slight problem. Every time I tried one of 00:00:10.180 |
these epic ideas, GPT-4 did it better. I really wanted to come out here and say, look, you can 00:00:15.580 |
use it for this or for this. As you'll see, it just didn't work out that way. So instead, reluctantly, 00:00:20.740 |
I had to change the title. Now, unfortunately, it's just a comparison showing how much better 00:00:26.340 |
GPT-4 is compared to BARD. A lot of people wanted this comparison after my last video used Bing for 00:00:32.840 |
comparison. This one's going to use OpenAI's GPT-4, but I wasn't satisfied with just showing 00:00:38.280 |
you the problems with BARD. I wanted to find the explanation. In the end, I didn't find one reason, 00:00:43.340 |
I found six as to why BARD is so far behind and why Google is losing the AI race. Let's get to 00:00:50.700 |
the comparison. First one is coding. And as you can see, BARD refuses to do coding. They actually 00:00:56.340 |
mentioned this in the FAQ that BARD won't do coding for you. As it says, I'm designed solely 00:01:01.760 |
to process and generate text. As you can see, it's a fairly basic coding challenge and BARD won't do 00:01:07.700 |
it. GPT-4 had no such qualms and the code worked first time. Of course, I did check it and it 00:01:13.300 |
worked, but this was just a simple challenge to turn letters into numbers. Next, and even worse 00:01:18.700 |
for BARD, it can't summarize PDFs. This is going to be such a common use case for Bing using GPT-4. 00:01:25.980 |
By the way, if you're a BARD user, you can use BARD for coding. If you're a BARD user, you can use 00:01:26.320 |
BARD for coding. By the way, it didn't admit that it couldn't summarize the PDF. It summarized a 00:01:30.340 |
completely different PDF. And if you check the other drafts, none of them summarize the correct 00:01:35.720 |
PDF. Of course, the GPT-4 accessed via OpenAI also can't do this because it can't access the web. It 00:01:41.720 |
also picked a completely different paper, but our old friend Bing could indeed read the PDF and 00:01:47.280 |
summarize it. Okay, what about summarization when I literally paste in the text that I need it to 00:01:56.300 |
summarize? Imagine you want to summarize a meeting via Google Meets or shorten an email thread in 00:02:01.600 |
Gmail. It has to get this right. I pasted in the same New York Times article into BARD and GPT-4, 00:02:07.900 |
and I am sad to say that BARD fluffed its lines. The link to the article will be in the description, 00:02:13.340 |
but I've read it carefully and it makes numerous mistakes. Let me scroll down and show you this 00:02:18.580 |
erroneous summary. First, it says the Fed is expected to raise interest rates, but doesn't say 00:02:26.280 |
full employment and inflation. Not only is full employment not mentioned in the article at all, 00:02:32.780 |
it also gets both numbers wrong. The unemployment rate in America isn't currently 3.8% and inflation 00:02:39.140 |
isn't at 7.9%. I checked these against the latest data and you can check it yourself, 00:02:43.720 |
but both are wrong. BARD also keeps going on tangents, like stocks are typically considered 00:02:48.760 |
to be riskier investments than bonds. Okay, that's fine, but why are you teaching me financial advice 00:02:53.520 |
you're supposed to be summarizing an article? Honestly, 00:02:56.260 |
it was a pretty unusable summary. So bad that to be honest, you'd have been better off just not 00:03:01.480 |
reading it. Trust me, I am not an open AI fanboy, but its model is just better currently. Notice how 00:03:07.440 |
in its summary, it doesn't go on tangents and it clarifies that it's investors who think that there 00:03:12.460 |
will be a quarter point increase. The five bullet points are succinct and accurate. This is a pretty 00:03:18.200 |
colossal loss for BARD. What about light content creation and idea generation? Surely it could do 00:03:24.760 |
well here. Just something in a sense. BARD is a pretty good example of how BARD can be used to 00:03:26.240 |
create new content, like create eight new YouTube video ideas with titles and synopses on integrating 00:03:31.800 |
generative AI into retail. If BARD can't be used by analysts, maybe it can be used by content 00:03:37.640 |
creators. Not really. I mean, you make your own mind up, but these titles are pretty repetitive 00:03:43.340 |
and bland. I know I can't really complain because my channel name is AI Explained, but these titles 00:03:48.960 |
are just unoriginal and these synopses lack detail. I'll let you read these, but compare them to GPT-4's 00:03:56.220 |
outputs. Each title is different and the ideas are much more explored and nuanced. 00:04:01.400 |
Okay, fine. What about email composition? And I have to say, count me a skeptic on this one. 00:04:06.760 |
I have never found that any model, let alone BARD, can do a decent job at this. It's not always that 00:04:13.360 |
the emails are bad. It's just that the time it takes me to teach the model what I want to say 00:04:17.720 |
in my email, I could have just written the email. I'm going to make a prediction at this point. I 00:04:21.360 |
don't think using language models to do emails is going to become that common. Of course, feel free 00:04:26.200 |
to quote me on this in a year's time. Now, you're probably thinking I'm being harsh. This is a 00:04:30.660 |
perfectly fine email. I did leave a thumbs up. It's just that I would never use BARD for this purpose. 00:04:36.040 |
And I would also never use GPT-4. Like, I don't want it to make up all these extra details about 00:04:41.640 |
what I'm going to discuss with John. It's just too risky to send an email that has any chance 00:04:46.060 |
of hallucinations. I know you guys might think that I really love Bing, but it's even worse here. 00:04:51.300 |
It claims that I've added relevant data and graphs. No, I haven't. I never mentioned anything about data 00:04:56.180 |
and graphs. Now my boss thinks I'm going to do data and graphs. What are you doing, Bing? And then 00:05:00.200 |
you're going to say, why am I using creative mode? Well, if we use balance mode or precise mode, we go 00:05:05.020 |
back to the BARD problem. It's an okay email, but look at the length of it. I could have just written 00:05:09.920 |
it out. Would have been quicker to do the email than the prompt. I was beginning to lose hope in 00:05:14.120 |
BARD, so I tried writing assistance. I picked a paragraph that someone I know used for a personal 00:05:19.620 |
statement to get into university. Of course, they were happy for me to share it. It's decently 00:05:24.040 |
written, but could be improved significantly. 00:05:26.160 |
I asked BARD, rewrite this paragraph with better English, make it original, 00:05:30.560 |
professional, and impactful. Now BARD did remove some of the errors, but it again went on a wild 00:05:36.480 |
tangent, trying to sell a career in data science, as if we were some sort of recruiter. Now I'm not 00:05:42.540 |
going to be too harsh. If you just take the first paragraph, it's okay. GPT-4's output is better, 00:05:48.900 |
but still has some problems. Now I think some of you are going to laugh at what happened with Bing. 00:05:56.140 |
pretty much had to trick Bing to get it to rewrite this paragraph. First it says, 00:06:00.580 |
"My mistake. I can't give a response to that right now." I tried again. It said, 00:06:05.140 |
"Hmm, let's try a different topic. Sorry about that." Finally, I just asked the exact same thing 00:06:10.300 |
with different words. I said, "Rephrase this text with smoother language." It seemed to like that, 00:06:16.240 |
and then did the job. I think it's the best output, but still has problems. Anyway, 00:06:20.320 |
this is not a grammar lesson, so let's move to science and physics. And BARD completely flops. 00:06:25.960 |
It gets this fairly basic physics question wrong. So how can it be a tutor for us? For a student to 00:06:32.260 |
effectively learn from a tutor, there has to be a degree of trust that the tutor is telling the 00:06:37.120 |
truth. GPT-4, by the way, gets this one right. I even asked BARD to come up with a multiple choice 00:06:43.300 |
quiz. It definitely came up with the quiz. Problem is, quite a few of the answers were wrong. I 00:06:48.220 |
didn't check all of them, but look at number seven and number eight. The correct answer just isn't 00:06:52.420 |
there. GPT-4 does a lot better with really interesting questions in increasing order 00:06:58.120 |
of difficulty. Now it does have some slip-ups. Look at question four. There are two correct 00:07:02.980 |
answers. One is a half, one is five over ten, but they both simplify to the same thing. GPT-4 00:07:08.440 |
was also able to give these explanations. I do think the day of AI tutoring is fast 00:07:13.780 |
approaching. I just don't think it's quite here yet. And certainly not with BARD. 00:07:17.800 |
I think the point is pretty much proven now. So let's move on to the explanations. 00:07:22.240 |
Why has Google fallen so far behind? First, a lot of its top researchers have left. There were eight 00:07:29.200 |
co-authors at Google for the famous "attention is all you need" paper on the transformer architecture. 00:07:34.660 |
That's amazing, right? They pretty much invented transformers. Problem is, now all but one of the 00:07:40.060 |
paper's eight co-authors have left. One joined OpenAI and others have started their own companies, 00:07:45.580 |
some of which I'll be covering in future videos. Speaking of which, 00:07:49.060 |
if you're learning anything from this video, please don't forget to leave a like, 00:07:52.060 |
and a comment. Next potential reason is that they don't seem to want to interfere with their 00:07:56.800 |
lucrative search model. As the product lead for BARD said, "I just want to be very clear, 00:08:01.780 |
BARD is not search." If you haven't seen my initial review of BARD, which pretty much proves that it's 00:08:07.060 |
terrible at search, do check it out after this video. If BARD is not designed for search, 00:08:12.100 |
what is it designed for? As the article points out, they haven't really provided specific use cases. 00:08:17.980 |
Next, are they worried about safety and accelerationism? Or 00:08:21.880 |
are they looking to buy up a competitor to OpenAI? They invested over $300 million in Anthropic. The 00:08:29.440 |
stated goal of that company is to work on AI safety and alignment. So is Google trying to 00:08:34.540 |
be on the right side of history and place all of its bets on safe AI? Or are they trying to 00:08:39.220 |
do to Anthropic what Microsoft did to OpenAI itself? I'll be following this particular story 00:08:44.500 |
quite closely over the coming weeks and months. Next, maybe Google has better models that they 00:08:49.960 |
genuinely don't want to release because they feel like they're not going to be able to do that. 00:08:51.700 |
They had the Imogen text-to-image model that was better than DALI 2 and they didn't release it. 00:08:59.920 |
Google said it was because Imogen encoded harmful stereotypes and representations. 00:09:04.540 |
I dug into the original Imogen paper and it was indeed much better than DALI 2. Google 00:09:10.180 |
wasn't bluffing, they had a better model and that wasn't the last time. In January of this year, 00:09:14.500 |
they released a paper on Muse, a text-to-image transformer that was better than both Imogen and 00:09:20.680 |
DALI 2. In January of this year, they released a paper on Muse, a text-to-image transformer that was better than both Imogen and DALI 2. 00:09:21.520 |
In case anyone thinks they're lying, here I think is the proof. The Muse model outputs are on the 00:09:27.040 |
right, the Imogen outputs are in the middle, and OpenAI's DALI 2 outputs are on the left. 00:09:32.200 |
Strikes me that Google's Muse is one of the first models to get text right. Midjourney, 00:09:37.360 |
even Midjourney version 5, definitely can't do this. 00:09:40.120 |
So why didn't Google release this? Well, I read to the end of the Muse paper and they say this: 00:09:45.400 |
"It's well known that models like Midjourney and Muse can be leveraged for misinformation, 00:09:51.340 |
harassment, and various types of social and cultural biases. Due to these important 00:09:56.320 |
considerations, we opt not to release code or a public demo at this point in time." 00:10:01.660 |
Let me know what you think in the comments, but I think it's more than possible that Google has 00:10:06.760 |
a language model that's far better than BARD, and even far better than Palm, 00:10:11.560 |
perhaps leveraging DeepMind's chinchilla model, and that they are genuinely keeping 00:10:16.060 |
it back and not publishing papers on it because they worry about these kinds of considerations. 00:10:21.160 |
Anyway, I do have a final theory about BARD, and that theory is that they might have been working 00:10:26.680 |
on what they regard to be more serious models. In December, Google released this paper on MedPalm. 00:10:33.280 |
It's a language model tailored to help in a medical setting. And if you think its accuracy 00:10:38.500 |
of 67.6% in answering medical questions was good, wait till we hear about the fact they've 00:10:45.040 |
now released MedPalm 2. Here is a snippet of Google's presentation on MedPalm 2, released 00:10:50.980 |
just a week ago. Today, we're announcing results from MedPalm 2, our new and improved model. 00:10:56.920 |
MedPalm 2 has reached 85% accuracy on the medical exam benchmark in research. This performance is 00:11:05.140 |
on par with expert test takers. It far exceeds the passing score, and it's an 18% leap over our own 00:11:13.000 |
state-of-art results from MedPalm. MedPalm 2 also performed impressively on Indian medical exams, 00:11:20.800 |
system to exceed the passing score on those challenging questions. 00:11:24.280 |
But finally, what does this say about the near-term future of BARD? Well, 00:11:28.180 |
the more users a model gets, the more data it gets, and so the more easily a model can be 00:11:33.400 |
improved. As this Forbes article points out, Microsoft now has access to the valuable 00:11:38.140 |
training data that these products generate, which is a dangerous prospect for an incumbent 00:11:42.940 |
like Google. And it's not like Google doesn't know this. The CEO of Google admitted that 00:11:47.380 |
products like this, talking about BARD, get better the more 00:11:50.620 |
people use them. It's a virtuous cycle. But does that mean that it will be a vicious cycle if 00:11:56.500 |
everyone uses GPT-4 instead of BARD? With less data, does that mean there'll be less improvement 00:12:01.720 |
of Google's model? Only time will tell, and I will be there to test it. 00:12:05.800 |
Thank you very much for watching, and do have a wonderful day.