back to index

What's Up With Bard? 9 Examples + 6 Reasons Google Fell Behind [ft. Muse, Med-PaLM 2 and more]


Chapters

0:0
7:26 Researchers Left
7:57 Keep Search Safe What Does Bard Do Again?
8:20 4. Anthropic Investment

Whisper Transcript | Transcript Only Page

00:00:00.000 | This video was supposed to be about the nine best prompts that you could use with Google's
00:00:04.900 | newly released BARD model. It's just that there was a slight problem. Every time I tried one of
00:00:10.180 | these epic ideas, GPT-4 did it better. I really wanted to come out here and say, look, you can
00:00:15.580 | use it for this or for this. As you'll see, it just didn't work out that way. So instead, reluctantly,
00:00:20.740 | I had to change the title. Now, unfortunately, it's just a comparison showing how much better
00:00:26.340 | GPT-4 is compared to BARD. A lot of people wanted this comparison after my last video used Bing for
00:00:32.840 | comparison. This one's going to use OpenAI's GPT-4, but I wasn't satisfied with just showing
00:00:38.280 | you the problems with BARD. I wanted to find the explanation. In the end, I didn't find one reason,
00:00:43.340 | I found six as to why BARD is so far behind and why Google is losing the AI race. Let's get to
00:00:50.700 | the comparison. First one is coding. And as you can see, BARD refuses to do coding. They actually
00:00:56.340 | mentioned this in the FAQ that BARD won't do coding for you. As it says, I'm designed solely
00:01:01.760 | to process and generate text. As you can see, it's a fairly basic coding challenge and BARD won't do
00:01:07.700 | it. GPT-4 had no such qualms and the code worked first time. Of course, I did check it and it
00:01:13.300 | worked, but this was just a simple challenge to turn letters into numbers. Next, and even worse
00:01:18.700 | for BARD, it can't summarize PDFs. This is going to be such a common use case for Bing using GPT-4.
00:01:25.980 | By the way, if you're a BARD user, you can use BARD for coding. If you're a BARD user, you can use
00:01:26.320 | BARD for coding. By the way, it didn't admit that it couldn't summarize the PDF. It summarized a
00:01:30.340 | completely different PDF. And if you check the other drafts, none of them summarize the correct
00:01:35.720 | PDF. Of course, the GPT-4 accessed via OpenAI also can't do this because it can't access the web. It
00:01:41.720 | also picked a completely different paper, but our old friend Bing could indeed read the PDF and
00:01:47.280 | summarize it. Okay, what about summarization when I literally paste in the text that I need it to
00:01:56.300 | summarize? Imagine you want to summarize a meeting via Google Meets or shorten an email thread in
00:02:01.600 | Gmail. It has to get this right. I pasted in the same New York Times article into BARD and GPT-4,
00:02:07.900 | and I am sad to say that BARD fluffed its lines. The link to the article will be in the description,
00:02:13.340 | but I've read it carefully and it makes numerous mistakes. Let me scroll down and show you this
00:02:18.580 | erroneous summary. First, it says the Fed is expected to raise interest rates, but doesn't say
00:02:23.880 | by whom. Second, it starts chatting about
00:02:26.280 | full employment and inflation. Not only is full employment not mentioned in the article at all,
00:02:32.780 | it also gets both numbers wrong. The unemployment rate in America isn't currently 3.8% and inflation
00:02:39.140 | isn't at 7.9%. I checked these against the latest data and you can check it yourself,
00:02:43.720 | but both are wrong. BARD also keeps going on tangents, like stocks are typically considered
00:02:48.760 | to be riskier investments than bonds. Okay, that's fine, but why are you teaching me financial advice
00:02:53.520 | you're supposed to be summarizing an article? Honestly,
00:02:56.260 | it was a pretty unusable summary. So bad that to be honest, you'd have been better off just not
00:03:01.480 | reading it. Trust me, I am not an open AI fanboy, but its model is just better currently. Notice how
00:03:07.440 | in its summary, it doesn't go on tangents and it clarifies that it's investors who think that there
00:03:12.460 | will be a quarter point increase. The five bullet points are succinct and accurate. This is a pretty
00:03:18.200 | colossal loss for BARD. What about light content creation and idea generation? Surely it could do
00:03:24.760 | well here. Just something in a sense. BARD is a pretty good example of how BARD can be used to
00:03:26.240 | create new content, like create eight new YouTube video ideas with titles and synopses on integrating
00:03:31.800 | generative AI into retail. If BARD can't be used by analysts, maybe it can be used by content
00:03:37.640 | creators. Not really. I mean, you make your own mind up, but these titles are pretty repetitive
00:03:43.340 | and bland. I know I can't really complain because my channel name is AI Explained, but these titles
00:03:48.960 | are just unoriginal and these synopses lack detail. I'll let you read these, but compare them to GPT-4's
00:03:56.220 | outputs. Each title is different and the ideas are much more explored and nuanced.
00:04:01.400 | Okay, fine. What about email composition? And I have to say, count me a skeptic on this one.
00:04:06.760 | I have never found that any model, let alone BARD, can do a decent job at this. It's not always that
00:04:13.360 | the emails are bad. It's just that the time it takes me to teach the model what I want to say
00:04:17.720 | in my email, I could have just written the email. I'm going to make a prediction at this point. I
00:04:21.360 | don't think using language models to do emails is going to become that common. Of course, feel free
00:04:26.200 | to quote me on this in a year's time. Now, you're probably thinking I'm being harsh. This is a
00:04:30.660 | perfectly fine email. I did leave a thumbs up. It's just that I would never use BARD for this purpose.
00:04:36.040 | And I would also never use GPT-4. Like, I don't want it to make up all these extra details about
00:04:41.640 | what I'm going to discuss with John. It's just too risky to send an email that has any chance
00:04:46.060 | of hallucinations. I know you guys might think that I really love Bing, but it's even worse here.
00:04:51.300 | It claims that I've added relevant data and graphs. No, I haven't. I never mentioned anything about data
00:04:56.180 | and graphs. Now my boss thinks I'm going to do data and graphs. What are you doing, Bing? And then
00:05:00.200 | you're going to say, why am I using creative mode? Well, if we use balance mode or precise mode, we go
00:05:05.020 | back to the BARD problem. It's an okay email, but look at the length of it. I could have just written
00:05:09.920 | it out. Would have been quicker to do the email than the prompt. I was beginning to lose hope in
00:05:14.120 | BARD, so I tried writing assistance. I picked a paragraph that someone I know used for a personal
00:05:19.620 | statement to get into university. Of course, they were happy for me to share it. It's decently
00:05:24.040 | written, but could be improved significantly.
00:05:26.160 | I asked BARD, rewrite this paragraph with better English, make it original,
00:05:30.560 | professional, and impactful. Now BARD did remove some of the errors, but it again went on a wild
00:05:36.480 | tangent, trying to sell a career in data science, as if we were some sort of recruiter. Now I'm not
00:05:42.540 | going to be too harsh. If you just take the first paragraph, it's okay. GPT-4's output is better,
00:05:48.900 | but still has some problems. Now I think some of you are going to laugh at what happened with Bing.
00:05:53.300 | It simply refused to do it twice. I
00:05:56.140 | pretty much had to trick Bing to get it to rewrite this paragraph. First it says,
00:06:00.580 | "My mistake. I can't give a response to that right now." I tried again. It said,
00:06:05.140 | "Hmm, let's try a different topic. Sorry about that." Finally, I just asked the exact same thing
00:06:10.300 | with different words. I said, "Rephrase this text with smoother language." It seemed to like that,
00:06:16.240 | and then did the job. I think it's the best output, but still has problems. Anyway,
00:06:20.320 | this is not a grammar lesson, so let's move to science and physics. And BARD completely flops.
00:06:25.960 | It gets this fairly basic physics question wrong. So how can it be a tutor for us? For a student to
00:06:32.260 | effectively learn from a tutor, there has to be a degree of trust that the tutor is telling the
00:06:37.120 | truth. GPT-4, by the way, gets this one right. I even asked BARD to come up with a multiple choice
00:06:43.300 | quiz. It definitely came up with the quiz. Problem is, quite a few of the answers were wrong. I
00:06:48.220 | didn't check all of them, but look at number seven and number eight. The correct answer just isn't
00:06:52.420 | there. GPT-4 does a lot better with really interesting questions in increasing order
00:06:58.120 | of difficulty. Now it does have some slip-ups. Look at question four. There are two correct
00:07:02.980 | answers. One is a half, one is five over ten, but they both simplify to the same thing. GPT-4
00:07:08.440 | was also able to give these explanations. I do think the day of AI tutoring is fast
00:07:13.780 | approaching. I just don't think it's quite here yet. And certainly not with BARD.
00:07:17.800 | I think the point is pretty much proven now. So let's move on to the explanations.
00:07:22.240 | Why has Google fallen so far behind? First, a lot of its top researchers have left. There were eight
00:07:29.200 | co-authors at Google for the famous "attention is all you need" paper on the transformer architecture.
00:07:34.660 | That's amazing, right? They pretty much invented transformers. Problem is, now all but one of the
00:07:40.060 | paper's eight co-authors have left. One joined OpenAI and others have started their own companies,
00:07:45.580 | some of which I'll be covering in future videos. Speaking of which,
00:07:49.060 | if you're learning anything from this video, please don't forget to leave a like,
00:07:52.060 | and a comment. Next potential reason is that they don't seem to want to interfere with their
00:07:56.800 | lucrative search model. As the product lead for BARD said, "I just want to be very clear,
00:08:01.780 | BARD is not search." If you haven't seen my initial review of BARD, which pretty much proves that it's
00:08:07.060 | terrible at search, do check it out after this video. If BARD is not designed for search,
00:08:12.100 | what is it designed for? As the article points out, they haven't really provided specific use cases.
00:08:17.980 | Next, are they worried about safety and accelerationism? Or
00:08:21.880 | are they looking to buy up a competitor to OpenAI? They invested over $300 million in Anthropic. The
00:08:29.440 | stated goal of that company is to work on AI safety and alignment. So is Google trying to
00:08:34.540 | be on the right side of history and place all of its bets on safe AI? Or are they trying to
00:08:39.220 | do to Anthropic what Microsoft did to OpenAI itself? I'll be following this particular story
00:08:44.500 | quite closely over the coming weeks and months. Next, maybe Google has better models that they
00:08:49.960 | genuinely don't want to release because they feel like they're not going to be able to do that.
00:08:51.700 | They had the Imogen text-to-image model that was better than DALI 2 and they didn't release it.
00:08:59.920 | Google said it was because Imogen encoded harmful stereotypes and representations.
00:09:04.540 | I dug into the original Imogen paper and it was indeed much better than DALI 2. Google
00:09:10.180 | wasn't bluffing, they had a better model and that wasn't the last time. In January of this year,
00:09:14.500 | they released a paper on Muse, a text-to-image transformer that was better than both Imogen and
00:09:20.680 | DALI 2. In January of this year, they released a paper on Muse, a text-to-image transformer that was better than both Imogen and DALI 2.
00:09:21.520 | In case anyone thinks they're lying, here I think is the proof. The Muse model outputs are on the
00:09:27.040 | right, the Imogen outputs are in the middle, and OpenAI's DALI 2 outputs are on the left.
00:09:32.200 | Strikes me that Google's Muse is one of the first models to get text right. Midjourney,
00:09:37.360 | even Midjourney version 5, definitely can't do this.
00:09:40.120 | So why didn't Google release this? Well, I read to the end of the Muse paper and they say this:
00:09:45.400 | "It's well known that models like Midjourney and Muse can be leveraged for misinformation,
00:09:51.340 | harassment, and various types of social and cultural biases. Due to these important
00:09:56.320 | considerations, we opt not to release code or a public demo at this point in time."
00:10:01.660 | Let me know what you think in the comments, but I think it's more than possible that Google has
00:10:06.760 | a language model that's far better than BARD, and even far better than Palm,
00:10:11.560 | perhaps leveraging DeepMind's chinchilla model, and that they are genuinely keeping
00:10:16.060 | it back and not publishing papers on it because they worry about these kinds of considerations.
00:10:21.160 | Anyway, I do have a final theory about BARD, and that theory is that they might have been working
00:10:26.680 | on what they regard to be more serious models. In December, Google released this paper on MedPalm.
00:10:33.280 | It's a language model tailored to help in a medical setting. And if you think its accuracy
00:10:38.500 | of 67.6% in answering medical questions was good, wait till we hear about the fact they've
00:10:45.040 | now released MedPalm 2. Here is a snippet of Google's presentation on MedPalm 2, released
00:10:50.980 | just a week ago. Today, we're announcing results from MedPalm 2, our new and improved model.
00:10:56.920 | MedPalm 2 has reached 85% accuracy on the medical exam benchmark in research. This performance is
00:11:05.140 | on par with expert test takers. It far exceeds the passing score, and it's an 18% leap over our own
00:11:13.000 | state-of-art results from MedPalm. MedPalm 2 also performed impressively on Indian medical exams,
00:11:18.940 | and it's the first AI
00:11:20.800 | system to exceed the passing score on those challenging questions.
00:11:24.280 | But finally, what does this say about the near-term future of BARD? Well,
00:11:28.180 | the more users a model gets, the more data it gets, and so the more easily a model can be
00:11:33.400 | improved. As this Forbes article points out, Microsoft now has access to the valuable
00:11:38.140 | training data that these products generate, which is a dangerous prospect for an incumbent
00:11:42.940 | like Google. And it's not like Google doesn't know this. The CEO of Google admitted that
00:11:47.380 | products like this, talking about BARD, get better the more
00:11:50.620 | people use them. It's a virtuous cycle. But does that mean that it will be a vicious cycle if
00:11:56.500 | everyone uses GPT-4 instead of BARD? With less data, does that mean there'll be less improvement
00:12:01.720 | of Google's model? Only time will tell, and I will be there to test it.
00:12:05.800 | Thank you very much for watching, and do have a wonderful day.