Back to Index

What's Up With Bard? 9 Examples + 6 Reasons Google Fell Behind [ft. Muse, Med-PaLM 2 and more]


Chapters

0:0
7:26 Researchers Left
7:57 Keep Search Safe What Does Bard Do Again?
8:20 4. Anthropic Investment

Transcript

This video was supposed to be about the nine best prompts that you could use with Google's newly released BARD model. It's just that there was a slight problem. Every time I tried one of these epic ideas, GPT-4 did it better. I really wanted to come out here and say, look, you can use it for this or for this.

As you'll see, it just didn't work out that way. So instead, reluctantly, I had to change the title. Now, unfortunately, it's just a comparison showing how much better GPT-4 is compared to BARD. A lot of people wanted this comparison after my last video used Bing for comparison. This one's going to use OpenAI's GPT-4, but I wasn't satisfied with just showing you the problems with BARD.

I wanted to find the explanation. In the end, I didn't find one reason, I found six as to why BARD is so far behind and why Google is losing the AI race. Let's get to the comparison. First one is coding. And as you can see, BARD refuses to do coding.

They actually mentioned this in the FAQ that BARD won't do coding for you. As it says, I'm designed solely to process and generate text. As you can see, it's a fairly basic coding challenge and BARD won't do it. GPT-4 had no such qualms and the code worked first time.

Of course, I did check it and it worked, but this was just a simple challenge to turn letters into numbers. Next, and even worse for BARD, it can't summarize PDFs. This is going to be such a common use case for Bing using GPT-4. By the way, if you're a BARD user, you can use BARD for coding.

If you're a BARD user, you can use BARD for coding. By the way, it didn't admit that it couldn't summarize the PDF. It summarized a completely different PDF. And if you check the other drafts, none of them summarize the correct PDF. Of course, the GPT-4 accessed via OpenAI also can't do this because it can't access the web.

It also picked a completely different paper, but our old friend Bing could indeed read the PDF and summarize it. Okay, what about summarization when I literally paste in the text that I need it to summarize? Imagine you want to summarize a meeting via Google Meets or shorten an email thread in Gmail.

It has to get this right. I pasted in the same New York Times article into BARD and GPT-4, and I am sad to say that BARD fluffed its lines. The link to the article will be in the description, but I've read it carefully and it makes numerous mistakes. Let me scroll down and show you this erroneous summary.

First, it says the Fed is expected to raise interest rates, but doesn't say by whom. Second, it starts chatting about full employment and inflation. Not only is full employment not mentioned in the article at all, it also gets both numbers wrong. The unemployment rate in America isn't currently 3.8% and inflation isn't at 7.9%.

I checked these against the latest data and you can check it yourself, but both are wrong. BARD also keeps going on tangents, like stocks are typically considered to be riskier investments than bonds. Okay, that's fine, but why are you teaching me financial advice you're supposed to be summarizing an article?

Honestly, it was a pretty unusable summary. So bad that to be honest, you'd have been better off just not reading it. Trust me, I am not an open AI fanboy, but its model is just better currently. Notice how in its summary, it doesn't go on tangents and it clarifies that it's investors who think that there will be a quarter point increase.

The five bullet points are succinct and accurate. This is a pretty colossal loss for BARD. What about light content creation and idea generation? Surely it could do well here. Just something in a sense. BARD is a pretty good example of how BARD can be used to create new content, like create eight new YouTube video ideas with titles and synopses on integrating generative AI into retail.

If BARD can't be used by analysts, maybe it can be used by content creators. Not really. I mean, you make your own mind up, but these titles are pretty repetitive and bland. I know I can't really complain because my channel name is AI Explained, but these titles are just unoriginal and these synopses lack detail.

I'll let you read these, but compare them to GPT-4's outputs. Each title is different and the ideas are much more explored and nuanced. Okay, fine. What about email composition? And I have to say, count me a skeptic on this one. I have never found that any model, let alone BARD, can do a decent job at this.

It's not always that the emails are bad. It's just that the time it takes me to teach the model what I want to say in my email, I could have just written the email. I'm going to make a prediction at this point. I don't think using language models to do emails is going to become that common.

Of course, feel free to quote me on this in a year's time. Now, you're probably thinking I'm being harsh. This is a perfectly fine email. I did leave a thumbs up. It's just that I would never use BARD for this purpose. And I would also never use GPT-4. Like, I don't want it to make up all these extra details about what I'm going to discuss with John.

It's just too risky to send an email that has any chance of hallucinations. I know you guys might think that I really love Bing, but it's even worse here. It claims that I've added relevant data and graphs. No, I haven't. I never mentioned anything about data and graphs. Now my boss thinks I'm going to do data and graphs.

What are you doing, Bing? And then you're going to say, why am I using creative mode? Well, if we use balance mode or precise mode, we go back to the BARD problem. It's an okay email, but look at the length of it. I could have just written it out.

Would have been quicker to do the email than the prompt. I was beginning to lose hope in BARD, so I tried writing assistance. I picked a paragraph that someone I know used for a personal statement to get into university. Of course, they were happy for me to share it.

It's decently written, but could be improved significantly. I asked BARD, rewrite this paragraph with better English, make it original, professional, and impactful. Now BARD did remove some of the errors, but it again went on a wild tangent, trying to sell a career in data science, as if we were some sort of recruiter.

Now I'm not going to be too harsh. If you just take the first paragraph, it's okay. GPT-4's output is better, but still has some problems. Now I think some of you are going to laugh at what happened with Bing. It simply refused to do it twice. I pretty much had to trick Bing to get it to rewrite this paragraph.

First it says, "My mistake. I can't give a response to that right now." I tried again. It said, "Hmm, let's try a different topic. Sorry about that." Finally, I just asked the exact same thing with different words. I said, "Rephrase this text with smoother language." It seemed to like that, and then did the job.

I think it's the best output, but still has problems. Anyway, this is not a grammar lesson, so let's move to science and physics. And BARD completely flops. It gets this fairly basic physics question wrong. So how can it be a tutor for us? For a student to effectively learn from a tutor, there has to be a degree of trust that the tutor is telling the truth.

GPT-4, by the way, gets this one right. I even asked BARD to come up with a multiple choice quiz. It definitely came up with the quiz. Problem is, quite a few of the answers were wrong. I didn't check all of them, but look at number seven and number eight.

The correct answer just isn't there. GPT-4 does a lot better with really interesting questions in increasing order of difficulty. Now it does have some slip-ups. Look at question four. There are two correct answers. One is a half, one is five over ten, but they both simplify to the same thing.

GPT-4 was also able to give these explanations. I do think the day of AI tutoring is fast approaching. I just don't think it's quite here yet. And certainly not with BARD. I think the point is pretty much proven now. So let's move on to the explanations. Why has Google fallen so far behind?

First, a lot of its top researchers have left. There were eight co-authors at Google for the famous "attention is all you need" paper on the transformer architecture. That's amazing, right? They pretty much invented transformers. Problem is, now all but one of the paper's eight co-authors have left. One joined OpenAI and others have started their own companies, some of which I'll be covering in future videos.

Speaking of which, if you're learning anything from this video, please don't forget to leave a like, and a comment. Next potential reason is that they don't seem to want to interfere with their lucrative search model. As the product lead for BARD said, "I just want to be very clear, BARD is not search." If you haven't seen my initial review of BARD, which pretty much proves that it's terrible at search, do check it out after this video.

If BARD is not designed for search, what is it designed for? As the article points out, they haven't really provided specific use cases. Next, are they worried about safety and accelerationism? Or are they looking to buy up a competitor to OpenAI? They invested over $300 million in Anthropic. The stated goal of that company is to work on AI safety and alignment.

So is Google trying to be on the right side of history and place all of its bets on safe AI? Or are they trying to do to Anthropic what Microsoft did to OpenAI itself? I'll be following this particular story quite closely over the coming weeks and months. Next, maybe Google has better models that they genuinely don't want to release because they feel like they're not going to be able to do that.

They had the Imogen text-to-image model that was better than DALI 2 and they didn't release it. Google said it was because Imogen encoded harmful stereotypes and representations. I dug into the original Imogen paper and it was indeed much better than DALI 2. Google wasn't bluffing, they had a better model and that wasn't the last time.

In January of this year, they released a paper on Muse, a text-to-image transformer that was better than both Imogen and DALI 2. In January of this year, they released a paper on Muse, a text-to-image transformer that was better than both Imogen and DALI 2. In case anyone thinks they're lying, here I think is the proof.

The Muse model outputs are on the right, the Imogen outputs are in the middle, and OpenAI's DALI 2 outputs are on the left. Strikes me that Google's Muse is one of the first models to get text right. Midjourney, even Midjourney version 5, definitely can't do this. So why didn't Google release this?

Well, I read to the end of the Muse paper and they say this: "It's well known that models like Midjourney and Muse can be leveraged for misinformation, harassment, and various types of social and cultural biases. Due to these important considerations, we opt not to release code or a public demo at this point in time." Let me know what you think in the comments, but I think it's more than possible that Google has a language model that's far better than BARD, and even far better than Palm, perhaps leveraging DeepMind's chinchilla model, and that they are genuinely keeping it back and not publishing papers on it because they worry about these kinds of considerations.

Anyway, I do have a final theory about BARD, and that theory is that they might have been working on what they regard to be more serious models. In December, Google released this paper on MedPalm. It's a language model tailored to help in a medical setting. And if you think its accuracy of 67.6% in answering medical questions was good, wait till we hear about the fact they've now released MedPalm 2.

Here is a snippet of Google's presentation on MedPalm 2, released just a week ago. Today, we're announcing results from MedPalm 2, our new and improved model. MedPalm 2 has reached 85% accuracy on the medical exam benchmark in research. This performance is on par with expert test takers. It far exceeds the passing score, and it's an 18% leap over our own state-of-art results from MedPalm.

MedPalm 2 also performed impressively on Indian medical exams, and it's the first AI system to exceed the passing score on those challenging questions. But finally, what does this say about the near-term future of BARD? Well, the more users a model gets, the more data it gets, and so the more easily a model can be improved.

As this Forbes article points out, Microsoft now has access to the valuable training data that these products generate, which is a dangerous prospect for an incumbent like Google. And it's not like Google doesn't know this. The CEO of Google admitted that products like this, talking about BARD, get better the more people use them.

It's a virtuous cycle. But does that mean that it will be a vicious cycle if everyone uses GPT-4 instead of BARD? With less data, does that mean there'll be less improvement of Google's model? Only time will tell, and I will be there to test it. Thank you very much for watching, and do have a wonderful day.