back to index

NEW Falcon 180B — the Open Source GPT-4?


Chapters

0:0 Most Powerful Open Source LLM
0:38 Falcon 180B Release
2:25 Hugging Face Falcon 180B
3:10 Falcon 180B on LLM Leaderboards
4:55 Falcon 180B Hardware Requirements
5:47 Cost of Running Falcon 180B
8:24 Falcon 180B vs GPT-4
11:23 Falcon 180B vs GPT-4 on Code
17:49 Falcon 180B Summary

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today, we're going to take a look at the new Falcon 180B large language model.
00:00:05.240 | It is a 180 billion parameter model.
00:00:08.880 | It is apparently on par with the BARD LLM and is very close to GPT-4 level performance.
00:00:16.960 | At the same time, it's licensed for commercial use and we can obviously access it ourselves.
00:00:23.880 | Although obviously it does take quite a bit of hardware to actually run the thing.
00:00:28.320 | So let's begin by just taking a look at what this model is and how it compares
00:00:36.080 | to other models that are available right now.
00:00:38.760 | OK, so Falcon is, again, it's from the Technology Innovation Institute.
00:00:45.520 | It's over in the UAE.
00:00:46.600 | It's an Abu Dhabi based, I assume, research lab.
00:00:51.280 | And they actually released earlier this year, Falcon 40B.
00:00:55.720 | I did a video on that at the time.
00:00:58.160 | It was the best performing LLM or pre-trained LLM on Hugging Face's LLM leaderboards,
00:01:05.120 | which is pretty cool.
00:01:07.120 | Now, the 180B model is actually, again, at the top of those leaderboards.
00:01:16.720 | And you can see here that it ranks just behind OpenAI's GPT-4,
00:01:20.960 | which is, I think, pretty impressive,
00:01:24.040 | especially when you consider the hypothesis or the sort of leaked sizes of these models.
00:01:31.680 | This is a 180B parameter LLM.
00:01:34.920 | And just that, GPT-4, if you believe George Hotz and Sumit of PyTorch,
00:01:42.760 | which are both, you know, they're kind of in-the-know people.
00:01:46.520 | If you believe them, then the total number
00:01:50.160 | of parameters for GPT-4 would be over a trillion.
00:01:53.400 | And that's because there's, according to them.
00:01:56.120 | Well, OK, so GPT-4 is 220 billion in each head and then it's an 8-way mixture model.
00:02:01.520 | Eight models, each with 220 billion parameters,
00:02:05.880 | all combined with a mixture of experts approach.
00:02:09.640 | So the fact that this model is close in performance to GPT-4,
00:02:15.120 | despite being smaller than just one GPT-4, I think that is kind of impressive.
00:02:22.640 | But we'll try that out later as well.
00:02:24.800 | We'll compare both.
00:02:25.880 | So as soon as Falcon 180B was announced,
00:02:29.760 | WinkFace also released it on HugInFace.
00:02:32.960 | So they spoke about it in this blog post.
00:02:36.680 | And we can actually see, OK, they have a base and a chat model, right?
00:02:40.960 | So the base model, the pre-trained model
00:02:43.280 | and chat model, which has been fine-tuned for chat.
00:02:46.600 | And we can see here, not that many downloads so far.
00:02:50.200 | I suppose it is pretty early days, but also it's a it's a big model.
00:02:54.800 | It's going to be hard for many people to actually deploy this.
00:03:00.200 | OK, so, yeah, they talk about it a little bit here and even show us how we would use
00:03:06.120 | it, OK, which is pretty useful.
00:03:10.040 | Now, they do mention that it is the tops lead board for pre-trained open access models.
00:03:16.560 | So we can come over to the OpenLLM leaderboard here.
00:03:19.720 | And then what we can do is we want to look for pre-trained only.
00:03:24.600 | OK, so uncheck those.
00:03:27.680 | So for some reason, I still have fine-tuned on there.
00:03:30.320 | Uncheck again. Let's try.
00:03:32.560 | Oh, this is Streamlit or Gradio being annoying.
00:03:35.960 | But anyway, OK, so all these with the little diamond here are the fine-tuned
00:03:41.160 | models we want to look at just pre-trained, right?
00:03:43.760 | There are fine-tuned models that perform better than
00:03:48.240 | Falcon 180B, but we come down here, the first model that is actually just
00:03:52.960 | pre-trained, not fine-tuned, is Falcon 180B here, right?
00:03:56.400 | So it's the highest performing pre-trained only model on the leaderboards.
00:04:01.520 | I'm not sure why I'm getting several here.
00:04:03.840 | I think there's something going on with the leaderboard at the moment.
00:04:08.320 | But anyway, it is there.
00:04:10.160 | Yes, something very odd.
00:04:11.360 | Anyway, it's there.
00:04:13.480 | It's at the top if you don't include all these fine-tuned models.
00:04:17.160 | And I think the idea here is really that, OK,
00:04:21.360 | yeah, right now there's all these fine-tuned models that are better than
00:04:25.400 | the Falcon model, the pre-trained Falcon model.
00:04:28.440 | That's true.
00:04:30.200 | But these fine-tuned models are fine-tuned from, like, lesser pre-trained models.
00:04:36.560 | So the idea is that people are going to fine-tune Falcon 180B
00:04:41.400 | that will improve the performance.
00:04:43.280 | And technically, we should see very soon fine-tuned models
00:04:48.920 | that are higher performance than these fine-tuned models.
00:04:53.200 | That's the idea anyway.
00:04:54.880 | What are the hardware requirements?
00:04:56.400 | That's going to be pretty important.
00:04:58.160 | So we can see here that most people are probably going to go for the minimum.
00:05:03.480 | OK, so this is using GPT quantization or
00:05:08.360 | INT4 quantization, which will actually slow things down quite a lot.
00:05:13.160 | So maybe we wouldn't even use that.
00:05:14.920 | I'm not sure.
00:05:16.640 | But let's say we do.
00:05:18.080 | OK, we're using INT4 quantization.
00:05:22.360 | That would require eight A100 GPUs, the 40GB ones, OK?
00:05:28.360 | The full precision model would be float32.
00:05:32.600 | Here we have float16, which doesn't really
00:05:35.760 | upgrade performance significantly or really a noticeable amount at all.
00:05:41.560 | Right, so eight A100 8GB GPUs.
00:05:46.080 | How much would that actually cost, right?
00:05:48.800 | How much would that cost?
00:05:51.800 | if we take a look at SageMaker,
00:05:53.880 | so this is a SageMaker pricing page and we can see all the instances.
00:05:59.160 | We need to come down to these ones here, right where they have a GPU model.
00:06:03.960 | They have the H100 here.
00:06:06.880 | Yeah, I think we can use one of those.
00:06:11.960 | Check, no, no, no, no.
00:06:13.760 | OK, we need 320GB of memory.
00:06:18.440 | Oh, yeah, we could. OK, so even
00:06:21.040 | even for float16 precision, we could actually use this one here.
00:06:25.960 | So the H100 instance.
00:06:28.800 | Now, for some reason, they don't have the actual pricing on this page.
00:06:33.360 | I don't know why. Maybe I'm maybe it's I don't know.
00:06:35.880 | It's not here.
00:06:37.680 | Which I don't understand, but fine, let's just I'm going to copy this and I'm
00:06:43.160 | Google it.
00:06:44.840 | Again, no idea why they don't put the pricing on the same page as the instances.
00:06:51.320 | Maybe I'm just doing something wrong.
00:06:54.880 | All right, so I think we've come to here on demand pricing.
00:07:00.280 | We need we need here accelerated computing.
00:07:05.920 | Do they even have the P5 here?
00:07:10.640 | I don't see it.
00:07:13.320 | OK, maybe it's not accessible for a lot, most of us normal people.
00:07:18.360 | Let's go with the P.
00:07:21.120 | P4D 24X large, do they have that?
00:07:25.120 | Yes, it's this one and it is.
00:07:28.680 | That is enough to fit our quantized model, like a fully quantized model.
00:07:36.920 | cool, and that will cost us quite a lot.
00:07:41.720 | Thirty two
00:07:44.240 | point seven dollars an hour.
00:07:49.160 | seven hundred eighty six dollars a day.
00:07:56.680 | And that will be
00:07:58.440 | four twenty two thousand dollars a month, which is on a good month on, you know,
00:08:05.720 | the shortest month you can possibly have, which is quite a lot.
00:08:10.960 | It makes you makes you appreciate opening highs pricing a little bit.
00:08:16.200 | So, yeah, that would be relatively expensive to run.
00:08:21.520 | But let's say let's say we do want to run it.
00:08:24.920 | What's the performance like?
00:08:28.160 | let's see, there is a somewhere there is a demo.
00:08:33.800 | Oh, yeah, it's even on the page.
00:08:36.040 | OK, so we have this little demo here, Falcon 180 B demo.
00:08:41.480 | We can ask you questions and I think this actually runs pretty quickly.
00:08:48.040 | I'm going to ask you to tell me about the latest news on LMS.
00:08:56.160 | Right. So I kind of want to see.
00:08:58.640 | Yeah, I don't have real time access.
00:09:01.840 | No, no, no. Is it going to tell me when?
00:09:04.400 | OK, I want to know what its knowledge cut off is.
00:09:09.840 | Is your knowledge cut off?
00:09:13.240 | Here we go.
00:09:14.520 | What's your knowledge cut off date? I'm not sure why I struggled with that so much.
00:09:18.800 | OK, cool.
00:09:20.960 | So tell me about the Llama 2 release.
00:09:26.200 | Let's see if it knows.
00:09:28.200 | So it doesn't know what Llama 2 is.
00:09:31.800 | Let's go back a little further.
00:09:34.200 | Let's ask about ChatGPT.
00:09:36.400 | Tell me about ChatGPT.
00:09:39.400 | All right, so
00:09:41.920 | we at least get towards the end of
00:09:45.960 | 2021.
00:09:49.920 | I'm pretty sure it was 2022.
00:09:54.880 | It's an interesting hallucination.
00:09:57.400 | If we have one ChatGPT release date, I miss a year.
00:10:02.200 | Okay, cool.
00:10:04.520 | I don't know where it got 2022 from.
00:10:07.280 | All right.
00:10:08.480 | So that's wrong.
00:10:10.760 | Not looking good for Falcon so far.
00:10:12.960 | Oh, well, let me test GPT-4.
00:10:18.600 | I'm going to ask it what is Llama 2?
00:10:22.560 | Let's see what it says.
00:10:24.600 | Okay.
00:10:25.800 | So GPT-4 also doesn't know they have their last update in September 2021.
00:10:32.640 | Right.
00:10:33.640 | So Falcon has more recent knowledge,
00:10:37.640 | although it seems to be a bit confused about when that knowledge is from.
00:10:42.240 | Are you sure ChatGPT was released on that date?
00:10:48.280 | I'm just curious.
00:10:51.920 | nice. Cool.
00:10:54.640 | Thank you.
00:10:55.440 | So, yeah, they seem to be a little confused about dates here.
00:10:59.840 | I'm not sure why that is.
00:11:01.000 | Anyway, nonetheless, at least December or November 2022.
00:11:07.600 | Falcon 180B has knowledge of that date,
00:11:10.360 | which is roughly a year later than or at least a year later than GPT-4.
00:11:16.160 | So that's one thing.
00:11:17.680 | Now, okay, let's ask it something coding related.
00:11:22.560 | Now, one of the things I always ask GPT-4 about is code.
00:11:30.800 | So I'm just going to copy this code.
00:11:33.120 | And this is from a project I covered earlier this year,
00:11:37.920 | basically creating an agent,
00:11:41.040 | a conversational agent using OpenAI's function calling.
00:11:45.160 | I'm just kind of curious how hard it would be.
00:11:47.520 | You can see here, if you're interested, it's FuncAgent.
00:11:50.440 | I think you can pip install it even.
00:11:53.040 | I could be wrong. I think you can pip install FuncAgent.
00:11:56.400 | Now, I want to see how does this model do with code?
00:12:02.440 | Right. So one thing that we should do is there's additional inputs here.
00:12:07.000 | We should reduce the temperature, in my opinion,
00:12:10.520 | increase the number of maximum new tokens,
00:12:14.120 | reduce top P and repetition penalty.
00:12:18.480 | I don't think it's like for code.
00:12:23.520 | It seems weird to have a repetition penalty.
00:12:25.680 | I'll just decrease it a little bit.
00:12:28.720 | So I'm going to say, can you tell me what this code is doing?
00:12:37.520 | I'm going to give it the Python, paste my code in here and submit.
00:12:46.520 | So they have this nice little interface.
00:12:49.200 | Oh, I get an error.
00:12:50.880 | Okay. So there is actually a limitation here.
00:12:53.680 | It tells us a demo is limited to a session length of 1000 words.
00:12:57.800 | So I had to remove a few of the functions or methods from the code.
00:13:01.720 | But that's good, because now we can ask GPT four and Falcon one ATB.
00:13:08.240 | What is missing or what needs to be improved?
00:13:11.640 | And we can see if they do well on that.
00:13:14.280 | So let's see.
00:13:16.160 | Now I have the code here on the left is the full code.
00:13:20.920 | It's 114 lines on the right is my modified code 48 lines.
00:13:27.000 | Okay. So there's a few missing methods here.
00:13:29.240 | I'm missing call function and final thought answer.
00:13:32.240 | So let's say, okay, can you explain this code?
00:13:38.560 | Just copy that and let's submit.
00:13:41.960 | Okay.
00:13:44.400 | So I mean, it's really quick to respond, which is cool.
00:13:47.400 | So the code is part of a class called agent
00:13:51.880 | that initializes an API key and list of functions to be using natural
00:13:57.160 | language processing tests, so on and so on.
00:13:59.320 | Okay.
00:14:00.520 | Open AIs API chat history and internal thoughts.
00:14:03.520 | So it has that choice of sub conditions like reaching limit on internal thoughts.
00:14:08.760 | I think that's cool.
00:14:09.600 | It's good.
00:14:10.320 | How does function cause my generation?
00:14:12.520 | Okay. Overall, this code implements a basic
00:14:15.560 | framework for an AI agent that can respond to user queries and use predefined
00:14:20.800 | functions to generate more complex responses.
00:14:23.080 | So I think that's a pretty good summary.
00:14:26.080 | Now, let's try with chat GBT, GPT4.
00:14:30.680 | So same again, I'm going to ask you to explain this code.
00:14:35.840 | This code defines a Python class called agent,
00:14:39.120 | presumably to utilize models like GPT4, so on and so on.
00:14:42.840 | Let's break down this code step by step.
00:14:44.520 | So it's going for a different approach.
00:14:46.160 | It's actually giving us some of the code.
00:14:50.680 | So the init method, initialize the objects, plus
00:14:54.880 | nothing, nothing that insightful right now.
00:14:59.840 | It hasn't got the it just says that this is to generate responses in a chat like
00:15:06.880 | manner, which, yes, kind of true, but it's missing kind of the point of a few items
00:15:12.800 | like the internal memory. Okay.
00:15:15.200 | So I think the definition of the components of code is pretty good.
00:15:20.600 | All right.
00:15:22.000 | And then we also have missing pieces here.
00:15:23.880 | So Falcon didn't, at least within that initial response,
00:15:29.000 | which is a lot shorter, to be fair, it didn't mention missing pieces.
00:15:33.800 | Okay.
00:15:35.000 | So it seems to have an idea.
00:15:37.800 | So the agent communicates to open API,
00:15:40.560 | possibly invoking some internal functions and returns a response back to user.
00:15:44.720 | So it kind of gets it right.
00:15:47.720 | It does identify these things I'm missing, which is cool.
00:15:51.920 | But it doesn't really give us an idea of what the code is actually doing.
00:15:57.680 | There's nothing here that's like, oh, this is an agent.
00:16:00.840 | It's using internal thoughts.
00:16:03.200 | Oh, no, it does mention it here,
00:16:06.640 | but it doesn't tell us how those internal thoughts use the Falcon one here.
00:16:14.120 | It kind of does.
00:16:15.720 | It uses predefined functions to generate more complex responses.
00:16:20.960 | The ask method creates an internal thought process for the agent,
00:16:24.760 | which can include function calls and final answers.
00:16:27.640 | Yeah.
00:16:28.920 | And it mentions internal thoughts here as well.
00:16:32.560 | I think overall, this explanation is easy to understand to me, at least.
00:16:37.360 | But it hasn't mentioned anything about what is missing.
00:16:40.760 | I'm just going to ask, are there any missing methods in this code?
00:16:50.080 | See what we get.
00:16:52.000 | Okay.
00:16:53.320 | However, without additional context about the purpose of this class,
00:16:57.160 | it's difficult to determine if there are any missing methods.
00:16:59.320 | Yeah. It's not really picking up on the fact
00:17:01.200 | that we're calling methods that are missing.
00:17:03.480 | Let me try and be more specific.
00:17:05.400 | Are there any class methods that are referred to, but missing in this code block?
00:17:18.520 | I don't think I can get any more specific than that.
00:17:23.840 | Alright.
00:17:26.560 | So it doesn't seem to do so well on kind of identifying issues with the code.
00:17:31.480 | Interesting. But I think the actual explanation to code is pretty good.
00:17:37.080 | So that's kind of I think, okay, when you're developing something,
00:17:44.240 | GPT four is probably going to be more useful there because it's way more specific.
00:17:49.720 | I mean, there's quite a lot to consider here.
00:17:52.200 | Performance is great.
00:17:53.440 | As they said, it's maybe not too far behind GPT four,
00:17:57.760 | but obviously not quite there yet deploying.
00:18:01.040 | Yes. Very hard and expensive.
00:18:04.240 | It's really hard to justify against just paying the price of open AI.
00:18:09.680 | It's definitely more cost effective.
00:18:12.000 | But at the same time, if privacy and having your data local,
00:18:16.520 | especially like in the EU, you have things like GDPR.
00:18:20.680 | There are maybe some cases where this sort of model is actually
00:18:26.520 | the only alternative you would have to something like GPT four despite the cost.
00:18:33.080 | It's going to be an expensive alternative, unfortunately, for now.
00:18:36.680 | But I'm sure over time it will probably decrease in price.
00:18:40.800 | There will be more optimized ways of deploying these models.
00:18:44.200 | But anyway, for now, that's it for this video.
00:18:48.800 | As I mentioned, I think this is really cool.
00:18:51.320 | I'm sure we're going to see a lot of interesting fine tuned models come out
00:18:55.480 | of this that will be even cooler, but yeah, I'll leave it there for now.
00:18:59.200 | So thank you very much for watching.
00:19:00.840 | I hope it's been useful and interesting, and I will see you again in the next one.
00:19:11.320 | (soft music)
00:19:13.740 | (soft music)
00:19:16.160 | [BLANK_AUDIO]