Back to Index

NEW Falcon 180B — the Open Source GPT-4?


Chapters

0:0 Most Powerful Open Source LLM
0:38 Falcon 180B Release
2:25 Hugging Face Falcon 180B
3:10 Falcon 180B on LLM Leaderboards
4:55 Falcon 180B Hardware Requirements
5:47 Cost of Running Falcon 180B
8:24 Falcon 180B vs GPT-4
11:23 Falcon 180B vs GPT-4 on Code
17:49 Falcon 180B Summary

Transcript

Today, we're going to take a look at the new Falcon 180B large language model. It is a 180 billion parameter model. It is apparently on par with the BARD LLM and is very close to GPT-4 level performance. At the same time, it's licensed for commercial use and we can obviously access it ourselves.

Although obviously it does take quite a bit of hardware to actually run the thing. So let's begin by just taking a look at what this model is and how it compares to other models that are available right now. OK, so Falcon is, again, it's from the Technology Innovation Institute.

It's over in the UAE. It's an Abu Dhabi based, I assume, research lab. And they actually released earlier this year, Falcon 40B. I did a video on that at the time. It was the best performing LLM or pre-trained LLM on Hugging Face's LLM leaderboards, which is pretty cool. Now, the 180B model is actually, again, at the top of those leaderboards.

And you can see here that it ranks just behind OpenAI's GPT-4, which is, I think, pretty impressive, especially when you consider the hypothesis or the sort of leaked sizes of these models. This is a 180B parameter LLM. And just that, GPT-4, if you believe George Hotz and Sumit of PyTorch, which are both, you know, they're kind of in-the-know people.

If you believe them, then the total number of parameters for GPT-4 would be over a trillion. And that's because there's, according to them. Well, OK, so GPT-4 is 220 billion in each head and then it's an 8-way mixture model. Eight models, each with 220 billion parameters, all combined with a mixture of experts approach.

So the fact that this model is close in performance to GPT-4, despite being smaller than just one GPT-4, I think that is kind of impressive. But we'll try that out later as well. We'll compare both. So as soon as Falcon 180B was announced, WinkFace also released it on HugInFace.

So they spoke about it in this blog post. And we can actually see, OK, they have a base and a chat model, right? So the base model, the pre-trained model and chat model, which has been fine-tuned for chat. And we can see here, not that many downloads so far.

I suppose it is pretty early days, but also it's a it's a big model. It's going to be hard for many people to actually deploy this. OK, so, yeah, they talk about it a little bit here and even show us how we would use it, OK, which is pretty useful.

Now, they do mention that it is the tops lead board for pre-trained open access models. So we can come over to the OpenLLM leaderboard here. And then what we can do is we want to look for pre-trained only. OK, so uncheck those. So for some reason, I still have fine-tuned on there.

Uncheck again. Let's try. Oh, this is Streamlit or Gradio being annoying. But anyway, OK, so all these with the little diamond here are the fine-tuned models we want to look at just pre-trained, right? There are fine-tuned models that perform better than Falcon 180B, but we come down here, the first model that is actually just pre-trained, not fine-tuned, is Falcon 180B here, right?

So it's the highest performing pre-trained only model on the leaderboards. I'm not sure why I'm getting several here. I think there's something going on with the leaderboard at the moment. But anyway, it is there. Yes, something very odd. Anyway, it's there. It's at the top if you don't include all these fine-tuned models.

And I think the idea here is really that, OK, yeah, right now there's all these fine-tuned models that are better than the Falcon model, the pre-trained Falcon model. That's true. But these fine-tuned models are fine-tuned from, like, lesser pre-trained models. So the idea is that people are going to fine-tune Falcon 180B that will improve the performance.

And technically, we should see very soon fine-tuned models that are higher performance than these fine-tuned models. That's the idea anyway. What are the hardware requirements? That's going to be pretty important. So we can see here that most people are probably going to go for the minimum. OK, so this is using GPT quantization or INT4 quantization, which will actually slow things down quite a lot.

So maybe we wouldn't even use that. I'm not sure. But let's say we do. OK, we're using INT4 quantization. That would require eight A100 GPUs, the 40GB ones, OK? The full precision model would be float32. Here we have float16, which doesn't really upgrade performance significantly or really a noticeable amount at all.

Right, so eight A100 8GB GPUs. How much would that actually cost, right? How much would that cost? So if we take a look at SageMaker, so this is a SageMaker pricing page and we can see all the instances. We need to come down to these ones here, right where they have a GPU model.

They have the H100 here. Yeah, I think we can use one of those. Check, no, no, no, no. OK, we need 320GB of memory. Oh, yeah, we could. OK, so even even for float16 precision, we could actually use this one here. So the H100 instance. Now, for some reason, they don't have the actual pricing on this page.

I don't know why. Maybe I'm maybe it's I don't know. It's not here. Which I don't understand, but fine, let's just I'm going to copy this and I'm Google it. Again, no idea why they don't put the pricing on the same page as the instances. Maybe I'm just doing something wrong.

All right, so I think we've come to here on demand pricing. We need we need here accelerated computing. Do they even have the P5 here? I don't see it. OK, maybe it's not accessible for a lot, most of us normal people. Let's go with the P. P4D 24X large, do they have that?

Yes, it's this one and it is. That is enough to fit our quantized model, like a fully quantized model. OK, cool, and that will cost us quite a lot. Thirty two point seven dollars an hour. So seven hundred eighty six dollars a day. And that will be four twenty two thousand dollars a month, which is on a good month on, you know, the shortest month you can possibly have, which is quite a lot.

It makes you makes you appreciate opening highs pricing a little bit. So, yeah, that would be relatively expensive to run. But let's say let's say we do want to run it. What's the performance like? OK, let's see, there is a somewhere there is a demo. Oh, yeah, it's even on the page.

OK, so we have this little demo here, Falcon 180 B demo. We can ask you questions and I think this actually runs pretty quickly. So I'm going to ask you to tell me about the latest news on LMS. Right. So I kind of want to see. Yeah, I don't have real time access.

No, no, no. Is it going to tell me when? OK, I want to know what its knowledge cut off is. Is your knowledge cut off? Here we go. What's your knowledge cut off date? I'm not sure why I struggled with that so much. OK, cool. So tell me about the Llama 2 release.

Let's see if it knows. So it doesn't know what Llama 2 is. Let's go back a little further. Let's ask about ChatGPT. Tell me about ChatGPT. All right, so we at least get towards the end of 2021. I'm pretty sure it was 2022. It's an interesting hallucination. If we have one ChatGPT release date, I miss a year.

Okay, cool. I don't know where it got 2022 from. All right. So that's wrong. Not looking good for Falcon so far. Oh, well, let me test GPT-4. I'm going to ask it what is Llama 2? Let's see what it says. Okay. So GPT-4 also doesn't know they have their last update in September 2021.

Right. So Falcon has more recent knowledge, although it seems to be a bit confused about when that knowledge is from. Are you sure ChatGPT was released on that date? I'm just curious. So nice. Cool. Thank you. So, yeah, they seem to be a little confused about dates here. I'm not sure why that is.

Anyway, nonetheless, at least December or November 2022. Falcon 180B has knowledge of that date, which is roughly a year later than or at least a year later than GPT-4. So that's one thing. Now, okay, let's ask it something coding related. Now, one of the things I always ask GPT-4 about is code.

So I'm just going to copy this code. And this is from a project I covered earlier this year, basically creating an agent, a conversational agent using OpenAI's function calling. I'm just kind of curious how hard it would be. You can see here, if you're interested, it's FuncAgent. I think you can pip install it even.

I could be wrong. I think you can pip install FuncAgent. Now, I want to see how does this model do with code? Right. So one thing that we should do is there's additional inputs here. We should reduce the temperature, in my opinion, increase the number of maximum new tokens, reduce top P and repetition penalty.

I don't think it's like for code. It seems weird to have a repetition penalty. I'll just decrease it a little bit. So I'm going to say, can you tell me what this code is doing? I'm going to give it the Python, paste my code in here and submit. So they have this nice little interface.

Oh, I get an error. Okay. So there is actually a limitation here. It tells us a demo is limited to a session length of 1000 words. So I had to remove a few of the functions or methods from the code. But that's good, because now we can ask GPT four and Falcon one ATB.

What is missing or what needs to be improved? And we can see if they do well on that. So let's see. Now I have the code here on the left is the full code. It's 114 lines on the right is my modified code 48 lines. Okay. So there's a few missing methods here.

I'm missing call function and final thought answer. So let's say, okay, can you explain this code? Just copy that and let's submit. Okay. So I mean, it's really quick to respond, which is cool. So the code is part of a class called agent that initializes an API key and list of functions to be using natural language processing tests, so on and so on.

Okay. Open AIs API chat history and internal thoughts. So it has that choice of sub conditions like reaching limit on internal thoughts. I think that's cool. It's good. How does function cause my generation? Okay. Overall, this code implements a basic framework for an AI agent that can respond to user queries and use predefined functions to generate more complex responses.

So I think that's a pretty good summary. Now, let's try with chat GBT, GPT4. So same again, I'm going to ask you to explain this code. This code defines a Python class called agent, presumably to utilize models like GPT4, so on and so on. Let's break down this code step by step.

So it's going for a different approach. It's actually giving us some of the code. So the init method, initialize the objects, plus nothing, nothing that insightful right now. It hasn't got the it just says that this is to generate responses in a chat like manner, which, yes, kind of true, but it's missing kind of the point of a few items like the internal memory.

Okay. So I think the definition of the components of code is pretty good. All right. And then we also have missing pieces here. So Falcon didn't, at least within that initial response, which is a lot shorter, to be fair, it didn't mention missing pieces. Okay. So it seems to have an idea.

So the agent communicates to open API, possibly invoking some internal functions and returns a response back to user. So it kind of gets it right. It does identify these things I'm missing, which is cool. But it doesn't really give us an idea of what the code is actually doing.

There's nothing here that's like, oh, this is an agent. It's using internal thoughts. Oh, no, it does mention it here, but it doesn't tell us how those internal thoughts use the Falcon one here. It kind of does. It uses predefined functions to generate more complex responses. Yes. The ask method creates an internal thought process for the agent, which can include function calls and final answers.

Yeah. And it mentions internal thoughts here as well. I think overall, this explanation is easy to understand to me, at least. But it hasn't mentioned anything about what is missing. I'm just going to ask, are there any missing methods in this code? See what we get. Okay. However, without additional context about the purpose of this class, it's difficult to determine if there are any missing methods.

Yeah. It's not really picking up on the fact that we're calling methods that are missing. Let me try and be more specific. Are there any class methods that are referred to, but missing in this code block? I don't think I can get any more specific than that. Alright. So it doesn't seem to do so well on kind of identifying issues with the code.

Interesting. But I think the actual explanation to code is pretty good. So that's kind of I think, okay, when you're developing something, GPT four is probably going to be more useful there because it's way more specific. I mean, there's quite a lot to consider here. Performance is great. As they said, it's maybe not too far behind GPT four, but obviously not quite there yet deploying.

Yes. Very hard and expensive. It's really hard to justify against just paying the price of open AI. It's definitely more cost effective. But at the same time, if privacy and having your data local, especially like in the EU, you have things like GDPR. There are maybe some cases where this sort of model is actually the only alternative you would have to something like GPT four despite the cost.

It's going to be an expensive alternative, unfortunately, for now. But I'm sure over time it will probably decrease in price. There will be more optimized ways of deploying these models. But anyway, for now, that's it for this video. As I mentioned, I think this is really cool. I'm sure we're going to see a lot of interesting fine tuned models come out of this that will be even cooler, but yeah, I'll leave it there for now.

So thank you very much for watching. I hope it's been useful and interesting, and I will see you again in the next one. Bye. Bye. Bye. Bye. Bye. Bye. (soft music) (soft music)