back to indexA Hackers' Guide to Language Models
Chapters
0:0 Introduction & Basic Ideas of Language Models
18:5 Limitations & Capabilities of GPT-4
31:28 AI Applications in Code Writing, Data Analysis & OCR
38:50 Practical Tips on Using OpenAI API
46:36 Creating a Code Interpreter with Function Calling
51:57 Using Local Language Models & GPU Options
59:33 Fine-Tuning Models & Decoding Tokens
65:37 Testing & Optimizing Models
70:32 Retrieval Augmented Generation
80:8 Fine-Tuning Models
86:0 Running Models on Macs
87:42 Llama.cpp & Its Cross-Platform Abilities
00:00:00.000 |
Hi, I am Jeremy Howard from Fast.ai and this is a hacker's guide to language models. 00:00:10.100 |
When I say a hacker's guide, what we're going to be looking at is a code first approach 00:00:15.260 |
to understanding how to use language models in practice. 00:00:20.220 |
So before we get started, we should probably talk about what is a language model. 00:00:25.940 |
I would say that this is going to make more sense if you know the kind of basics of deep 00:00:35.480 |
If you don't, I think you'll still get plenty out of it and there'll be plenty of things 00:00:41.160 |
But if you do have a chance, I would recommend checking out course.fast.ai, which is a free 00:00:46.040 |
course and specifically if you could at least kind of watch, if not work through the first 00:00:56.280 |
five lessons that would get you to a point where you understand all the basic fundamentals 00:01:01.800 |
of deep learning that will make this lesson tutorial make even more sense. 00:01:09.800 |
Maybe I shouldn't call this a tutorial, it's more of a quick run-through, so I'm going 00:01:13.760 |
to try to run through all the basic ideas of language models, how to use them, both open 00:01:22.720 |
And it's all going to be based using code as much as possible. 00:01:27.800 |
So let's start by talking about what a language model is. 00:01:32.640 |
And so as you might have heard before, a language model is something that knows how to predict 00:01:36.720 |
the next word of a sentence or knows how to fill in the missing words of a sentence. 00:01:42.240 |
And we can look at an example of one, open AI has a language model, text DaVinci003, 00:01:49.200 |
and we can play with it by passing in some words and ask it to predict what the next 00:01:57.320 |
So if we pass in, "When I arrived back at the panda breeding facility after the extraordinary 00:02:02.440 |
reign of live frogs, I couldn't believe what I saw." 00:02:05.600 |
I just came up with that yesterday and I thought what might happen next. 00:02:16.040 |
Nat.dev lets us play with a variety of language models, and here I've selected text DaVinci003, 00:02:22.800 |
and I'll hit submit, and it starts printing stuff out. 00:02:27.600 |
The pandas were happily playing and eating the frogs that had fallen from the sky. 00:02:31.520 |
It was an amazing sight to see these animals taking advantage of such unique opportunity. 00:02:37.520 |
So quick measures to ensure the safety of the pandas and the frogs. 00:02:41.880 |
That's what happened after the extraordinary reign of live frogs at the panda breeding 00:02:46.260 |
You'll see here that I've enabled show probabilities, which is a thing in Nat.dev where it shows, 00:02:54.880 |
It's pretty likely the next word here is going to be "the," and after "the," since we're 00:02:58.960 |
talking about a panda breeding facility, it's going to be "panda's were," and what were 00:03:03.360 |
Well, they could have been doing a few things. 00:03:05.000 |
They could have been doing something happily, or the pandas were having, the pandas were 00:03:14.400 |
It thought it was 20% likely it's going to be happily, and what were they happily doing? 00:03:18.640 |
Could have been playing, hopping, eating, and so forth. 00:03:26.320 |
So they're eating the frogs that, and then had almost certainly. 00:03:31.140 |
So you can see what it's doing at each point is it's predicting the probability of a variety 00:03:37.760 |
And depending on how you set it up, it will either pick the most likely one every time, 00:03:44.080 |
or you can change, muck around with things like p-values and temperatures to change what 00:03:53.680 |
So at each time, then it'll give us a different result, and this is kind of fun. 00:04:03.680 |
Frogs perched on the heads of some of the pandas, it was an amazing sight, et cetera, 00:04:18.560 |
Now you might notice here it hasn't predicted pandas, it's predicted panned. 00:04:41.760 |
So you can see that it's not always predicting words, specifically what it's doing is predicting 00:04:48.080 |
Tokens are either whole words or subword units, pieces of a word, or it could even be punctuation 00:05:02.520 |
So for example, we can use the actual, it's called tokenization to create tokens from 00:05:10.680 |
We can use the same tokenizer that GPT uses by using tick token, and we can specifically 00:05:17.040 |
say we want to use the same tokenizer that that model text eventually 003 uses. 00:05:23.200 |
And so for example, when I earlier tried this, it talked about the frog splashing. 00:05:28.800 |
And so I thought, well, we'll encode they are splashing. 00:05:35.760 |
And what those numbers are, they're basically just lookups into a vocabulary that OpenAI 00:05:42.000 |
And if you train your own models, you'll be automatically creating or your code will create. 00:05:46.880 |
And if I then decode those, it says, oh, these numbers are they space are space spool, ashing. 00:05:57.920 |
And so put that all together, they are splashing. 00:06:00.160 |
So you can see that the start of a word is the space before it is also being encoded 00:06:12.240 |
So these language models are quite neat, that they can work at all, but they're not of themselves 00:06:30.880 |
The basic idea of what chat GPT, GPT-4, BART, et cetera, are doing comes from a paper which 00:06:42.600 |
describes an algorithm that I created back in 2017 called ULMFit. 00:06:47.800 |
And Sebastian Ruder and I wrote a paper up describing the ULMFit approach, which was 00:06:52.040 |
the one that basically laid out what everybody's doing, how this system works. 00:06:59.960 |
Step one is language model training, but you'll see this is actually from the paper. 00:07:07.860 |
Now what language model pre-training does is this is the thing which predicts the next 00:07:14.160 |
And so in the original ULMFit paper, so the algorithm I developed in 2017, then Sebastian 00:07:24.760 |
What I originally did was I trained this language model on Wikipedia. 00:07:28.920 |
Now what that meant is I took a neural network and a neural network is just a function. 00:07:35.960 |
If you don't know what it is, it's just a mathematical function that's extremely flexible 00:07:41.080 |
And initially it can't do anything, but using stochastic gradient descent or SGD, you can 00:07:47.120 |
teach it to do almost anything if you give it examples. 00:07:50.780 |
And so I gave it lots of examples of sentences from Wikipedia. 00:07:54.240 |
So for example, from the Wikipedia article for the birds, the birds is a 1963 American 00:07:59.760 |
natural horror thriller film produced and directed by Alfred, and then it would stop. 00:08:06.520 |
And so then the model would have to guess what the next word is. 00:08:10.400 |
And if it guessed Hitchcock, it would be rewarded. 00:08:14.040 |
And if it guessed something else, it would be penalized. 00:08:17.480 |
And effectively, basically it's trying to maximize those rewards. 00:08:20.840 |
It's trying to find a set of weights for this function that makes it more likely that it 00:08:26.880 |
And then later on in this article, it reads from Wikipedia, "Any previously dated Mitch 00:08:32.080 |
but ended it due to Mitch's cold, overbearing mother Lydia, who dislikes any woman in Mitch's..." 00:08:39.040 |
Now you can see that filling this in actually requires being pretty thoughtful because there's 00:08:44.440 |
a bunch of things that like kind of logically could go there. 00:08:48.160 |
Like a woman could be in Mitch's closet, could be in Mitch's house. 00:08:57.120 |
And so, you know, you could probably guess in the Wikipedia article describing the plot 00:09:01.240 |
of the birds is actually any woman in Mitch's life. 00:09:05.440 |
Now to do a good job of solving this problem as well as possible of guessing the next word 00:09:13.920 |
of sentences, the neural network is gonna have to learn a lot of stuff about the world. 00:09:23.360 |
It's gonna learn that there are things called objects, that there's a thing called time, 00:09:28.840 |
that objects react to each other over time, that there are things called movies, that 00:09:35.000 |
movies have directors, that there are people, that people have names and so forth, and that 00:09:39.960 |
a movie director is Alfred Hitchcock and he directed horror films and so on and so forth. 00:09:48.000 |
It's gonna have to learn extraordinary amount if it's gonna do a really good job of predicting 00:09:54.860 |
Now these neural networks specifically are deep neural networks, this is deep learning, 00:10:00.720 |
and in these deep neural networks which have, when I created this, I think it had like a 00:10:06.240 |
hundred million parameters, nowadays they have billions of parameters, it's got the ability 00:10:15.360 |
to create a rich hierarchy of abstractions and representations which it can build on. 00:10:23.740 |
And so this is really the key idea behind neural networks and language models, is that 00:10:31.800 |
if it's gonna do a good job of being able to predict the next word of any sentence in 00:10:36.780 |
any situation, it's gonna have to know an awful lot about the world, it's gonna have 00:10:41.240 |
to know about how to solve math questions or figure out the next move in a chess game 00:10:52.960 |
Now nobody says it's gonna do a good job of that, so it's a lot of work to create and 00:11:00.600 |
train a model that is good at that, but if you can create one that's good at that, it's 00:11:04.520 |
gonna have a lot of capabilities internally that it would have to be drawing on to be 00:11:13.160 |
So the key idea here for me is that this is a form of compression, and this idea of the 00:11:20.280 |
relationship between compression and intelligence goes back many, many decades, and the basic 00:11:26.760 |
idea is that yeah, if you can guess what words are coming up next, then effectively you're 00:11:34.440 |
compressing all that information down into a neural network. 00:11:41.400 |
Now I said this is not useful of itself, well why do we do it? 00:11:45.620 |
Well we do it because we want to pull out those capabilities, and the way we pull out 00:11:50.480 |
those capabilities is we take two more steps. 00:11:53.440 |
The second step is we do something called language model fine-tuning, and in language 00:11:58.440 |
model fine-tuning we are no longer just giving it all of Wikipedia or nowadays we don't just 00:12:04.520 |
give it all of Wikipedia, but in fact a large chunk of the internet is fed to pre-training 00:12:12.160 |
In the fine-tuning stage we feed it a set of documents a lot closer to the final task 00:12:19.400 |
that we want the model to do, but it's still the same basic idea, it's still trying to 00:12:29.640 |
After that we then do a final classifier fine-tuning, and in the classifier fine-tuning this is 00:12:36.600 |
the kind of end task we're trying to get it to do. 00:12:41.040 |
Nowadays these two steps are very specific approaches are taken. 00:12:45.000 |
For the step two, the step B, the language model fine-tuning, people nowadays do a particular 00:12:53.120 |
The idea is that the task we want most of the time to achieve is solve problems, answer 00:13:00.280 |
questions, and so in the instruction tuning phase we use datasets like this one. 00:13:06.560 |
This is a great dataset called OpenOrca created by a fantastic open source group, and it's 00:13:14.560 |
built on top of something called the flan collection. 00:13:18.880 |
You can see that basically there's all kinds of different questions in here, so there's 00:13:25.000 |
four gigabytes of questions and context and so forth. 00:13:32.240 |
Each one generally has a question or an instruction or a request and then a response. 00:13:44.280 |
I think this is from the flan dataset if I meant correctly. 00:13:47.560 |
So for instance it could be does the sentence in the iron age answer the question the period 00:13:52.920 |
of time from 1200 to 1000 BCE is known as what, choice is one, yes or no, and then the 00:13:59.600 |
language model is meant to write one or two as appropriate for yes or no, or it could 00:14:07.600 |
be things about I think this is from a music video who is the girl in more than you know 00:14:13.600 |
answer and then it would have to write the correct name of the remember model or dancer 00:14:18.400 |
or whatever from from that music video and so forth. 00:14:23.080 |
So it's still doing language modeling so fine-tuning and pre-training are kind of the same thing 00:14:29.760 |
but this is more targeted now not just to be able to fill in the missing parts of any 00:14:36.100 |
document from the internet but to fill in the words necessary to answer questions to 00:14:45.920 |
Okay so that's instruction tuning and then step three which is the classifier fine-tuning 00:14:53.200 |
nowadays there's generally various approaches such as reinforcement learning from human 00:14:58.160 |
feedback and others which are basically giving humans or sometimes more advanced models multiple 00:15:10.180 |
answers to a question such as here are some from a reinforcement learning from human feedback 00:15:15.360 |
paper I can't remember which one I got it from, list five ideas for how to regain enthusiasm 00:15:20.200 |
for my career and so the model will spit out two possible answers or it will have a less 00:15:26.600 |
good model and a more good model and then a human or a better model will pick which 00:15:32.040 |
is best and so that's used for the final fine-tuning stage. 00:15:38.120 |
So all of that is to say although you can download pure language models from the internet 00:15:48.320 |
they're not generally that useful on their own until you've fine-tuned them now you don't 00:15:54.800 |
necessarily need step C nowadays actually people are discovering that maybe just step 00:15:58.760 |
B might be enough it's still a bit controversial. 00:16:02.920 |
Okay so when we talk about a language model where we could be talking about something 00:16:09.400 |
that's just been pre-trained something that's been fine-tuned or something that's gone through 00:16:13.960 |
something like RLHF all of those things are generally described nowadays as language models. 00:16:23.140 |
So my view is that if you are going to be good at language modeling in any way then 00:16:31.000 |
you need to start by being a really effective user of language models and to be a really 00:16:36.080 |
effective user of language models you've got to use the best one that there is and currently 00:16:41.400 |
so what are we up to September 2023 the best one is by far GPT-4 this might change sometime 00:16:50.560 |
in the not-too-distant future but this is right now GPT-4 is the recommendation strong strong 00:16:55.280 |
recommendation now you can use GPT-4 by paying 20 bucks a month to open AI and then you can 00:17:03.320 |
use it a whole lot it's very hard to to run out of credits I find. 00:17:11.000 |
Now what can GPT-2 it's interesting and instructive in my opinion to start with the very common 00:17:19.360 |
views you see on the internet or even in academia about what it can't do. 00:17:23.640 |
So for example there was this paper you might have seen GPT-4 can't reason which describes 00:17:30.420 |
a number of empirical analysis done of 25 diverse reasoning problems and found that 00:17:38.780 |
it was not able to solve them it's utterly incapable of reasoning. 00:17:44.840 |
So I always find you got to be a bit careful about reading stuff like this because I just 00:17:50.840 |
took the first three that I came across in that paper and I gave them to GPT-4 and by 00:18:00.640 |
the way something very useful in GPT-4 is you can click on the share button and you'll 00:18:09.400 |
get something that looks like this and this is really handy. 00:18:12.880 |
So here's an example of something from the paper that said GPT-4 can't do this Mabel's 00:18:18.760 |
heart rate at 9am was 75 beats per minute her blood pressure at 7pm was 120 over 80 00:18:25.200 |
she died 11pm was she alive at noon it's of course you're human we know obviously she 00:18:30.120 |
must be and GPT-4 says hmm this appears to be a riddle not a real inquiry into medical 00:18:37.560 |
conditions here's a summary of the information and yeah sounds like Mabel was alive at noon 00:18:47.480 |
so that's correct this was the second one I tried from the paper that says GPT-4 can't 00:18:52.280 |
do this and I found actually GPT-4 can do this and it said that GPT-4 can't do this 00:19:00.200 |
and I found GPT-4 can do this now I mentioned this to say GPT-4 is probably a lot better 00:19:07.480 |
than you would expect if you've read all this stuff on the internet about all the dumb things 00:19:14.480 |
that it does almost every time I see on the internet saying something that GPT-4 can't 00:19:21.640 |
do I check it and it turns out it does this one was just last week Sally a girl has three 00:19:27.720 |
brothers each brother has two sisters how many sisters does Sally have so have a think 00:19:34.000 |
about it and so GPT-4 says okay Sally is counted as one sister by each of her brothers if each 00:19:43.760 |
brother has two sisters that means there's another sister in the picture apart from Sally 00:19:48.720 |
so Sally has one sister correct and then this one I got saw just like three or four days 00:19:59.240 |
ago this is a common view that language models can't track things like this there is the 00:20:07.520 |
riddle I'm in my house on top of my chair in the living room is a coffee cup inside 00:20:11.640 |
the coffee cup is a thimble inside the thimble is a diamond I move the chair to the bedroom 00:20:16.980 |
I put the coffee cup on the bed I turn the cup upside down then I return it up say up 00:20:21.200 |
place the coffee cup on the counter in the kitchen where's my diamond and so GPT-4 says 00:20:26.640 |
yeah okay you turn it upside down so probably the diamond fell out so therefore the diamonds 00:20:33.840 |
in the bedroom where it fell out again correct why is it that people are claiming that GPT-4 00:20:44.000 |
can't do these things and it can well the reason is because I think on the whole they 00:20:47.620 |
are not aware of how GPT-4 was trained GPT-4 was not trained at any point to give correct 00:20:58.520 |
answers GPT-4 was trained initially to give most likely next words and there's an awful 00:21:06.800 |
lot of stuff on the internet where the most rare documents are not describing things that 00:21:11.400 |
are true they could be fiction they could be jokes it could be just stupid people don't 00:21:16.560 |
say dumb stuff so this first stage does not necessarily give you correct answers the second 00:21:23.680 |
stage with the induction tuning also like it's it's it's trying to give correct answers 00:21:31.480 |
but part of the problem is that then in the stage where you start asking people which 00:21:36.320 |
answer do they like better people tended to say in these in these things that they prefer 00:21:44.620 |
more confident answers and they often were not people who were trained well enough to 00:21:50.480 |
recognize wrong answers so there's lots of reasons that the that the SGD weight updates 00:21:58.040 |
from this process for stuff like GPT-4 don't particularly or don't entirely reward correct 00:22:05.440 |
answers but you can help it want to give you correct answers if you think about the LM 00:22:13.440 |
pre-training what are the kinds of things in a document that would suggest oh this is 00:22:19.340 |
going to be high quality information and so you can actually prime GPT-4 to give you high 00:22:28.480 |
quality information by giving it custom instructions and what this does is this is basically text 00:22:36.960 |
that is prepended to all of your queries and so you say like oh you're brilliant at reasoning 00:22:44.240 |
so like okay that's obviously you're to prime it to give good answers and then try to work 00:22:51.760 |
against the fact that the RLHF folks preferred confidence just tell it no tell me if there 00:23:02.240 |
might not be a correct answer also the way that the text is generated is it literally 00:23:09.840 |
generates the next word and then it puts all that whole lot back into the model and generates 00:23:16.240 |
the next next word puts that all back in the model generates the next next next word and 00:23:20.280 |
so forth that means the more words it generates the more computation it can do and so I literally 00:23:26.640 |
I tell it that right and so I say first spend a few sentences explaining background context 00:23:32.880 |
etc so this custom instruction allows it to solve more challenging problems and you can 00:23:48.480 |
see the difference here's what it looks like for example if I say how do I get a count 00:23:55.440 |
of rows grouped by value in pandas and it just gives me a whole lot of information which 00:24:01.720 |
is actually it thinking so I just skip over it and then it gives me the answer and actually 00:24:06.920 |
in my custom instructions I actually say if the request begins with VV actually make it 00:24:16.760 |
as concise as possible and so it kind of goes into brief mode and here is brief mode how 00:24:23.440 |
do I get the group this is the same thing but with VV at the start and it just spits 00:24:27.960 |
it out now in this case it's a really simple question so I didn't need time to think so 00:24:33.040 |
hopefully that gives you a sense of how to get language models to give good answers you 00:24:41.160 |
have to help them and if you it if it's not working it might be user error basically but 00:24:47.920 |
having said that there's plenty of stuff that language models like GPT-4 can't do one thing 00:24:54.600 |
to think carefully about is does it know about itself can you ask it what is your context 00:25:01.560 |
length how were you trained what transformer architecture you're based on at any one of 00:25:10.320 |
these stages did it have the opportunity to learn any of those things well obviously not 00:25:16.240 |
at the pre-training stage nothing on the internet existed during GPT-4's training saying how 00:25:21.840 |
GPT-4 was trained right probably Ditto in the instruction tuning probably Ditto in the 00:25:28.520 |
RLHF so in general you can't ask for example a language model about itself now again because 00:25:36.320 |
of the RLHF it'll want to make you happy by giving you opinionated answers so it'll just 00:25:42.920 |
spit out the most likely thing it thinks with great confidence this is just a general kind 00:25:49.200 |
of hallucination right so hallucinations is just this idea that the language model wants 00:25:54.600 |
to complete the sentence and it wants to do it in an opinionated way that's likely to 00:25:59.260 |
make people happy it doesn't know anything about URLs it really hasn't seen many at all 00:26:07.600 |
I think a lot of them if not all of them you pretty much were stripped out so if you ask 00:26:12.440 |
it anything about like what's at this web page again it'll generally just make it up 00:26:19.240 |
and it doesn't know at least GPT-4 doesn't know anything after September 2021 because 00:26:24.680 |
the information it was pre-trained on was from that time period September 2021 and before 00:26:32.920 |
called the knowledge cutoff so here's some things it can't do Steve Newman sent me this 00:26:40.100 |
good example of something that it can't do here is a logic puzzle I need to carry a cabbage 00:26:49.100 |
a goat and a wolf across a river I can only carry one item at a time I can't leave the 00:26:54.440 |
goat with a cabbage I can't leave the cabbage with the wolf how do I get everything across 00:26:59.440 |
to the other side now the problem is this looks a lot like something called the classic 00:27:06.100 |
river crossing puzzle so classic in fact that it has a whole Wikipedia page about it and 00:27:16.440 |
in the classic puzzle the wolf will eat the goat or the goat will eat the cabbage now 00:27:25.160 |
in in Steve's version he changed it the goat would eat the cabbage and the wolf would eat 00:27:37.320 |
the cabbage but the wolf won't eat the goat so what happens well very interestingly GPT-4 00:27:45.960 |
here is entirely overwhelmed by the language model training it's seen this puzzle so many 00:27:50.680 |
times it knows what word comes next so it says oh yeah I take the goat across the road 00:27:56.120 |
across the river and leave it on the other side leaving the wolf with a cabbage but we're 00:28:00.560 |
just told you can't leave the wolf with a cabbage so it gets it wrong now the thing 00:28:07.560 |
is though you can encourage GPT-4 or any of these language models to try again so during 00:28:14.120 |
the instruction tuning and RLHF they're actually fine-tuned with multi-stage conversations so 00:28:20.080 |
you can give it a multi-stage conversation repeat back to me the constraints I listed 00:28:24.560 |
what happened after step one is a constraint violated oh yeah yeah yeah I made a mistake 00:28:31.240 |
okay my new attempt instead of taking the goat across the river and leaving it on the 00:28:36.800 |
other side is I'll take the goat across the river and leave it on the other side it's 00:28:41.840 |
done the same thing oh yeah I did do the same thing okay I'll take the wolf across well 00:28:49.920 |
now the goats with a cabbage that still doesn't work oh yeah that didn't work either sorry 00:28:57.080 |
about that instead of taking the goat across the other side I'll take the goat across the 00:29:00.880 |
other side okay what's going on here right this is terrible well one of the problems 00:29:07.120 |
here is that not only is on the internet it's so common to see this particular goat puzzle 00:29:16.720 |
that it's so confident it knows what the next word is also on the internet when you see 00:29:21.560 |
stuff which is stupid on a web page it's really likely to be followed up with more stuff that 00:29:28.600 |
is stupid once GPT-4 starts being wrong it tends to be more and more wrong it's very 00:29:39.080 |
hard to turn it around to start it making it be right so you actually have to go back 00:29:46.520 |
and there's actually an edit button on these chats and so what you generally want to do 00:29:58.680 |
is if it's made a mistake is don't say oh here's more information to help you fix it 00:30:03.160 |
but instead go back and click the edit and change it here 00:30:16.300 |
and so this time it's not going to get confused so in this case actually fixing Steve's example 00:30:25.280 |
takes quite a lot of effort but I think I've managed to get it to work eventually and I 00:30:29.560 |
actually said oh sometimes people read things too quickly they don't notice things it can 00:30:33.880 |
trick them up then they apply some pattern get the wrong answer you do the same thing 00:30:39.200 |
by the way so I'm going to trick you so before you about to get tricked make sure you don't 00:30:45.800 |
get tricked here's the tricky puzzle and then also with my custom instructions it takes 00:30:51.220 |
time discussing it and this time it gets it correct it takes the cabbage across first 00:30:58.120 |
so it took a lot of effort to get to a point where it could actually solve this because 00:31:04.040 |
yeah when it's you know for things where it's been primed to answer a certain way again 00:31:11.560 |
and again and again it's very hard for it to not do that okay now something else super 00:31:19.880 |
helpful that you can use is what they call advanced data analysis in advanced data analysis 00:31:28.060 |
you can ask it to basically write code for you and we're going to look at how to implement 00:31:33.040 |
this from scratch ourself quite soon but first of all let's learn how to use it so I was 00:31:38.680 |
trying to build something that split into markdown headings a document on third level 00:31:44.560 |
markdown headings so that's three hashes at the start of a line and I was doing it on 00:31:50.840 |
the whole of Wikipedia so using regular expressions was really slow so I said oh I want to speed 00:31:55.720 |
this up and it said okay here's some code which is great because then I can say okay 00:32:02.280 |
test it and include edge cases and so it then puts in the code creates extra cases tests 00:32:14.480 |
it and it says yep it's working however I just covered it's not I notice it's actually 00:32:21.640 |
removing the carriage return at the end of each sentence so I said I'll fix that and 00:32:26.840 |
update your tests so it said okay so now it's changed the test update the test cases to 00:32:34.720 |
run them and oh it's not working so it says oh yeah fix the issue in the test cases no 00:32:44.200 |
it didn't work and you can see it's quite clever the way it's trying to fix it by looking 00:32:50.880 |
at the results and but as you can see it's not every one of these is another attempt 00:32:58.360 |
another attempt another attempt until eventually I gave up waiting and it's so funny each time 00:33:02.840 |
it's like debugging again okay this time I got to handle it properly and I gave up at 00:33:09.560 |
the point where it's like oh one more attempt so it didn't solve it interestingly enough 00:33:15.120 |
and you know I again it's it it's there's some limits to the amount of kind of logic 00:33:24.240 |
that it can do this is really a very simple question I asked it to do for me and so hopefully 00:33:29.480 |
you can see you can't expect even GPT for code interpreter or advanced data analysis 00:33:36.040 |
is now called to make it so you don't have to write code anymore you know it's not a 00:33:42.080 |
substitute for having programmers so but it can you know it can often do a lot as I'll 00:33:53.080 |
show you in a moment so for example actually OCR like this is something I thought was really 00:33:59.840 |
cool you can just paste and so you paste your upload so GPT for you can upload an image 00:34:10.280 |
the stater analysis yeah you can upload an image here and then I wanted to basically 00:34:18.800 |
grab some text out of an image somebody had got a screenshot with their screen and I wanted 00:34:22.880 |
to add it which was something saying oh this language model can't do this and I wanted 00:34:27.600 |
to try it as well so rather than retyping it I just uploaded that image my screenshot 00:34:31.800 |
and said can you extract the text from this image and it said oh yeah I could do that 00:34:36.320 |
I could use OCR and like so it literally wrote at OCR script and there it is just took a 00:34:45.880 |
few seconds so the difference here is it didn't really require to think of much logic it could 00:34:54.000 |
just use a very very familiar pattern that it would have seen many times so this is generally 00:35:00.400 |
where I find language models excel is where it doesn't have to think too far outside the 00:35:05.560 |
box I mean it's great on kind of creativity tasks but for like reasoning and logic tasks 00:35:11.440 |
that are outside the box I find it not great but yeah it's great at doing code for a whole 00:35:17.120 |
wide variety of different libraries and languages having said that by the way Google also has 00:35:26.560 |
a language model called bard it's way less good than GPT for most of the time but there 00:35:31.880 |
is a nice thing that you can literally paste an image straight into the prompt and I just 00:35:37.520 |
typed OCR this and it didn't even have to go through code interpreter or whatever it 00:35:41.600 |
just said oh sure I've done it and there's the result of the OCR and then it even commented 00:35:48.480 |
on what it just does yard which I thought was cute and oh even more interestingly it 00:35:53.800 |
even figured out where the OCR text came from and gave me a link to it so I thought that 00:36:01.800 |
was pretty cool okay so there's an example of it doing well I'll show you one for this 00:36:09.120 |
talk I found really helpful I wanted to show you guys how much it costs to use the open 00:36:14.840 |
AI API but unfortunately when I went to the open AI web page it was like all over the 00:36:23.280 |
place the pricing information was on all separate tables and it was kind of a bit of a mess so 00:36:30.000 |
I wanted to create a table with all of the information combined like this and here's 00:36:38.080 |
how I did it I went to the open AI page I hit Apple a to select all and then I said 00:36:49.760 |
in chat GPT create a table with the pricing information rows no summarization no information 00:36:56.040 |
not in this page every row should appear as a separate row in your output and I hit paste 00:37:01.000 |
now that was not very helpful to it because hitting paste it's got the navbar it's got 00:37:08.600 |
lots of extra information at the bottom it's got all of its footer etc but it's really 00:37:17.760 |
good at this stuff it did it first time so there was the markdown table so I copied and 00:37:23.040 |
pasted that into Jupiter and I got my markdown table and so now you can see at a glance the 00:37:30.240 |
cost of GPT for 3.5 etc but then what I really wanted to do or show you that as a picture 00:37:38.600 |
so I just said oh chart the input row from this table just pasted the table back and 00:37:46.600 |
I did so that's pretty amazing now so let's talk about this pricing so so far we've used 00:37:54.600 |
chat GPT which cost 20 bucks a month and there's no like per token cost or anything but if 00:38:00.480 |
you want to use the API from Python or whatever you have to pay per token which is approximately 00:38:06.000 |
per word maybe it's about one and a third tokens per word on average unfortunately in 00:38:13.600 |
the chart it did not include these headers GPT for GPT 3.5 so these first two ones are 00:38:18.480 |
GPT 4 and these two are GPT 3.5 so you can see the GPT 3.5 is way way cheaper and you 00:38:28.120 |
can see it here it's 0.03 versus 0.0015 so it's so cheap you can really play around with 00:38:38.960 |
it not worry and I want to give you a sense of what that looks like okay so why would 00:38:46.240 |
you use the open AI API rather than chat GPT because you can do it programmatically so 00:38:53.640 |
you can you know you can analyze data sets you can do repetitive stuff it's kind of like 00:39:03.040 |
a different way of programming you know it's it's things that you can think of describing 00:39:09.480 |
but let's just look at the most simple example of what that looks like so if you pip install 00:39:12.960 |
open AI then you can import chat completion and then you can say okay chat completion 00:39:20.600 |
dot create using GPT 3.5 turbo and then you can pass in a system message this is basically 00:39:28.440 |
the same as custom instructions so okay you're an Aussie LLM that uses Aussie slang and analogies 00:39:33.640 |
wherever possible okay and so you can see I'm passing in an array here of messages so 00:39:39.160 |
the first is the system message and then the user message which is what is money okay so 00:39:46.120 |
GPT 3.5 returns a big embedded dictionary and the message content is well my money is 00:39:57.040 |
like the oil that keeps the machinery of our economy running smoothly there you go just 00:40:03.640 |
like a koala loves its eucalyptus leaves we humans can't survive without this stuff so 00:40:08.800 |
there's the Aussie LLMs view of what is money so the really the main ones I pretty much 00:40:17.440 |
always use GPT 4 and GPT 3.5 GPT 4 is just so so much better at anything remotely challenging 00:40:29.120 |
but obviously it's much more expensive so rule of thumb you know maybe try 3.5 turbo 00:40:34.000 |
first see how it goes if you're happy with the results then great if you're not pointing 00:40:40.000 |
out for the more expensive one okay so I just created a little function here called response 00:40:45.760 |
that will print out this nested thing and so now oh and so then the other thing to point 00:40:55.000 |
out here is that the result of this also has a usage field which contains how many tokens 00:41:03.200 |
was it so it's about 150 tokens so at point zero zero two dollars per thousand tokens 00:41:14.800 |
for 150 tokens means we just paid point zero three cents point zero zero zero three dollars 00:41:25.600 |
to get that done so as you can see the cost is insignificant if we were using GPT 4 it 00:41:31.880 |
would be point zero three per thousand so it would be half a cent so unless you're doing 00:41:42.680 |
many thousands of GPT 4 you're not going to be even up into the dollars and GPT 3.5 even 00:41:49.080 |
more than that but you know keep an eye on it open AI has a usage page and you can track 00:41:55.160 |
your usage now what happens when we are this is really important to understand when we 00:42:02.700 |
have a follow-up in the same conversation how does that work so we just asked what goat 00:42:13.240 |
means so for example Michael Jordan is often referred to as the goat for his exceptional 00:42:21.320 |
skills and accomplishments and Elvis and the Beatles referred to as goat due to their profound 00:42:28.040 |
influence and achievement so I could say what profound influence and achievements are you 00:42:38.840 |
referring to okay well I meant Elvis Presley and the Beatles did all these things now how 00:42:49.080 |
does that work how does this follow-up work well what happens is the entire conversation 00:42:55.240 |
is passed back and so we can actually do that here so here is the same system prompt here 00:43:03.960 |
is the same question right and then the answer comes back with role assistant and I'm going 00:43:10.120 |
to do something pretty cheeky I'm going to pretend that it didn't say money is like oil 00:43:17.240 |
I'm gonna say oh you actually said money is like kangaroos I thought what it's gonna do 00:43:23.880 |
okay so you can like literally invent a conversation in which the language model said something 00:43:29.520 |
different because this is actually how it's done in a multi-stage conversation there's 00:43:34.320 |
no state right there's nothing stored on the server you're passing back the entire conversation 00:43:40.920 |
again and telling it what it told you right so I'm going to tell it it's it told me that 00:43:47.280 |
money is like kangaroos and then I'll ask the user oh really in what way it's just kind 00:43:52.320 |
of cool because you can like see how it convinces you of of something I just invented oh let 00:43:59.400 |
me break it down for you cover just like kangaroos hop around and carry their joeys in their 00:44:03.080 |
pouch money is a means of carrying value around so there you go it's uh make your own analogy 00:44:08.920 |
cool so I'll create a little function here that just puts these things together for us 00:44:17.160 |
system message if there is one the user message and returns their completion and so now we 00:44:23.120 |
can ask it what's the meaning of life passing in the Aussie system prompt the meaning of 00:44:29.560 |
life is like trying to catch a wave on a sunny day at Bondi Beach okay there you go so um 00:44:35.520 |
what do you need to be aware of well as I said one thing is keep an eye on your usage 00:44:40.200 |
if you're doing it you know hundreds or thousands of times in a loop keep an eye on not spending 00:44:46.040 |
too much money but also if you're doing it too fast particularly the first day or two 00:44:50.880 |
you've got an account you're likely to hit the limits for the API and so the limits initially 00:45:00.120 |
are pretty low as you can see three requests per minute that's for free users page users 00:45:11.880 |
first 48 hours and after that it starts going up and you can always ask for more I just 00:45:17.200 |
mentioned this because you're going to want to have a function that keeps an eye on that 00:45:23.080 |
and so what I did is I actually just went to Bing which has a somewhat crappy version 00:45:28.880 |
of gpt4 nowadays but it can still do basic stuff for free and I said please show me python 00:45:35.440 |
code to call the openai API and handle rate limits and it wrote this code it's got a try 00:45:45.960 |
checks your rate limit errors grabs the retry after sleeps for that long and calls itself 00:45:55.160 |
and so now we can use that to ask for example what's the world's funniest joke and there 00:46:02.480 |
we go is the world's funniest joke so there's like the basic stuff you need to get started 00:46:11.800 |
using the openai LLMs and yeah it's definitely suggest spending plenty of time with that 00:46:24.720 |
so that you feel like you're really a LLM using expert so what else can we do well let's 00:46:35.840 |
create our own code interpreter that runs inside Jupiter and so to do this we're going 00:46:43.400 |
to take advantage of a really nifty thing called function calling which is provided 00:46:49.680 |
by the openai API and in function calling when we call our ask gpt function is this little 00:46:57.920 |
one here we had room to pass in some keyword arguments that will be just passed along to 00:47:03.520 |
check completion dot create and one of those keyword arguments you can pass is functions 00:47:12.520 |
what on earth is that functions tells openai about tools that you have about functions 00:47:22.040 |
that you have so for example I created a really simple function called sums and it adds two 00:47:31.080 |
things in fact it adds two ints and I'm going to pass that function to check completion 00:47:44.560 |
dot create now you can't pass a Python function directly you actually have to pass what's 00:47:51.200 |
called the JSON schema so you have to pass the schema for the function so I created this 00:47:58.120 |
nifty little function that you're welcome to borrow which uses pedantic and also Python's 00:48:07.160 |
inspect module to automatically take a Python function and return the schema for it and 00:48:15.960 |
so this is actually what's going to get passed to openai that's going to know that there's 00:48:18.960 |
a function called sums it's going to know what it does and it's going to know what parameters 00:48:24.040 |
it takes what the defaults are and what's required so this is like when I first heard 00:48:31.860 |
about this I found this a bit mind bending because this is so different to how we normally 00:48:35.920 |
program computers where the key thing for programming the computer here actually is 00:48:41.280 |
the doc string this is the thing that gpt4 will look at and say oh what does this function 00:48:47.040 |
do so it's critical that this describes exactly what the function does and so if I then say 00:48:53.840 |
what is six plus three right now just I really wanted to make sure it actually did it here 00:49:03.480 |
so I gave it lots of prompts to say because obviously it knows how to do it itself without 00:49:08.520 |
calling sums so it'll only use your functions if it feels it needs to which is a weird concept 00:49:15.680 |
I mean I guess feels is not a great word to use but you kind of have to anthropomorphize 00:49:20.480 |
these things a little bit because they don't behave like normal computer programs so if 00:49:25.760 |
I if I ask gpt what is six plus three and tell it that there's a function called sums then 00:49:33.080 |
it does not actually return the number nine instead it returns something saying please 00:49:39.400 |
call a function call this function and pass it these arguments so if I print it out there's 00:49:46.920 |
the arguments so I created a little function called call function and it goes into the 00:49:54.920 |
result of open AI grabs the function call checks that the name is something that it's 00:50:02.000 |
allowed to do grabs it from the global system table and calls it passing in the parameters 00:50:10.880 |
and so if I now say okay call the function that we got back we finally get nine so this 00:50:24.160 |
is a very simple example it's not really doing anything that useful but what we could do 00:50:27.920 |
now is we can create a much more powerful function called Python and the Python function 00:50:39.000 |
creates code using Python and returns the result now of course I didn't want my computer 00:50:49.840 |
to run arbitrary Python code that gpt4 told it to without checking so I just got it to 00:50:56.000 |
check first so say I'm sure you want to do this so now I can say ask gpt what is 12 factorial 00:51:09.760 |
system prompt you can use Python for any required computations and say okay here's a function 00:51:14.200 |
you've got available it's the Python function so if I now call this it will pass me back 00:51:23.240 |
again a completion object and here it's going to say okay I want you to call Python passing 00:51:29.240 |
in this argument and when I do it's going to go import math result equals blah and then 00:51:37.440 |
return result do I want to do that yes I do and there it is now there's one more step 00:51:47.840 |
which we can optionally do I mean we've got the answer we wanted but often we want the 00:51:51.780 |
answer in more of a chat format and so the way to do that is to again repeat everything 00:51:58.700 |
that you've passed into so far but then instead of adding an assistant role response we have 00:52:06.880 |
to provide a function role response and simply put in here the result we got back from the 00:52:15.080 |
function and if we do that we now get the pros response 12 factorial is equal to four hundred 00:52:26.680 |
seven in a million one thousand six hundred now functions like Python you can still ask 00:52:36.480 |
it about non-python things and it just ignores it if you don't need it right so you can have 00:52:44.440 |
a whole bunch of functions available that you've built to do whatever you need for the 00:52:50.840 |
stuff which the language model isn't familiar with and it'll still solve whatever it can 00:53:00.640 |
on its own and use your tools use your functions where possible okay so we have built our own 00:53:13.640 |
code interpreter from scratch I think that's pretty amazing so that is what you can do 00:53:26.000 |
with or some of the stuff you can do with open AI what about stuff that you can do on 00:53:34.580 |
your own computer well to use a language model on your own computer you're going to need 00:53:41.140 |
to use a GPU so I guess the first thing to think about is like do you want this does 00:53:50.580 |
it make sense to do stuff on your own computer what are the benefits there are not any open 00:54:00.160 |
source models that are as good yet as GPT for and I would have to say also like actually 00:54:08.280 |
open AI's pricing is really pretty good so it's it's not immediately obvious that you 00:54:14.560 |
definitely want to kind of go in house but there's lots of reasons you might want to 00:54:20.080 |
and we'll look at some examples of them today one example you might want to go in house 00:54:26.680 |
is that you want to be able to ask questions about your proprietary documents or about 00:54:34.280 |
information after September 2021 the the knowledge cutoff or you might want to create your own 00:54:40.640 |
model that's particularly good at solving the kinds of problems that you need to solve 00:54:45.440 |
using fine-tuning and these are all things that you absolutely can get better than GPT 00:54:49.800 |
for performance at work or at home without too much without too much money or trouble 00:54:57.240 |
so these are the situations in which you might want to go down this path and so you don't 00:55:01.960 |
necessarily have to buy a GPU on Kaggle they will give you a notebook with two quite old 00:55:09.000 |
GPUs attached and very little RAM but it's something or you can use CoLab and on CoLab 00:55:17.280 |
you can get much better GPUs than Kaggle has and more RAM particularly if you pay a monthly 00:55:25.520 |
subscription fee so those are some options for free or low-cost you can also of course 00:55:38.680 |
go to one of the many kind of GPU server providers and they change all the time as to what's 00:55:47.800 |
good or what's not. RunPod is one example and you can see you know if you want the biggest 00:55:56.560 |
and best machine you're talking $34 an hour so it gets pretty expensive but you can certainly 00:56:03.280 |
get things a lot cheaper 80 cents an hour. Lambda Labs is often pretty good you know 00:56:14.920 |
it's really hard at the moment to actually find let's see pricing to actually find people 00:56:24.120 |
that have them available so they've got lots listed here but they often have none or very 00:56:29.240 |
few available there's also something pretty interesting called vast AI which basically 00:56:37.640 |
lets you use other people's computers when they're not using them and as you can see 00:56:49.440 |
you know they tend to be much cheaper than other folks and then they tend to have better 00:56:56.760 |
availability as well but of course for sensitive stuff you don't want to be running it on some 00:57:00.680 |
randos computer so anyway so there's a few options for renting stuff you know I think 00:57:06.720 |
if you can it's worth buying something and definitely the one to buy at the moment is 00:57:10.480 |
the GTX 3090 used you can generally get them from eBay for like 700 bucks or so. A 4090 00:57:21.040 |
isn't really better for language models even though it's a newer GPU the reason for that 00:57:26.560 |
is that language models are all about memory speed how quickly can you get in and stuff 00:57:31.760 |
in and out of memory rather than how fast is the processor and that hasn't really improved 00:57:35.640 |
a whole lot. So the 2000 bucks the other thing as well as memory speed is memory size 24 00:57:44.320 |
gigs it doesn't quite cut it for a lot of things so you'd probably want to get two of 00:57:48.280 |
these GPUs so you're talking like $1500 or so or you can get a 48 gig RAM GPU it's called 00:57:58.080 |
an A6000 but this is going to cost you more like 5 grand so again getting two of these 00:58:06.280 |
is going to be a better deal and this is not going to be faster than these either. Or funnily 00:58:14.440 |
enough you could just get a Mac with a lot of RAM particularly if you get an M2 Ultra 00:58:20.840 |
Macs have particularly the M2 Ultra has pretty fast memory it's still going to be way slower 00:58:27.480 |
than using an Nvidia card but it's going to be like you're going to be able to get you 00:58:32.520 |
know like I think 192 gig or something so it's not a terrible option particularly if 00:58:42.120 |
you're not training models you just wanting to use other existing trained models. So anyway 00:58:52.600 |
most people who do this stuff seriously almost everybody has Nvidia cards. So then what we're 00:59:00.760 |
going to be using is a library called transformers from Hugging Face and the reason for that is 00:59:06.160 |
that basically people upload lots of pre-trained models or fine-tuned models up to the Hugging 00:59:13.280 |
Face hub and in fact there's even a leaderboard where you can see which are the best models. 00:59:20.960 |
Now this is a really fraught area to at the moment this one is meant to be the best model 00:59:30.240 |
it has the highest average score and maybe it is good I haven't actually used a particular 00:59:35.520 |
model or maybe it's not I actually have no idea because the problem is these metrics 00:59:43.880 |
are not particularly well aligned with real life usage for all kinds of reasons and also 00:59:51.320 |
sometimes you get something called leakage which means that sometimes some of the questions 00:59:57.000 |
from these things actually leaks through to some of the training sets. So you can get 01:00:03.000 |
as a rule of thumb what to use from here but you should always try things and you can also 01:00:09.640 |
say you know these ones are all this 70 B here that tells you how big it is so this 01:00:14.200 |
is a 70 billion parameter model. So generally speaking for the kinds of GPUs you we're talking 01:00:23.600 |
about you'll be wanting no bigger than 13 B and quite often 7B. So let's see if we can 01:00:33.240 |
find here's a 13 B model for example. All right so you can find models to try out from things 01:00:41.340 |
like this leaderboard and there's also a really great leaderboard called fast eval which I 01:00:47.520 |
like a lot because it focuses on some more sophisticated evaluation methods such as this 01:00:55.720 |
chain of thought evaluation method. So I kind of trust these a little bit more and these 01:01:02.080 |
are also you know GSM 8K is a difficult math benchmark big bench hard so forth. So yeah 01:01:11.440 |
so you know stable beluga 2 wizard math 13 B dolphin llama 13 B etc these would all be 01:01:18.360 |
good options. Yeah so you need to pick a model and at the moment nearly all the good models 01:01:28.000 |
are based on matters llama 2. So when I say based on what does that mean well what that 01:01:34.880 |
means is this model here llama 2 7B so it's a llama model that's that's just the name 01:01:43.560 |
meta call it this is their version 2 of llama this is their 7 billion size one it's the 01:01:48.280 |
smallest one that they make and specifically these weights have been created for hugging 01:01:53.080 |
face so you can load it with the hugging face transformers and this model has only got as 01:01:58.760 |
far as here it's done the language model for pre-training it's done none of the instruction 01:02:03.720 |
tuning and none of the RLHF so we would need to fine tune it to really get it to do much 01:02:12.040 |
useful. So we can just say okay create a automatically create the appropriate model for language 01:02:22.200 |
model so causal LM is basically refers to that ULM fit stage one process or stage two 01:02:28.700 |
in fact so get the pre-trained model from this name meta llama llama 2 blah blah. Okay 01:02:36.520 |
now generally speaking we use 16-bit floating point numbers nowadays but if you think about 01:02:48.720 |
it 16-bit is two bytes so 7B times two it's going to be 14 gigabytes just to load in the 01:02:59.640 |
weights so you've got to have a decent model to be able to do that perhaps surprisingly 01:03:07.120 |
you can actually just cast it to 8-bit and it still works pretty well thanks to something 01:03:11.760 |
called discretization. So let's try that so remember this is just a language model it can 01:03:18.840 |
only complete sentences we can't ask it a question and expect a great answer so let's 01:03:22.980 |
just give it the start of a sentence Jeremy how it is are and so we need the right tokenizer 01:03:28.400 |
so this will automatically create the right kind of tokenizer for this model we can grab 01:03:32.640 |
the tokens as PyTorch here they are and just to confirm if we decode them back again we 01:03:44.920 |
get back the original plus a special token to say this is the start of a document and 01:03:50.400 |
so we can now call generate so generate will auto-regressively so call the model again 01:04:00.080 |
and again passing its previous result back as the next as the next input and I'm just 01:04:08.800 |
going to do that 15 times so this is you can you can write this for loop yourself this 01:04:13.520 |
isn't doing anything fancy in fact I would recommend writing this yourself to make sure 01:04:18.400 |
that you know how that it all works okay we have to put those tokens on the GPU and at 01:04:26.640 |
the end I recommend putting them back onto the CPU the result and here are the tokens 01:04:31.520 |
not very interesting so we have to decode them using the tokenizer and so the first 01:04:36.160 |
25 so first 15 tokens are Jeremy Howard is a 28 year old Australian AI researcher and 01:04:42.440 |
entrepreneur okay well 28 years old is not exactly correct but we'll call it close enough 01:04:47.440 |
I like that thank you very much llama 7b so okay so we've got a language model completing 01:04:54.400 |
sentences it took one and a third seconds and that's a bit slower than it could be because 01:05:04.400 |
we used 8-bit if we use 16-bit there's a special thing called B float 16 which is a really 01:05:11.240 |
great 16-bit floating point format that's use usable on any somewhat recent GP Nvidia 01:05:17.840 |
GPU now if we use it it's going to take twice as much RAM as we discussed but look at the 01:05:24.240 |
time it's come down to 390 milliseconds now there is a better option still than even that 01:05:34.640 |
there's a different kind of discretization called GPTQ where a model is carefully optimized 01:05:43.160 |
to work with 4 or 8 or other you know lower precision data automatically and this particular 01:05:55.320 |
person known as the bloke is fantastic at taking popular models running that optimization 01:06:02.080 |
process and then uploading the results back to hacking phase so we can use this GPTQ version 01:06:12.040 |
and internally this is actually going to use I'm not sure exactly how many bits this particular 01:06:16.200 |
one is I think it's probably going to be four bits but it's going to be much more optimized 01:06:22.240 |
and so look at this 270 milliseconds it's actually faster than 16-bit even though internally 01:06:30.640 |
it's actually casting it up to 16-bit each layer to do it that's because there's a lot 01:06:35.420 |
less memory moving around and to confirm in fact what we could even do now is we got to 01:06:42.000 |
13B easy and in fact it's still faster than the 7B now that we're using the GPTQ version 01:06:49.280 |
so this is a really helpful tip so let's put all those things together the tokenizer that 01:06:55.360 |
generate the batch decode we'll call this gen for generate and so we can now use the 01:06:59.800 |
13B GPTQ model and let's try this Jeremy Howard is a so it's got to 50 tokens so fast 16-year 01:07:08.760 |
veteran of Silicon Valley co-founder of Kaggle a marketplace a predictive model his company 01:07:13.720 |
Kaggle.com has become to data science competitions what I don't know what I was going to say but 01:07:17.720 |
anyway it's on the right track I was actually there for 10 years not 16 but that's all right 01:07:22.520 |
okay so this is looking good but probably a lot of the time we're going to be interested 01:07:32.680 |
in you know asking questions or using instructions so stability AI has this nice series called 01:07:39.000 |
stable beluga including a small 7B one and other bigger ones and these are all based on llama too 01:07:46.200 |
but these have been instruction tuned they might even have been RLHDF I can't remember now so we 01:07:53.480 |
can create a stable beluga model and now something really important that I keep forgetting everybody 01:08:02.040 |
keeps forgetting is during the instruction tuning process during the instruction tuning process 01:08:14.200 |
the instructions that are passed in actually are they don't just appear like this they actually 01:08:26.760 |
always are in a particular format and the format believe it or not changes quite a bit from 01:08:33.000 |
from fine tune to fine tune and so you have to go to the webpage for the model 01:08:39.800 |
and scroll down to find out what the prompt format is so here's the prompt format so I 01:08:48.680 |
generally just copy it and then I paste it into python which I did here and created a function 01:09:01.480 |
called make prompt that use the exact same format that it said to use and so now if I want to say 01:09:09.960 |
who is Jeremy Howard I can call gen again that was that function I created up here and make the 01:09:16.920 |
correct prompt from that question and then it returns back okay so you can see here all this 01:09:24.360 |
prefix this is a system instruction this is my question and then the assistant says Jeremy Howard 01:09:31.160 |
is an Australian entrepreneur computer scientist co-founder of machine learning and deep learning 01:09:35.400 |
company fasted AI okay this one's actually all correct so it's getting better by using an actual 01:09:41.960 |
instruction tuned model and so we could then start to scale up so we could use the 13b and in fact 01:09:50.920 |
we looked briefly at this open orca data set earlier so llama 2 has been fine tuned on open 01:09:58.520 |
orca and then also fine tuned on another really great data set called platypus and so the whole 01:10:05.720 |
thing together is the open orca platypus and then this is going to be the bigger 13b 01:10:11.560 |
gptq means it's going to be quantized so that's got a different format okay a different prompt 01:10:19.240 |
format so again we can scroll down and see what the prompt format is there it is okay and so 01:10:27.720 |
we can create a function called make open orca prompt that has that prompt format 01:10:37.320 |
and so now we can say okay who is Jeremy Howard and now I've become British which is kind of true 01:10:42.120 |
I was born in England but I moved to Australia a professional poker player definitely not that 01:10:47.720 |
co-founding several companies including fasted AI also Kaggle okay so not bad it was acquired 01:10:56.040 |
by Google was it 2017 probably something around there okay so you can see we've got our own models 01:11:04.280 |
giving us some pretty good information um how do we make it even better you know because it's 01:11:11.720 |
it's it's still hallucinating you know um and you know llama 2 I think has been trained with 01:11:23.000 |
more up-to-date information than gpt4 it doesn't have the September 2021 cutoff um but it you know 01:11:30.920 |
it's still got a knowledge cutoff you know we would like to go to use the most up-to-date 01:11:35.320 |
information we want to use the right information to answer these questions as well as possible 01:11:39.720 |
so to do this we can use something called retrieval augmented generation 01:11:45.000 |
so what happens with retrieval augmented generation is when we take the question 01:11:53.800 |
we've been asked like um who is Jeremy Howard and then we say okay let's try and search for 01:12:02.280 |
documents that may help us answer that question um so obviously we would expect for example 01:12:09.880 |
wikipedia to be useful and then what we do is we say okay with that information um let's now see 01:12:19.320 |
if we can tell the language model about what we found and then have it answer the question 01:12:26.520 |
um so let me show you so let's actually grab a wikipedia um python package 01:12:35.480 |
we will scrape wikipedia grabbing the Jeremy Howard web page 01:12:41.160 |
and so here's the start of the Jeremy Howard wikipedia page 01:12:48.760 |
it has 613 words now generally speaking these open source models will have a context length 01:12:54.600 |
of about 2 000 or 4 000 so the context length is how many tokens can it handle so that's fine it'll 01:13:01.080 |
be able to handle this web page and what we're going to do is we're going to ask it the question 01:13:06.440 |
so we're going to have here question and with a question but before it we're going to say answer 01:13:10.600 |
the question with the help of the context we're going to provide this to the language model and 01:13:14.760 |
we're going to say context and they're going to have the whole web page so suddenly now our 01:13:19.240 |
question is going to be a lot bigger their prompt right so our prompt now contains the entire web 01:13:28.520 |
page the whole wikipedia page followed by our question and so now it says Jeremy Howard is an 01:13:38.360 |
Australian data scientist, entrepreneur and educator known for his work in deep learning, 01:13:42.920 |
co-founder of fastai teaches courses, develops software, conducts research, used to be yeah okay 01:13:49.240 |
it's perfect right so it's actually done a really good job like if somebody asked me to send them a 01:13:56.280 |
you know 100 word bio uh that would actually probably be better than i would have written 01:14:02.760 |
myself and you'll see even though i asked for 300 tokens it actually got sent back the end of stream 01:14:10.680 |
token and so it knows to stop at this point um well that's all very well but how do we know 01:14:19.480 |
to pass in the Jeremy Howard wikipedia page well the way we know which wikipedia page to pass in 01:14:25.960 |
is that we can use another model to tell us which web page or which document is the most 01:14:34.120 |
useful for answering a question and the way we do that is we we can use something called sentence 01:14:44.760 |
transformer and we can use a special kind of model that's specifically designed to take a document 01:14:51.800 |
and turn it into a bunch of activations where two documents that are similar will have similar 01:14:59.960 |
activations so let me just let me show you what i mean what i'm going to do is i'm going to grab 01:15:05.320 |
just the first paragraph of my wikipedia page and i'm going to grab the first paragraph of Tony 01:15:12.520 |
Blair's wikipedia page okay so we're pretty different people right this is just like a really 01:15:18.120 |
simple small example and i'm going to then call this model i'm going to say encode and i'm going 01:15:25.240 |
to encode my wikipedia first paragraph tony blair's first paragraph and the question which was um who 01:15:33.640 |
is Jeremy Howard and it's going to pass back a 384 long vector of embeddings for the question 01:15:45.320 |
for me and for tony blair and what i can now do is i can calculate the similarity 01:15:53.960 |
between the question and the Jeremy Howard wikipedia page 01:15:57.480 |
and i can also do it for the question versus the tony blair wikipedia page and as you can see it's 01:16:04.200 |
higher for me and so that tells you that if you're trying to figure out what document to use to help 01:16:11.240 |
you answer this question better off using the Jeremy Howard wikipedia page than the tony blair 01:16:16.600 |
wikipedia page so if you had a few hundred documents you were thinking of using to give 01:16:26.440 |
back to the model as context to help it answer a question you could literally just pass them all 01:16:31.960 |
through to encode go through each one one at a time and see which is closest when you've got 01:16:38.280 |
thousands or millions of documents you can use something called a vector database where basically 01:16:45.160 |
as a one-off thing you go through and you encode all of your documents and so in fact um there's 01:16:54.600 |
there's lots of pre-built systems for this um here's an example of one called h2ogpt 01:17:04.280 |
that i've got running here on my computer it's just an open source thing 01:17:14.520 |
written in python sitting here running on port 7860 and so i've just gone to localhost 7860 01:17:20.440 |
and what i did was i just uh uploaded i just clicked upload and i just 01:17:27.640 |
uploaded a bunch of papers in fact i might be able to see it better yeah here we go a bunch of papers 01:17:39.320 |
uh can we search yeah i can so for example we can look at the ulm fit paper that uh 01:17:45.800 |
said breeder and i did and you can see it's turned taken the pdf and turned it into 01:17:51.000 |
slightly crappily a text format and then it's created an embedding for each 01:18:00.280 |
you know each section so i could then ask it you know what is ulm fit and i'll hit enter 01:18:13.000 |
and you can see here it's now actually saying based on the information provided in the context so 01:18:18.440 |
it's showing us it's been given some context what context did it get so here are the things that it 01:18:23.320 |
found right so it's being sent this context so this is kind of citations 01:18:32.920 |
uh goal of ulm fit proves the performance by leveraging the knowledge and adapting it to 01:18:41.320 |
the specific task at hand um how what techniques be more specific does ulm fit uh let's see how it goes 01:18:55.800 |
okay there we go so here's the three steps pre-trained fine tune fine tune cool um so you can 01:19:07.080 |
see it's not bad right um it's not amazing like you know the context in this particular case is 01:19:14.440 |
pretty small um and it's and in particular if you think about how that embedding thing worked 01:19:22.040 |
you can't really use like the normal kind of follow-up so for example um if i say it says 01:19:29.880 |
fine tuning a classifier so i could say what classifier is used now the problem is that there's 01:19:38.280 |
no context here being sent to the embedding model so it's actually going to have no idea i'm talking 01:19:42.760 |
about ulm fit so generally speaking it's going to do a terrible job yeah see it says used as a 01:19:49.480 |
reberta model but it's not but if i look at the sources it's no longer actually referring to howard 01:19:54.920 |
and ruder so anyway you can see the basic idea this is called retrieval augmented generation RAG 01:20:01.960 |
and it's a it's a nifty approach but you have to do it with with some care um and so there are lots 01:20:12.680 |
of these uh private gpt things out there um actually the h2o gpt webpage does a fantastic job 01:20:25.960 |
um so as you can see if you want to run a private gpt there's no shortage of options 01:20:36.680 |
um and you can have your retrieval augmented generation 01:20:40.840 |
i haven't tried i've only tried this one h2o gpt i don't love it it's all right 01:20:50.520 |
so finally i want to talk about what's perhaps the most interesting um uh option we have which is to 01:20:56.360 |
do our own fine tuning and fine tuning is cool because rather than just retrieving documents 01:21:01.880 |
which might have useful context we can actually change our model to behave based on the documents 01:21:08.920 |
that we have available and i'm going to show you a really interesting example of fine tuning here 01:21:14.200 |
what we're going to do is we're going to fine tune using this um no sql data set and it's got examples 01:21:23.560 |
of like a a schema for a table in a database a question and then the answer is 01:21:35.800 |
the correct sql to solve that question using that database schema and so i'm hoping we could use 01:21:46.760 |
this to create a um you know a kind of it could be a handy use a handy tool for for business users 01:21:54.440 |
where they type some english question and sql generated uh for them automatically don't know 01:22:01.880 |
if it actually work in practice or not but this is just a little fun idea i thought we'd try out 01:22:06.920 |
i know there's lots of uh startups and stuff out there trying to do this more seriously 01:22:13.240 |
but this is this is quite cool because it actually got it working today in just a couple of hours 01:22:19.160 |
so what we do is we use the hugging face datasets library and what that does just like the hugging 01:22:29.320 |
face hub has lots of models stored on it hugging face datasets has lots of datasets stored on it 01:22:36.520 |
and so instead of using transformers which is what we use to grab models we use datasets 01:22:41.480 |
and we just pass in the name of the person and the name of their repo and it grabs the dataset 01:22:47.960 |
and so we can take a look at it and it just has a training set with features and so then i can 01:22:56.520 |
have a look at the training set so here's an example which looks a bit like what we've just 01:23:05.560 |
seen so what we do now is we want to fine tune a model now we can do that in in a notebook from 01:23:14.840 |
scratch takes i don't know 100 or so lines of code it's not too much but given the time constraints 01:23:20.840 |
here um and also like i thought why not why don't we just use something that's ready to go 01:23:26.600 |
so for example there's something called axolotl which is quite nice in my opinion 01:23:30.680 |
here it is here lovely another very nice open source piece of software and again you can just 01:23:39.320 |
pip install it and it's got things like gptq and 16 bit and so forth ready to go and so what i did 01:23:49.640 |
was i um it basically has a whole bunch of examples of things that it already knows how to do 01:23:57.320 |
it's got llama 2 example so i copied the llama 2 example and i created a sql example so basically 01:24:05.400 |
just told it this is the path to the dataset that i want this is the type um and everything else 01:24:13.640 |
pretty much i left the same uh and then i just ran this command which is from their readme accelerate 01:24:20.840 |
launch axolotl passed in my yaml and that took about an hour um on my gpu and at the end of the 01:24:29.640 |
hour it had created a q laura out directory uh q stands for quantize that's because i was creating 01:24:37.720 |
a smaller quantized model uh laura i'm not going to talk about today but laura is a very cool thing 01:24:42.920 |
that basically another thing that makes your models smaller and also handles um uh can use 01:24:50.040 |
bigger models on smaller gpu for training um so uh i trained it and then i thought okay let's uh 01:25:02.680 |
create our own one so we're going to have this context and um this question 01:25:12.760 |
get the count of competition hosts by theme and i'm not going to pass it an answer so i'll just 01:25:21.000 |
ignore that um so again i found out what prompt uh they were using um and created a sql prompt 01:25:29.880 |
function and so here's what i'm going to do use the following contextual information to answer the 01:25:34.840 |
question uh context create table says the context question list all competition hosts ordered in 01:25:41.560 |
ascending order and then i tokenized that called generate and the answer was select count hosts 01:25:56.520 |
comma theme from farm competition group by theme that is correct so i think that's pretty remarkable 01:26:05.640 |
we have just built you know so it took me like an hour to figure out how to do it and then an hour 01:26:12.120 |
to actually do the training um and at the end of that we've actually got something which which is 01:26:18.680 |
converting um pros into sql based on our schema so i think that's that's a really exciting idea 01:26:27.240 |
um the only other thing i do want to briefly mention is um is doing stuff on max um if you've 01:26:36.520 |
got a mac uh you there's a couple of really good options um the options are mlc and llama.cpp 01:26:45.240 |
currently uh mlc in particular i think it's kind of underappreciated it's a um you know really nice 01:26:53.400 |
project um uh where you can run language models on literally iphone android web browsers 01:27:07.160 |
everything uh it's really cool and and so i'm now actually on my mac here 01:27:15.000 |
and i've got a um tiny little python program called chat and it's going to import chat module 01:27:26.760 |
and it's going to import a discretized 7b um and it's going to ask the question what is the meaning 01:27:38.280 |
of life so let's try it python chat.py again i just installed this earlier today i haven't 01:27:48.520 |
done that much stuff on macs before but i was pretty impressed to see that it is doing a good 01:27:57.720 |
job here what is the meaning of life is complex and philosophical some people might find meaning 01:28:05.000 |
in their relationships with others their impact in the world etc etc okay and it's doing 9.6 tokens 01:28:14.120 |
per second so there you go so there is running um a model on a mac and then another option that 01:28:20.840 |
you've probably heard about is llama.cpp uh llama.cpp uh runs on lots of different things as well 01:28:28.600 |
including macs and also on cuda um it uses a different format called gguf and you can again 01:28:37.320 |
you can use it from python even though it's a cpp thing it's got a python wrapper so you can just 01:28:42.120 |
download again from hugging face a gguf uh file so you can just go through and there's lots of 01:28:52.200 |
different ones they're all documented as to what's what you can pick how big a file you want you can 01:28:56.760 |
download it and then you just say okay llama model path equals pass in that gguf file it spits out 01:29:04.440 |
lots and lots and lots of gunk and then you can say okay so if i called that llm you can then say 01:29:11.960 |
llm question name the planets of the solar system 32 tokens and there we are one Pluto no longer 01:29:22.600 |
considered a planet two Mercury three Venus four Earth Mars six oh never had other tokens um so 01:29:28.760 |
again you know it's um just to show you here there are all these different options um 01:29:34.600 |
uh you know i would say you know if you've got a 01:29:38.840 |
nvidia graphics card and you're a reasonably capable python programmer you'd probably be one 01:29:45.800 |
of you use pytorch and the hugging face ecosystem um but uh you know i think you know these things 01:29:53.880 |
might change over time as well and certainly a lot of stuff is coming into llama pretty quickly now 01:29:57.720 |
and it's developing very fast as you can see um there's a lot of stuff that you can do right now 01:30:03.720 |
with language models um particularly if you if you're pretty comfortable as a python programmer 01:30:10.520 |
um i think it's a really exciting time to get involved in some ways it's a frustrating time 01:30:15.880 |
to get involved because um you know it's very early and a lot of stuff has weird little edge 01:30:26.040 |
cases and it's tricky to install and stuff like that um there's a lot of great discord channels 01:30:33.720 |
however fastai i have our own discord channel so feel free to just google for fastai discord and 01:30:39.000 |
and drop in we've got a channel called generative um you feel free to ask any questions or tell us 01:30:45.480 |
about what you're finding um yeah it's definitely something where you want to be getting help from 01:30:50.760 |
other people on this journey because it is very early days and you know people are still figuring 01:30:56.840 |
things out as we go but i think it's an exciting time to be doing this stuff and i'm yeah i'm 01:31:02.280 |
really enjoying it and i hope that this has given some of you a useful starting point on your own 01:31:08.280 |
journey so i hope you found this useful thanks for listening bye