A Hackers' Guide to Language Models

00:00:00.000 | Hi, I am Jeremy Howard from Fast.ai and this is a hacker's guide to language models.

00:00:10.100 | When I say a hacker's guide, what we're going to be looking at is a code first approach

00:00:15.260 | to understanding how to use language models in practice.

00:00:20.220 | So before we get started, we should probably talk about what is a language model.

00:00:25.940 | I would say that this is going to make more sense if you know the kind of basics of deep

00:00:33.680 | learning.

00:00:35.480 | If you don't, I think you'll still get plenty out of it and there'll be plenty of things

00:00:39.120 | you can do.

00:00:41.160 | But if you do have a chance, I would recommend checking out course.fast.ai, which is a free

00:00:46.040 | course and specifically if you could at least kind of watch, if not work through the first

00:00:56.280 | five lessons that would get you to a point where you understand all the basic fundamentals

00:01:01.800 | of deep learning that will make this lesson tutorial make even more sense.

00:01:09.800 | Maybe I shouldn't call this a tutorial, it's more of a quick run-through, so I'm going

00:01:13.760 | to try to run through all the basic ideas of language models, how to use them, both open

00:01:19.920 | source ones and open AI-based ones.

00:01:22.720 | And it's all going to be based using code as much as possible.

00:01:27.800 | So let's start by talking about what a language model is.

00:01:32.640 | And so as you might have heard before, a language model is something that knows how to predict

00:01:36.720 | the next word of a sentence or knows how to fill in the missing words of a sentence.

00:01:42.240 | And we can look at an example of one, open AI has a language model, text DaVinci003,

00:01:49.200 | and we can play with it by passing in some words and ask it to predict what the next

00:01:55.360 | words might be.

00:01:57.320 | So if we pass in, "When I arrived back at the panda breeding facility after the extraordinary

00:02:02.440 | reign of live frogs, I couldn't believe what I saw."

00:02:05.600 | I just came up with that yesterday and I thought what might happen next.

00:02:08.920 | So kind of fun for creative brainstorming.

00:02:12.420 | There's a nice site called Nat.dev.

00:02:16.040 | Nat.dev lets us play with a variety of language models, and here I've selected text DaVinci003,

00:02:22.800 | and I'll hit submit, and it starts printing stuff out.

00:02:27.600 | The pandas were happily playing and eating the frogs that had fallen from the sky.

00:02:31.520 | It was an amazing sight to see these animals taking advantage of such unique opportunity.

00:02:37.520 | So quick measures to ensure the safety of the pandas and the frogs.

00:02:40.880 | So there you go.

00:02:41.880 | That's what happened after the extraordinary reign of live frogs at the panda breeding

00:02:44.540 | facility.

00:02:46.260 | You'll see here that I've enabled show probabilities, which is a thing in Nat.dev where it shows,

00:02:52.640 | well, let's take a look.

00:02:54.880 | It's pretty likely the next word here is going to be "the," and after "the," since we're

00:02:58.960 | talking about a panda breeding facility, it's going to be "panda's were," and what were

00:03:02.360 | they doing?

00:03:03.360 | Well, they could have been doing a few things.

00:03:05.000 | They could have been doing something happily, or the pandas were having, the pandas were

00:03:09.520 | out, the pandas were playing.

00:03:11.880 | So it picked the most likely.

00:03:14.400 | It thought it was 20% likely it's going to be happily, and what were they happily doing?

00:03:18.640 | Could have been playing, hopping, eating, and so forth.

00:03:26.320 | So they're eating the frogs that, and then had almost certainly.

00:03:31.140 | So you can see what it's doing at each point is it's predicting the probability of a variety

00:03:35.880 | of possible next words.

00:03:37.760 | And depending on how you set it up, it will either pick the most likely one every time,

00:03:44.080 | or you can change, muck around with things like p-values and temperatures to change what

00:03:52.480 | comes up.

00:03:53.680 | So at each time, then it'll give us a different result, and this is kind of fun.

00:04:03.680 | Frogs perched on the heads of some of the pandas, it was an amazing sight, et cetera,

00:04:08.840 | et cetera.

00:04:10.840 | So that's what a language model does.

00:04:18.560 | Now you might notice here it hasn't predicted pandas, it's predicted panned.

00:04:27.360 | And then separately, us.

00:04:30.200 | Okay after panned it's going to be us.

00:04:32.200 | So it's not always a whole word.

00:04:34.720 | Here it's an, and then harmed.

00:04:37.920 | Oh, actually it's an, ha, mud.

00:04:41.760 | So you can see that it's not always predicting words, specifically what it's doing is predicting

00:04:45.760 | tokens.

00:04:48.080 | Tokens are either whole words or subword units, pieces of a word, or it could even be punctuation

00:04:55.440 | or numbers or so forth.

00:05:00.680 | So let's have a look at how that works.

00:05:02.520 | So for example, we can use the actual, it's called tokenization to create tokens from

00:05:09.080 | a string.

00:05:10.680 | We can use the same tokenizer that GPT uses by using tick token, and we can specifically

00:05:17.040 | say we want to use the same tokenizer that that model text eventually 003 uses.

00:05:23.200 | And so for example, when I earlier tried this, it talked about the frog splashing.

00:05:28.800 | And so I thought, well, we'll encode they are splashing.

00:05:32.720 | And the result is a bunch of numbers.

00:05:35.760 | And what those numbers are, they're basically just lookups into a vocabulary that OpenAI

00:05:41.000 | in this case created.

00:05:42.000 | And if you train your own models, you'll be automatically creating or your code will create.

00:05:46.880 | And if I then decode those, it says, oh, these numbers are they space are space spool, ashing.

00:05:57.920 | And so put that all together, they are splashing.

00:06:00.160 | So you can see that the start of a word is the space before it is also being encoded

00:06:08.320 | here.

00:06:12.240 | So these language models are quite neat, that they can work at all, but they're not of themselves

00:06:23.280 | really designed to do anything.

00:06:26.880 | Let me explain.

00:06:30.880 | The basic idea of what chat GPT, GPT-4, BART, et cetera, are doing comes from a paper which

00:06:42.600 | describes an algorithm that I created back in 2017 called ULMFit.

00:06:47.800 | And Sebastian Ruder and I wrote a paper up describing the ULMFit approach, which was

00:06:52.040 | the one that basically laid out what everybody's doing, how this system works.

00:06:56.280 | And the system has three steps.

00:06:59.960 | Step one is language model training, but you'll see this is actually from the paper.

00:07:05.100 | We actually described it as pre-training.

00:07:07.860 | Now what language model pre-training does is this is the thing which predicts the next

00:07:12.780 | word of a sentence.

00:07:14.160 | And so in the original ULMFit paper, so the algorithm I developed in 2017, then Sebastian

00:07:19.760 | Ruder and I wrote it up in 2018, early 2018.

00:07:24.760 | What I originally did was I trained this language model on Wikipedia.

00:07:28.920 | Now what that meant is I took a neural network and a neural network is just a function.

00:07:35.960 | If you don't know what it is, it's just a mathematical function that's extremely flexible

00:07:39.420 | and it's got lots and lots of parameters.

00:07:41.080 | And initially it can't do anything, but using stochastic gradient descent or SGD, you can

00:07:47.120 | teach it to do almost anything if you give it examples.

00:07:50.780 | And so I gave it lots of examples of sentences from Wikipedia.

00:07:54.240 | So for example, from the Wikipedia article for the birds, the birds is a 1963 American

00:07:59.760 | natural horror thriller film produced and directed by Alfred, and then it would stop.

00:08:06.520 | And so then the model would have to guess what the next word is.

00:08:10.400 | And if it guessed Hitchcock, it would be rewarded.

00:08:14.040 | And if it guessed something else, it would be penalized.

00:08:17.480 | And effectively, basically it's trying to maximize those rewards.

00:08:20.840 | It's trying to find a set of weights for this function that makes it more likely that it

00:08:25.440 | would predict Hitchcock.

00:08:26.880 | And then later on in this article, it reads from Wikipedia, "Any previously dated Mitch

00:08:32.080 | but ended it due to Mitch's cold, overbearing mother Lydia, who dislikes any woman in Mitch's..."

00:08:39.040 | Now you can see that filling this in actually requires being pretty thoughtful because there's

00:08:44.440 | a bunch of things that like kind of logically could go there.

00:08:48.160 | Like a woman could be in Mitch's closet, could be in Mitch's house.

00:08:57.120 | And so, you know, you could probably guess in the Wikipedia article describing the plot

00:09:01.240 | of the birds is actually any woman in Mitch's life.

00:09:05.440 | Now to do a good job of solving this problem as well as possible of guessing the next word

00:09:13.920 | of sentences, the neural network is gonna have to learn a lot of stuff about the world.

00:09:23.360 | It's gonna learn that there are things called objects, that there's a thing called time,

00:09:28.840 | that objects react to each other over time, that there are things called movies, that

00:09:35.000 | movies have directors, that there are people, that people have names and so forth, and that

00:09:39.960 | a movie director is Alfred Hitchcock and he directed horror films and so on and so forth.

00:09:48.000 | It's gonna have to learn extraordinary amount if it's gonna do a really good job of predicting

00:09:52.520 | the next word of sentences.

00:09:54.860 | Now these neural networks specifically are deep neural networks, this is deep learning,

00:10:00.720 | and in these deep neural networks which have, when I created this, I think it had like a

00:10:06.240 | hundred million parameters, nowadays they have billions of parameters, it's got the ability

00:10:15.360 | to create a rich hierarchy of abstractions and representations which it can build on.

00:10:23.740 | And so this is really the key idea behind neural networks and language models, is that

00:10:31.800 | if it's gonna do a good job of being able to predict the next word of any sentence in

00:10:36.780 | any situation, it's gonna have to know an awful lot about the world, it's gonna have

00:10:41.240 | to know about how to solve math questions or figure out the next move in a chess game

00:10:47.080 | or recognise poetry and so on and so forth.

00:10:52.960 | Now nobody says it's gonna do a good job of that, so it's a lot of work to create and

00:11:00.600 | train a model that is good at that, but if you can create one that's good at that, it's

00:11:04.520 | gonna have a lot of capabilities internally that it would have to be drawing on to be

00:11:11.320 | able to do this effectively.

00:11:13.160 | So the key idea here for me is that this is a form of compression, and this idea of the

00:11:20.280 | relationship between compression and intelligence goes back many, many decades, and the basic

00:11:26.760 | idea is that yeah, if you can guess what words are coming up next, then effectively you're

00:11:34.440 | compressing all that information down into a neural network.

00:11:41.400 | Now I said this is not useful of itself, well why do we do it?

00:11:45.620 | Well we do it because we want to pull out those capabilities, and the way we pull out

00:11:50.480 | those capabilities is we take two more steps.

00:11:53.440 | The second step is we do something called language model fine-tuning, and in language

00:11:58.440 | model fine-tuning we are no longer just giving it all of Wikipedia or nowadays we don't just

00:12:04.520 | give it all of Wikipedia, but in fact a large chunk of the internet is fed to pre-training

00:12:11.160 | these models.

00:12:12.160 | In the fine-tuning stage we feed it a set of documents a lot closer to the final task

00:12:19.400 | that we want the model to do, but it's still the same basic idea, it's still trying to

00:12:24.180 | predict the next word of a sentence.

00:12:29.640 | After that we then do a final classifier fine-tuning, and in the classifier fine-tuning this is

00:12:36.600 | the kind of end task we're trying to get it to do.

00:12:41.040 | Nowadays these two steps are very specific approaches are taken.

00:12:45.000 | For the step two, the step B, the language model fine-tuning, people nowadays do a particular

00:12:50.840 | kind called instruction tuning.

00:12:53.120 | The idea is that the task we want most of the time to achieve is solve problems, answer

00:13:00.280 | questions, and so in the instruction tuning phase we use datasets like this one.

00:13:06.560 | This is a great dataset called OpenOrca created by a fantastic open source group, and it's

00:13:14.560 | built on top of something called the flan collection.

00:13:18.880 | You can see that basically there's all kinds of different questions in here, so there's

00:13:25.000 | four gigabytes of questions and context and so forth.

00:13:32.240 | Each one generally has a question or an instruction or a request and then a response.

00:13:40.400 | Here are some examples of instructions.

00:13:44.280 | I think this is from the flan dataset if I meant correctly.

00:13:47.560 | So for instance it could be does the sentence in the iron age answer the question the period

00:13:52.920 | of time from 1200 to 1000 BCE is known as what, choice is one, yes or no, and then the

00:13:59.600 | language model is meant to write one or two as appropriate for yes or no, or it could

00:14:07.600 | be things about I think this is from a music video who is the girl in more than you know

00:14:13.600 | answer and then it would have to write the correct name of the remember model or dancer

00:14:18.400 | or whatever from from that music video and so forth.

00:14:23.080 | So it's still doing language modeling so fine-tuning and pre-training are kind of the same thing

00:14:29.760 | but this is more targeted now not just to be able to fill in the missing parts of any

00:14:36.100 | document from the internet but to fill in the words necessary to answer questions to

00:14:43.920 | do useful things.

00:14:45.920 | Okay so that's instruction tuning and then step three which is the classifier fine-tuning

00:14:53.200 | nowadays there's generally various approaches such as reinforcement learning from human

00:14:58.160 | feedback and others which are basically giving humans or sometimes more advanced models multiple

00:15:10.180 | answers to a question such as here are some from a reinforcement learning from human feedback

00:15:15.360 | paper I can't remember which one I got it from, list five ideas for how to regain enthusiasm

00:15:20.200 | for my career and so the model will spit out two possible answers or it will have a less

00:15:26.600 | good model and a more good model and then a human or a better model will pick which

00:15:32.040 | is best and so that's used for the final fine-tuning stage.

00:15:38.120 | So all of that is to say although you can download pure language models from the internet

00:15:48.320 | they're not generally that useful on their own until you've fine-tuned them now you don't

00:15:54.800 | necessarily need step C nowadays actually people are discovering that maybe just step

00:15:58.760 | B might be enough it's still a bit controversial.

00:16:02.920 | Okay so when we talk about a language model where we could be talking about something

00:16:09.400 | that's just been pre-trained something that's been fine-tuned or something that's gone through

00:16:13.960 | something like RLHF all of those things are generally described nowadays as language models.

00:16:23.140 | So my view is that if you are going to be good at language modeling in any way then

00:16:31.000 | you need to start by being a really effective user of language models and to be a really

00:16:36.080 | effective user of language models you've got to use the best one that there is and currently

00:16:41.400 | so what are we up to September 2023 the best one is by far GPT-4 this might change sometime

00:16:50.560 | in the not-too-distant future but this is right now GPT-4 is the recommendation strong strong

00:16:55.280 | recommendation now you can use GPT-4 by paying 20 bucks a month to open AI and then you can

00:17:03.320 | use it a whole lot it's very hard to to run out of credits I find.

00:17:11.000 | Now what can GPT-2 it's interesting and instructive in my opinion to start with the very common

00:17:19.360 | views you see on the internet or even in academia about what it can't do.

00:17:23.640 | So for example there was this paper you might have seen GPT-4 can't reason which describes

00:17:30.420 | a number of empirical analysis done of 25 diverse reasoning problems and found that

00:17:38.780 | it was not able to solve them it's utterly incapable of reasoning.

00:17:44.840 | So I always find you got to be a bit careful about reading stuff like this because I just

00:17:50.840 | took the first three that I came across in that paper and I gave them to GPT-4 and by

00:18:00.640 | the way something very useful in GPT-4 is you can click on the share button and you'll

00:18:09.400 | get something that looks like this and this is really handy.

00:18:12.880 | So here's an example of something from the paper that said GPT-4 can't do this Mabel's

00:18:18.760 | heart rate at 9am was 75 beats per minute her blood pressure at 7pm was 120 over 80

00:18:25.200 | she died 11pm was she alive at noon it's of course you're human we know obviously she

00:18:30.120 | must be and GPT-4 says hmm this appears to be a riddle not a real inquiry into medical

00:18:37.560 | conditions here's a summary of the information and yeah sounds like Mabel was alive at noon

00:18:47.480 | so that's correct this was the second one I tried from the paper that says GPT-4 can't

00:18:52.280 | do this and I found actually GPT-4 can do this and it said that GPT-4 can't do this

00:19:00.200 | and I found GPT-4 can do this now I mentioned this to say GPT-4 is probably a lot better

00:19:07.480 | than you would expect if you've read all this stuff on the internet about all the dumb things

00:19:14.480 | that it does almost every time I see on the internet saying something that GPT-4 can't

00:19:21.640 | do I check it and it turns out it does this one was just last week Sally a girl has three

00:19:27.720 | brothers each brother has two sisters how many sisters does Sally have so have a think

00:19:34.000 | about it and so GPT-4 says okay Sally is counted as one sister by each of her brothers if each

00:19:43.760 | brother has two sisters that means there's another sister in the picture apart from Sally

00:19:48.720 | so Sally has one sister correct and then this one I got saw just like three or four days

00:19:59.240 | ago this is a common view that language models can't track things like this there is the

00:20:07.520 | riddle I'm in my house on top of my chair in the living room is a coffee cup inside

00:20:11.640 | the coffee cup is a thimble inside the thimble is a diamond I move the chair to the bedroom

00:20:16.980 | I put the coffee cup on the bed I turn the cup upside down then I return it up say up

00:20:21.200 | place the coffee cup on the counter in the kitchen where's my diamond and so GPT-4 says

00:20:26.640 | yeah okay you turn it upside down so probably the diamond fell out so therefore the diamonds

00:20:33.840 | in the bedroom where it fell out again correct why is it that people are claiming that GPT-4

00:20:44.000 | can't do these things and it can well the reason is because I think on the whole they

00:20:47.620 | are not aware of how GPT-4 was trained GPT-4 was not trained at any point to give correct

00:20:58.520 | answers GPT-4 was trained initially to give most likely next words and there's an awful

00:21:06.800 | lot of stuff on the internet where the most rare documents are not describing things that

00:21:11.400 | are true they could be fiction they could be jokes it could be just stupid people don't

00:21:16.560 | say dumb stuff so this first stage does not necessarily give you correct answers the second

00:21:23.680 | stage with the induction tuning also like it's it's it's trying to give correct answers

00:21:31.480 | but part of the problem is that then in the stage where you start asking people which

00:21:36.320 | answer do they like better people tended to say in these in these things that they prefer

00:21:44.620 | more confident answers and they often were not people who were trained well enough to

00:21:50.480 | recognize wrong answers so there's lots of reasons that the that the SGD weight updates

00:21:58.040 | from this process for stuff like GPT-4 don't particularly or don't entirely reward correct

00:22:05.440 | answers but you can help it want to give you correct answers if you think about the LM

00:22:13.440 | pre-training what are the kinds of things in a document that would suggest oh this is

00:22:19.340 | going to be high quality information and so you can actually prime GPT-4 to give you high

00:22:28.480 | quality information by giving it custom instructions and what this does is this is basically text

00:22:36.960 | that is prepended to all of your queries and so you say like oh you're brilliant at reasoning

00:22:44.240 | so like okay that's obviously you're to prime it to give good answers and then try to work

00:22:51.760 | against the fact that the RLHF folks preferred confidence just tell it no tell me if there

00:23:02.240 | might not be a correct answer also the way that the text is generated is it literally

00:23:09.840 | generates the next word and then it puts all that whole lot back into the model and generates

00:23:16.240 | the next next word puts that all back in the model generates the next next next word and

00:23:20.280 | so forth that means the more words it generates the more computation it can do and so I literally

00:23:26.640 | I tell it that right and so I say first spend a few sentences explaining background context

00:23:32.880 | etc so this custom instruction allows it to solve more challenging problems and you can

00:23:48.480 | see the difference here's what it looks like for example if I say how do I get a count

00:23:55.440 | of rows grouped by value in pandas and it just gives me a whole lot of information which

00:24:01.720 | is actually it thinking so I just skip over it and then it gives me the answer and actually

00:24:06.920 | in my custom instructions I actually say if the request begins with VV actually make it

00:24:16.760 | as concise as possible and so it kind of goes into brief mode and here is brief mode how

00:24:23.440 | do I get the group this is the same thing but with VV at the start and it just spits

00:24:27.960 | it out now in this case it's a really simple question so I didn't need time to think so

00:24:33.040 | hopefully that gives you a sense of how to get language models to give good answers you

00:24:41.160 | have to help them and if you it if it's not working it might be user error basically but

00:24:47.920 | having said that there's plenty of stuff that language models like GPT-4 can't do one thing

00:24:54.600 | to think carefully about is does it know about itself can you ask it what is your context

00:25:01.560 | length how were you trained what transformer architecture you're based on at any one of

00:25:10.320 | these stages did it have the opportunity to learn any of those things well obviously not

00:25:16.240 | at the pre-training stage nothing on the internet existed during GPT-4's training saying how

00:25:21.840 | GPT-4 was trained right probably Ditto in the instruction tuning probably Ditto in the

00:25:28.520 | RLHF so in general you can't ask for example a language model about itself now again because

00:25:36.320 | of the RLHF it'll want to make you happy by giving you opinionated answers so it'll just

00:25:42.920 | spit out the most likely thing it thinks with great confidence this is just a general kind

00:25:49.200 | of hallucination right so hallucinations is just this idea that the language model wants

00:25:54.600 | to complete the sentence and it wants to do it in an opinionated way that's likely to

00:25:59.260 | make people happy it doesn't know anything about URLs it really hasn't seen many at all

00:26:07.600 | I think a lot of them if not all of them you pretty much were stripped out so if you ask

00:26:12.440 | it anything about like what's at this web page again it'll generally just make it up

00:26:19.240 | and it doesn't know at least GPT-4 doesn't know anything after September 2021 because

00:26:24.680 | the information it was pre-trained on was from that time period September 2021 and before

00:26:32.920 | called the knowledge cutoff so here's some things it can't do Steve Newman sent me this

00:26:40.100 | good example of something that it can't do here is a logic puzzle I need to carry a cabbage

00:26:49.100 | a goat and a wolf across a river I can only carry one item at a time I can't leave the

00:26:54.440 | goat with a cabbage I can't leave the cabbage with the wolf how do I get everything across

00:26:59.440 | to the other side now the problem is this looks a lot like something called the classic

00:27:06.100 | river crossing puzzle so classic in fact that it has a whole Wikipedia page about it and

00:27:16.440 | in the classic puzzle the wolf will eat the goat or the goat will eat the cabbage now

00:27:25.160 | in in Steve's version he changed it the goat would eat the cabbage and the wolf would eat

00:27:37.320 | the cabbage but the wolf won't eat the goat so what happens well very interestingly GPT-4

00:27:45.960 | here is entirely overwhelmed by the language model training it's seen this puzzle so many

00:27:50.680 | times it knows what word comes next so it says oh yeah I take the goat across the road

00:27:56.120 | across the river and leave it on the other side leaving the wolf with a cabbage but we're

00:28:00.560 | just told you can't leave the wolf with a cabbage so it gets it wrong now the thing

00:28:07.560 | is though you can encourage GPT-4 or any of these language models to try again so during

00:28:14.120 | the instruction tuning and RLHF they're actually fine-tuned with multi-stage conversations so

00:28:20.080 | you can give it a multi-stage conversation repeat back to me the constraints I listed

00:28:24.560 | what happened after step one is a constraint violated oh yeah yeah yeah I made a mistake

00:28:31.240 | okay my new attempt instead of taking the goat across the river and leaving it on the

00:28:36.800 | other side is I'll take the goat across the river and leave it on the other side it's

00:28:41.840 | done the same thing oh yeah I did do the same thing okay I'll take the wolf across well

00:28:49.920 | now the goats with a cabbage that still doesn't work oh yeah that didn't work either sorry

00:28:57.080 | about that instead of taking the goat across the other side I'll take the goat across the

00:29:00.880 | other side okay what's going on here right this is terrible well one of the problems

00:29:07.120 | here is that not only is on the internet it's so common to see this particular goat puzzle

00:29:16.720 | that it's so confident it knows what the next word is also on the internet when you see

00:29:21.560 | stuff which is stupid on a web page it's really likely to be followed up with more stuff that

00:29:28.600 | is stupid once GPT-4 starts being wrong it tends to be more and more wrong it's very

00:29:39.080 | hard to turn it around to start it making it be right so you actually have to go back

00:29:46.520 | and there's actually an edit button on these chats and so what you generally want to do

00:29:58.680 | is if it's made a mistake is don't say oh here's more information to help you fix it

00:30:03.160 | but instead go back and click the edit and change it here

00:30:16.300 | and so this time it's not going to get confused so in this case actually fixing Steve's example

00:30:25.280 | takes quite a lot of effort but I think I've managed to get it to work eventually and I

00:30:29.560 | actually said oh sometimes people read things too quickly they don't notice things it can

00:30:33.880 | trick them up then they apply some pattern get the wrong answer you do the same thing

00:30:39.200 | by the way so I'm going to trick you so before you about to get tricked make sure you don't

00:30:45.800 | get tricked here's the tricky puzzle and then also with my custom instructions it takes

00:30:51.220 | time discussing it and this time it gets it correct it takes the cabbage across first

00:30:58.120 | so it took a lot of effort to get to a point where it could actually solve this because

00:31:04.040 | yeah when it's you know for things where it's been primed to answer a certain way again

00:31:11.560 | and again and again it's very hard for it to not do that okay now something else super

00:31:19.880 | helpful that you can use is what they call advanced data analysis in advanced data analysis

00:31:28.060 | you can ask it to basically write code for you and we're going to look at how to implement

00:31:33.040 | this from scratch ourself quite soon but first of all let's learn how to use it so I was

00:31:38.680 | trying to build something that split into markdown headings a document on third level

00:31:44.560 | markdown headings so that's three hashes at the start of a line and I was doing it on

00:31:50.840 | the whole of Wikipedia so using regular expressions was really slow so I said oh I want to speed

00:31:55.720 | this up and it said okay here's some code which is great because then I can say okay

00:32:02.280 | test it and include edge cases and so it then puts in the code creates extra cases tests

00:32:14.480 | it and it says yep it's working however I just covered it's not I notice it's actually

00:32:21.640 | removing the carriage return at the end of each sentence so I said I'll fix that and

00:32:26.840 | update your tests so it said okay so now it's changed the test update the test cases to

00:32:34.720 | run them and oh it's not working so it says oh yeah fix the issue in the test cases no

00:32:44.200 | it didn't work and you can see it's quite clever the way it's trying to fix it by looking

00:32:50.880 | at the results and but as you can see it's not every one of these is another attempt

00:32:58.360 | another attempt another attempt until eventually I gave up waiting and it's so funny each time

00:33:02.840 | it's like debugging again okay this time I got to handle it properly and I gave up at

00:33:09.560 | the point where it's like oh one more attempt so it didn't solve it interestingly enough

00:33:15.120 | and you know I again it's it it's there's some limits to the amount of kind of logic

00:33:24.240 | that it can do this is really a very simple question I asked it to do for me and so hopefully

00:33:29.480 | you can see you can't expect even GPT for code interpreter or advanced data analysis

00:33:36.040 | is now called to make it so you don't have to write code anymore you know it's not a

00:33:42.080 | substitute for having programmers so but it can you know it can often do a lot as I'll

00:33:53.080 | show you in a moment so for example actually OCR like this is something I thought was really

00:33:59.840 | cool you can just paste and so you paste your upload so GPT for you can upload an image

00:34:10.280 | the stater analysis yeah you can upload an image here and then I wanted to basically

00:34:18.800 | grab some text out of an image somebody had got a screenshot with their screen and I wanted

00:34:22.880 | to add it which was something saying oh this language model can't do this and I wanted

00:34:27.600 | to try it as well so rather than retyping it I just uploaded that image my screenshot

00:34:31.800 | and said can you extract the text from this image and it said oh yeah I could do that

00:34:36.320 | I could use OCR and like so it literally wrote at OCR script and there it is just took a

00:34:45.880 | few seconds so the difference here is it didn't really require to think of much logic it could

00:34:54.000 | just use a very very familiar pattern that it would have seen many times so this is generally

00:35:00.400 | where I find language models excel is where it doesn't have to think too far outside the

00:35:05.560 | box I mean it's great on kind of creativity tasks but for like reasoning and logic tasks

00:35:11.440 | that are outside the box I find it not great but yeah it's great at doing code for a whole

00:35:17.120 | wide variety of different libraries and languages having said that by the way Google also has

00:35:26.560 | a language model called bard it's way less good than GPT for most of the time but there

00:35:31.880 | is a nice thing that you can literally paste an image straight into the prompt and I just

00:35:37.520 | typed OCR this and it didn't even have to go through code interpreter or whatever it

00:35:41.600 | just said oh sure I've done it and there's the result of the OCR and then it even commented

00:35:48.480 | on what it just does yard which I thought was cute and oh even more interestingly it

00:35:53.800 | even figured out where the OCR text came from and gave me a link to it so I thought that

00:36:01.800 | was pretty cool okay so there's an example of it doing well I'll show you one for this

00:36:09.120 | talk I found really helpful I wanted to show you guys how much it costs to use the open

00:36:14.840 | AI API but unfortunately when I went to the open AI web page it was like all over the

00:36:23.280 | place the pricing information was on all separate tables and it was kind of a bit of a mess so

00:36:30.000 | I wanted to create a table with all of the information combined like this and here's

00:36:38.080 | how I did it I went to the open AI page I hit Apple a to select all and then I said

00:36:49.760 | in chat GPT create a table with the pricing information rows no summarization no information

00:36:56.040 | not in this page every row should appear as a separate row in your output and I hit paste

00:37:01.000 | now that was not very helpful to it because hitting paste it's got the navbar it's got

00:37:08.600 | lots of extra information at the bottom it's got all of its footer etc but it's really

00:37:17.760 | good at this stuff it did it first time so there was the markdown table so I copied and

00:37:23.040 | pasted that into Jupiter and I got my markdown table and so now you can see at a glance the

00:37:30.240 | cost of GPT for 3.5 etc but then what I really wanted to do or show you that as a picture

00:37:38.600 | so I just said oh chart the input row from this table just pasted the table back and

00:37:46.600 | I did so that's pretty amazing now so let's talk about this pricing so so far we've used

00:37:54.600 | chat GPT which cost 20 bucks a month and there's no like per token cost or anything but if

00:38:00.480 | you want to use the API from Python or whatever you have to pay per token which is approximately

00:38:06.000 | per word maybe it's about one and a third tokens per word on average unfortunately in

00:38:13.600 | the chart it did not include these headers GPT for GPT 3.5 so these first two ones are

00:38:18.480 | GPT 4 and these two are GPT 3.5 so you can see the GPT 3.5 is way way cheaper and you

00:38:28.120 | can see it here it's 0.03 versus 0.0015 so it's so cheap you can really play around with

00:38:38.960 | it not worry and I want to give you a sense of what that looks like okay so why would

00:38:46.240 | you use the open AI API rather than chat GPT because you can do it programmatically so

00:38:53.640 | you can you know you can analyze data sets you can do repetitive stuff it's kind of like

00:39:03.040 | a different way of programming you know it's it's things that you can think of describing

00:39:09.480 | but let's just look at the most simple example of what that looks like so if you pip install

00:39:12.960 | open AI then you can import chat completion and then you can say okay chat completion

00:39:20.600 | dot create using GPT 3.5 turbo and then you can pass in a system message this is basically

00:39:28.440 | the same as custom instructions so okay you're an Aussie LLM that uses Aussie slang and analogies

00:39:33.640 | wherever possible okay and so you can see I'm passing in an array here of messages so

00:39:39.160 | the first is the system message and then the user message which is what is money okay so

00:39:46.120 | GPT 3.5 returns a big embedded dictionary and the message content is well my money is

00:39:57.040 | like the oil that keeps the machinery of our economy running smoothly there you go just

00:40:03.640 | like a koala loves its eucalyptus leaves we humans can't survive without this stuff so

00:40:08.800 | there's the Aussie LLMs view of what is money so the really the main ones I pretty much

00:40:17.440 | always use GPT 4 and GPT 3.5 GPT 4 is just so so much better at anything remotely challenging

00:40:29.120 | but obviously it's much more expensive so rule of thumb you know maybe try 3.5 turbo

00:40:34.000 | first see how it goes if you're happy with the results then great if you're not pointing

00:40:40.000 | out for the more expensive one okay so I just created a little function here called response

00:40:45.760 | that will print out this nested thing and so now oh and so then the other thing to point

00:40:55.000 | out here is that the result of this also has a usage field which contains how many tokens

00:41:03.200 | was it so it's about 150 tokens so at point zero zero two dollars per thousand tokens

00:41:14.800 | for 150 tokens means we just paid point zero three cents point zero zero zero three dollars

00:41:25.600 | to get that done so as you can see the cost is insignificant if we were using GPT 4 it

00:41:31.880 | would be point zero three per thousand so it would be half a cent so unless you're doing

00:41:42.680 | many thousands of GPT 4 you're not going to be even up into the dollars and GPT 3.5 even

00:41:49.080 | more than that but you know keep an eye on it open AI has a usage page and you can track

00:41:55.160 | your usage now what happens when we are this is really important to understand when we

00:42:02.700 | have a follow-up in the same conversation how does that work so we just asked what goat

00:42:13.240 | means so for example Michael Jordan is often referred to as the goat for his exceptional

00:42:21.320 | skills and accomplishments and Elvis and the Beatles referred to as goat due to their profound

00:42:28.040 | influence and achievement so I could say what profound influence and achievements are you

00:42:38.840 | referring to okay well I meant Elvis Presley and the Beatles did all these things now how

00:42:49.080 | does that work how does this follow-up work well what happens is the entire conversation

00:42:55.240 | is passed back and so we can actually do that here so here is the same system prompt here

00:43:03.960 | is the same question right and then the answer comes back with role assistant and I'm going

00:43:10.120 | to do something pretty cheeky I'm going to pretend that it didn't say money is like oil

00:43:17.240 | I'm gonna say oh you actually said money is like kangaroos I thought what it's gonna do

00:43:23.880 | okay so you can like literally invent a conversation in which the language model said something

00:43:29.520 | different because this is actually how it's done in a multi-stage conversation there's

00:43:34.320 | no state right there's nothing stored on the server you're passing back the entire conversation

00:43:40.920 | again and telling it what it told you right so I'm going to tell it it's it told me that

00:43:47.280 | money is like kangaroos and then I'll ask the user oh really in what way it's just kind

00:43:52.320 | of cool because you can like see how it convinces you of of something I just invented oh let

00:43:59.400 | me break it down for you cover just like kangaroos hop around and carry their joeys in their

00:44:03.080 | pouch money is a means of carrying value around so there you go it's uh make your own analogy

00:44:08.920 | cool so I'll create a little function here that just puts these things together for us

00:44:17.160 | system message if there is one the user message and returns their completion and so now we

00:44:23.120 | can ask it what's the meaning of life passing in the Aussie system prompt the meaning of

00:44:29.560 | life is like trying to catch a wave on a sunny day at Bondi Beach okay there you go so um

00:44:35.520 | what do you need to be aware of well as I said one thing is keep an eye on your usage

00:44:40.200 | if you're doing it you know hundreds or thousands of times in a loop keep an eye on not spending

00:44:46.040 | too much money but also if you're doing it too fast particularly the first day or two

00:44:50.880 | you've got an account you're likely to hit the limits for the API and so the limits initially

00:45:00.120 | are pretty low as you can see three requests per minute that's for free users page users

00:45:11.880 | first 48 hours and after that it starts going up and you can always ask for more I just

00:45:17.200 | mentioned this because you're going to want to have a function that keeps an eye on that

00:45:23.080 | and so what I did is I actually just went to Bing which has a somewhat crappy version

00:45:28.880 | of gpt4 nowadays but it can still do basic stuff for free and I said please show me python

00:45:35.440 | code to call the openai API and handle rate limits and it wrote this code it's got a try

00:45:45.960 | checks your rate limit errors grabs the retry after sleeps for that long and calls itself

00:45:55.160 | and so now we can use that to ask for example what's the world's funniest joke and there

00:46:02.480 | we go is the world's funniest joke so there's like the basic stuff you need to get started

00:46:11.800 | using the openai LLMs and yeah it's definitely suggest spending plenty of time with that

00:46:24.720 | so that you feel like you're really a LLM using expert so what else can we do well let's

00:46:35.840 | create our own code interpreter that runs inside Jupiter and so to do this we're going

00:46:43.400 | to take advantage of a really nifty thing called function calling which is provided

00:46:49.680 | by the openai API and in function calling when we call our ask gpt function is this little

00:46:57.920 | one here we had room to pass in some keyword arguments that will be just passed along to

00:47:03.520 | check completion dot create and one of those keyword arguments you can pass is functions

00:47:12.520 | what on earth is that functions tells openai about tools that you have about functions

00:47:22.040 | that you have so for example I created a really simple function called sums and it adds two

00:47:31.080 | things in fact it adds two ints and I'm going to pass that function to check completion

00:47:44.560 | dot create now you can't pass a Python function directly you actually have to pass what's

00:47:51.200 | called the JSON schema so you have to pass the schema for the function so I created this

00:47:58.120 | nifty little function that you're welcome to borrow which uses pedantic and also Python's

00:48:07.160 | inspect module to automatically take a Python function and return the schema for it and

00:48:15.960 | so this is actually what's going to get passed to openai that's going to know that there's

00:48:18.960 | a function called sums it's going to know what it does and it's going to know what parameters

00:48:24.040 | it takes what the defaults are and what's required so this is like when I first heard

00:48:31.860 | about this I found this a bit mind bending because this is so different to how we normally

00:48:35.920 | program computers where the key thing for programming the computer here actually is

00:48:41.280 | the doc string this is the thing that gpt4 will look at and say oh what does this function

00:48:47.040 | do so it's critical that this describes exactly what the function does and so if I then say

00:48:53.840 | what is six plus three right now just I really wanted to make sure it actually did it here

00:49:03.480 | so I gave it lots of prompts to say because obviously it knows how to do it itself without

00:49:08.520 | calling sums so it'll only use your functions if it feels it needs to which is a weird concept

00:49:15.680 | I mean I guess feels is not a great word to use but you kind of have to anthropomorphize

00:49:20.480 | these things a little bit because they don't behave like normal computer programs so if

00:49:25.760 | I if I ask gpt what is six plus three and tell it that there's a function called sums then

00:49:33.080 | it does not actually return the number nine instead it returns something saying please

00:49:39.400 | call a function call this function and pass it these arguments so if I print it out there's

00:49:46.920 | the arguments so I created a little function called call function and it goes into the

00:49:54.920 | result of open AI grabs the function call checks that the name is something that it's

00:50:02.000 | allowed to do grabs it from the global system table and calls it passing in the parameters

00:50:10.880 | and so if I now say okay call the function that we got back we finally get nine so this

00:50:24.160 | is a very simple example it's not really doing anything that useful but what we could do

00:50:27.920 | now is we can create a much more powerful function called Python and the Python function

00:50:39.000 | creates code using Python and returns the result now of course I didn't want my computer

00:50:49.840 | to run arbitrary Python code that gpt4 told it to without checking so I just got it to

00:50:56.000 | check first so say I'm sure you want to do this so now I can say ask gpt what is 12 factorial

00:51:09.760 | system prompt you can use Python for any required computations and say okay here's a function

00:51:14.200 | you've got available it's the Python function so if I now call this it will pass me back

00:51:23.240 | again a completion object and here it's going to say okay I want you to call Python passing

00:51:29.240 | in this argument and when I do it's going to go import math result equals blah and then

00:51:37.440 | return result do I want to do that yes I do and there it is now there's one more step

00:51:47.840 | which we can optionally do I mean we've got the answer we wanted but often we want the

00:51:51.780 | answer in more of a chat format and so the way to do that is to again repeat everything

00:51:58.700 | that you've passed into so far but then instead of adding an assistant role response we have

00:52:06.880 | to provide a function role response and simply put in here the result we got back from the

00:52:15.080 | function and if we do that we now get the pros response 12 factorial is equal to four hundred

00:52:26.680 | seven in a million one thousand six hundred now functions like Python you can still ask

00:52:36.480 | it about non-python things and it just ignores it if you don't need it right so you can have

00:52:44.440 | a whole bunch of functions available that you've built to do whatever you need for the

00:52:50.840 | stuff which the language model isn't familiar with and it'll still solve whatever it can

00:53:00.640 | on its own and use your tools use your functions where possible okay so we have built our own

00:53:13.640 | code interpreter from scratch I think that's pretty amazing so that is what you can do

00:53:26.000 | with or some of the stuff you can do with open AI what about stuff that you can do on

00:53:34.580 | your own computer well to use a language model on your own computer you're going to need

00:53:41.140 | to use a GPU so I guess the first thing to think about is like do you want this does

00:53:50.580 | it make sense to do stuff on your own computer what are the benefits there are not any open

00:54:00.160 | source models that are as good yet as GPT for and I would have to say also like actually

00:54:08.280 | open AI's pricing is really pretty good so it's it's not immediately obvious that you

00:54:14.560 | definitely want to kind of go in house but there's lots of reasons you might want to

00:54:20.080 | and we'll look at some examples of them today one example you might want to go in house

00:54:26.680 | is that you want to be able to ask questions about your proprietary documents or about

00:54:34.280 | information after September 2021 the the knowledge cutoff or you might want to create your own

00:54:40.640 | model that's particularly good at solving the kinds of problems that you need to solve

00:54:45.440 | using fine-tuning and these are all things that you absolutely can get better than GPT

00:54:49.800 | for performance at work or at home without too much without too much money or trouble

00:54:57.240 | so these are the situations in which you might want to go down this path and so you don't

00:55:01.960 | necessarily have to buy a GPU on Kaggle they will give you a notebook with two quite old

00:55:09.000 | GPUs attached and very little RAM but it's something or you can use CoLab and on CoLab

00:55:17.280 | you can get much better GPUs than Kaggle has and more RAM particularly if you pay a monthly

00:55:25.520 | subscription fee so those are some options for free or low-cost you can also of course

00:55:38.680 | go to one of the many kind of GPU server providers and they change all the time as to what's

00:55:47.800 | good or what's not. RunPod is one example and you can see you know if you want the biggest

00:55:56.560 | and best machine you're talking $34 an hour so it gets pretty expensive but you can certainly

00:56:03.280 | get things a lot cheaper 80 cents an hour. Lambda Labs is often pretty good you know

00:56:14.920 | it's really hard at the moment to actually find let's see pricing to actually find people

00:56:24.120 | that have them available so they've got lots listed here but they often have none or very

00:56:29.240 | few available there's also something pretty interesting called vast AI which basically

00:56:37.640 | lets you use other people's computers when they're not using them and as you can see

00:56:49.440 | you know they tend to be much cheaper than other folks and then they tend to have better

00:56:56.760 | availability as well but of course for sensitive stuff you don't want to be running it on some

00:57:00.680 | randos computer so anyway so there's a few options for renting stuff you know I think

00:57:06.720 | if you can it's worth buying something and definitely the one to buy at the moment is

00:57:10.480 | the GTX 3090 used you can generally get them from eBay for like 700 bucks or so. A 4090

00:57:21.040 | isn't really better for language models even though it's a newer GPU the reason for that

00:57:26.560 | is that language models are all about memory speed how quickly can you get in and stuff

00:57:31.760 | in and out of memory rather than how fast is the processor and that hasn't really improved

00:57:35.640 | a whole lot. So the 2000 bucks the other thing as well as memory speed is memory size 24

00:57:44.320 | gigs it doesn't quite cut it for a lot of things so you'd probably want to get two of

00:57:48.280 | these GPUs so you're talking like $1500 or so or you can get a 48 gig RAM GPU it's called

00:57:58.080 | an A6000 but this is going to cost you more like 5 grand so again getting two of these

00:58:06.280 | is going to be a better deal and this is not going to be faster than these either. Or funnily

00:58:14.440 | enough you could just get a Mac with a lot of RAM particularly if you get an M2 Ultra

00:58:20.840 | Macs have particularly the M2 Ultra has pretty fast memory it's still going to be way slower

00:58:27.480 | than using an Nvidia card but it's going to be like you're going to be able to get you

00:58:32.520 | know like I think 192 gig or something so it's not a terrible option particularly if

00:58:42.120 | you're not training models you just wanting to use other existing trained models. So anyway

00:58:52.600 | most people who do this stuff seriously almost everybody has Nvidia cards. So then what we're

00:59:00.760 | going to be using is a library called transformers from Hugging Face and the reason for that is

00:59:06.160 | that basically people upload lots of pre-trained models or fine-tuned models up to the Hugging

00:59:13.280 | Face hub and in fact there's even a leaderboard where you can see which are the best models.

00:59:20.960 | Now this is a really fraught area to at the moment this one is meant to be the best model

00:59:30.240 | it has the highest average score and maybe it is good I haven't actually used a particular

00:59:35.520 | model or maybe it's not I actually have no idea because the problem is these metrics

00:59:43.880 | are not particularly well aligned with real life usage for all kinds of reasons and also

00:59:51.320 | sometimes you get something called leakage which means that sometimes some of the questions

00:59:57.000 | from these things actually leaks through to some of the training sets. So you can get

01:00:03.000 | as a rule of thumb what to use from here but you should always try things and you can also

01:00:09.640 | say you know these ones are all this 70 B here that tells you how big it is so this

01:00:14.200 | is a 70 billion parameter model. So generally speaking for the kinds of GPUs you we're talking

01:00:23.600 | about you'll be wanting no bigger than 13 B and quite often 7B. So let's see if we can

01:00:33.240 | find here's a 13 B model for example. All right so you can find models to try out from things

01:00:41.340 | like this leaderboard and there's also a really great leaderboard called fast eval which I

01:00:47.520 | like a lot because it focuses on some more sophisticated evaluation methods such as this

01:00:55.720 | chain of thought evaluation method. So I kind of trust these a little bit more and these

01:01:02.080 | are also you know GSM 8K is a difficult math benchmark big bench hard so forth. So yeah

01:01:11.440 | so you know stable beluga 2 wizard math 13 B dolphin llama 13 B etc these would all be

01:01:18.360 | good options. Yeah so you need to pick a model and at the moment nearly all the good models

01:01:28.000 | are based on matters llama 2. So when I say based on what does that mean well what that

01:01:34.880 | means is this model here llama 2 7B so it's a llama model that's that's just the name

01:01:43.560 | meta call it this is their version 2 of llama this is their 7 billion size one it's the

01:01:48.280 | smallest one that they make and specifically these weights have been created for hugging

01:01:53.080 | face so you can load it with the hugging face transformers and this model has only got as

01:01:58.760 | far as here it's done the language model for pre-training it's done none of the instruction

01:02:03.720 | tuning and none of the RLHF so we would need to fine tune it to really get it to do much

01:02:12.040 | useful. So we can just say okay create a automatically create the appropriate model for language

01:02:22.200 | model so causal LM is basically refers to that ULM fit stage one process or stage two

01:02:28.700 | in fact so get the pre-trained model from this name meta llama llama 2 blah blah. Okay

01:02:36.520 | now generally speaking we use 16-bit floating point numbers nowadays but if you think about

01:02:48.720 | it 16-bit is two bytes so 7B times two it's going to be 14 gigabytes just to load in the

01:02:59.640 | weights so you've got to have a decent model to be able to do that perhaps surprisingly

01:03:07.120 | you can actually just cast it to 8-bit and it still works pretty well thanks to something

01:03:11.760 | called discretization. So let's try that so remember this is just a language model it can

01:03:18.840 | only complete sentences we can't ask it a question and expect a great answer so let's

01:03:22.980 | just give it the start of a sentence Jeremy how it is are and so we need the right tokenizer

01:03:28.400 | so this will automatically create the right kind of tokenizer for this model we can grab

01:03:32.640 | the tokens as PyTorch here they are and just to confirm if we decode them back again we

01:03:44.920 | get back the original plus a special token to say this is the start of a document and

01:03:50.400 | so we can now call generate so generate will auto-regressively so call the model again

01:04:00.080 | and again passing its previous result back as the next as the next input and I'm just

01:04:08.800 | going to do that 15 times so this is you can you can write this for loop yourself this

01:04:13.520 | isn't doing anything fancy in fact I would recommend writing this yourself to make sure

01:04:18.400 | that you know how that it all works okay we have to put those tokens on the GPU and at

01:04:26.640 | the end I recommend putting them back onto the CPU the result and here are the tokens

01:04:31.520 | not very interesting so we have to decode them using the tokenizer and so the first

01:04:36.160 | 25 so first 15 tokens are Jeremy Howard is a 28 year old Australian AI researcher and

01:04:42.440 | entrepreneur okay well 28 years old is not exactly correct but we'll call it close enough

01:04:47.440 | I like that thank you very much llama 7b so okay so we've got a language model completing

01:04:54.400 | sentences it took one and a third seconds and that's a bit slower than it could be because

01:05:04.400 | we used 8-bit if we use 16-bit there's a special thing called B float 16 which is a really

01:05:11.240 | great 16-bit floating point format that's use usable on any somewhat recent GP Nvidia

01:05:17.840 | GPU now if we use it it's going to take twice as much RAM as we discussed but look at the

01:05:24.240 | time it's come down to 390 milliseconds now there is a better option still than even that

01:05:34.640 | there's a different kind of discretization called GPTQ where a model is carefully optimized

01:05:43.160 | to work with 4 or 8 or other you know lower precision data automatically and this particular

01:05:55.320 | person known as the bloke is fantastic at taking popular models running that optimization

01:06:02.080 | process and then uploading the results back to hacking phase so we can use this GPTQ version

01:06:12.040 | and internally this is actually going to use I'm not sure exactly how many bits this particular

01:06:16.200 | one is I think it's probably going to be four bits but it's going to be much more optimized

01:06:22.240 | and so look at this 270 milliseconds it's actually faster than 16-bit even though internally

01:06:30.640 | it's actually casting it up to 16-bit each layer to do it that's because there's a lot

01:06:35.420 | less memory moving around and to confirm in fact what we could even do now is we got to

01:06:42.000 | 13B easy and in fact it's still faster than the 7B now that we're using the GPTQ version

01:06:49.280 | so this is a really helpful tip so let's put all those things together the tokenizer that

01:06:55.360 | generate the batch decode we'll call this gen for generate and so we can now use the

01:06:59.800 | 13B GPTQ model and let's try this Jeremy Howard is a so it's got to 50 tokens so fast 16-year

01:07:08.760 | veteran of Silicon Valley co-founder of Kaggle a marketplace a predictive model his company

01:07:13.720 | Kaggle.com has become to data science competitions what I don't know what I was going to say but

01:07:17.720 | anyway it's on the right track I was actually there for 10 years not 16 but that's all right

01:07:22.520 | okay so this is looking good but probably a lot of the time we're going to be interested

01:07:32.680 | in you know asking questions or using instructions so stability AI has this nice series called

01:07:39.000 | stable beluga including a small 7B one and other bigger ones and these are all based on llama too

01:07:46.200 | but these have been instruction tuned they might even have been RLHDF I can't remember now so we

01:07:53.480 | can create a stable beluga model and now something really important that I keep forgetting everybody

01:08:02.040 | keeps forgetting is during the instruction tuning process during the instruction tuning process

01:08:14.200 | the instructions that are passed in actually are they don't just appear like this they actually

01:08:26.760 | always are in a particular format and the format believe it or not changes quite a bit from

01:08:33.000 | from fine tune to fine tune and so you have to go to the webpage for the model

01:08:39.800 | and scroll down to find out what the prompt format is so here's the prompt format so I

01:08:48.680 | generally just copy it and then I paste it into python which I did here and created a function

01:09:01.480 | called make prompt that use the exact same format that it said to use and so now if I want to say

01:09:09.960 | who is Jeremy Howard I can call gen again that was that function I created up here and make the

01:09:16.920 | correct prompt from that question and then it returns back okay so you can see here all this

01:09:24.360 | prefix this is a system instruction this is my question and then the assistant says Jeremy Howard

01:09:31.160 | is an Australian entrepreneur computer scientist co-founder of machine learning and deep learning

01:09:35.400 | company fasted AI okay this one's actually all correct so it's getting better by using an actual

01:09:41.960 | instruction tuned model and so we could then start to scale up so we could use the 13b and in fact

01:09:50.920 | we looked briefly at this open orca data set earlier so llama 2 has been fine tuned on open

01:09:58.520 | orca and then also fine tuned on another really great data set called platypus and so the whole

01:10:05.720 | thing together is the open orca platypus and then this is going to be the bigger 13b

01:10:11.560 | gptq means it's going to be quantized so that's got a different format okay a different prompt

01:10:19.240 | format so again we can scroll down and see what the prompt format is there it is okay and so

01:10:27.720 | we can create a function called make open orca prompt that has that prompt format

01:10:37.320 | and so now we can say okay who is Jeremy Howard and now I've become British which is kind of true

01:10:42.120 | I was born in England but I moved to Australia a professional poker player definitely not that

01:10:47.720 | co-founding several companies including fasted AI also Kaggle okay so not bad it was acquired

01:10:56.040 | by Google was it 2017 probably something around there okay so you can see we've got our own models

01:11:04.280 | giving us some pretty good information um how do we make it even better you know because it's

01:11:11.720 | it's it's still hallucinating you know um and you know llama 2 I think has been trained with

01:11:23.000 | more up-to-date information than gpt4 it doesn't have the September 2021 cutoff um but it you know

01:11:30.920 | it's still got a knowledge cutoff you know we would like to go to use the most up-to-date

01:11:35.320 | information we want to use the right information to answer these questions as well as possible

01:11:39.720 | so to do this we can use something called retrieval augmented generation

01:11:45.000 | so what happens with retrieval augmented generation is when we take the question

01:11:53.800 | we've been asked like um who is Jeremy Howard and then we say okay let's try and search for

01:12:02.280 | documents that may help us answer that question um so obviously we would expect for example

01:12:09.880 | wikipedia to be useful and then what we do is we say okay with that information um let's now see

01:12:19.320 | if we can tell the language model about what we found and then have it answer the question

01:12:26.520 | um so let me show you so let's actually grab a wikipedia um python package

01:12:35.480 | we will scrape wikipedia grabbing the Jeremy Howard web page

01:12:41.160 | and so here's the start of the Jeremy Howard wikipedia page

01:12:48.760 | it has 613 words now generally speaking these open source models will have a context length

01:12:54.600 | of about 2 000 or 4 000 so the context length is how many tokens can it handle so that's fine it'll

01:13:01.080 | be able to handle this web page and what we're going to do is we're going to ask it the question

01:13:06.440 | so we're going to have here question and with a question but before it we're going to say answer

01:13:10.600 | the question with the help of the context we're going to provide this to the language model and

01:13:14.760 | we're going to say context and they're going to have the whole web page so suddenly now our

01:13:19.240 | question is going to be a lot bigger their prompt right so our prompt now contains the entire web

01:13:28.520 | page the whole wikipedia page followed by our question and so now it says Jeremy Howard is an

01:13:38.360 | Australian data scientist, entrepreneur and educator known for his work in deep learning,

01:13:42.920 | co-founder of fastai teaches courses, develops software, conducts research, used to be yeah okay

01:13:49.240 | it's perfect right so it's actually done a really good job like if somebody asked me to send them a

01:13:56.280 | you know 100 word bio uh that would actually probably be better than i would have written

01:14:02.760 | myself and you'll see even though i asked for 300 tokens it actually got sent back the end of stream

01:14:10.680 | token and so it knows to stop at this point um well that's all very well but how do we know

01:14:19.480 | to pass in the Jeremy Howard wikipedia page well the way we know which wikipedia page to pass in

01:14:25.960 | is that we can use another model to tell us which web page or which document is the most

01:14:34.120 | useful for answering a question and the way we do that is we we can use something called sentence

01:14:44.760 | transformer and we can use a special kind of model that's specifically designed to take a document

01:14:51.800 | and turn it into a bunch of activations where two documents that are similar will have similar

01:14:59.960 | activations so let me just let me show you what i mean what i'm going to do is i'm going to grab

01:15:05.320 | just the first paragraph of my wikipedia page and i'm going to grab the first paragraph of Tony

01:15:12.520 | Blair's wikipedia page okay so we're pretty different people right this is just like a really

01:15:18.120 | simple small example and i'm going to then call this model i'm going to say encode and i'm going

01:15:25.240 | to encode my wikipedia first paragraph tony blair's first paragraph and the question which was um who

01:15:33.640 | is Jeremy Howard and it's going to pass back a 384 long vector of embeddings for the question

01:15:45.320 | for me and for tony blair and what i can now do is i can calculate the similarity

01:15:53.960 | between the question and the Jeremy Howard wikipedia page

01:15:57.480 | and i can also do it for the question versus the tony blair wikipedia page and as you can see it's

01:16:04.200 | higher for me and so that tells you that if you're trying to figure out what document to use to help

01:16:11.240 | you answer this question better off using the Jeremy Howard wikipedia page than the tony blair

01:16:16.600 | wikipedia page so if you had a few hundred documents you were thinking of using to give

01:16:26.440 | back to the model as context to help it answer a question you could literally just pass them all

01:16:31.960 | through to encode go through each one one at a time and see which is closest when you've got

01:16:38.280 | thousands or millions of documents you can use something called a vector database where basically

01:16:45.160 | as a one-off thing you go through and you encode all of your documents and so in fact um there's

01:16:54.600 | there's lots of pre-built systems for this um here's an example of one called h2ogpt

01:17:00.360 | and this is just something that i've got

01:17:04.280 | that i've got running here on my computer it's just an open source thing

01:17:14.520 | written in python sitting here running on port 7860 and so i've just gone to localhost 7860

01:17:20.440 | and what i did was i just uh uploaded i just clicked upload and i just

01:17:27.640 | uploaded a bunch of papers in fact i might be able to see it better yeah here we go a bunch of papers

01:17:34.280 | and so you know we could look at

01:17:39.320 | uh can we search yeah i can so for example we can look at the ulm fit paper that uh

01:17:45.800 | said breeder and i did and you can see it's turned taken the pdf and turned it into

01:17:51.000 | slightly crappily a text format and then it's created an embedding for each

01:18:00.280 | you know each section so i could then ask it you know what is ulm fit and i'll hit enter

01:18:13.000 | and you can see here it's now actually saying based on the information provided in the context so

01:18:18.440 | it's showing us it's been given some context what context did it get so here are the things that it

01:18:23.320 | found right so it's being sent this context so this is kind of citations

01:18:32.920 | uh goal of ulm fit proves the performance by leveraging the knowledge and adapting it to

01:18:41.320 | the specific task at hand um how what techniques be more specific does ulm fit uh let's see how it goes

01:18:55.800 | okay there we go so here's the three steps pre-trained fine tune fine tune cool um so you can

01:19:07.080 | see it's not bad right um it's not amazing like you know the context in this particular case is

01:19:14.440 | pretty small um and it's and in particular if you think about how that embedding thing worked

01:19:22.040 | you can't really use like the normal kind of follow-up so for example um if i say it says

01:19:29.880 | fine tuning a classifier so i could say what classifier is used now the problem is that there's

01:19:38.280 | no context here being sent to the embedding model so it's actually going to have no idea i'm talking

01:19:42.760 | about ulm fit so generally speaking it's going to do a terrible job yeah see it says used as a

01:19:49.480 | reberta model but it's not but if i look at the sources it's no longer actually referring to howard

01:19:54.920 | and ruder so anyway you can see the basic idea this is called retrieval augmented generation RAG

01:20:01.960 | and it's a it's a nifty approach but you have to do it with with some care um and so there are lots

01:20:12.680 | of these uh private gpt things out there um actually the h2o gpt webpage does a fantastic job

01:20:21.720 | of listing lots of them and comparing

01:20:25.960 | um so as you can see if you want to run a private gpt there's no shortage of options

01:20:36.680 | um and you can have your retrieval augmented generation

01:20:40.840 | i haven't tried i've only tried this one h2o gpt i don't love it it's all right

01:20:50.520 | so finally i want to talk about what's perhaps the most interesting um uh option we have which is to

01:20:56.360 | do our own fine tuning and fine tuning is cool because rather than just retrieving documents

01:21:01.880 | which might have useful context we can actually change our model to behave based on the documents

01:21:08.920 | that we have available and i'm going to show you a really interesting example of fine tuning here

01:21:14.200 | what we're going to do is we're going to fine tune using this um no sql data set and it's got examples

01:21:23.560 | of like a a schema for a table in a database a question and then the answer is

01:21:35.800 | the correct sql to solve that question using that database schema and so i'm hoping we could use

01:21:46.760 | this to create a um you know a kind of it could be a handy use a handy tool for for business users

01:21:54.440 | where they type some english question and sql generated uh for them automatically don't know

01:22:01.880 | if it actually work in practice or not but this is just a little fun idea i thought we'd try out

01:22:06.920 | i know there's lots of uh startups and stuff out there trying to do this more seriously

01:22:13.240 | but this is this is quite cool because it actually got it working today in just a couple of hours

01:22:19.160 | so what we do is we use the hugging face datasets library and what that does just like the hugging

01:22:29.320 | face hub has lots of models stored on it hugging face datasets has lots of datasets stored on it

01:22:36.520 | and so instead of using transformers which is what we use to grab models we use datasets

01:22:41.480 | and we just pass in the name of the person and the name of their repo and it grabs the dataset

01:22:47.960 | and so we can take a look at it and it just has a training set with features and so then i can

01:22:56.520 | have a look at the training set so here's an example which looks a bit like what we've just

01:23:05.560 | seen so what we do now is we want to fine tune a model now we can do that in in a notebook from

01:23:14.840 | scratch takes i don't know 100 or so lines of code it's not too much but given the time constraints

01:23:20.840 | here um and also like i thought why not why don't we just use something that's ready to go

01:23:26.600 | so for example there's something called axolotl which is quite nice in my opinion

01:23:30.680 | here it is here lovely another very nice open source piece of software and again you can just

01:23:39.320 | pip install it and it's got things like gptq and 16 bit and so forth ready to go and so what i did

01:23:49.640 | was i um it basically has a whole bunch of examples of things that it already knows how to do

01:23:57.320 | it's got llama 2 example so i copied the llama 2 example and i created a sql example so basically

01:24:05.400 | just told it this is the path to the dataset that i want this is the type um and everything else

01:24:13.640 | pretty much i left the same uh and then i just ran this command which is from their readme accelerate

01:24:20.840 | launch axolotl passed in my yaml and that took about an hour um on my gpu and at the end of the

01:24:29.640 | hour it had created a q laura out directory uh q stands for quantize that's because i was creating

01:24:37.720 | a smaller quantized model uh laura i'm not going to talk about today but laura is a very cool thing

01:24:42.920 | that basically another thing that makes your models smaller and also handles um uh can use

01:24:50.040 | bigger models on smaller gpu for training um so uh i trained it and then i thought okay let's uh

01:25:02.680 | create our own one so we're going to have this context and um this question

01:25:12.760 | get the count of competition hosts by theme and i'm not going to pass it an answer so i'll just

01:25:21.000 | ignore that um so again i found out what prompt uh they were using um and created a sql prompt

01:25:29.880 | function and so here's what i'm going to do use the following contextual information to answer the

01:25:34.840 | question uh context create table says the context question list all competition hosts ordered in

01:25:41.560 | ascending order and then i tokenized that called generate and the answer was select count hosts

01:25:56.520 | comma theme from farm competition group by theme that is correct so i think that's pretty remarkable

01:26:05.640 | we have just built you know so it took me like an hour to figure out how to do it and then an hour

01:26:12.120 | to actually do the training um and at the end of that we've actually got something which which is

01:26:18.680 | converting um pros into sql based on our schema so i think that's that's a really exciting idea

01:26:27.240 | um the only other thing i do want to briefly mention is um is doing stuff on max um if you've

01:26:36.520 | got a mac uh you there's a couple of really good options um the options are mlc and llama.cpp

01:26:45.240 | currently uh mlc in particular i think it's kind of underappreciated it's a um you know really nice

01:26:53.400 | project um uh where you can run language models on literally iphone android web browsers

01:27:07.160 | everything uh it's really cool and and so i'm now actually on my mac here

01:27:15.000 | and i've got a um tiny little python program called chat and it's going to import chat module

01:27:26.760 | and it's going to import a discretized 7b um and it's going to ask the question what is the meaning

01:27:38.280 | of life so let's try it python chat.py again i just installed this earlier today i haven't

01:27:48.520 | done that much stuff on macs before but i was pretty impressed to see that it is doing a good

01:27:57.720 | job here what is the meaning of life is complex and philosophical some people might find meaning

01:28:05.000 | in their relationships with others their impact in the world etc etc okay and it's doing 9.6 tokens

01:28:14.120 | per second so there you go so there is running um a model on a mac and then another option that

01:28:20.840 | you've probably heard about is llama.cpp uh llama.cpp uh runs on lots of different things as well

01:28:28.600 | including macs and also on cuda um it uses a different format called gguf and you can again

01:28:37.320 | you can use it from python even though it's a cpp thing it's got a python wrapper so you can just

01:28:42.120 | download again from hugging face a gguf uh file so you can just go through and there's lots of

01:28:52.200 | different ones they're all documented as to what's what you can pick how big a file you want you can

01:28:56.760 | download it and then you just say okay llama model path equals pass in that gguf file it spits out

01:29:04.440 | lots and lots and lots of gunk and then you can say okay so if i called that llm you can then say

01:29:11.960 | llm question name the planets of the solar system 32 tokens and there we are one Pluto no longer

01:29:22.600 | considered a planet two Mercury three Venus four Earth Mars six oh never had other tokens um so

01:29:28.760 | again you know it's um just to show you here there are all these different options um

01:29:34.600 | uh you know i would say you know if you've got a

01:29:38.840 | nvidia graphics card and you're a reasonably capable python programmer you'd probably be one

01:29:45.800 | of you use pytorch and the hugging face ecosystem um but uh you know i think you know these things

01:29:53.880 | might change over time as well and certainly a lot of stuff is coming into llama pretty quickly now

01:29:57.720 | and it's developing very fast as you can see um there's a lot of stuff that you can do right now

01:30:03.720 | with language models um particularly if you if you're pretty comfortable as a python programmer

01:30:10.520 | um i think it's a really exciting time to get involved in some ways it's a frustrating time

01:30:15.880 | to get involved because um you know it's very early and a lot of stuff has weird little edge

01:30:26.040 | cases and it's tricky to install and stuff like that um there's a lot of great discord channels

01:30:33.720 | however fastai i have our own discord channel so feel free to just google for fastai discord and

01:30:39.000 | and drop in we've got a channel called generative um you feel free to ask any questions or tell us

01:30:45.480 | about what you're finding um yeah it's definitely something where you want to be getting help from

01:30:50.760 | other people on this journey because it is very early days and you know people are still figuring

01:30:56.840 | things out as we go but i think it's an exciting time to be doing this stuff and i'm yeah i'm

01:31:02.280 | really enjoying it and i hope that this has given some of you a useful starting point on your own

01:31:08.280 | journey so i hope you found this useful thanks for listening bye

A Hackers' Guide to Language Models

Chapters