back to indexGPT 4: Superpower results with search
Chapters
0:0 Why GPT-4 can fail - hallucinations
1:50 What we can do with retrieval augmentation
2:16 How retrieval augmentation works
7:41 Scraping docs for LLMs
10:1 Preprocessing and chunking text for GPT4
13:24 Creating embeddings with text-embedding-ada-002
14:58 Creating the Pinecone vector database
19:24 Retrieving relevant docs with semantic search
20:23 GPT-4 generated answers
23:44 GPT-4 with augmentation vs. GPT-4 without
25:34 Building powerful tools is almost too easy
00:00:00.000 |
Today we're going to take a look at a example walkthrough app that is going to show us how 00:00:07.040 |
to alleviate two of the biggest problems with GPT-4, GPT-3.5 and other large language models. 00:00:13.680 |
Those two things that I'm talking about are their ability to very convincingly make things up, 00:00:19.520 |
which we call hallucinations, and also their inability to contain up-to-date information. 00:00:27.600 |
So most of the models we're dealing with at the moment, they haven't seen any world information, 00:00:33.440 |
world data since September 2021. That's up to that point, that's where their train data cuts off. 00:00:39.920 |
So they're pretty outdated. So what we are going to be able to do with the approach I'm going to 00:00:45.360 |
show you is take a question like, how do I use the LLM chain in LineChain? Now, LineChain is a very 00:00:53.200 |
recent Python library. So most of these models, their train data to cut off is at September 2021. 00:00:59.920 |
They have no idea about LineChain. LLM chain is a particular object within that library. 00:01:06.160 |
If I asked GPT-4, how do I do this? The answer isn't very good. So the answer is that LLM chain 00:01:14.240 |
in LineChain is an ambiguous term, lacks context. It could refer to a language model. So it did 00:01:19.680 |
manage to get that, which is kind of cool. It could be a blockchain technology. This is the 00:01:24.800 |
answer that I seem to see in GPT models quite a lot, that this is some sort of blockchain technology. 00:01:30.720 |
So assuming that LLM chain refers to a language model and LineChain refers to a blockchain 00:01:38.640 |
technology, then it gives you instructions on how to use it. This is just completely false. 00:01:44.240 |
This isn't useful in any way to us whatsoever. So this isn't good. With the approach I'm going 00:01:51.680 |
to show you, we will get this answer. To use the LLM chain, LineChain, follow these steps. 00:01:56.880 |
Import necessary libraries. Do this. Create, initialize your LLM. Create a prompt template. 00:02:05.200 |
Import the LLM chain. Initialize your LLM chain and then run your LLM chain. That's exactly how 00:02:11.680 |
you do it. So what we're going to cover in this video is how to make that happen. So the question 00:02:16.800 |
now is, what are we doing? What are we going to do? Now, as I mentioned, large language models, 00:02:22.240 |
they kind of exist in a vacuum. They don't have any sort of external stimuli to the world. They 00:02:28.960 |
just have their own internal memory, which was built during the training of this large 00:02:36.080 |
language model. That is kind of all they have. And it's pretty powerful. I mean, you've seen 00:02:41.600 |
ChatGPT, now GPT-4, like the things that they can do is incredible, right? Their general knowledge 00:02:49.840 |
of the world is very, very good. It's just not up to date and it's not always reliable. Sometimes 00:02:55.840 |
they just make things up. So what we want to do is give the large language model access to the 00:03:02.800 |
outside world. Now, how do we do that? Well, we're going to use a few different components here. The 00:03:10.720 |
main component is what we call a vector database. We're going to be using what is called the Pine 00:03:16.960 |
Cone Vector Database for that. Essentially, you can think of this as within your brain, 00:03:25.040 |
you kind of have your long-term memory somewhere in there. You can think of Pine Cone as your kind 00:03:33.040 |
of long-term memory storage. The large language model is, I know maybe it's like your short-term 00:03:40.240 |
memory. Maybe it's also like the neocortex, which kind of like runs your brain or performs all these 00:03:48.800 |
logical calculations within your brain. That is kind of how we can think of these two components 00:03:54.720 |
and how they relate to each other. Then we're also going to, okay, so let's say we take a query. 00:04:01.280 |
We're going to take this query down here. Typically, we just put that query straight 00:04:05.600 |
into the large language model. Instead, now what we're going to do is we're going to have 00:04:09.840 |
another large language model that has been built for embeddings. Now, an embedding, 00:04:15.920 |
you can think of embeddings as kind of like the language of language models. That's kind of what 00:04:24.240 |
they are. These vectors, they basically create a representation, a numerical representation 00:04:32.240 |
of language. It's probably better if I draw that out. You have this embedding model here. 00:04:38.880 |
Given your query, it's going to map that into essentially what is a vector space. It's going 00:04:46.000 |
to put it here based on the meaning of that query. We create this vector embedding. Then we take its 00:04:54.080 |
pine cone. In pine cone, we already have many of these vector embeddings that we created beforehand. 00:05:01.040 |
Let's say this is kind of inside pine cone. There's all of these different vectors everywhere. 00:05:11.120 |
They all represent a piece of information. What we're doing here is we're taking that, 00:05:16.320 |
we're putting it in here. We're saying, okay, which are the vectors that are nearest to our 00:05:23.120 |
query vector? Maybe it's this one, this one, and this one. Then we return those. Those three items, 00:05:31.600 |
they come out to here. We have our vectors. They are connected to some piece of text, 00:05:40.160 |
relevant text to whatever our query is. We then take our query, bring it up here, 00:05:48.480 |
and we feed it into the large language model alongside these pieces of information that we 00:05:54.800 |
just retrieved. Now the large language model has a way to, it has some sort of connection to the 00:06:02.880 |
outside world in the form of this vector database, which is retrieving relevant information based on 00:06:08.800 |
a particular query. That's what we're going to implement. I think that's enough for this kind of 00:06:17.040 |
abstract visual for this. Let's just jump straight into the code. I will leave a link to this 00:06:23.440 |
notebook so you can follow along. It will be somewhere near the top of the video right now. 00:06:28.720 |
There are a few things that we need to import here or install. We're going to be using Beautiful 00:06:36.400 |
Soup. You saw the question before. It is about a particular Python library. Where do we get 00:06:44.960 |
the information about the Python library? We just go to their docs. 00:06:49.120 |
We go to the line chain, read docs.io, and they have a lot. Everything we need is here. It has 00:07:02.080 |
guides. It has code. It has everything. All we're going to do is just scrape the website. 00:07:09.840 |
Obviously, that website, their doc site is pretty up to date for the library. We can just keep 00:07:17.920 |
something that goes through and maybe updates it every half a day or every day, depending on 00:07:24.720 |
how up to date you need this thing. We're going to be using Beautiful Soup, 00:07:30.960 |
TickToken, OpenAI, LineChain, and PyConclient. I'm going to go through all of these later as we 00:07:37.600 |
come to that. I don't want to take too long going through how we're getting our data and so on, 00:07:45.040 |
because it's going to vary depending on what it is you're actually doing. I'll just show you very 00:07:49.840 |
quickly. I'm using requests. I'm getting the different web pages. We come to here, and I'm 00:07:56.160 |
basically just identifying all the links that are to the same, like linechain.readdocs.io. I'm 00:08:02.240 |
getting all the links on each page that direct to another page on the site. Then I'm also just 00:08:10.640 |
getting the main content from that page. You can see here, the front page, "Welcome to LineChain, 00:08:17.920 |
content is getting started," modules. It's super messy, and I'm sure 100% you can do better than 00:08:24.320 |
what I'm doing here. This is really quick code. Most of this, even the preprocessing, 00:08:31.200 |
data scraping side of things, chat, this is all mostly chat GPT, not even me. This is just pulled 00:08:38.960 |
together really quickly, and we get this pretty messy input. But large language models are really 00:08:48.000 |
good at processing text, so I don't actually need anything more than this, which is pretty insane. 00:08:53.680 |
I'm just taking this and putting it into a function here, scrape. We have a URL, 00:08:59.600 |
which is just a string, and we go through and we extract everything we need there. 00:09:03.680 |
Then here, I'm setting up that loop to go through all the pages that we find, 00:09:09.360 |
and just scrape everything, and we add everything to data here. You can see if we scroll up, there's 00:09:16.960 |
a few 404s where it can't find a webpage. Now, this might just be that I'm calling the wrongly 00:09:25.920 |
formatted URL or something else. I'm not sure, but I'm not too worried. It's just a pretty quick 00:09:33.520 |
run-through here. All I want is that we have a decent amount of data in here, and we do. 00:09:38.160 |
Let's have a look at what one of those looks like. Data, this is the third page that we scraped. 00:09:46.240 |
Yeah, it's really messy. It's hard to read. I think there's code in here. 00:09:52.080 |
Yeah, there's code and everything in here. It's hard, but it's fine. It works. We don't 00:10:00.400 |
really need much more, but it is very long. There are token limits to GPT-4. The model we're going 00:10:07.520 |
to be using is an 8k token limit. There will be a new model. It's a 32k token limit, but we don't 00:10:13.600 |
want to necessarily use that full token limit because it's expensive. They charge you per token. 00:10:19.360 |
We don't want to just pass in a full page of text like this. It's better if we chunk it into smaller 00:10:25.520 |
chunks, which allows us to be more concise in the information that we're feeding into GPT-4 later on, 00:10:33.040 |
and also save money. You don't want to just throw in everything you have. 00:10:36.880 |
What I'm going to do is we're going to split everything into not 1,000 token chunks, 00:10:43.520 |
actually we want it a little bit lower, so 500 token chunks. Now, here I'm actually using 00:10:50.000 |
LangChain. They have a really nice text splitter function here. Let me walk you through this, 00:10:56.560 |
because this I think most of us are going to need to do when we're working with text data. 00:11:01.520 |
We want to take our big chunks of text and we want to split it into smaller chunks. 00:11:07.440 |
How do we do that? Well, first we want to get the OpenAI, because we're using OpenAI models here, 00:11:13.920 |
we want to get the OpenAI tape token tokenizer to count the number of tokens that we have in a 00:11:19.120 |
chunk. That's what we're doing here. We're setting up this counting function, which will check the 00:11:24.640 |
length of our text. We're going to pass that into this function here. What is this function? 00:11:30.400 |
This is called the recursive character text splitter. What this is going to do is it's going 00:11:35.360 |
to try first to separate your text into roughly 500 token chunks using this character string. 00:11:44.000 |
Double new lines. If it can't find that, it's going to try single new line. If it can't do that, 00:11:50.160 |
it will try space. If it can't do that, it's going to split wherever it can. This is probably one of 00:11:58.400 |
the better, in my opinion, options for splitting your text into chunks. It works really well. 00:12:04.320 |
With this text, it's probably not even that ideal. I don't even know if we have new lines in this. 00:12:11.120 |
This is probably just mostly going to split on spaces. It works. We don't need to worry about it 00:12:19.520 |
too much. Cool. We process our data into chunks using that approach. We have this here. We're 00:12:27.680 |
going through all of our data. We split everything. We are getting the text records. I don't know if 00:12:36.960 |
-- do we have an example? Yeah, here. If we come to the format of this, we have the URL and we also 00:12:44.640 |
have the text. That's why we're pulling in this text here. Because we now have multiple chunks 00:12:54.000 |
for each page, we need to create a separate chunk for each one of those, but we still want to 00:13:01.680 |
include the URL. We create a unique ID for each chunk. We have that chunk of text that we got 00:13:08.960 |
from here. We have the number chunks. Each page is going to have 5, 6, 7 or so chunks. We also 00:13:16.720 |
have the URL for the page. We can link back to that at a later point if we wanted to do that. 00:13:23.200 |
All right. Cool. Then we initialize our embedding model here. We're using the API directly. 00:13:32.080 |
What we're doing here is using an embedding model. Now, embeddings are pretty cheap. I don't 00:13:40.560 |
remember the exact pricing, but it's really hard to spend a lot of money when you're embedding 00:13:44.560 |
things with this model. I wouldn't worry too much about the cost on this side of things. It's more 00:13:51.440 |
when you get to Jupyter 4 later on where it starts to get a bit more expensive. 00:13:54.400 |
This is just an example. How do we create our embeddings? We have open AI embedding create. 00:14:04.160 |
We pass in the text embedding model. You also need your open AI key. For that, you need to go 00:14:11.120 |
to platform open AI.com. Let me double check that. You'd come to the platform here. You'd go 00:14:26.640 |
up to your profile on the top right and you just click view API keys. That's it. Then we run that 00:14:33.840 |
and we'll get a response that has this. We have object, data, model, usage. We want to go into 00:14:40.080 |
data and then we get our embeddings like this. We have this is embedding one or zero. This is 00:14:46.240 |
embedding one because we passed two sentences there. Each one of those is this dimensionality. 00:14:51.520 |
This is important for initializing our vector database or our vector index. Let's move on to 00:15:00.480 |
that. We get to here. We need to initialize our connection to Pinecone. For this, you do need to 00:15:10.400 |
sign up for an account. You can get a free API key. To do that, you need to go to app.pinecone.io. 00:15:19.840 |
We should find ourselves here. You'll probably end up in this, like, it will say your name, 00:15:26.880 |
default project, and you just go to API keys. You press copy and you would paste it into here. 00:15:36.720 |
You also need the environment. The environment is not necessarily going to be this. I should 00:15:41.280 |
just remove that. The environment is whatever you have here and this will change. It depends on when 00:15:47.600 |
you sign up, among other things. So, yeah, that will vary. Don't rely on what I put here, which 00:15:54.240 |
was the US West 1 GCP. It can change. It also depends if you already have a project that you 00:16:00.640 |
set up with a particular environment, then, of course, it's going to be whichever environment 00:16:04.560 |
you chose there. All right. After that, we check if the index already exists. If this is your first 00:16:14.000 |
time walking through this with me, then it probably won't exist. So, the index is this GPT4 00:16:19.680 |
line chain dot. You can see if I go into mine, it will be there, right? Because I just created it 00:16:26.960 |
before recording this. So, I do have that in there. So, this would not run. All right. But 00:16:33.040 |
important things is we have our index name. You can rename it to whatever you want. I'm just using 00:16:38.800 |
this because it's descriptive. I'm not going to forget what it is. A dimension is where we need 00:16:43.200 |
that 1, 5, 3, 6, which we've got up here. So, the dimensionality of our vectors. That's important. 00:16:50.480 |
And then the metric, we're using dot product. So, text embedding, you should be able to use it with 00:16:56.160 |
our dot product or cosine. We're just going dot product there. And then here, we are so, after 00:17:03.600 |
this, we'll create our index. Then we're connecting to our index. Okay? So, this is gRPC index. You 00:17:09.200 |
can also use just index. But this is kind of more reliable, faster, and so on. And then after you've 00:17:17.040 |
connected, you can view your index stats. Now, the first time you run this, you should see that 00:17:23.120 |
the total vector count is zero, right? Because it's empty. Then, you know, after we've done that, 00:17:29.520 |
we move on to populating the index. To populate the index, we will do this, right? So, we're going 00:17:36.880 |
to do it in batches of 100. All right? So, we'll create 100 embeddings and add all of those to 00:17:43.840 |
Pinecone in a batch of 100. Okay? So, what we're going to do is we loop through our dataset, 00:17:50.160 |
through all of the chunks that we have with this batch size. We find the end of the batch. 00:17:55.120 |
So, the initial one should be like zero to 100. Right? We take our metadata information there. 00:18:05.040 |
We get the IDs from that. We get the text from that. And then what we do is we create our 00:18:12.240 |
embeddings. Now, that should work. But sometimes there are issues, like when you have like a rate 00:18:19.040 |
limit error or something along those lines. So, I just had a really simple try except statement 00:18:26.080 |
in here to just try again. Okay. Cool. After that, we've got our embeddings. Okay. That's good. 00:18:34.960 |
And we can move on to so, we clean up our metadata here. So, within our metadata, 00:18:41.680 |
we only want the text. Maybe the chunk. I don't think we really even need the chunk, 00:18:46.320 |
but I'm just putting it in there. And the URL, I think that's important. Like, if we're returning 00:18:51.280 |
results to a user, it can be nice to direct them to where those results are coming from. 00:18:57.760 |
Right? It helps a user have trust in whatever you're sort of spitting out, rather than not 00:19:06.000 |
knowing where this information is coming from. Right? And then we add all of that to our vector 00:19:12.240 |
index. So, we have our IDs, the embeddings, and the metadata for that batch of 100 items. And 00:19:18.240 |
then we just loop through, keep going, keep going, in batches of 100. Right. Once that is all done, 00:19:25.760 |
we get to move on to what I think is the cool part. Right? So, how do I use the LM chain inline 00:19:33.520 |
chain? I think we can just run this. Okay. And let's have a look at the responses. Now, this 00:19:42.640 |
is kind of messy. Here we go. So, I'm returning five responses. Now, if you see the first one 00:19:49.280 |
here, this is not that relevant. Okay. The top one that we have here. Okay. Fine. Come on to the 00:19:56.800 |
next one. This is talking a little bit about large language models. I don't think it necessarily 00:20:03.200 |
manages LM chain here. Fine. Move on to the next one. Now, we get something. Right? LM combined 00:20:10.240 |
chains. It's talking about what chains are, why we use them. It talks about the prompt template, 00:20:16.240 |
which is a part of the LM chain. And it talks a little bit more about the LM chain. Right. So, 00:20:21.600 |
that's the sort of information we want. But, I mean, there's so much information here. Do we 00:20:25.520 |
really want to give all of this to a user? No, I don't think so. Right? We want to basically 00:20:32.720 |
give this information to a large language model, which is going to use it to give a more concise 00:20:37.760 |
and useful answer to the user. So, to do that, we create this sort of format here for our query. 00:20:48.240 |
Right? So, this is just adding in that information that we got up here 00:20:51.200 |
into our query. And we can have a look at what it looks like. So, augmented query. 00:20:55.920 |
Right? It's actually kind of messy. Let me print it. Maybe that will be better. 00:21:07.040 |
Right. So, you can kind of see, I mean, these are just single lines. It's really messy. But we 00:21:13.360 |
separate each example with like these three dashes and a few new lines. And then we have all this. 00:21:20.880 |
And then we ask, we put our query at the end. How do I use the LM chain and line chain? All right. 00:21:27.680 |
That is our new augmented query. We have all this external information from the world. And then we 00:21:34.400 |
have our query. Before, it was just this. Right? Now, we have all this other information that we 00:21:39.680 |
can feed into the model. Right? Now, GPT-4, at least in its current state, is a chat model. 00:21:48.400 |
Okay? So, we need to use a chat completion endpoint like we would have done with GPT-3.5 00:21:55.040 |
Turbo. And with those, we have kind of like the system message that primes the model. Right? So, 00:22:02.000 |
I'm going to say you are a QnA bot. You are highly intelligent. That answers your questions 00:22:07.120 |
based on the information provided. So, this is important. Based on the information provided by 00:22:11.840 |
the user above each question. Right? Now, this information isn't actually provided by the user. 00:22:19.120 |
But as far as our AI bot knows, it is. Because it's coming in through a user prompt. Right? 00:22:25.040 |
If the information cannot be found in the information provided by the user, 00:22:30.640 |
you truthfully say that I do not know. Okay? I don't know the answer to this. Right? So, 00:22:36.800 |
this is to try and avoid hallucination where it makes things up. Right? Because we kind of don't 00:22:41.920 |
want that. It doesn't fully fix that problem. But it does help a lot. So, we pass in that primer. 00:22:49.200 |
And then we pass in our augmented query. We're also going to do this. Actually, let me run this. 00:22:55.680 |
We're also going to do this here. So, we're going to display the response nicely 00:22:59.360 |
with markdown. So, what we'll see with GPT-4 is that it's going to kind of format everything 00:23:06.320 |
nicely for us. Which is great. But obviously, just printing it out doesn't look that good. 00:23:10.880 |
So, we use this. Okay? And let's run. And we get this. Okay. So, to use the LN chain, 00:23:18.800 |
line chain, follow these steps, import necessary classes. I think these all look correct. OpenAI, 00:23:25.520 |
temperature 0.9. Now, all this looks pretty good. I'd say the only thing missing is probably that 00:23:32.000 |
it is missing the fact that you need to add in your OpenAI API key. But otherwise, this looks 00:23:40.800 |
perfect. Right? So, I mean, that's really cool. Okay. That's great. But maybe a question that 00:23:47.760 |
at least I would have is how does this compare to not feeding in all of that extra information 00:23:52.480 |
that we got from the vector database? All right. We can try. All right. So, let's do the same 00:23:57.920 |
thing again. This time, we're not using the augmented query. We're just using the query. 00:24:01.920 |
And we just get I don't know. Right? Because we set that set the system up beforehand with 00:24:09.760 |
the system message to not answer and just say I don't know if it doesn't have the information 00:24:15.840 |
contained within the information that we pass within the user prompt. Okay. So, that's good. 00:24:22.160 |
It's working. But what if we didn't have the I don't know part? Would it just maybe it could 00:24:27.360 |
just answer the question. Maybe we're kind of limiting it here. So, I've added this new system 00:24:33.120 |
message. You are QA bot. A highly intelligent system that answers user questions doesn't say 00:24:37.920 |
anything about saying I don't know. Let's try. Okay. Cool. So, LineChain hasn't provided any 00:24:46.160 |
public documentation on LLMChain nor is there a known technology called LLMChain in their library. 00:24:52.720 |
To better assist you, could you provide more information or context about LLMChain, LineChain? 00:24:57.120 |
Okay. Meanwhile, if you are providing to LineChain a blockchain based decentralized AI language 00:25:03.920 |
model, I'm you know, I keep getting this answer from GPT and I have no idea if it's actually a 00:25:11.920 |
real thing or it's just like completely made up. I assume it must be because it keeps telling me 00:25:16.160 |
this. But yeah, I mean, obviously, this is wrong. This isn't what we're going for. It says here if 00:25:21.920 |
you're looking for help with a specific language chain or model in NLP, like this is kind of 00:25:29.040 |
relevant, but it's not. It clearly doesn't know what we're talking about. It's just making guesses. 00:25:34.320 |
This is just an example of where we would use this system. As you saw, it's pretty easy to 00:25:42.480 |
set up. There's nothing complicated going on here. We're just kind of calling this API, 00:25:46.240 |
calling this API, and all of a sudden we have this insanely powerful tool that we can use 00:25:53.520 |
to build really cool things. It's getting stupidly easy to create these sort of systems 00:26:00.400 |
that are incredibly powerful. I think it shows there are so many startups that are doing this 00:26:06.000 |
sort of thing. But at least for me, what I find most interesting here is that I can take this, 00:26:12.240 |
I can integrate into some sort of tooling or process that is specific to what I need to do, 00:26:18.720 |
and it can just help me be more productive and help me do things faster. I think that's probably, 00:26:25.760 |
at least for me right now, that's the most exciting bit. Then of course, for anyone working 00:26:31.120 |
in the company or any founders working on their startup and so on, these sort of technologies 00:26:36.880 |
are like rocket fuel. Things you can do in such a short amount of time is insane. Anyway, 00:26:44.960 |
I'm going to leave it there. I hope this video has been interesting and helpful. 00:26:49.760 |
Thank you very much for watching, and I will see you again in the next one. Bye.