Llama Index 101 with Vector DBs and GPT 3.5

00:00:00.000 | Today we're going to take a look at how we can use Llama Index in production with Pinecone.

00:00:06.480 | Now this is an introduction to the Llama Index library that was previously known as GPT Index.

00:00:12.160 | We're not going to go into any details on the more advanced features of the library.

00:00:18.560 | We're just going to see how to actually use it and get started with it and do that in a way that

00:00:25.680 | would be more production friendly with a vector database like Pinecone. Now for those of you that

00:00:31.680 | don't know Llama Index is a library that helps us build a better retrieval augmentation pipeline for

00:00:41.280 | our LLMs. So we would use retrieval augmentation when we want to give our LLM source knowledge,

00:00:49.360 | so knowledge from the outside world or maybe some internal database or something along those lines.

00:00:56.560 | And that will help us one, reference that other knowledge so we can add in citations and things

00:01:03.280 | like that there and it will also help us reduce the likelihood of hallucinations. So Llama Index

00:01:10.560 | is a library that will support us in doing that. Now Llama Index can do a lot of things,

00:01:15.760 | not all of those we're going to cover in this video but the main features of the library include

00:01:22.320 | the data loaders that allow us to very easily extract data from APIs, PDFs, SQL databases,

00:01:31.760 | CSVs, all of the most common types of data sources. It also gives us some more advanced

00:01:39.680 | ways of structuring our data so we can add in connections between different data sources which

00:01:45.520 | is kind of useful. So imagine you have a lot of chunks of text from PDFs, what you can do is add in

00:01:52.000 | connections between those chunks. So the first chunk in your database would be connected to the

00:01:57.440 | next chunk with a little connector that says this is actually the next chunk and this is the previous

00:02:02.880 | chunk. And they also support things like post-retrieval re-ranking as well. So there's

00:02:08.960 | plenty to talk about but first let's get started with a simple introduction to the library. So

00:02:15.120 | we're going to walk through this notebook here. There will be a link to this notebook at the top

00:02:19.440 | of the video right now. So the first thing we need to do is install the prerequisite libraries so

00:02:24.800 | go ahead and run that. Now for the runtime here we don't need to be using GPU so you can just check

00:02:33.680 | if you are doing that or not. It costs money to use GPU on Colab so you can just set hardware

00:02:38.480 | accelerator to none to save that money. Okay and once we're doing that what we're going to do is

00:02:44.240 | just download a data set. So I'm going to use the squad so it's the Sanford question answering data

00:02:49.040 | set. Okay so there's a few things I'm doing here. First I'm just getting the relevant columns I need

00:02:55.360 | there so the id, the context which is like a chunk of text and the title so the basically the page

00:03:03.200 | title where that context is coming from. And then what I'm doing is dropping duplicates. So in the

00:03:08.240 | squad data set you will basically have like 20 contexts and 20 different answers but those 20

00:03:15.600 | contexts are all identical for those 20 different questions so you end up with a lot of duplicate

00:03:21.280 | contexts in there but because we are just using the context we actually need to remove all that

00:03:26.960 | duplication. So that's what I'm doing here and then we get this. Okay so we have our id so it's

00:03:35.120 | like the document or context id, the context itself and then we have where that is coming

00:03:41.520 | from. Okay so we have the first few there are all from the University of Notre Dame Wikipedia page

00:03:46.800 | and in total we have just almost 19,000 records in there. So LongIndex uses these document objects

00:03:58.720 | which you can think of as being basically the context or it revolves around the context

00:04:05.200 | of your data right so this chunk of text and it will obviously include other bits of information

00:04:14.160 | around that context. So for us it's going to include the document id right so every document

00:04:20.080 | is going to need an id and this is optional so we can also add extra info which we can think of as

00:04:26.560 | metadata for our context. Now for us we just have title but obviously we could add more this is a

00:04:33.920 | dictionary so we could add I don't know something else here right and we could just you know put

00:04:39.600 | something but of course we don't actually need that so we'll remove that but yeah you can you

00:04:46.400 | can put as many fields as you like in there and yeah let's run that and take a look at one of

00:04:52.240 | those documents and see what it looks like. So you can think of this as a core object for LongIndex

00:05:00.960 | all right so we have this document we have the text and then we go through here we have the

00:05:08.080 | document id and that is your info. Now embedding we don't have an embedding for it yet so we're

00:05:13.760 | going to create that later but the embedding is also very important because that's what will

00:05:17.920 | allow us to search through that data set later on. Okay so now what we need to do is actually

00:05:27.040 | create those embeddings so to create those embeddings we're going to be using OpenAI

00:05:31.840 | so for that you will need to get a OpenAI API key from platform.openai.conf and then you would just

00:05:41.680 | put that in here I have already done it so I will move on to this. So one step further from our

00:05:49.280 | document is what we would call a node so a node the way that I would think of this is it's your

00:05:55.680 | document object with extra information about that document in relation to other documents within

00:06:03.120 | what will be your database. So let's say you have the chunks of paragraphs or chunks of text from a

00:06:10.480 | pdf a node will contain the information that chunk one is followed by chunk two and then in

00:06:17.440 | chunk two it will say chunk one was the preceding chunk before this so it has that relational

00:06:24.720 | information between the chunks whereas a document will not have that. So we would need to add that

00:06:31.920 | in there we're not going to do that here we'll talk about that in the future but we still need

00:06:37.120 | to use the the nodes here so we're going to just run this so our nodes in this case are basically

00:06:44.480 | going to be the same as our documents in terms of the information that they carry but node is the

00:06:49.760 | object type that we will build our vector database from. So let's run this I should say here we've

00:07:00.640 | we have set up our OpenAI API key we don't actually need to use it yet I should have

00:07:05.520 | really done that later but it's there now so we have that ready for when we do want to use it.

00:07:12.080 | Okay so we've just created all of the nodes from the documents let's take a look at those nodes

00:07:18.400 | okay obviously we have the same number of nodes as we do documents. Now we are going to be using

00:07:26.400 | Pinecone which is a managed vector database as the as a database for our Llama index data okay

00:07:35.280 | so to use that we need to get our API key and environment which we do from app.pinecone.io

00:07:44.800 | and within there we once you are in app.pinecone.io you should be able to see API keys over on the

00:07:50.640 | left you'll see something looks like this zoom out a bit and you just want to copy your API key

00:07:57.120 | and also remember your environment here so I've got us-west1-gcp so your API key you put it in

00:08:03.440 | here and here I'm going to put us-west1-gcp. Okay and after running that let me walk you through

00:08:12.880 | this what's going on here so we initialize our connection to Pinecone we create our Pinecone

00:08:19.440 | index so it's going to call it Llama index intro you can call this whatever you want

00:08:22.960 | and the things that we do need to do is one create our index if it does not already exist which if

00:08:29.600 | you're running this for the first time it won't and to create that index you need to make sure

00:08:35.360 | dimensionality is the same as the text embedding on the 002 model which is the embedding model

00:08:40.400 | we're using and that dimensionality is 1536 and we also need to make sure we're using the right

00:08:47.520 | metric we can actually use any metric here so you can use euclidean dot product or cosine

00:08:52.720 | but I think cosine is the fastest in terms of the similarity calculation when you're using text

00:09:01.120 | embedding on 002 although in reality the difference between them is practically nil so you can use any

00:09:09.040 | but I recommend cosine now after that we would just connect to the index okay so here we're

00:09:16.080 | connecting to Pinecone creating the index and then connecting to that index

00:09:20.000 | okay once that is done we can move on to connecting so we've just created our index

00:09:31.600 | connected to it now what we want to do is connect to it through the vector store abstraction in

00:09:37.520 | Llama index so to do that that's pretty simple Pinecone vector store and then we just pass in

00:09:44.160 | our index that's it that's pretty easy so this would just allow us to use the other

00:09:50.720 | Llama index components with our Pinecone vector store

00:09:55.440 | cool so I think that is all good and then we have a few more things going on here so let's talk

00:10:06.320 | through all of this let me make it a lot more readable so yeah there's a few things going on

00:10:13.440 | so we have basically what we're wanting to do here is create our index which is this GPT vector

00:10:19.920 | store index so we're basically going to take all of our documents and we're going to take the

00:10:25.840 | service context which is like your embedding pipeline and we're also going to take the

00:10:32.160 | storage context which is the vector store itself and this will essentially act as a pipeline around

00:10:40.480 | that so it's going to take all of our documents it's going to feed them through our embedding

00:10:47.040 | pipeline so this service context embed all of them and put them all into our vector store

00:10:54.880 | so I mean it's not in reality it's pretty straightforward let me just explain that from

00:11:01.360 | the perspective of where we're actually initializing these so storage context from default

00:11:08.640 | it's really simple we're just using our vector store there are other parameters in here but we

00:11:12.560 | don't need to use any of those because we're just using our vector store with the default

00:11:17.600 | settings with the service context like I said this is the embedding pipeline again we don't really

00:11:23.760 | need to specify much here we just need to specify okay we're using openai embeddings

00:11:28.160 | this is going to automatically pull in our API heap which we set earlier on up here okay

00:11:38.320 | so it's going to automatically pull in the API key we do need to set the model so text embedding

00:11:44.240 | order 002 at the time of me going through this is the recommended model from openai

00:11:50.000 | and we have our embedding batch size so this is one important thing that you should set

00:11:54.880 | basically it will embed things in batches of 100 I think by default the value for this is much

00:12:03.200 | smaller it's 32 or 16 or something like that so that basically means if every it's going to

00:12:11.440 | take 16 chunks of text it's going to send them to openai get the embeddings and then it's going to

00:12:19.360 | pass it onto the storage context and upset those pine cone but what we've done here is set the

00:12:25.600 | embedding batch size to 100 so it's going to take 100 send it to openai then send it to pine cone

00:12:32.160 | that means that you need to make less requests what is it it's like six times less requests if

00:12:39.840 | you set this to 100 which means in essence you're going to be roughly six times faster because the

00:12:45.440 | majority of the wait time in these API requests is actually the network latency so it's making

00:12:51.680 | the request receiving the request so by increasing that batch size it means things are going to be

00:12:57.360 | faster which is I think what we all want so yeah we set that and then with that we initialize our

00:13:05.840 | service context right so that embedding pipeline or maybe we can even think of it as a pre-processing

00:13:12.080 | pipeline for our data and then we just set everything up together right so our that's

00:13:18.480 | our full indexing pipeline we can initialize that now this can take a long time and unfortunately

00:13:27.520 | alarm index doesn't have like a progress bar or anything built into it but we can actually check

00:13:33.200 | the progress of our index creation then we go down to so alarm index intro here we can go to index

00:13:42.560 | info and then we see the total number of vectors that are in there okay and can we also you can

00:13:49.760 | also see the rate of them being updated as well and you can then refresh and you can see where we

00:13:58.160 | are okay so we're 4.3 thousand and we need to upsert how many quite a few actually so it's going

00:14:09.840 | to take a little while what i might do is just stop that for now and we can just jump ahead and

00:14:19.360 | begin asking questions so i'm not waiting too long for that but yeah that's just one of the

00:14:25.680 | unfortunate things with alarming dates but we can kind of get around that by just taking a look in

00:14:30.320 | the in the actual pinecone dashboard at how many vectors we actually currently have in there okay

00:14:38.000 | so yeah let's stop that right now it is very slow to do this with with llama index if you were just

00:14:46.480 | wanting to get your vectors and documents in there i would just use pinecone directly it's

00:14:52.160 | a much faster i mean for what 18 000 16 000 whatever that number is you're going to be

00:14:58.480 | waiting i don't know not long like maybe a couple of minutes at most because you need to embed

00:15:05.840 | things with open ai and then send things pinecone yeah a few minutes if you if you set that code

00:15:12.160 | abruptly but anyway that does mean that we wouldn't benefit from the other things that

00:15:19.600 | lam index offers so in some cases it might just be a case of being patient but that lam index and

00:15:27.120 | the embedding process will be optimized in in the near future so that hopefully that will not take

00:15:33.120 | quite as long to actually upset everything so from here let's pretend we've upset everything

00:15:40.320 | now what we want to do is build our query engine so the query engine is basically it's just the

00:15:47.840 | index and we have this method called as query engine it basically just reformats that into

00:15:54.000 | something that we can begin querying okay so we have our create our query engine then we do

00:16:01.040 | query engine query and our question is going to be in what year was the college of engineering

00:16:06.240 | established at the university of notredame we saw that the first few items in there were talking

00:16:12.800 | about the university of notredame so we would expect that that will work why okay let me so

00:16:22.800 | it looks like that hasn't actually initialized index properly so because i've kind of sucked it

00:16:28.160 | midway through so what i'll do is i'm just going to take like 100 dots quickly okay so it's a bit

00:16:36.880 | quicker let's check okay so we still have those all those other documents in there so now let's

00:16:43.600 | try that okay and we get the college engineering was established in 1920 i'm sure it's one of the

00:16:49.680 | first items it's probably where i got the question from the universe oh yeah so like question four

00:16:55.280 | here i think if we take a look at that so data four oops oh it's a day frame so it should be i look

00:17:09.280 | no three i'm being stupid

00:17:18.400 | yeah and we can have a look at the context okay so it's pulling this information established in

00:17:25.840 | 1920 okay cool so yeah that's how we would set up with larm index with a vector database like pine

00:17:34.320 | cone once we're done with that all we want to do is if you're finishing you put maybe you want to

00:17:40.880 | ask some more questions so obviously go ahead and do that but once you are done and you're not going

00:17:45.280 | to use the index again we delete the index just so that we're not wasting resources there and we can

00:17:51.680 | actually use our at least if you're on the free to you can use that index for something else after

00:17:57.360 | that so that's it for this video i just wanted to very quickly introduce llama index and how we would

00:18:05.040 | use it of course like i said at the start there is a lot more to llama index than what i'm showing

00:18:10.880 | here but this is very simply an introduction to the library but anyway i hope this has all been

00:18:18.000 | useful and interesting so thank you very much for watching and i will see you again in the next one

00:18:28.320 | you

00:18:29.380 | you

00:18:29.880 | you

Llama Index 101 with Vector DBs and GPT 3.5

Chapters