back to index

Llama Index 101 with Vector DBs and GPT 3.5


Chapters

0:0 Getting Started with Llama Index
1:13 Llama Index Features
2:15 Llama Index Code Intro
3:55 Llama Index Document Objects
5:45 Llama Index Nodes
7:23 Indexing with Pinecone
9:22 Vector Store in Llama Index
15:36 Making Queries with Llama Index

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to take a look at how we can use Llama Index in production with Pinecone.
00:00:06.480 | Now this is an introduction to the Llama Index library that was previously known as GPT Index.
00:00:12.160 | We're not going to go into any details on the more advanced features of the library.
00:00:18.560 | We're just going to see how to actually use it and get started with it and do that in a way that
00:00:25.680 | would be more production friendly with a vector database like Pinecone. Now for those of you that
00:00:31.680 | don't know Llama Index is a library that helps us build a better retrieval augmentation pipeline for
00:00:41.280 | our LLMs. So we would use retrieval augmentation when we want to give our LLM source knowledge,
00:00:49.360 | so knowledge from the outside world or maybe some internal database or something along those lines.
00:00:56.560 | And that will help us one, reference that other knowledge so we can add in citations and things
00:01:03.280 | like that there and it will also help us reduce the likelihood of hallucinations. So Llama Index
00:01:10.560 | is a library that will support us in doing that. Now Llama Index can do a lot of things,
00:01:15.760 | not all of those we're going to cover in this video but the main features of the library include
00:01:22.320 | the data loaders that allow us to very easily extract data from APIs, PDFs, SQL databases,
00:01:31.760 | CSVs, all of the most common types of data sources. It also gives us some more advanced
00:01:39.680 | ways of structuring our data so we can add in connections between different data sources which
00:01:45.520 | is kind of useful. So imagine you have a lot of chunks of text from PDFs, what you can do is add in
00:01:52.000 | connections between those chunks. So the first chunk in your database would be connected to the
00:01:57.440 | next chunk with a little connector that says this is actually the next chunk and this is the previous
00:02:02.880 | chunk. And they also support things like post-retrieval re-ranking as well. So there's
00:02:08.960 | plenty to talk about but first let's get started with a simple introduction to the library. So
00:02:15.120 | we're going to walk through this notebook here. There will be a link to this notebook at the top
00:02:19.440 | of the video right now. So the first thing we need to do is install the prerequisite libraries so
00:02:24.800 | go ahead and run that. Now for the runtime here we don't need to be using GPU so you can just check
00:02:33.680 | if you are doing that or not. It costs money to use GPU on Colab so you can just set hardware
00:02:38.480 | accelerator to none to save that money. Okay and once we're doing that what we're going to do is
00:02:44.240 | just download a data set. So I'm going to use the squad so it's the Sanford question answering data
00:02:49.040 | set. Okay so there's a few things I'm doing here. First I'm just getting the relevant columns I need
00:02:55.360 | there so the id, the context which is like a chunk of text and the title so the basically the page
00:03:03.200 | title where that context is coming from. And then what I'm doing is dropping duplicates. So in the
00:03:08.240 | squad data set you will basically have like 20 contexts and 20 different answers but those 20
00:03:15.600 | contexts are all identical for those 20 different questions so you end up with a lot of duplicate
00:03:21.280 | contexts in there but because we are just using the context we actually need to remove all that
00:03:26.960 | duplication. So that's what I'm doing here and then we get this. Okay so we have our id so it's
00:03:35.120 | like the document or context id, the context itself and then we have where that is coming
00:03:41.520 | from. Okay so we have the first few there are all from the University of Notre Dame Wikipedia page
00:03:46.800 | and in total we have just almost 19,000 records in there. So LongIndex uses these document objects
00:03:58.720 | which you can think of as being basically the context or it revolves around the context
00:04:05.200 | of your data right so this chunk of text and it will obviously include other bits of information
00:04:14.160 | around that context. So for us it's going to include the document id right so every document
00:04:20.080 | is going to need an id and this is optional so we can also add extra info which we can think of as
00:04:26.560 | metadata for our context. Now for us we just have title but obviously we could add more this is a
00:04:33.920 | dictionary so we could add I don't know something else here right and we could just you know put
00:04:39.600 | something but of course we don't actually need that so we'll remove that but yeah you can you
00:04:46.400 | can put as many fields as you like in there and yeah let's run that and take a look at one of
00:04:52.240 | those documents and see what it looks like. So you can think of this as a core object for LongIndex
00:05:00.960 | all right so we have this document we have the text and then we go through here we have the
00:05:08.080 | document id and that is your info. Now embedding we don't have an embedding for it yet so we're
00:05:13.760 | going to create that later but the embedding is also very important because that's what will
00:05:17.920 | allow us to search through that data set later on. Okay so now what we need to do is actually
00:05:27.040 | create those embeddings so to create those embeddings we're going to be using OpenAI
00:05:31.840 | so for that you will need to get a OpenAI API key from platform.openai.conf and then you would just
00:05:41.680 | put that in here I have already done it so I will move on to this. So one step further from our
00:05:49.280 | document is what we would call a node so a node the way that I would think of this is it's your
00:05:55.680 | document object with extra information about that document in relation to other documents within
00:06:03.120 | what will be your database. So let's say you have the chunks of paragraphs or chunks of text from a
00:06:10.480 | pdf a node will contain the information that chunk one is followed by chunk two and then in
00:06:17.440 | chunk two it will say chunk one was the preceding chunk before this so it has that relational
00:06:24.720 | information between the chunks whereas a document will not have that. So we would need to add that
00:06:31.920 | in there we're not going to do that here we'll talk about that in the future but we still need
00:06:37.120 | to use the the nodes here so we're going to just run this so our nodes in this case are basically
00:06:44.480 | going to be the same as our documents in terms of the information that they carry but node is the
00:06:49.760 | object type that we will build our vector database from. So let's run this I should say here we've
00:07:00.640 | we have set up our OpenAI API key we don't actually need to use it yet I should have
00:07:05.520 | really done that later but it's there now so we have that ready for when we do want to use it.
00:07:12.080 | Okay so we've just created all of the nodes from the documents let's take a look at those nodes
00:07:18.400 | okay obviously we have the same number of nodes as we do documents. Now we are going to be using
00:07:26.400 | Pinecone which is a managed vector database as the as a database for our Llama index data okay
00:07:35.280 | so to use that we need to get our API key and environment which we do from app.pinecone.io
00:07:44.800 | and within there we once you are in app.pinecone.io you should be able to see API keys over on the
00:07:50.640 | left you'll see something looks like this zoom out a bit and you just want to copy your API key
00:07:57.120 | and also remember your environment here so I've got us-west1-gcp so your API key you put it in
00:08:03.440 | here and here I'm going to put us-west1-gcp. Okay and after running that let me walk you through
00:08:12.880 | this what's going on here so we initialize our connection to Pinecone we create our Pinecone
00:08:19.440 | index so it's going to call it Llama index intro you can call this whatever you want
00:08:22.960 | and the things that we do need to do is one create our index if it does not already exist which if
00:08:29.600 | you're running this for the first time it won't and to create that index you need to make sure
00:08:35.360 | dimensionality is the same as the text embedding on the 002 model which is the embedding model
00:08:40.400 | we're using and that dimensionality is 1536 and we also need to make sure we're using the right
00:08:47.520 | metric we can actually use any metric here so you can use euclidean dot product or cosine
00:08:52.720 | but I think cosine is the fastest in terms of the similarity calculation when you're using text
00:09:01.120 | embedding on 002 although in reality the difference between them is practically nil so you can use any
00:09:09.040 | but I recommend cosine now after that we would just connect to the index okay so here we're
00:09:16.080 | connecting to Pinecone creating the index and then connecting to that index
00:09:20.000 | okay once that is done we can move on to connecting so we've just created our index
00:09:31.600 | connected to it now what we want to do is connect to it through the vector store abstraction in
00:09:37.520 | Llama index so to do that that's pretty simple Pinecone vector store and then we just pass in
00:09:44.160 | our index that's it that's pretty easy so this would just allow us to use the other
00:09:50.720 | Llama index components with our Pinecone vector store
00:09:55.440 | cool so I think that is all good and then we have a few more things going on here so let's talk
00:10:06.320 | through all of this let me make it a lot more readable so yeah there's a few things going on
00:10:13.440 | so we have basically what we're wanting to do here is create our index which is this GPT vector
00:10:19.920 | store index so we're basically going to take all of our documents and we're going to take the
00:10:25.840 | service context which is like your embedding pipeline and we're also going to take the
00:10:32.160 | storage context which is the vector store itself and this will essentially act as a pipeline around
00:10:40.480 | that so it's going to take all of our documents it's going to feed them through our embedding
00:10:47.040 | pipeline so this service context embed all of them and put them all into our vector store
00:10:54.880 | so I mean it's not in reality it's pretty straightforward let me just explain that from
00:11:01.360 | the perspective of where we're actually initializing these so storage context from default
00:11:08.640 | it's really simple we're just using our vector store there are other parameters in here but we
00:11:12.560 | don't need to use any of those because we're just using our vector store with the default
00:11:17.600 | settings with the service context like I said this is the embedding pipeline again we don't really
00:11:23.760 | need to specify much here we just need to specify okay we're using openai embeddings
00:11:28.160 | this is going to automatically pull in our API heap which we set earlier on up here okay
00:11:38.320 | so it's going to automatically pull in the API key we do need to set the model so text embedding
00:11:44.240 | order 002 at the time of me going through this is the recommended model from openai
00:11:50.000 | and we have our embedding batch size so this is one important thing that you should set
00:11:54.880 | basically it will embed things in batches of 100 I think by default the value for this is much
00:12:03.200 | smaller it's 32 or 16 or something like that so that basically means if every it's going to
00:12:11.440 | take 16 chunks of text it's going to send them to openai get the embeddings and then it's going to
00:12:19.360 | pass it onto the storage context and upset those pine cone but what we've done here is set the
00:12:25.600 | embedding batch size to 100 so it's going to take 100 send it to openai then send it to pine cone
00:12:32.160 | that means that you need to make less requests what is it it's like six times less requests if
00:12:39.840 | you set this to 100 which means in essence you're going to be roughly six times faster because the
00:12:45.440 | majority of the wait time in these API requests is actually the network latency so it's making
00:12:51.680 | the request receiving the request so by increasing that batch size it means things are going to be
00:12:57.360 | faster which is I think what we all want so yeah we set that and then with that we initialize our
00:13:05.840 | service context right so that embedding pipeline or maybe we can even think of it as a pre-processing
00:13:12.080 | pipeline for our data and then we just set everything up together right so our that's
00:13:18.480 | our full indexing pipeline we can initialize that now this can take a long time and unfortunately
00:13:27.520 | alarm index doesn't have like a progress bar or anything built into it but we can actually check
00:13:33.200 | the progress of our index creation then we go down to so alarm index intro here we can go to index
00:13:42.560 | info and then we see the total number of vectors that are in there okay and can we also you can
00:13:49.760 | also see the rate of them being updated as well and you can then refresh and you can see where we
00:13:58.160 | are okay so we're 4.3 thousand and we need to upsert how many quite a few actually so it's going
00:14:09.840 | to take a little while what i might do is just stop that for now and we can just jump ahead and
00:14:19.360 | begin asking questions so i'm not waiting too long for that but yeah that's just one of the
00:14:25.680 | unfortunate things with alarming dates but we can kind of get around that by just taking a look in
00:14:30.320 | the in the actual pinecone dashboard at how many vectors we actually currently have in there okay
00:14:38.000 | so yeah let's stop that right now it is very slow to do this with with llama index if you were just
00:14:46.480 | wanting to get your vectors and documents in there i would just use pinecone directly it's
00:14:52.160 | a much faster i mean for what 18 000 16 000 whatever that number is you're going to be
00:14:58.480 | waiting i don't know not long like maybe a couple of minutes at most because you need to embed
00:15:05.840 | things with open ai and then send things pinecone yeah a few minutes if you if you set that code
00:15:12.160 | abruptly but anyway that does mean that we wouldn't benefit from the other things that
00:15:19.600 | lam index offers so in some cases it might just be a case of being patient but that lam index and
00:15:27.120 | the embedding process will be optimized in in the near future so that hopefully that will not take
00:15:33.120 | quite as long to actually upset everything so from here let's pretend we've upset everything
00:15:40.320 | now what we want to do is build our query engine so the query engine is basically it's just the
00:15:47.840 | index and we have this method called as query engine it basically just reformats that into
00:15:54.000 | something that we can begin querying okay so we have our create our query engine then we do
00:16:01.040 | query engine query and our question is going to be in what year was the college of engineering
00:16:06.240 | established at the university of notredame we saw that the first few items in there were talking
00:16:12.800 | about the university of notredame so we would expect that that will work why okay let me so
00:16:22.800 | it looks like that hasn't actually initialized index properly so because i've kind of sucked it
00:16:28.160 | midway through so what i'll do is i'm just going to take like 100 dots quickly okay so it's a bit
00:16:36.880 | quicker let's check okay so we still have those all those other documents in there so now let's
00:16:43.600 | try that okay and we get the college engineering was established in 1920 i'm sure it's one of the
00:16:49.680 | first items it's probably where i got the question from the universe oh yeah so like question four
00:16:55.280 | here i think if we take a look at that so data four oops oh it's a day frame so it should be i look
00:17:09.280 | no three i'm being stupid
00:17:18.400 | yeah and we can have a look at the context okay so it's pulling this information established in
00:17:25.840 | 1920 okay cool so yeah that's how we would set up with larm index with a vector database like pine
00:17:34.320 | cone once we're done with that all we want to do is if you're finishing you put maybe you want to
00:17:40.880 | ask some more questions so obviously go ahead and do that but once you are done and you're not going
00:17:45.280 | to use the index again we delete the index just so that we're not wasting resources there and we can
00:17:51.680 | actually use our at least if you're on the free to you can use that index for something else after
00:17:57.360 | that so that's it for this video i just wanted to very quickly introduce llama index and how we would
00:18:05.040 | use it of course like i said at the start there is a lot more to llama index than what i'm showing
00:18:10.880 | here but this is very simply an introduction to the library but anyway i hope this has all been
00:18:18.000 | useful and interesting so thank you very much for watching and i will see you again in the next one