back to indexRetrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)
Chapters
0:0 Introduction
2:22 Language Models
4:33 Fine-Tuning
6:4 Prompt Engineering (Few-Shot)
7:24 Prompt Engineering (QA)
10:15 RAG pipeline (introduction)
13:38 Embedding Vectors
19:41 Sentence Embedding
23:17 Sentence BERT
28:10 RAG pipeline (review)
29:50 RAG with Gradient
31:38 Vector Database
33:11 K-NN (Naive)
35:16 Hierarchical Navigable Small Worlds (Introduction)
35:54 Six Degrees of Separation
39:35 Navigable Small Worlds
43:8 Skip-List
45:23 Hierarchical Navigable Small Worlds
47:27 RAG pipeline (review)
48:22 Closing
00:00:00.000 |
Hello guys, welcome back to my channel. Today we are going to talk about retrieval augmented generation 00:00:05.100 |
So as you know large language models can only answer questions or generate text based on their training data 00:00:11.400 |
So for example, if we have a language model that was trained in 2018 and we ask it about the COVID 00:00:17.400 |
probably the language model will know not know anything about the COVID and 00:00:21.040 |
One of the way to augment the knowledge of language model is to fine-tune the model on the latest data 00:00:26.480 |
But there is another technique called retrieval augmented generation, which is very useful 00:00:31.040 |
Especially for question answering and it has been also being used recently by Twitter to create their new language model called Grok 00:00:38.560 |
So Grok can access in real time all the data from the tweets and answer questions on the latest trends 00:00:46.320 |
we will explore what is a retrieval augmented generation, all the pipeline, all the pieces of the pipeline and the 00:00:52.060 |
architecture behind each of these building blocks 00:00:55.200 |
So let's review the topics of today. I will give a brief introduction to language models 00:01:01.000 |
So what they are, how they work, how we inference them 00:01:04.220 |
we will then move on to the pipeline that makes up the retrieval augmented generation with particular reference to the embedding vectors and 00:01:14.340 |
We will also explore the architecture behind sentence birth 00:01:18.200 |
Which is one of the ways to generate embeddings of sentences and then we will move on to vector databases 00:01:24.640 |
What they are, how we use them and how the algorithm of a vector database works in finding a particular vector 00:01:34.180 |
What I expect you guys to already know before watching this video is for sure you are a little too familiar with the transformer 00:01:44.400 |
But at least you're familiar with the basic building blocks of a transformer and that you have watched my previous video about birth 00:01:50.680 |
So if you're not familiar with these topics, please I recommend you watch my previous video on the transformer and my previous video on birth 00:01:57.000 |
They will give you all the necessary background to fully understand the current video. Now, let's start our journey 00:02:03.080 |
We will start by introducing first of all my cat Oleo 00:02:07.580 |
Because I will be using him for a lot of the examples in my video 00:02:12.480 |
So if you are not Chinese, you don't know how to read his name 00:02:15.640 |
Which is Oleo which stands for the biscuits Oreo because he's black and white 00:02:20.400 |
So let's get started. The first thing we will be talking about is language models. Now, what is a language model? 00:02:27.740 |
Well, a language model is a probabilistic model that assigns probabilities to sequence of words. In practice a language model 00:02:35.440 |
Allows us to compute the following. What is the probability that the word China comes after in the sentence? 00:02:43.480 |
Shanghai is a city in. So the language model allow us to 00:02:48.000 |
model the probability of the next token given the prompt and 00:02:52.800 |
We usually train a neural network to predict these probabilities 00:02:57.420 |
A neural network that has been trained on a very large corpora of text is known as a large language model 00:03:06.800 |
We usually have a very big corpora of text that is made of, which is a kind of a collection of documents 00:03:13.920 |
Often language models are trained on the entire Wikipedia or millions of web pages or even thousands of books because we wanted the language 00:03:21.740 |
Model to acquire as much knowledge as possible and we usually use a transformer based 00:03:31.400 |
For example llama is usually known as a decoder only network and bird is usually known as an encoder only network 00:03:41.440 |
Because llama has the last part of llama is the basically the decoder of the 00:03:46.960 |
Transformer without the cross attention plus a linear layer plus a softmax while bird using only the encoder side 00:03:54.160 |
And then it has some heads that can be a linear layer 00:03:57.160 |
Depending on the task it is we are using it for 00:04:02.840 |
we usually build a prompt and then we ask the language model to continue this prompt with by 00:04:08.440 |
Iteratively adding tokens such that the tokens that the language model adds make sense in 00:04:14.640 |
Continuity with the prompt. So for example here 00:04:18.840 |
I'm asking chatgpd to come to continue a joke and chatgpd continues the joke by adding a few more tokens that when read 00:04:26.280 |
In their entirety they make sense. They are coherent with what I actually wrote 00:04:33.440 |
You are what you eat, so a language model can only output text and information 00:04:39.280 |
It was trained upon this means that if we train a language model only on English content 00:04:44.480 |
Very probably it will not be able to output Japanese or French 00:04:50.600 |
Concepts to new content new information to a language model. We need to fine-tune the model 00:04:56.040 |
However fine-tune fine-tuning is has some cons for example 00:05:01.160 |
It can be expensive in term of computation power necessary to fine-tune 00:05:05.280 |
The number of parameters of the model may not be sufficient to capture all the knowledge that we want to teach the the model itself 00:05:12.920 |
So for example llama was introduced with the 7 billion 13 billion and the 70 billion parameters 00:05:19.040 |
Why because with 7 billion parameters you it can capture 00:05:24.200 |
Some some knowledge, but not as much as the 70 billion parameters model 00:05:29.320 |
So the number of parameters is a limitation on the amount of knowledge the language model can acquire 00:05:38.560 |
For example if you have a model that has trained on English content 00:05:43.440 |
And then you heavily fine-tune it on Japanese content 00:05:46.660 |
The model will not be at the end of the fine-tuning will not be as proficient in English and as in Japanese 00:05:53.520 |
But it may forget some of the English content 00:05:57.040 |
It was initially trained upon so is that we say that the fine-tuning is not additive to the knowledge of the language model 00:06:08.680 |
for the fine-tuning with the prompt engineering for example 00:06:12.760 |
It is possible to ask the language model to perform a new task 00:06:16.360 |
Task that it was not specifically trained upon by working with the prompt for example 00:06:22.400 |
This is a few short prompting technique and the following is an example so 00:06:27.080 |
We first give an instruction to the language model here is the instruction on 00:06:32.840 |
What is the task the language model is going to perform? 00:06:36.160 |
Then we give the language model some example to how to perform this task 00:06:40.560 |
And then we ask the language model to perform it 00:06:43.680 |
For example the task here is Oleo is a cat that likes to play tricks on his friend Umar 00:06:49.400 |
By replacing all the names in everything he writes with meow 00:06:52.720 |
For example Umar writes Bob runs a YouTube channel and Oleo will modify it to meow runs a YouTube channel 00:06:59.880 |
So what happens if Umar writes Alice likes to play with his friend Bob 00:07:06.720 |
The language model comes up with the right answer, which is meow likes to play with his friend meow so 00:07:13.200 |
JGPT in this case was not trained on performing this task 00:07:16.960 |
But by looking at the prompt and the example that we provided it it was able to come up with the right solution 00:07:23.800 |
And we can use prompt engineering also for question answering with the same kind of reasoning 00:07:30.160 |
So we build a very big prompt that includes an instruction part 00:07:34.400 |
So you are an assistant trained to answer questions using the given context 00:07:39.680 |
In which we provide some context which is a piece of text in which to retrieve the answer 00:07:45.160 |
And then we ask the question to the language model 00:07:47.880 |
How many parameters are there in grok zero so grok is the language model that is introduced by Twitter 00:07:53.960 |
And it's a language model that can also access the latest tweets, and we will see later how 00:08:00.560 |
But the point is I am asking the language model so JGPT to tell me about grok zero 00:08:10.160 |
It doesn't know about the existence of this grok zero because it came out very recently 00:08:15.000 |
So the model JGPT was able to retrieve the answer by looking at the context 00:08:23.440 |
It says grok zero the prototype LLM mentioned in the provided context is stated to have been trained with 33 billion parameters 00:08:31.440 |
Because the JGPT was able to access the context in which we talk about the how many parameters there are in grok zero 00:08:39.720 |
Which is this line after announcing XAIV prototype we train the prototype LLM with 33 billion parameters 00:08:51.160 |
this kind of way of working with the prompt is actually very powerful, but also 00:08:57.520 |
Fine-tuning is not necessarily wrong or the wrong way to deal with this 00:09:01.800 |
Lack of knowledge problem in the language models because it usually when we fine-tune a model on a specific 00:09:09.560 |
Particular content it results in a higher quality results compared to just prompt engineering 00:09:15.800 |
And also as you saw before to ask a question to JGPT 00:09:23.840 |
So it means that the number of tokens that we are giving to the model is quite big the more we have 00:09:29.560 |
the bigger the context that we need to provide to answer a particular question and 00:09:34.300 |
We know that the more context we give the more information the language model will have to come up with the right answer 00:09:43.760 |
But the problem is bigger context is also computationally expensive 00:09:49.360 |
So by fine-tuning actually we can reduce this content size because we don't need to provide the context 00:09:54.840 |
Anymore because we are fine-tuning the language model on the specific data on which we will ask questions for so to a language model that 00:10:02.400 |
Has been fine-tuned. We just need to ask the question 00:10:05.200 |
So how many parameters are there in grok0 without providing all the context and if the language model has been fine-tuned correctly 00:10:11.440 |
It will be able to come up with the right answer 00:10:15.040 |
Now we need to introduce the retrieval augmented generation pipeline 00:10:20.080 |
Because that's our next step and we will explore each building block that makes up the pipeline 00:10:25.440 |
So imagine that we want to do question answering with the retrieval augmented generation 00:10:30.200 |
Compared to what we did before before we did it with the prompt engineering 00:10:36.320 |
how many parameters are there in grok0 this is our query and 00:10:42.000 |
Imagine we also have some documents in which we can find this answer these documents may be some pdf documents 00:10:48.160 |
But they may also be web pages from which we can retrieve this answer 00:10:51.920 |
What we do is we split all these documents or pieces of text into chunks of text 00:10:58.880 |
so small pieces of text for example a document may be made up of many pages and each page may be made up of 00:11:04.600 |
Paragraphs and each paragraph is made up of sentences 00:11:08.000 |
In the usually we split each of these documents into small sentences 00:11:12.980 |
And also the web pages are split in the same way 00:11:21.220 |
Such that each embedding is a vector of a fixed size that captures the meaning of each sentence 00:11:28.720 |
Then we store all of these embeddings into a vector database and later 00:11:34.160 |
We will see all these embeddings and vector database how they work 00:11:37.520 |
Then we take also the query which is a sentence 00:11:41.120 |
We convert it into an embedding using the same model that we have used to convert the documents into embeddings 00:11:51.360 |
So this query embedding into our database which already have many embeddings each 00:11:56.380 |
Representing a sentence from our document and it will come up with some results with the best matching 00:12:02.220 |
Embeddings for our particular query and each embedding is also 00:12:09.240 |
It comes from so the vector database is also able to retrieve the original text from which that embedding was created 00:12:16.540 |
So if we are for example, how many parameters are there in grok zero? 00:12:21.100 |
The vector database will search all of its embedding and will give us the embeddings that best match our query 00:12:27.420 |
So probably it will look for all the piece of text that talk about grok zero and the parameters it contains 00:12:34.460 |
Now that we have the context and the query we create a template for a prompt 00:12:41.340 |
So just like the the prompt we have used before so you are an assistant trained to answer questions using the given context 00:12:48.940 |
We paste the context and the query inside of the prompt template 00:12:54.000 |
And just like before we feed it to the language model 00:12:57.820 |
And then the language model will be able to answer our question by using the context provided 00:13:03.520 |
So we are with the retrieval augmented generation. We are not fine-tuning a model to answer the questions 00:13:13.980 |
But we are introducing a database here called the vector database that can access the 00:13:19.420 |
Context given our query so it can retrieve the context necessary to answer our particular question 00:13:26.620 |
Feed it to the language model and then the language model 00:13:29.980 |
Using the context and our question will be able to come up with the right answer very probably 00:13:35.900 |
Now let's talk about embedding vectors so what they are and how we work 00:13:43.660 |
Okay, first of all, why do we use vectors to represent words for example given the words cherry digital and information 00:13:52.140 |
If we represent embedding vectors using only two dimensions 00:13:55.900 |
So as you remember in the vanilla transformer each embedding vector is 512 dimensions 00:14:01.100 |
But imagine we are in a simpler world. We only have two dimensions 00:14:04.540 |
So we can plot these embedding vectors on a xy plane 00:14:09.580 |
And what we do is we hope to see something like this 00:14:13.420 |
So that words with similar meaning or words that represent the same concept 00:14:21.980 |
So for example the word digital and the word information are pointing to the same direction in space 00:14:27.820 |
Such that the angle between words that have a similar meaning is small 00:14:33.500 |
So the angle between digital and information is small and the angle between 00:14:39.020 |
Words that have a different meaning is bigger 00:14:41.660 |
So for example the word cherry and digital have an angle that is bigger compared to digital and information 00:14:47.760 |
Indicating that they represent different concepts 00:14:53.420 |
We expect it to point to this vertical direction here such that the angle between cherry and tomato should be small 00:15:02.860 |
We usually use the cosine similarity to measure the angle between vectors and later we will see the formula of the cosine similarity 00:15:10.480 |
Now, how did we come up with the idea of representing words as embeddings? 00:15:17.280 |
The first idea is that words that are synonyms tend to occur in the same context 00:15:23.820 |
So surrounded by the same words, for example, the word teacher and the professor 00:15:29.500 |
Usually occur by the word school, university, exam, lecture, course, etc and vice versa 00:15:37.100 |
We can also say that words that occur in the same context tend to have similar meaning 00:15:43.900 |
This is known as the distributional hypothesis 00:15:46.720 |
This means that to capture the meaning of a word 00:15:51.260 |
We also need to have access to its context so to the words surrounding it 00:15:57.660 |
But this also means that this is also the reason why we employ self-attention 00:16:02.160 |
In the transformer model to capture the conceptual information of each token 00:16:07.260 |
So as you remember the transformer model we have this self-attention 00:16:11.260 |
The self-attention is a way to relate each token with all the other token in the same sentence 00:16:16.060 |
Based also on the position each token occupies in the sentence because we have the concept of positional encoding 00:16:23.180 |
So the self-attention access two things to calculate its score of attention 00:16:28.540 |
The first is the embedding of the word which captures its meaning 00:16:32.620 |
The second information it accesses is the positional encoding so that words that are closer to each other are related 00:16:38.940 |
Differently to words that are far from each other 00:16:42.140 |
And this self-attention mechanism modifies the embedding of each word in such a way that it also captures the 00:16:50.680 |
Contextual information of that word and the words that surround it 00:17:00.200 |
Task which is called the musket language model task to capture information 00:17:08.040 |
This musket language model task is based on the clothes task and we humans do it very often. Let me give you an example 00:17:16.200 |
Imagine I give you the following sentence rome is the something something of italy 00:17:21.160 |
This is why which is why it hosts many government buildings. Can you tell me what is the missing word? 00:17:27.320 |
Well, of course the missing word is capital because by looking at the rest of the sentence is the one that makes the most sense 00:17:35.800 |
How did we come up with the word capital for the missing word? Well, we look at the words that were surrounding the blank space 00:17:44.840 |
so it means that to the the word capital depends on the 00:17:48.840 |
Context in which it appears on the words that surround it 00:17:55.960 |
We want the self-attention mechanism to relate all the input tokens with each other 00:18:00.760 |
So that BERT has enough information about the context of the missing word to predict it 00:18:08.280 |
For example, imagine we want to train BERT on the musket language model task 00:18:12.600 |
And we create an input with the musket word just like before 00:18:16.280 |
So rome is the something something of italy, which is why it hosts many government buildings 00:18:21.560 |
We replace the blank space with a special token called musk 00:18:24.920 |
This becomes the input of BERT which is made of 14 tokens 00:18:31.880 |
BERT is a transformer model. So it will output it's a sequence to sequence model 00:18:36.360 |
So if the input is 14 tokens, the output will also be 14 tokens 00:18:40.440 |
We ask BERT to predict the fourth token because it's the one that has been musket out 00:18:46.280 |
We know what is the word, which the word is capital 00:18:49.560 |
So we ask we calculate the loss based on what is the predicted fourth token and what it should be the actual fourth token 00:18:57.080 |
And then we back propagate the the loss to update all the weights of the model when we run back propagation 00:19:06.520 |
The model will also update the input embeddings here 00:19:10.360 |
So the input embeddings by the back propagation mechanism will be modified in such a way that this word 00:19:17.800 |
So the word capital so the embedding associated with the word capital will be modified in such a way that it captures all the information 00:19:27.080 |
So the next time BERT will have less troubles predicting it given its context 00:19:32.920 |
And this is uh, actually the one of the reason why we run back propagation because we want the model to get better and better 00:19:41.560 |
What if I told you that actually we can also create embeddings not of single tokens 00:19:47.400 |
But also of entire sentences so we can use the self-attention mechanism to capture also the meaning of an entire sentence 00:19:54.920 |
What we can do is we can use a pre-trained BERT model to produce embeddings of entire sentences. Let's see how 00:20:02.760 |
Well, suppose we have a simple input sentence 00:20:05.960 |
For example, this one made of 13 tokens called our professor always gives us lots of assignments to do in the weekend 00:20:13.400 |
We feed it to BERT, but notice that I removed the linear layer from BERT 00:20:21.080 |
so the first thing we do is we notice is that the input of the self-attention is a matrix of shape 13 by 00:20:27.860 |
768. Why? Because we have 13 tokens and each token is represented by an embedding with 00:20:34.100 |
768 dimensions which is the dimension which is the size of the embedding vector in BERT base 00:20:43.220 |
The self-attention mechanism will output another matrix with the shape 13 by 00:20:49.780 |
768 so 13 tokens each one with its own embedding of 768 dimensions 00:20:57.300 |
And the output of the self-attention is an embedding that captures not only the meaning of the word or its position in the sentence 00:21:04.740 |
But also all the contextual information all the relationship between other words and the current word 00:21:09.860 |
So the output will be 13 tokens each one represented by an embedding of size 768 00:21:17.320 |
What we do now each of them is representing kind of the meaning of a single word, right? 00:21:24.820 |
So what we do we can average all of them to create the embedding of the sentence 00:21:32.840 |
We calculate the average of them which will result in a single vector with 00:21:38.100 |
768 dimensions and this single vector will represent the sentence embedding which captures the meaning of the entire sentence 00:21:45.700 |
And this is how we create the embedding of a sentence 00:21:53.140 |
Sentence embeddings in to see if two sentences have similar meaning so for example, imagine 00:22:02.500 |
The query for example before was talking about how many parameters are there in grok zero 00:22:07.140 |
And then we have another sentence that talks about how many parameters there are in grok zero 00:22:11.460 |
So, how can we relate these two sentences? We need to find a similarity 00:22:16.320 |
Function and what we do is we usually use a cosine similarity because they are both vectors 00:22:21.620 |
And the cosine similarity can be calculated as between vectors and it measures the cosine of the angle between the two vectors 00:22:29.300 |
A smaller angle results in a high cosine singular similarity score while a bigger angle results in a smaller 00:22:36.480 |
Cosine similarity score and this is the formula of the cosine similarity score 00:22:43.520 |
So nobody told BERT that the embedding it produces should be comparable with the cosine similarity 00:22:52.500 |
And then we take the average of these embeddings 00:22:55.440 |
But nobody told BERT that these embeddings should be in such a way that two similar sentences should produce similar embeddings 00:23:03.460 |
How can we teach BERT to produce embeddings that can be compared with a similarity function of our choice 00:23:13.040 |
Which could be a cosine similarity or the euclidean distance 00:23:16.420 |
Well, we introduce sentence BERT. Sentence BERT is one of the most popular 00:23:24.560 |
To produce embeddings for entire sentences that can be compared using a similarity function of our choice in this case 00:23:33.840 |
So sentence BERT was introduced in a paper called sentence BERT sentence embeddings using Siamese BERT 00:23:40.380 |
Networks and we will see all of this. What does it mean? What is a Siamese network? 00:23:47.340 |
Two sentences that are similar in meaning for example 00:23:50.300 |
My my father plays with me at the park and I play with my dad at the park 00:23:55.820 |
The first one we will call it sentence A and the second one we will call it sentence B 00:24:02.940 |
We feed them to BERT. So each of them will be a 00:24:08.700 |
Sequence of tokens. For example, this one may be 10 tokens and this one may be 8 tokens 00:24:12.940 |
We feed it to BERT which will produce output of 10 tokens and 8 tokens 00:24:18.300 |
Then we do the pooling the mean pooling that we did before 00:24:21.900 |
So we take all these output tokens and we calculate the average of them to produce one only 00:24:27.020 |
vector of dimension 760 as in case we are using BERT base or bigger in case we are using a bigger BERT 00:24:35.660 |
The first one we will call it sentence embedding A 00:24:38.300 |
So the first vector is the embedding of the sentence A and the second one is the embedding of the sentence B 00:24:44.140 |
We then measure the cosine similarity between these two vectors 00:24:48.640 |
We have our target cosine similarity because we are training this 00:24:52.700 |
BERT this model. So we for example given these two sentences, which are quite similar in meaning 00:24:59.980 |
We may have a target that is very close to one because the angle between them will be small 00:25:05.500 |
we hope that the angle between them should be small so 00:25:08.620 |
We calculate the loss because we have a target and the output of the model and then we run back 00:25:14.460 |
propagation on to update all the weights of this model 00:25:17.660 |
Now as you can see this model is made up of two branches that are same. So in structure 00:25:27.340 |
Which is a network that is made of two branches or more branches that are same with each other with respect to the architecture 00:25:34.960 |
But also with respect to the weights of the model 00:25:39.100 |
So what we do actually when we represent these siamese networks, we represent it at two branches, but when we code this model actually 00:25:49.980 |
So only one branch here that will reproduce cosine similarities and what we do 00:25:57.500 |
First we run the sentence A through this model 00:26:01.340 |
Then we run the sentence B also through this model 00:26:04.700 |
We calculate the cosine similarity between these two output and then we run back propagation such that 00:26:09.980 |
The back propagation will only modify the parameters of this model here, but when we represent it 00:26:17.740 |
For showing we actually we represent it as two branches, but remember that they are not actually two branches. It's only one branch 00:26:25.340 |
It's only one weights only one architecture and the same number of parameters 00:26:32.940 |
Will if we train birth the sentence birth like this it will produce embeddings 00:26:40.940 |
Sentences have a similar cosine similarity. So I have high cosine similarity 00:26:47.440 |
And so we can compare them using the cosine similarity measure 00:26:51.600 |
Also, if if you remember birth produces at least birth base produces embeddings of size 00:26:58.320 |
768 if we want to produce sentence embeddings that are smaller than 760 dimensions 00:27:05.220 |
We can include a linear layer here to reduce the dimensions. For example, we want to go from 768 to 512 00:27:13.760 |
In the paper of sentence birth actually they not only 00:27:17.440 |
Use the mean pooling that we use the so to calculate the average of all the tokens output by birth to produce one 00:27:24.320 |
Vector that represents the meaning of the entire sentence, but they also use max pooling 00:27:29.360 |
And another technique that they use is the CLS token 00:27:32.720 |
So if you remember the CLS token is the first token that we give as input to birth 00:27:37.040 |
And it's also the first token that is output by birth 00:27:40.880 |
And usually this because of the self-attention mechanism. This CLS token captures the 00:27:46.320 |
information from all the other tokens because 00:27:54.240 |
However, the sentence birth paper they have shown that 00:27:57.200 |
Both methods so the max pooling and the CLS token don't perform better than mean pooling 00:28:03.440 |
So they they recommend using mean pooling which is also one of actually what is used in production nowadays 00:28:14.180 |
What is the pipeline of retrieval augmented generation now that we know how embeddings works? 00:28:19.780 |
So we have our query which is how many parameters are there in grok zero? 00:28:24.900 |
Then we have some documents in which we can find this answer. So the documents may be pdf documents or web pages 00:28:31.780 |
We split them into single sentences and we embed these sentences using our sentence birth 00:28:40.180 |
Our sentence birth will produce vectors of a fixed size. Suppose 768 00:28:44.680 |
And we store them all these vectors in a vector db. We will see later how it works 00:28:51.460 |
The query is also converted into a vector of size 00:29:09.160 |
That best match our query. What do we mean by this? 00:29:12.740 |
We mean all the embeddings that have that when we calculate the cosine similarity score with our query 00:29:19.060 |
It results in a high value or if you are using another 00:29:22.420 |
Similarity score for example euclidean distance the distance is small depending on what? 00:29:28.180 |
Distance we are using the cosine similarity or the euclidean distance 00:29:33.060 |
This will produce the top embeddings that best match our query and we map them back into the text from which they originated from 00:29:41.140 |
This will produce the context that we feed into our prompt template 00:29:46.360 |
Along with the query we give it to the large language model, which will produce the answer 00:29:51.540 |
As we saw previously augment the knowledge of a language model 00:29:55.060 |
We have two strategies fine tuning on a custom data set or using a vector database made up of embeddings 00:30:01.700 |
We can also use a combination of both for example by fine tuning for a few epochs 00:30:06.260 |
And then using a vector database to complement with knowledge retrieved from the web 00:30:10.260 |
Whatever strategy we decide to proceed with we need a reliable scalable and easy to use service for building our retrieval augmented generation pipelines 00:30:21.620 |
Gradient is a scalable ai cloud platform that provides simple apis for fine-tuning models generating embeddings and running inference 00:30:30.020 |
Thanks to its integration with popular library lama index 00:30:33.140 |
We can build retrieval augmented generation pipelines with few lines of code 00:30:37.140 |
For example, we select the model we want to use in our case 00:30:41.380 |
It's lama2. We define the set of custom documents that the model can use to retrieve answers 00:30:46.980 |
We define the model that we want to use to generate embeddings 00:30:50.200 |
Ask a question for example, do you know anyone named oleo it voila 00:30:55.380 |
Thanks to the power of retrieval augmented generation. The model can now retrieve information about our channel's mascot oleo 00:31:01.620 |
Gradient helps us build all aspects of a retrieval augmented generation pipeline 00:31:06.820 |
For example, we can also fine-tune models on custom data as well as generate embeddings 00:31:11.720 |
With gradient you have total ownership of your data as well as the weights of fine-tuned models open source models are great 00:31:18.820 |
Because they save time on development and debugging as we have access to the architecture and the support of a vast community of developers 00:31:25.640 |
Gradient also integrates with popular libraries lama index and lang chain 00:31:30.100 |
Check the link in the description to redeem your five dollar coupon to get started today with gradient 00:31:38.420 |
Let's talk about vector databases what they are and how their matching algorithm works 00:31:44.100 |
So, how can the vector database search our query in all the vectors that it has stored? 00:31:51.620 |
A vector database stores vectors of fixed dimensions called embeddings such that 00:31:57.540 |
We can then query the database to find all the embeddings that are closest or more similar to our query using a distance metric 00:32:05.540 |
The cosine similarity or the euclidean distance 00:32:09.140 |
The vector database uses a variant of the knn algorithm which stands for the k nearest neighbor algorithm or another 00:32:16.580 |
Similarity search algorithm, but usually it's usually a variant of the knn algorithm 00:32:21.160 |
And the vector databases are not only used in retrieval augmented generation pipeline 00:32:25.940 |
They are also used for finding for example similar songs 00:32:28.820 |
So if we have songs we can create embeddings of them 00:32:32.100 |
So some embedding some vector that captures all the information about that song 00:32:39.780 |
One for example, we have a user who want to find all the similar songs to a given one 00:32:44.740 |
We will create the embedding of that song and all the others. We compare them using some similarity score. For example, the cosine similarity score 00:32:52.260 |
For example, also google images they search similar images using a similar technique. So using an embedding 00:32:58.740 |
Space in which they produce the embedding of a particular image and all the other and then they check the one that match best 00:33:10.120 |
Now knn is an algorithm that allow us to compare a particular query with all the vectors that we have 00:33:21.140 |
Sort them by distance or by similarity depending on which one we use and then we keep the top best matching 00:33:30.420 |
So how many parameters are there in grok and imagine we have a database vector a vector database 00:33:36.100 |
Made up of 1 million embeddings because actually 1 million is not even a big number because if you consider that 00:33:42.980 |
Suppose grok, for example that is accessing the tweets in real time every day 00:33:48.580 |
I think we have thousands hundreds of thousands of tweets if not millions of tweets 00:33:53.700 |
So actually the amount of vectors it has it's actually in the order of billions. I think not even millions 00:34:00.420 |
So actually millions looks like a big number, but it's not 00:34:03.460 |
Especially when we deal with the textual data also from the web we have billions of web pages that we may need to index 00:34:10.100 |
So what we do for example in this knn with the naive approach 00:34:14.660 |
Which is the most simple way of matching a query to all the other vectors is to compare 00:34:20.260 |
This query with all the other vectors given our cosine similarity function. For example, we may we may have this 00:34:26.500 |
The with the first vector it may be 0.3. The second 0.1 0.2, etc, etc 00:34:39.060 |
So with for example the highest one, for example, this one should be the first one 00:34:43.380 |
Then the this one should be the second one etc, etc, and we keep the top k 00:34:47.140 |
So the top three or the top two depending on how we chose k 00:34:50.580 |
Now this is actually a very simple approach, but it's also very slow because if there are n 00:34:56.980 |
Embedding vectors. So in this case 1 million and each one has d dimensions 00:35:01.780 |
So in this case, for example in the case of birth base, it's 768 00:35:05.480 |
The computational complexity is in the order of n multiplied by d which is very very very slow 00:35:15.940 |
And we will be exploring one algorithm in particular that is also used 00:35:20.900 |
Right now in the most popular vector db's which is called the hierarchical navigable small words 00:35:29.620 |
What we the idea is that we will trade precision for speed. So before what the algorithm we saw before so the naive 00:35:39.860 |
Which performs very slowly, but it's actually precise because each query the query is compared with each of the vectors 00:35:46.600 |
So it will always produce accurate results because we have all the possible comparison done 00:35:56.100 |
Increase the speed we need to reduce the number of comparisons that we do and the metric that we usually care in similarity search 00:36:03.540 |
Is recall. So the recall basically indicates that if our suppose that 00:36:11.620 |
The best matching vector is for example, this one and this one and this one 00:36:19.620 |
Top three query so knn to retrieve them all three of them 00:36:25.700 |
But imagine it only returns two in this case. We we have that the the query 00:36:30.900 |
Returned only two of the best matching vectors. So we will say that the recall is 66 percent or two thirds 00:36:41.060 |
How many relevant items are retrieved among all the relevant items that it should have retrieved from our search? 00:36:48.180 |
And we will see one one algorithm in particular for approximate nearest neighbors, which is called hierarchical navigable small words 00:36:57.060 |
Now hierarchical navigable small words is actually used 00:37:01.540 |
Is actually very popular nowadays in vector databases and in particular. It's also the same algorithm that powers the database quadrant 00:37:09.000 |
Which is also the open source vector database used by twitter's grok llm 00:37:13.940 |
For example, this is the exchange of tweets that I saw the other day between elon musk and the team of quadrant in which they 00:37:21.380 |
Quadrant says that actually the grok is accessing all the tweets in real time 00:37:28.900 |
using the vector database, which is quadrant and if we check the 00:37:33.700 |
Documentation of this vector database, we will see that the quadrant currently only uses the hierarchical navigable small words as the vector index 00:37:41.860 |
So this is the algorithm that powers the database that is currently used by twitter to 00:37:47.060 |
introduce retrieval augmented generation in its large language model grok 00:37:54.580 |
The first idea behind this hierarchical navigable small words is the 00:38:02.900 |
So actually the hierarchical navigable small words is an evolution of an earlier algorithm called navigable small words 00:38:09.380 |
Which is an algorithm for approximate nearest neighbors that we will see later 00:38:13.700 |
Which is based on the idea of six degrees of separation 00:38:17.080 |
So in the 1960s, there was an experiment which is called the milgram experiment 00:38:23.540 |
Which wanted to test the social connections among people in the usa 00:38:27.380 |
The participants who were initially located in nebraska and constance were given a letter to deliver to a person that they didn't know 00:38:38.980 |
However, they were not allowed to send the letter directly to the recipient instead. They were instructed to send it to someone who 00:38:54.200 |
They found that the letter reached the final recipient in five or six steps 00:39:00.180 |
Creating the concept that people all over the world are connected by six degrees of separation 00:39:05.880 |
And actually in 2016 facebook published a blog post in which they claimed that the 1.59 00:39:12.040 |
Billion active users on facebook were connected by an average of 3.5 degrees of separation 00:39:19.300 |
This means that between me and mark zuckerberg. There are 3.5 connections, which means that on average, of course 00:39:25.220 |
Which means that I have a friend who has a friend who has a friend who knows mark zuckerberg 00:39:30.840 |
And this is the idea of the degrees of separation 00:39:34.840 |
So, let's see. What is this navigable small words? 00:39:38.260 |
Now the navigable small words algorithm builds a graph that just like facebook friends connects close vectors with each other 00:39:47.300 |
But keeping the total number of connections small 00:39:50.020 |
For example, every vector may be connected with up to other six vectors like to mimic the sixth degree of 00:39:56.340 |
Separation for example, imagine we have a very small database with only 15 vectors 00:40:02.340 |
Each one representing a particular sentence that we retrieved from 00:40:06.660 |
Our knowledge source which could be documents or web pages 00:40:10.420 |
And we have a graph like this in which for example 00:40:13.700 |
The first text is about the transformer which is connected to another piece of text that talks about the transformer model 00:40:20.580 |
Then we have another text that connects the tree with the two which is now talking about the cancer with ai 00:40:32.580 |
How do we find a given query in this particular graph? 00:40:36.900 |
So imagine we have our query which is how many encoder layers are there in the transformer model? 00:40:43.940 |
How does the algorithm find the k nearest neighbors? So the best matching k vectors for our query? 00:40:50.660 |
The algorithm will proceed like this. It will find first of all an entry point in this graph, which is randomly chosen 00:40:57.460 |
So we randomly choose among all these vectors one node as a random as an entry point 00:41:05.460 |
And then we compare the similarity score of this query and this node 00:41:10.740 |
And compare it with the similarity score of the query with the friends of this node 00:41:15.140 |
So with the number with the node number seven and the node number two 00:41:18.900 |
If one of the friends has a better similarity score, then we move it to move it there 00:41:25.060 |
So the number two for example may have a better similarity score with the q compared to the number five 00:41:32.100 |
So we move to the number two and then we do again this process 00:41:35.780 |
so we check what is the cosine similarity score between the node number three and the query and we compare it with the cosine similarity of 00:41:44.740 |
Vector number two with the query if the number three has a better cosine similarity score 00:41:50.020 |
We move there and we keep doing like this until we reach a node in which his 00:41:55.700 |
The friends of this node so the number eight and the number six 00:42:03.460 |
With respect to the query compared to the current one 00:42:06.260 |
So this number four has the best cosine similarity score among all of his connections among the 00:42:19.220 |
Basically, we ordered them from the best matching to the lowest matching and we keep the top k 00:42:25.300 |
Also, we repeat this search many times by choosing different random entry points and 00:42:33.300 |
Every time we choose we sort all of the results by similarity score and then we keep the top k 00:42:39.140 |
And these are the best matching k nearest neighbor. So using the navigable small words algorithm 00:42:45.720 |
If we want to insert a vector in this graph, we just do what we did before. So we actually search 00:42:52.900 |
Given for example, we want to insert this query 00:42:56.100 |
For example, we will just do it like a search and then when we have found the top k 00:43:01.780 |
We just connect the query with the top k and that's how we insert a new item into this graph 00:43:11.140 |
Hierarchical navigable small words is based on another data structure that is called the skip list 00:43:20.180 |
Hierarchical navigable small words. We need to introduce this new 00:43:24.100 |
Data structure. So the skip list is a data structure that maintains a sorted list 00:43:30.980 |
And allows to search and insertions with an average of logarithmic time complexity 00:43:41.380 |
What we can do in this first of all as you can see this, this is not only one linked list 00:43:47.460 |
We have many linked list levels of linked list. We have the level 0, the level 1, the level 2, the level 3 00:43:54.500 |
The bottom level has the most number of items. The more we go up the less is the number of items 00:44:00.740 |
So if we want to search the number line in this linked list in this skip list 00:44:06.100 |
We start from the top level. So we start from the head number three 00:44:10.340 |
We check what is the next item and we compare it with the 00:44:15.060 |
So the first item is number five. We compare it with what is the next item in this case 00:44:20.260 |
It's the end which means that the number nine must be down 00:44:24.100 |
So we go down we then compare it with the next item, which is number 17 00:44:28.500 |
Which means that it cannot be after this node because it's 17. So it must be down 00:44:34.340 |
We go down and then we compare it with the next node, which is the number nine and we see 00:44:39.540 |
Okay, we reached the number nine. Now imagine we want to find another number. Let's say the number 12 00:44:44.980 |
We start again from the h3. We go to the first item and compare to the next. Okay, it's the end 00:44:52.100 |
So we go down then we arrive here. We compare it with the next we see it's 17. So it's bigger than 12 00:45:00.580 |
Then we see it's nine. So the next item is number nine. So we visit nine 00:45:05.860 |
And then we compare it with the next item which is 17, so it's bigger than 12 00:45:11.620 |
So we go down we go here and then we compare it with what is the next item 00:45:17.140 |
Which is number 12 and it's the number that we are looking for and we stop there 00:45:20.980 |
So this is how the skip list works. Now to create the hierarchical navigable small worlds 00:45:26.660 |
we combine the concept of navigable small worlds with the idea of the skip list in producing a 00:45:32.820 |
hierarchical navigable small worlds algorithm 00:45:36.600 |
So we start with we have a lower level which has more nodes and more connections and the upper level which has less nodes and less 00:45:45.140 |
Connections. So we say that this one is more dense and this one is more sparse 00:45:53.700 |
Suppose we have a query just like before and we want to search in this graph 00:45:58.500 |
We find a random entry point in the top level of this graph 00:46:06.420 |
The cosine similarity of this node with the query and all of his friends with the query and we see that this one is the best 00:46:20.180 |
So we check the cosine similarity of this node with the query and all of his friends with the query 00:46:25.780 |
And we see that this friend here has a better cosine similarity score. So we move here 00:46:30.260 |
Then we check this node here with all of his friends and we see that this one is the best one. So we go down 00:46:36.580 |
Also this one we see that it's the best one among all of his friends for the cosine similarity score 00:46:43.540 |
So we go down and then we do again this test and we say what is the cosine similarity score of this 00:46:50.180 |
Node and the query and also all of his friends with the query 00:46:57.220 |
And we move to the one that is best in case there is one and then we stop as soon as we find a local best 00:47:09.460 |
We repeat this search just like before by using randomly selected entry points 00:47:14.580 |
We take all these vectors that we have visited we sort we keep the top 00:47:19.620 |
K best matching based on the similarity score that we are using or the distance function that we are using 00:47:25.780 |
Okay, now that we have seen also how the vector database works let's review again the pipeline of retrieval augmented generation by 00:47:37.700 |
Summing up what we have seen. So again, we have our query 00:47:41.220 |
We have some documents from which we want to retrieve the knowledge 00:47:44.740 |
We split them into pieces of text we create embeddings using sentence bird 00:47:49.060 |
For example, we store all these vectors in a vector database 00:47:52.840 |
We convert our query in an embedding and we search in a vector database using the algorithm that we have seen before 00:48:02.660 |
This will produce the top k embeddings best matching with our query from which we associate go back to the text that they were taken from 00:48:10.260 |
We combine the query and the context retrieved in a template 00:48:15.780 |
We feed it to the large language model and finally the large language model will produce the answer 00:48:24.260 |
I hope you learned a lot today because I wanted to create this video for a long time actually 00:48:29.460 |
But I wanted also to understand myself all the algorithms that were behind this pipeline 00:48:34.340 |
And I hope that you are also now familiar with all these concepts 00:48:37.960 |
I know that actually implementing the rug pipeline is very easy. There are many 00:48:43.220 |
Popular libraries like llama index and long chain, but I wanted to actually go deep inside of how it works and each 00:48:50.660 |
Building block how they actually work together 00:48:53.620 |
Please let let me know if there is something that you don't understand 00:48:57.620 |
I will try to answer all the questions in the comment section. Also, let me know if you want to 00:49:02.980 |
Something that you want better clarified in my future videos or how can I improve my future videos for better clarity? 00:49:12.260 |
This is the best motivation for me to continue making high quality content and like the video if you like it 00:49:17.060 |
Share the video with your friends with your professors with your students, etc