Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

Hello guys, welcome back to my channel. Today we are going to talk about retrieval augmented generation So as you know large language models can only answer questions or generate text based on their training data So for example, if we have a language model that was trained in 2018 and we ask it about the COVID probably the language model will know not know anything about the COVID and One of the way to augment the knowledge of language model is to fine-tune the model on the latest data But there is another technique called retrieval augmented generation, which is very useful Especially for question answering and it has been also being used recently by Twitter to create their new language model called Grok So Grok can access in real time all the data from the tweets and answer questions on the latest trends So in this video we will explore what is a retrieval augmented generation, all the pipeline, all the pieces of the pipeline and the architecture behind each of these building blocks So let's review the topics of today.

I will give a brief introduction to language models So what they are, how they work, how we inference them we will then move on to the pipeline that makes up the retrieval augmented generation with particular reference to the embedding vectors and how they are built, what they are and then We will also explore the architecture behind sentence birth Which is one of the ways to generate embeddings of sentences and then we will move on to vector databases What they are, how we use them and how the algorithm of a vector database works in finding a particular vector That we are looking similar to a query What I expect you guys to already know before watching this video is for sure you are a little too familiar with the transformer We will not be going so much in detail But at least you're familiar with the basic building blocks of a transformer and that you have watched my previous video about birth So if you're not familiar with these topics, please I recommend you watch my previous video on the transformer and my previous video on birth They will give you all the necessary background to fully understand the current video.

Now, let's start our journey We will start by introducing first of all my cat Oleo Because I will be using him for a lot of the examples in my video So if you are not Chinese, you don't know how to read his name Which is Oleo which stands for the biscuits Oreo because he's black and white So let's get started.

The first thing we will be talking about is language models. Now, what is a language model? Well, a language model is a probabilistic model that assigns probabilities to sequence of words. In practice a language model Allows us to compute the following. What is the probability that the word China comes after in the sentence?

Shanghai is a city in. So the language model allow us to model the probability of the next token given the prompt and We usually train a neural network to predict these probabilities A neural network that has been trained on a very large corpora of text is known as a large language model When we train large language models We usually have a very big corpora of text that is made of, which is a kind of a collection of documents now Often language models are trained on the entire Wikipedia or millions of web pages or even thousands of books because we wanted the language Model to acquire as much knowledge as possible and we usually use a transformer based Architecture to create a language model For example llama is usually known as a decoder only network and bird is usually known as an encoder only network Because llama has the last part of llama is the basically the decoder of the Transformer without the cross attention plus a linear layer plus a softmax while bird using only the encoder side And then it has some heads that can be a linear layer Depending on the task it is we are using it for to inference a language model we usually build a prompt and then we ask the language model to continue this prompt with by Iteratively adding tokens such that the tokens that the language model adds make sense in Continuity with the prompt.

So for example here I'm asking chatgpd to come to continue a joke and chatgpd continues the joke by adding a few more tokens that when read In their entirety they make sense. They are coherent with what I actually wrote You are what you eat, so a language model can only output text and information It was trained upon this means that if we train a language model only on English content Very probably it will not be able to output Japanese or French to teach new Concepts to new content new information to a language model.

We need to fine-tune the model However fine-tune fine-tuning is has some cons for example It can be expensive in term of computation power necessary to fine-tune The number of parameters of the model may not be sufficient to capture all the knowledge that we want to teach the the model itself So for example llama was introduced with the 7 billion 13 billion and the 70 billion parameters Why because with 7 billion parameters you it can capture Some some knowledge, but not as much as the 70 billion parameters model So the number of parameters is a limitation on the amount of knowledge the language model can acquire Also fine-tuning is not additive For example if you have a model that has trained on English content And then you heavily fine-tune it on Japanese content The model will not be at the end of the fine-tuning will not be as proficient in English and as in Japanese But it may forget some of the English content It was initially trained upon so is that we say that the fine-tuning is not additive to the knowledge of the language model Of course we can compensate for the for the fine-tuning with the prompt engineering for example It is possible to ask the language model to perform a new task Task that it was not specifically trained upon by working with the prompt for example This is a few short prompting technique and the following is an example so We first give an instruction to the language model here is the instruction on What is the task the language model is going to perform?

Then we give the language model some example to how to perform this task And then we ask the language model to perform it For example the task here is Oleo is a cat that likes to play tricks on his friend Umar By replacing all the names in everything he writes with meow For example Umar writes Bob runs a YouTube channel and Oleo will modify it to meow runs a YouTube channel So what happens if Umar writes Alice likes to play with his friend Bob How would Oleo modify it?

The language model comes up with the right answer, which is meow likes to play with his friend meow so JGPT in this case was not trained on performing this task But by looking at the prompt and the example that we provided it it was able to come up with the right solution And we can use prompt engineering also for question answering with the same kind of reasoning So we build a very big prompt that includes an instruction part So you are an assistant trained to answer questions using the given context In which we provide some context which is a piece of text in which to retrieve the answer And then we ask the question to the language model How many parameters are there in grok zero so grok is the language model that is introduced by Twitter And it's a language model that can also access the latest tweets, and we will see later how But the point is I am asking the language model so JGPT to tell me about grok zero so very probably JGPT was not trained on the It doesn't know about the existence of this grok zero because it came out very recently So the model JGPT was able to retrieve the answer by looking at the context So in this case, for example It says grok zero the prototype LLM mentioned in the provided context is stated to have been trained with 33 billion parameters Because the JGPT was able to access the context in which we talk about the how many parameters there are in grok zero Which is this line after announcing XAIV prototype we train the prototype LLM with 33 billion parameters so the answer is correct and This this kind of way of working with the prompt is actually very powerful, but also Fine-tuning is not necessarily wrong or the wrong way to deal with this Lack of knowledge problem in the language models because it usually when we fine-tune a model on a specific Particular content it results in a higher quality results compared to just prompt engineering And also as you saw before to ask a question to JGPT We had to build a very big prompt So it means that the number of tokens that we are giving to the model is quite big the more we have the bigger the context that we need to provide to answer a particular question and We know that the more context we give the more information the language model will have to come up with the right answer So usually we need a bigger context But the problem is bigger context is also computationally expensive So by fine-tuning actually we can reduce this content size because we don't need to provide the context Anymore because we are fine-tuning the language model on the specific data on which we will ask questions for so to a language model that Has been fine-tuned.

We just need to ask the question So how many parameters are there in grok0 without providing all the context and if the language model has been fine-tuned correctly It will be able to come up with the right answer Now we need to introduce the retrieval augmented generation pipeline Because that's our next step and we will explore each building block that makes up the pipeline So imagine that we want to do question answering with the retrieval augmented generation Compared to what we did before before we did it with the prompt engineering so how many parameters are there in grok0 this is our query and Imagine we also have some documents in which we can find this answer these documents may be some pdf documents But they may also be web pages from which we can retrieve this answer What we do is we split all these documents or pieces of text into chunks of text so small pieces of text for example a document may be made up of many pages and each page may be made up of Paragraphs and each paragraph is made up of sentences In the usually we split each of these documents into small sentences And also the web pages are split in the same way We create embeddings of these sentences Such that each embedding is a vector of a fixed size that captures the meaning of each sentence Then we store all of these embeddings into a vector database and later We will see all these embeddings and vector database how they work Then we take also the query which is a sentence We convert it into an embedding using the same model that we have used to convert the documents into embeddings We search this embedding So this query embedding into our database which already have many embeddings each Representing a sentence from our document and it will come up with some results with the best matching Embeddings for our particular query and each embedding is also associated with the piece of text It comes from so the vector database is also able to retrieve the original text from which that embedding was created So if we are for example, how many parameters are there in grok zero?

The vector database will search all of its embedding and will give us the embeddings that best match our query So probably it will look for all the piece of text that talk about grok zero and the parameters it contains Now that we have the context and the query we create a template for a prompt So just like the the prompt we have used before so you are an assistant trained to answer questions using the given context We paste the context and the query inside of the prompt template And just like before we feed it to the language model And then the language model will be able to answer our question by using the context provided So we are with the retrieval augmented generation.

We are not fine-tuning a model to answer the questions We are actually using a prompt engineering But we are introducing a database here called the vector database that can access the Context given our query so it can retrieve the context necessary to answer our particular question Feed it to the language model and then the language model Using the context and our question will be able to come up with the right answer very probably Now let's talk about embedding vectors so what they are and how we work Okay, first of all, why do we use vectors to represent words for example given the words cherry digital and information If we represent embedding vectors using only two dimensions So as you remember in the vanilla transformer each embedding vector is 512 dimensions But imagine we are in a simpler world.

We only have two dimensions So we can plot these embedding vectors on a xy plane And what we do is we hope to see something like this So that words with similar meaning or words that represent the same concept Point in the same direction in space So for example the word digital and the word information are pointing to the same direction in space Such that the angle between words that have a similar meaning is small So the angle between digital and information is small and the angle between Words that have a different meaning is bigger So for example the word cherry and digital have an angle that is bigger compared to digital and information Indicating that they represent different concepts Imagine we have another word called tomato We expect it to point to this vertical direction here such that the angle between cherry and tomato should be small How do we measure this angle?

We usually use the cosine similarity to measure the angle between vectors and later we will see the formula of the cosine similarity Now, how did we come up with the idea of representing words as embeddings? The first idea is that words that are synonyms tend to occur in the same context So surrounded by the same words, for example, the word teacher and the professor Usually occur by the word school, university, exam, lecture, course, etc and vice versa We can also say that words that occur in the same context tend to have similar meaning This is known as the distributional hypothesis This means that to capture the meaning of a word We also need to have access to its context so to the words surrounding it But this also means that this is also the reason why we employ self-attention In the transformer model to capture the conceptual information of each token So as you remember the transformer model we have this self-attention The self-attention is a way to relate each token with all the other token in the same sentence Based also on the position each token occupies in the sentence because we have the concept of positional encoding So the self-attention access two things to calculate its score of attention The first is the embedding of the word which captures its meaning The second information it accesses is the positional encoding so that words that are closer to each other are related Differently to words that are far from each other And this self-attention mechanism modifies the embedding of each word in such a way that it also captures the Contextual information of that word and the words that surround it We trained BERT on a very particular Task which is called the musket language model task to capture information To create the embeddings of BERT This musket language model task is based on the clothes task and we humans do it very often.

Let me give you an example Imagine I give you the following sentence rome is the something something of italy This is why which is why it hosts many government buildings. Can you tell me what is the missing word? Well, of course the missing word is capital because by looking at the rest of the sentence is the one that makes the most sense How did we come up with the word capital for the missing word?

Well, we look at the words that were surrounding the blank space so it means that to the the word capital depends on the Context in which it appears on the words that surround it And this is how we train BERT We want the self-attention mechanism to relate all the input tokens with each other So that BERT has enough information about the context of the missing word to predict it For example, imagine we want to train BERT on the musket language model task And we create an input with the musket word just like before So rome is the something something of italy, which is why it hosts many government buildings We replace the blank space with a special token called musk This becomes the input of BERT which is made of 14 tokens We feed it to BERT BERT is a transformer model.

So it will output it's a sequence to sequence model So if the input is 14 tokens, the output will also be 14 tokens We ask BERT to predict the fourth token because it's the one that has been musket out We know what is the word, which the word is capital So we ask we calculate the loss based on what is the predicted fourth token and what it should be the actual fourth token And then we back propagate the the loss to update all the weights of the model when we run back propagation BERT The model will also update the input embeddings here So the input embeddings by the back propagation mechanism will be modified in such a way that this word So the word capital so the embedding associated with the word capital will be modified in such a way that it captures all the information About its context So the next time BERT will have less troubles predicting it given its context And this is uh, actually the one of the reason why we run back propagation because we want the model to get better and better By reducing the loss What if I told you that actually we can also create embeddings not of single tokens But also of entire sentences so we can use the self-attention mechanism to capture also the meaning of an entire sentence What we can do is we can use a pre-trained BERT model to produce embeddings of entire sentences.

Let's see how Well, suppose we have a simple input sentence For example, this one made of 13 tokens called our professor always gives us lots of assignments to do in the weekend We feed it to BERT, but notice that I removed the linear layer from BERT And I will show you why so the first thing we do is we notice is that the input of the self-attention is a matrix of shape 13 by 768.

Why? Because we have 13 tokens and each token is represented by an embedding with 768 dimensions which is the dimension which is the size of the embedding vector in BERT base So the smaller BERT pre-trained model The self-attention mechanism will output another matrix with the shape 13 by 768 so 13 tokens each one with its own embedding of 768 dimensions And the output of the self-attention is an embedding that captures not only the meaning of the word or its position in the sentence But also all the contextual information all the relationship between other words and the current word So the output will be 13 tokens each one represented by an embedding of size 768 What we do now each of them is representing kind of the meaning of a single word, right?

So what we do we can average all of them to create the embedding of the sentence So we take all these vectors of size 768 We calculate the average of them which will result in a single vector with 768 dimensions and this single vector will represent the sentence embedding which captures the meaning of the entire sentence And this is how we create the embedding of a sentence Now, how can we compare Sentence embeddings in to see if two sentences have similar meaning so for example, imagine one sentence is talking about the The query for example before was talking about how many parameters are there in grok zero And then we have another sentence that talks about how many parameters there are in grok zero So, how can we relate these two sentences?

We need to find a similarity Function and what we do is we usually use a cosine similarity because they are both vectors And the cosine similarity can be calculated as between vectors and it measures the cosine of the angle between the two vectors A smaller angle results in a high cosine singular similarity score while a bigger angle results in a smaller Cosine similarity score and this is the formula of the cosine similarity score But there is a problem So nobody told BERT that the embedding it produces should be comparable with the cosine similarity So BERT is outputting some embeddings And then we take the average of these embeddings But nobody told BERT that these embeddings should be in such a way that two similar sentences should produce similar embeddings How can we teach BERT to produce embeddings that can be compared with a similarity function of our choice Which could be a cosine similarity or the euclidean distance Well, we introduce sentence BERT.

Sentence BERT is one of the most popular models to To produce embeddings for entire sentences that can be compared using a similarity function of our choice in this case It's the cosine similarity score So sentence BERT was introduced in a paper called sentence BERT sentence embeddings using Siamese BERT Networks and we will see all of this.

What does it mean? What is a Siamese network? Now imagine we have Two sentences that are similar in meaning for example My my father plays with me at the park and I play with my dad at the park The first one we will call it sentence A and the second one we will call it sentence B We feed them to BERT.

So each of them will be a Sequence of tokens. For example, this one may be 10 tokens and this one may be 8 tokens We feed it to BERT which will produce output of 10 tokens and 8 tokens Then we do the pooling the mean pooling that we did before So we take all these output tokens and we calculate the average of them to produce one only vector of dimension 760 as in case we are using BERT base or bigger in case we are using a bigger BERT The first one we will call it sentence embedding A So the first vector is the embedding of the sentence A and the second one is the embedding of the sentence B We then measure the cosine similarity between these two vectors We have our target cosine similarity because we are training this BERT this model.

So we for example given these two sentences, which are quite similar in meaning We may have a target that is very close to one because the angle between them will be small we hope that the angle between them should be small so We calculate the loss because we have a target and the output of the model and then we run back propagation on to update all the weights of this model Now as you can see this model is made up of two branches that are same.

So in structure But this is called the siamese network Which is a network that is made of two branches or more branches that are same with each other with respect to the architecture But also with respect to the weights of the model So what we do actually when we represent these siamese networks, we represent it at two branches, but when we code this model actually We will actually only have one model So only one branch here that will reproduce cosine similarities and what we do at operating level is that First we run the sentence A through this model Then we run the sentence B also through this model We calculate the cosine similarity between these two output and then we run back propagation such that The back propagation will only modify the parameters of this model here, but when we represent it For showing we actually we represent it as two branches, but remember that they are not actually two branches.

It's only one branch It's only one weights only one architecture and the same number of parameters This way the birth Will if we train birth the sentence birth like this it will produce embeddings But in such a way that the similar Sentences have a similar cosine similarity. So I have high cosine similarity And so we can compare them using the cosine similarity measure Also, if if you remember birth produces at least birth base produces embeddings of size 768 if we want to produce sentence embeddings that are smaller than 760 dimensions We can include a linear layer here to reduce the dimensions.

For example, we want to go from 768 to 512 In the paper of sentence birth actually they not only Use the mean pooling that we use the so to calculate the average of all the tokens output by birth to produce one Vector that represents the meaning of the entire sentence, but they also use max pooling And another technique that they use is the CLS token So if you remember the CLS token is the first token that we give as input to birth And it's also the first token that is output by birth And usually this because of the self-attention mechanism.

This CLS token captures the information from all the other tokens because Of how the self-attention mechanism works and However, the sentence birth paper they have shown that Both methods so the max pooling and the CLS token don't perform better than mean pooling So they they recommend using mean pooling which is also one of actually what is used in production nowadays Okay Now let's review again What is the pipeline of retrieval augmented generation now that we know how embeddings works?

So we have our query which is how many parameters are there in grok zero? Then we have some documents in which we can find this answer. So the documents may be pdf documents or web pages We split them into single sentences and we embed these sentences using our sentence birth Our sentence birth will produce vectors of a fixed size.

Suppose 768 And we store them all these vectors in a vector db. We will see later how it works The query is also converted into a vector of size 768 dimensions and we search this Query in the vector dbs. How do we search? We want to find all the embeddings That best match our query.

What do we mean by this? We mean all the embeddings that have that when we calculate the cosine similarity score with our query It results in a high value or if you are using another Similarity score for example euclidean distance the distance is small depending on what? Distance we are using the cosine similarity or the euclidean distance This will produce the top embeddings that best match our query and we map them back into the text from which they originated from This will produce the context that we feed into our prompt template Along with the query we give it to the large language model, which will produce the answer As we saw previously augment the knowledge of a language model We have two strategies fine tuning on a custom data set or using a vector database made up of embeddings We can also use a combination of both for example by fine tuning for a few epochs And then using a vector database to complement with knowledge retrieved from the web Whatever strategy we decide to proceed with we need a reliable scalable and easy to use service for building our retrieval augmented generation pipelines That's why I recommend gradient Gradient is a scalable ai cloud platform that provides simple apis for fine-tuning models generating embeddings and running inference Thanks to its integration with popular library lama index We can build retrieval augmented generation pipelines with few lines of code For example, we select the model we want to use in our case It's lama2.

We define the set of custom documents that the model can use to retrieve answers We define the model that we want to use to generate embeddings Ask a question for example, do you know anyone named oleo it voila Thanks to the power of retrieval augmented generation. The model can now retrieve information about our channel's mascot oleo Gradient helps us build all aspects of a retrieval augmented generation pipeline For example, we can also fine-tune models on custom data as well as generate embeddings With gradient you have total ownership of your data as well as the weights of fine-tuned models open source models are great Because they save time on development and debugging as we have access to the architecture and the support of a vast community of developers Gradient also integrates with popular libraries lama index and lang chain Check the link in the description to redeem your five dollar coupon to get started today with gradient Let's talk about Let's talk about vector databases what they are and how their matching algorithm works So, how can the vector database search our query in all the vectors that it has stored?

Okay A vector database stores vectors of fixed dimensions called embeddings such that We can then query the database to find all the embeddings that are closest or more similar to our query using a distance metric The cosine similarity or the euclidean distance The vector database uses a variant of the knn algorithm which stands for the k nearest neighbor algorithm or another Similarity search algorithm, but usually it's usually a variant of the knn algorithm And the vector databases are not only used in retrieval augmented generation pipeline They are also used for finding for example similar songs So if we have songs we can create embeddings of them So some embedding some vector that captures all the information about that song And then we can find similar songs a given One for example, we have a user who want to find all the similar songs to a given one We will create the embedding of that song and all the others.

We compare them using some similarity score. For example, the cosine similarity score For example, also google images they search similar images using a similar technique. So using an embedding Space in which they produce the embedding of a particular image and all the other and then they check the one that match best We can also do the same with products, etc Now knn is an algorithm that allow us to compare a particular query with all the vectors that we have stored in our database Sort them by distance or by similarity depending on which one we use and then we keep the top best matching For example, imagine we have a query So how many parameters are there in grok and imagine we have a database vector a vector database Made up of 1 million embeddings because actually 1 million is not even a big number because if you consider that Suppose grok, for example that is accessing the tweets in real time every day I think we have thousands hundreds of thousands of tweets if not millions of tweets So actually the amount of vectors it has it's actually in the order of billions.

I think not even millions So actually millions looks like a big number, but it's not Especially when we deal with the textual data also from the web we have billions of web pages that we may need to index So what we do for example in this knn with the naive approach Which is the most simple way of matching a query to all the other vectors is to compare This query with all the other vectors given our cosine similarity function.

For example, we may we may have this The with the first vector it may be 0.3. The second 0.1 0.2, etc, etc then we sort The we sort them by cosine similarity score So with for example the highest one, for example, this one should be the first one Then the this one should be the second one etc, etc, and we keep the top k So the top three or the top two depending on how we chose k Now this is actually a very simple approach, but it's also very slow because if there are n Embedding vectors.

So in this case 1 million and each one has d dimensions So in this case, for example in the case of birth base, it's 768 The computational complexity is in the order of n multiplied by d which is very very very slow Let's see if there are better approaches And we will be exploring one algorithm in particular that is also used Right now in the most popular vector db's which is called the hierarchical navigable small words Now What we the idea is that we will trade precision for speed.

So before what the algorithm we saw before so the naive knn Which performs very slowly, but it's actually precise because each query the query is compared with each of the vectors So it will always produce accurate results because we have all the possible comparison done but do so to reduce the number of to Increase the speed we need to reduce the number of comparisons that we do and the metric that we usually care in similarity search Is recall.

So the recall basically indicates that if our suppose that in before The best matching vector is for example, this one and this one and this one uh, we want our Top three query so knn to retrieve them all three of them But imagine it only returns two in this case.

We we have that the the query Returned only two of the best matching vectors. So we will say that the recall is 66 percent or two thirds So basically the recall measures How many relevant items are retrieved among all the relevant items that it should have retrieved from our search?

And we will see one one algorithm in particular for approximate nearest neighbors, which is called hierarchical navigable small words Now hierarchical navigable small words is actually used Is actually very popular nowadays in vector databases and in particular. It's also the same algorithm that powers the database quadrant Which is also the open source vector database used by twitter's grok llm For example, this is the exchange of tweets that I saw the other day between elon musk and the team of quadrant in which they Quadrant says that actually the grok is accessing all the tweets in real time using the vector database, which is quadrant and if we check the Documentation of this vector database, we will see that the quadrant currently only uses the hierarchical navigable small words as the vector index So this is the algorithm that powers the database that is currently used by twitter to introduce retrieval augmented generation in its large language model grok The first idea behind this hierarchical navigable small words is the Is the idea of the six degrees of evolution So actually the hierarchical navigable small words is an evolution of an earlier algorithm called navigable small words Which is an algorithm for approximate nearest neighbors that we will see later Which is based on the idea of six degrees of separation So in the 1960s, there was an experiment which is called the milgram experiment Which wanted to test the social connections among people in the usa The participants who were initially located in nebraska and constance were given a letter to deliver to a person that they didn't know They did not know And this person was in boston However, they were not allowed to send the letter directly to the recipient instead.

They were instructed to send it to someone who Who could best know this target person? At the end of the milgram word experiment They found that the letter reached the final recipient in five or six steps Creating the concept that people all over the world are connected by six degrees of separation And actually in 2016 facebook published a blog post in which they claimed that the 1.59 Billion active users on facebook were connected by an average of 3.5 degrees of separation This means that between me and mark zuckerberg.

There are 3.5 connections, which means that on average, of course Which means that I have a friend who has a friend who has a friend who knows mark zuckerberg And this is the idea of the degrees of separation So, let's see. What is this navigable small words? Now the navigable small words algorithm builds a graph that just like facebook friends connects close vectors with each other But keeping the total number of connections small For example, every vector may be connected with up to other six vectors like to mimic the sixth degree of Separation for example, imagine we have a very small database with only 15 vectors Each one representing a particular sentence that we retrieved from Our knowledge source which could be documents or web pages And we have a graph like this in which for example The first text is about the transformer which is connected to another piece of text that talks about the transformer model Then we have another text that connects the tree with the two which is now talking about the cancer with ai So diagnosing cancer with ai And etc etc now How do we find a given query in this particular graph?

So imagine we have our query which is how many encoder layers are there in the transformer model? How does the algorithm find the k nearest neighbors? So the best matching k vectors for our query? The algorithm will proceed like this. It will find first of all an entry point in this graph, which is randomly chosen So we randomly choose among all these vectors one node as a random as an entry point We visit it And then we compare the similarity score of this query and this node And compare it with the similarity score of the query with the friends of this node So with the number with the node number seven and the node number two If one of the friends has a better similarity score, then we move it to move it there So the number two for example may have a better similarity score with the q compared to the number five So we move to the number two and then we do again this process so we check what is the cosine similarity score between the node number three and the query and we compare it with the cosine similarity of the Vector number two with the query if the number three has a better cosine similarity score We move there and we keep doing like this until we reach a node in which his The friends of this node so the number eight and the number six Don't have a better cosine similarity score With respect to the query compared to the current one So this number four has the best cosine similarity score among all of his connections among the 0.1, 0.8 and the 0.6.

So we stop there And we have visited many nodes Basically, we ordered them from the best matching to the lowest matching and we keep the top k Also, we repeat this search many times by choosing different random entry points and Every time we choose we sort all of the results by similarity score and then we keep the top k And these are the best matching k nearest neighbor.

So using the navigable small words algorithm If we want to insert a vector in this graph, we just do what we did before. So we actually search Given for example, we want to insert this query For example, we will just do it like a search and then when we have found the top k We just connect the query with the top k and that's how we insert a new item into this graph The second idea of the Hierarchical navigable small words is based on another data structure that is called the skip list So to go from navigable small words to Hierarchical navigable small words.

We need to introduce this new Data structure. So the skip list is a data structure that maintains a sorted list And allows to search and insertions with an average of logarithmic time complexity for example If we want to search the number nine What we can do in this first of all as you can see this, this is not only one linked list We have many linked list levels of linked list.

We have the level 0, the level 1, the level 2, the level 3 The bottom level has the most number of items. The more we go up the less is the number of items So if we want to search the number line in this linked list in this skip list We start from the top level.

So we start from the head number three We check what is the next item and we compare it with the So the first item is number five. We compare it with what is the next item in this case It's the end which means that the number nine must be down So we go down we then compare it with the next item, which is number 17 Which means that it cannot be after this node because it's 17.

So it must be down We go down and then we compare it with the next node, which is the number nine and we see Okay, we reached the number nine. Now imagine we want to find another number. Let's say the number 12 We start again from the h3. We go to the first item and compare to the next.

Okay, it's the end So we go down then we arrive here. We compare it with the next we see it's 17. So it's bigger than 12 So we go down Then we see it's nine. So the next item is number nine. So we visit nine And then we compare it with the next item which is 17, so it's bigger than 12 So we go down we go here and then we compare it with what is the next item Which is number 12 and it's the number that we are looking for and we stop there So this is how the skip list works.

Now to create the hierarchical navigable small worlds we combine the concept of navigable small worlds with the idea of the skip list in producing a hierarchical navigable small worlds algorithm So we start with we have a lower level which has more nodes and more connections and the upper level which has less nodes and less Connections.

So we say that this one is more dense and this one is more sparse How does the search work in this graph? Suppose we have a query just like before and we want to search in this graph We find a random entry point in the top level of this graph And then we visit it and then we compare The cosine similarity of this node with the query and all of his friends with the query and we see that this one is the best One so we go down We go down and we do again the same We do again the same test So we check the cosine similarity of this node with the query and all of his friends with the query And we see that this friend here has a better cosine similarity score.

So we move here Then we check this node here with all of his friends and we see that this one is the best one. So we go down Also this one we see that it's the best one among all of his friends for the cosine similarity score So we go down and then we do again this test and we say what is the cosine similarity score of this Node and the query and also all of his friends with the query So this node this node and this node And we move to the one that is best in case there is one and then we stop as soon as we find a local best So the one node that is not worse than all of his friends We repeat this search just like before by using randomly selected entry points We take all these vectors that we have visited we sort we keep the top K best matching based on the similarity score that we are using or the distance function that we are using Okay, now that we have seen also how the vector database works let's review again the pipeline of retrieval augmented generation by Summing up what we have seen.

So again, we have our query We have some documents from which we want to retrieve the knowledge We split them into pieces of text we create embeddings using sentence bird For example, we store all these vectors in a vector database We convert our query in an embedding and we search in a vector database using the algorithm that we have seen before So the hierarchical navigable small words This will produce the top k embeddings best matching with our query from which we associate go back to the text that they were taken from We combine the query and the context retrieved in a template We feed it to the large language model and finally the large language model will produce the answer Thank you guys for watching my video I hope you learned a lot today because I wanted to create this video for a long time actually But I wanted also to understand myself all the algorithms that were behind this pipeline And I hope that you are also now familiar with all these concepts I know that actually implementing the rug pipeline is very easy.

There are many Popular libraries like llama index and long chain, but I wanted to actually go deep inside of how it works and each Building block how they actually work together Please let let me know if there is something that you don't understand I will try to answer all the questions in the comment section.

Also, let me know if you want to Something that you want better clarified in my future videos or how can I improve my future videos for better clarity? Please subscribe to my channel This is the best motivation for me to continue making high quality content and like the video if you like it Share the video with your friends with your professors with your students, etc And have a nice day guys

Retrieval Augmented Generation (RAG) Explained: Embedding, Sentence BERT, Vector Database (HNSW)

Chapters

Transcript