back to indexGenerative AI and Long-Term Memory for LLMs (OpenAI, Cohere, OS, Pinecone)
Chapters
0:0 What is generative AI
1:40 Generative question answering
4:6 Two options for helping LLMs
5:33 Long-term memory in LLMs
7:1 OP stack for retrieval augmented GQA
8:48 Testing a few examples
12:56 Final thoughts on Generative AI
00:00:00.000 |
Generative AI is what many expect to be the next big technology boom. 00:00:06.760 |
And being what it is, AI, it could have far-reaching implications that are beyond what we would 00:00:15.140 |
That's not to say that we have entered the end game of AI with AGI or anything like that, 00:00:21.920 |
but I think that generative AI is a pretty big step forwards. 00:00:27.300 |
And it seems that investors are aware of this as well. 00:00:30.420 |
We all know that the majority of industries had a very bad 2022, yet generative AI startups 00:00:43.720 |
According to New York Times, that's almost as much as they received in the past five 00:00:53.000 |
There were several wow moments that came from generative AI in 2022. 00:00:58.940 |
From generative art tools like OpenAI's DALI 2, Mid-Journey, and Sable Diffusion, to the 00:01:06.360 |
next generation of large language models from the likes of OpenAI with the GPT 3.5 models, 00:01:14.180 |
the Open Source Bloom project, and the chatbots like Goose Lambda, and of course, the chat 00:01:22.220 |
All of this together marks just the first year of the widespread adoption of generative 00:01:31.020 |
We're still in the very early days of a technology that is poised to completely change the way 00:01:40.580 |
And one of the most thought provoking use cases in how we interact with machines, I 00:01:46.000 |
think belongs to generative question answering or GQA. 00:01:50.220 |
Now the most simple GQA pipeline consists of nothing more than a user's question or 00:01:59.060 |
The query is passed to the large language model and based on what the large language 00:02:03.460 |
model has learned during its training, so the knowledge that's stored within the model 00:02:09.100 |
parameters, it will output an answer to your question. 00:02:13.540 |
And we can see that this works for general knowledge questions pretty well across the 00:02:20.180 |
So if we take a look at OpenAI's DaVinci 003 model, Cohere's extra large model behind 00:02:26.700 |
the generation endpoint, or even Open Source models that we can access through Hugging 00:02:31.380 |
Face Transformers, we will get a good answer for general knowledge questions. 00:02:37.760 |
So if we ask, "Who was the first person on the moon?" we will get across the board 00:02:44.600 |
So we can see that this works incredibly well for things that are within the general knowledge 00:02:51.780 |
However, if we start asking more specific or advanced questions, these large language 00:02:59.380 |
So if we ask it a very specific question about machine learning methods and specifically 00:03:05.100 |
NLP and semantic search training methods, like, "Which training method should I use 00:03:11.460 |
for training sentence transformers when I have just pairs of positive sentences?" 00:03:17.380 |
Now, you don't need to understand what that means if you don't, no problem. 00:03:22.140 |
One of the correct answers to this should be multiple noted ranking loss, or even just 00:03:29.420 |
Yeah, if we ask this question, and we'll go ahead and ask what I found to be the best 00:03:34.780 |
performing of the large language models so far. 00:03:38.540 |
If we ask DaVinci 003 this question, it gives us this answer, and it says, "We need to 00:03:44.020 |
use a supervised training method," which, yes, that is correct, but it doesn't really 00:03:51.060 |
It doesn't give us a specific method to use, and the reason it doesn't give us that is 00:03:57.980 |
This knowledge has not been encoded into the model weights or parameters during training, 00:04:06.540 |
Now, there are two options we can take in order to help the model answer this question. 00:04:12.140 |
The first is we can fine tune the large language model on the text data that would contain 00:04:22.180 |
It can take a lot of computational resources or money, and it also requires a lot of text 00:04:28.820 |
data as well, which is not always necessarily available. 00:04:31.940 |
If we just mention the answer once in a single sentence of a million sentences, the large 00:04:39.620 |
language model might not pick up on that information, and when we ask the question again, it may 00:04:46.340 |
We need a lot of text data that mentions this in multiple contexts in order for it to learn 00:04:54.780 |
Considering that, our second option, which I think is probably the easier option, is 00:04:59.740 |
to use something called retrieval augmented generation, or in this case, retrieval augmented 00:05:08.460 |
This simply means that we add what is called a retrieval component to our GQA pipeline. 00:05:15.060 |
Adding this retrieval component allows us to retrieve relevant information. 00:05:21.500 |
If we have that sentence within our million sentences, we can retrieve that sentence and 00:05:25.700 |
feed it into our large language model alongside our query. 00:05:29.620 |
We're essentially creating a secondary source of information. 00:05:34.300 |
Going ahead with this second option of retrieval augmented ML, when we apply it to large language 00:05:42.260 |
models, we can actually think of it as a form of long-term memory. 00:05:47.500 |
To implement this long-term memory, we need to integrate a knowledge base into our GQA 00:05:53.980 |
This knowledge base is the retrieval component that we're talking about, and it allows us 00:05:58.260 |
to take our query and search through our sentences or paragraphs for relevant information and 00:06:04.140 |
return that relevant information that we can then pass to our larger language model. 00:06:09.640 |
As you can see, using this approach, we get much better results. 00:06:14.740 |
Again, using DaVinci 003 for the generation model here, we get, "You should use natural 00:06:20.660 |
language inference NLI with multiple negative ranking loss." 00:06:24.540 |
Now, NLI is just one option for the format of the data, essentially, but the answer of 00:06:31.880 |
multiple negative ranking loss is definitely what we're looking for. 00:06:36.060 |
This much better answer is a direct result of adding more contextual information to our 00:06:42.780 |
query, which we would refer to as source knowledge. 00:06:47.560 |
Source knowledge is basically any knowledge that gets passed through to the large language 00:06:52.940 |
model within the input of whatever we're putting into the model at inference time, so when 00:07:01.700 |
In this example, what we use is OpenAI with both generation and actually embedding, which 00:07:09.820 |
I'll explain in a moment, and also Pinecone Vector Database as our knowledge base. 00:07:16.020 |
Both these together are what we would refer to as the OP stack, so OpenAI, Pinecone. 00:07:21.740 |
This is a more recently popularized option for building very performant AI apps that 00:07:29.100 |
rely on a retrieval component like Retrieval Augmented GQA. 00:07:34.660 |
At query time in this scenario, the pipeline consisted of three main steps. 00:07:40.020 |
The first one, we use an OpenAI embedding endpoint to encode our query into what we 00:07:48.020 |
call dense vector, and step two, we took that encoded query, sent it to our knowledge base, 00:07:55.020 |
which returned relevant context or text passages back to us, which then we combined with our 00:08:05.700 |
We take our query and that relevant information, relevant context, and push them into our large 00:08:12.720 |
language model to generate a natural language answer, and as you can see, adding that extra 00:08:20.420 |
context from Pinecone, our knowledge base, allowed the large language model to answer 00:08:28.860 |
Even beyond providing more factual, accurate answers, the fact that we can retrieve the 00:08:34.140 |
sources of information and actually present them to users using this approach also instills 00:08:40.780 |
user trust in the system, allowing users to confirm the reliability of the information 00:08:50.180 |
We're going to use the same pipeline that I've already described. 00:08:54.340 |
The knowledge base that we're going to be using, so the data source, is the James Callum 00:08:59.060 |
YouTube Transcriptions dataset, which is hosted on Hockeying Face Datasets, which is just 00:09:03.980 |
a dataset of transcribed audio from various tech and ML YouTube channels. 00:09:10.560 |
If we ask questions around ML and tech, generally speaking, if it's within the knowledge base, 00:09:15.820 |
it should be able to answer those questions pretty accurately. 00:09:21.020 |
Our first answer is NLI stands for Natural Language Interface, which is wrong. 00:09:27.300 |
The second is correct, so we get Natural Language Inference, NLI is a test that requires pairs 00:09:32.560 |
of sentences to be labeled as either contradictory, neutral, or entailing inferring each other, 00:09:50.060 |
It looks like we're just getting a description of what clip is, which is, I mean, this is 00:09:54.420 |
It used to classify images and generate natural language descriptions of them, which is not 00:10:04.220 |
In fact, I know that's not what I would go with. 00:10:07.380 |
To use clip, you need access to a GPU and the OpenAI clip repository. 00:10:11.060 |
Yes, you can do that, and you can use the provided scripts to train and evaluate the 00:10:20.140 |
It's mostly correct, except from the start, it's not really how I would describe clip, 00:10:24.500 |
but then the rest about using the clip repository is correct. 00:10:27.340 |
Now, I got a rate limit error, so let me try and comment this part out and try again. 00:10:37.460 |
So you can use OpenAI's clip easily by using the Hugging Face Transformers library, which 00:10:42.660 |
in my opinion is 100% the easiest way to use the model. 00:10:47.660 |
And then we get this, which some library for doing anything with NLP and computer vision. 00:10:52.860 |
Not necessarily that standard with computer vision, but I think I know the source of information 00:10:57.180 |
that's coming from, which is one of my videos. 00:10:59.620 |
And I probably do say something along those lines, because that is what we're using clip 00:11:05.540 |
And then to get started, you should install PyTorch and the Transformers and Datasets 00:11:10.460 |
libraries, which is actually usually the case using a Dataset from Datasets. 00:11:17.420 |
And you do need to install PyTorch with Transformers. 00:11:23.620 |
I want to know what is a good, what is a good de facto model or sentence transformer model 00:11:36.660 |
So in no augmentation, we get a popular de facto sentence transformer model for semantic 00:11:47.980 |
So here it seems like they're talking about the standard BERT model and not even the sentence 00:12:06.700 |
So the pre-trained universal sentence encoder model is a good de facto sentence transformer 00:12:16.220 |
I think there are better models to use, but that is actually, I think one of the most 00:12:21.740 |
popular ones to use as the sort of first sentence transformer that people end up using or sentence 00:12:31.420 |
And this is a much more accurate answer than what we got before without the context, without 00:12:37.380 |
the augmentation, which was BERT, which is not even a sentence transformer. 00:12:42.100 |
So I think this is still a pretty good answer. 00:12:45.100 |
Personally, I would like to see like an MPNet model or something on there, but that's actually 00:12:51.820 |
So I think this is probably a more broadly accepted answer. 00:12:55.940 |
Okay, so as demonstrated, large language models do work incredibly well, particularly for 00:13:01.940 |
general knowledge questions, but they definitely struggle with more niche or more specific 00:13:09.820 |
And this typically leads to what we call hallucinations, which is where the model is basically spewing 00:13:19.240 |
And it's really obvious to the user that these models are being inaccurate in what they are 00:13:25.880 |
saying because they can say very untruthful things very convincingly because these models, 00:13:33.060 |
we can think of them as essentially masters of linguistic patterns. 00:13:37.740 |
So they can say things that are completely false and say them in a way that makes them 00:13:44.500 |
So to protect us from this issue, we can add what we call a long term memory component 00:13:54.420 |
And through this, we benefit from having an external knowledge base to improve system 00:13:59.260 |
factuality and also improve user trust in the system. 00:14:03.820 |
Naturally, there is a very vast potential for this type of technology. 00:14:10.340 |
And despite being very new, there are already many people using it. 00:14:14.900 |
I've seen that you.com have their new YouChat feature, which gives you natural language 00:14:25.160 |
I've seen many podcast search apps recently using this technology. 00:14:30.300 |
And there are even rumors of Microsoft with Bing using ChatGPT, which is another form 00:14:36.980 |
of this technology as a challenger to Google itself. 00:14:42.260 |
So as I think we can all see, there's very big potential and opportunity here for disruption 00:14:52.660 |
Essentially any industry, any company that relies on information in some way and retrieving 00:15:00.000 |
that information efficiently can benefit from the use of retrieval augmented generative 00:15:06.100 |
question answering and other retrieval augmented generative AI technologies. 00:15:11.580 |
So this really represents an opportunity for replacing some of those outdated information 00:15:22.180 |
I hope all of this has been somewhat thought provoking, interesting, and useful, but that's 00:15:30.140 |
So thank you very much for watching and I will see you again in the next one.