back to index

Generative AI and Long-Term Memory for LLMs (OpenAI, Cohere, OS, Pinecone)


Chapters

0:0 What is generative AI
1:40 Generative question answering
4:6 Two options for helping LLMs
5:33 Long-term memory in LLMs
7:1 OP stack for retrieval augmented GQA
8:48 Testing a few examples
12:56 Final thoughts on Generative AI

Whisper Transcript | Transcript Only Page

00:00:00.000 | Generative AI is what many expect to be the next big technology boom.
00:00:06.760 | And being what it is, AI, it could have far-reaching implications that are beyond what we would
00:00:14.140 | imagine today.
00:00:15.140 | That's not to say that we have entered the end game of AI with AGI or anything like that,
00:00:21.920 | but I think that generative AI is a pretty big step forwards.
00:00:27.300 | And it seems that investors are aware of this as well.
00:00:30.420 | We all know that the majority of industries had a very bad 2022, yet generative AI startups
00:00:38.320 | actually received $1.37 billion in funding.
00:00:43.720 | According to New York Times, that's almost as much as they received in the past five
00:00:48.920 | years combined.
00:00:50.440 | However, it's hardly surprising.
00:00:53.000 | There were several wow moments that came from generative AI in 2022.
00:00:58.940 | From generative art tools like OpenAI's DALI 2, Mid-Journey, and Sable Diffusion, to the
00:01:06.360 | next generation of large language models from the likes of OpenAI with the GPT 3.5 models,
00:01:14.180 | the Open Source Bloom project, and the chatbots like Goose Lambda, and of course, the chat
00:01:22.220 | All of this together marks just the first year of the widespread adoption of generative
00:01:31.020 | We're still in the very early days of a technology that is poised to completely change the way
00:01:37.820 | that we interact with the machines.
00:01:40.580 | And one of the most thought provoking use cases in how we interact with machines, I
00:01:46.000 | think belongs to generative question answering or GQA.
00:01:50.220 | Now the most simple GQA pipeline consists of nothing more than a user's question or
00:01:55.980 | query and a large language model.
00:01:59.060 | The query is passed to the large language model and based on what the large language
00:02:03.460 | model has learned during its training, so the knowledge that's stored within the model
00:02:09.100 | parameters, it will output an answer to your question.
00:02:13.540 | And we can see that this works for general knowledge questions pretty well across the
00:02:19.180 | board.
00:02:20.180 | So if we take a look at OpenAI's DaVinci 003 model, Cohere's extra large model behind
00:02:26.700 | the generation endpoint, or even Open Source models that we can access through Hugging
00:02:31.380 | Face Transformers, we will get a good answer for general knowledge questions.
00:02:37.760 | So if we ask, "Who was the first person on the moon?" we will get across the board
00:02:42.780 | the answer, Neil Armstrong.
00:02:44.600 | So we can see that this works incredibly well for things that are within the general knowledge
00:02:49.620 | base of these large language models.
00:02:51.780 | However, if we start asking more specific or advanced questions, these large language
00:02:57.020 | models will begin to fail.
00:02:59.380 | So if we ask it a very specific question about machine learning methods and specifically
00:03:05.100 | NLP and semantic search training methods, like, "Which training method should I use
00:03:11.460 | for training sentence transformers when I have just pairs of positive sentences?"
00:03:17.380 | Now, you don't need to understand what that means if you don't, no problem.
00:03:22.140 | One of the correct answers to this should be multiple noted ranking loss, or even just
00:03:27.700 | ranking loss would be fine as well.
00:03:29.420 | Yeah, if we ask this question, and we'll go ahead and ask what I found to be the best
00:03:34.780 | performing of the large language models so far.
00:03:38.540 | If we ask DaVinci 003 this question, it gives us this answer, and it says, "We need to
00:03:44.020 | use a supervised training method," which, yes, that is correct, but it doesn't really
00:03:49.420 | answer the question.
00:03:51.060 | It doesn't give us a specific method to use, and the reason it doesn't give us that is
00:03:55.620 | because the model doesn't know.
00:03:57.980 | This knowledge has not been encoded into the model weights or parameters during training,
00:04:05.120 | so it can't answer the question.
00:04:06.540 | Now, there are two options we can take in order to help the model answer this question.
00:04:12.140 | The first is we can fine tune the large language model on the text data that would contain
00:04:19.660 | this information.
00:04:20.660 | Now, this can be hard to do.
00:04:22.180 | It can take a lot of computational resources or money, and it also requires a lot of text
00:04:28.820 | data as well, which is not always necessarily available.
00:04:31.940 | If we just mention the answer once in a single sentence of a million sentences, the large
00:04:39.620 | language model might not pick up on that information, and when we ask the question again, it may
00:04:44.580 | not have learned the answer.
00:04:46.340 | We need a lot of text data that mentions this in multiple contexts in order for it to learn
00:04:52.300 | this information well.
00:04:54.780 | Considering that, our second option, which I think is probably the easier option, is
00:04:59.740 | to use something called retrieval augmented generation, or in this case, retrieval augmented
00:05:06.260 | generative Q&A.
00:05:08.460 | This simply means that we add what is called a retrieval component to our GQA pipeline.
00:05:15.060 | Adding this retrieval component allows us to retrieve relevant information.
00:05:21.500 | If we have that sentence within our million sentences, we can retrieve that sentence and
00:05:25.700 | feed it into our large language model alongside our query.
00:05:29.620 | We're essentially creating a secondary source of information.
00:05:34.300 | Going ahead with this second option of retrieval augmented ML, when we apply it to large language
00:05:42.260 | models, we can actually think of it as a form of long-term memory.
00:05:47.500 | To implement this long-term memory, we need to integrate a knowledge base into our GQA
00:05:52.980 | pipeline.
00:05:53.980 | This knowledge base is the retrieval component that we're talking about, and it allows us
00:05:58.260 | to take our query and search through our sentences or paragraphs for relevant information and
00:06:04.140 | return that relevant information that we can then pass to our larger language model.
00:06:09.640 | As you can see, using this approach, we get much better results.
00:06:14.740 | Again, using DaVinci 003 for the generation model here, we get, "You should use natural
00:06:20.660 | language inference NLI with multiple negative ranking loss."
00:06:24.540 | Now, NLI is just one option for the format of the data, essentially, but the answer of
00:06:31.880 | multiple negative ranking loss is definitely what we're looking for.
00:06:36.060 | This much better answer is a direct result of adding more contextual information to our
00:06:42.780 | query, which we would refer to as source knowledge.
00:06:47.560 | Source knowledge is basically any knowledge that gets passed through to the large language
00:06:52.940 | model within the input of whatever we're putting into the model at inference time, so when
00:06:58.940 | we're predicting or generating text.
00:07:01.700 | In this example, what we use is OpenAI with both generation and actually embedding, which
00:07:09.820 | I'll explain in a moment, and also Pinecone Vector Database as our knowledge base.
00:07:16.020 | Both these together are what we would refer to as the OP stack, so OpenAI, Pinecone.
00:07:21.740 | This is a more recently popularized option for building very performant AI apps that
00:07:29.100 | rely on a retrieval component like Retrieval Augmented GQA.
00:07:34.660 | At query time in this scenario, the pipeline consisted of three main steps.
00:07:40.020 | The first one, we use an OpenAI embedding endpoint to encode our query into what we
00:07:48.020 | call dense vector, and step two, we took that encoded query, sent it to our knowledge base,
00:07:55.020 | which returned relevant context or text passages back to us, which then we combined with our
00:08:02.700 | query, and that leads on to step three.
00:08:05.700 | We take our query and that relevant information, relevant context, and push them into our large
00:08:12.720 | language model to generate a natural language answer, and as you can see, adding that extra
00:08:20.420 | context from Pinecone, our knowledge base, allowed the large language model to answer
00:08:25.620 | the question much more accurately.
00:08:28.860 | Even beyond providing more factual, accurate answers, the fact that we can retrieve the
00:08:34.140 | sources of information and actually present them to users using this approach also instills
00:08:40.780 | user trust in the system, allowing users to confirm the reliability of the information
00:08:46.380 | that is being presented to them.
00:08:48.180 | Let's go ahead and try a few more examples.
00:08:50.180 | We're going to use the same pipeline that I've already described.
00:08:54.340 | The knowledge base that we're going to be using, so the data source, is the James Callum
00:08:59.060 | YouTube Transcriptions dataset, which is hosted on Hockeying Face Datasets, which is just
00:09:03.980 | a dataset of transcribed audio from various tech and ML YouTube channels.
00:09:10.560 | If we ask questions around ML and tech, generally speaking, if it's within the knowledge base,
00:09:15.820 | it should be able to answer those questions pretty accurately.
00:09:18.060 | We're going to start with, what is NLI?
00:09:21.020 | Our first answer is NLI stands for Natural Language Interface, which is wrong.
00:09:27.300 | The second is correct, so we get Natural Language Inference, NLI is a test that requires pairs
00:09:32.560 | of sentences to be labeled as either contradictory, neutral, or entailing inferring each other,
00:09:38.460 | which is perfect.
00:09:39.900 | Let's try something else.
00:09:41.260 | How can I use OpenAI's clip easily?
00:09:49.060 | No augmentation.
00:09:50.060 | It looks like we're just getting a description of what clip is, which is, I mean, this is
00:09:53.420 | correct.
00:09:54.420 | It used to classify images and generate natural language descriptions of them, which is not
00:10:02.500 | how I would define it.
00:10:04.220 | In fact, I know that's not what I would go with.
00:10:07.380 | To use clip, you need access to a GPU and the OpenAI clip repository.
00:10:11.060 | Yes, you can do that, and you can use the provided scripts to train and evaluate the
00:10:15.940 | model.
00:10:16.940 | Additionally, you can use a so on and so on.
00:10:19.140 | Okay.
00:10:20.140 | It's mostly correct, except from the start, it's not really how I would describe clip,
00:10:24.500 | but then the rest about using the clip repository is correct.
00:10:27.340 | Now, I got a rate limit error, so let me try and comment this part out and try again.
00:10:33.500 | Okay.
00:10:34.500 | And what I wanted to get is this.
00:10:37.460 | So you can use OpenAI's clip easily by using the Hugging Face Transformers library, which
00:10:42.660 | in my opinion is 100% the easiest way to use the model.
00:10:47.660 | And then we get this, which some library for doing anything with NLP and computer vision.
00:10:52.860 | Not necessarily that standard with computer vision, but I think I know the source of information
00:10:57.180 | that's coming from, which is one of my videos.
00:10:59.620 | And I probably do say something along those lines, because that is what we're using clip
00:11:03.180 | for in this instance.
00:11:05.540 | And then to get started, you should install PyTorch and the Transformers and Datasets
00:11:10.460 | libraries, which is actually usually the case using a Dataset from Datasets.
00:11:17.420 | And you do need to install PyTorch with Transformers.
00:11:20.460 | So that is really cool.
00:11:21.980 | And let's ask one more question.
00:11:23.620 | I want to know what is a good, what is a good de facto model or sentence transformer model
00:11:31.620 | to use in semantic search?
00:11:34.940 | And let's see what we get.
00:11:36.660 | So in no augmentation, we get a popular de facto sentence transformer model for semantic
00:11:42.060 | search.
00:11:43.060 | It's BERT.
00:11:44.060 | It's a deep learning model.
00:11:45.060 | It's been pre-trained and so on and so on.
00:11:46.980 | Not actually.
00:11:47.980 | So here it seems like they're talking about the standard BERT model and not even the sentence
00:11:53.740 | transformer or bi-encoded version of BERT.
00:11:56.840 | So I would say it's definitely wrong.
00:11:58.980 | So I'm hitting a rate limit again.
00:12:00.860 | So let me comment this out and run it again.
00:12:04.700 | Okay.
00:12:05.700 | And here we go.
00:12:06.700 | So the pre-trained universal sentence encoder model is a good de facto sentence transformer
00:12:12.280 | model to use in semantic search.
00:12:14.180 | Now I would disagree with that.
00:12:16.220 | I think there are better models to use, but that is actually, I think one of the most
00:12:21.740 | popular ones to use as the sort of first sentence transformer that people end up using or sentence
00:12:28.380 | encoding model that people end up using.
00:12:31.420 | And this is a much more accurate answer than what we got before without the context, without
00:12:37.380 | the augmentation, which was BERT, which is not even a sentence transformer.
00:12:42.100 | So I think this is still a pretty good answer.
00:12:45.100 | Personally, I would like to see like an MPNet model or something on there, but that's actually
00:12:50.020 | more my personal preference.
00:12:51.820 | So I think this is probably a more broadly accepted answer.
00:12:55.940 | Okay, so as demonstrated, large language models do work incredibly well, particularly for
00:13:01.940 | general knowledge questions, but they definitely struggle with more niche or more specific
00:13:07.900 | pointed questions.
00:13:09.820 | And this typically leads to what we call hallucinations, which is where the model is basically spewing
00:13:16.340 | out things that are not true.
00:13:19.240 | And it's really obvious to the user that these models are being inaccurate in what they are
00:13:25.880 | saying because they can say very untruthful things very convincingly because these models,
00:13:33.060 | we can think of them as essentially masters of linguistic patterns.
00:13:37.740 | So they can say things that are completely false and say them in a way that makes them
00:13:42.980 | just seem true.
00:13:44.500 | So to protect us from this issue, we can add what we call a long term memory component
00:13:52.220 | to our GQA systems.
00:13:54.420 | And through this, we benefit from having an external knowledge base to improve system
00:13:59.260 | factuality and also improve user trust in the system.
00:14:03.820 | Naturally, there is a very vast potential for this type of technology.
00:14:10.340 | And despite being very new, there are already many people using it.
00:14:14.900 | I've seen that you.com have their new YouChat feature, which gives you natural language
00:14:22.220 | responses to your search queries.
00:14:25.160 | I've seen many podcast search apps recently using this technology.
00:14:30.300 | And there are even rumors of Microsoft with Bing using ChatGPT, which is another form
00:14:36.980 | of this technology as a challenger to Google itself.
00:14:42.260 | So as I think we can all see, there's very big potential and opportunity here for disruption
00:14:49.620 | within the space of information retrieval.
00:14:52.660 | Essentially any industry, any company that relies on information in some way and retrieving
00:15:00.000 | that information efficiently can benefit from the use of retrieval augmented generative
00:15:06.100 | question answering and other retrieval augmented generative AI technologies.
00:15:11.580 | So this really represents an opportunity for replacing some of those outdated information
00:15:16.380 | retrieval technologies that we use today.
00:15:19.900 | Now that's it for this video.
00:15:22.180 | I hope all of this has been somewhat thought provoking, interesting, and useful, but that's
00:15:29.140 | it for now.
00:15:30.140 | So thank you very much for watching and I will see you again in the next one.