Generative AI and Long-Term Memory for LLMs (OpenAI, Cohere, OS, Pinecone)

00:00:00.000 | Generative AI is what many expect to be the next big technology boom.

00:00:06.760 | And being what it is, AI, it could have far-reaching implications that are beyond what we would

00:00:14.140 | imagine today.

00:00:15.140 | That's not to say that we have entered the end game of AI with AGI or anything like that,

00:00:21.920 | but I think that generative AI is a pretty big step forwards.

00:00:27.300 | And it seems that investors are aware of this as well.

00:00:30.420 | We all know that the majority of industries had a very bad 2022, yet generative AI startups

00:00:38.320 | actually received $1.37 billion in funding.

00:00:43.720 | According to New York Times, that's almost as much as they received in the past five

00:00:48.920 | years combined.

00:00:50.440 | However, it's hardly surprising.

00:00:53.000 | There were several wow moments that came from generative AI in 2022.

00:00:58.940 | From generative art tools like OpenAI's DALI 2, Mid-Journey, and Sable Diffusion, to the

00:01:06.360 | next generation of large language models from the likes of OpenAI with the GPT 3.5 models,

00:01:14.180 | the Open Source Bloom project, and the chatbots like Goose Lambda, and of course, the chat

00:01:21.080 | GPT.

00:01:22.220 | All of this together marks just the first year of the widespread adoption of generative

00:01:30.020 | AI.

00:01:31.020 | We're still in the very early days of a technology that is poised to completely change the way

00:01:37.820 | that we interact with the machines.

00:01:40.580 | And one of the most thought provoking use cases in how we interact with machines, I

00:01:46.000 | think belongs to generative question answering or GQA.

00:01:50.220 | Now the most simple GQA pipeline consists of nothing more than a user's question or

00:01:55.980 | query and a large language model.

00:01:59.060 | The query is passed to the large language model and based on what the large language

00:02:03.460 | model has learned during its training, so the knowledge that's stored within the model

00:02:09.100 | parameters, it will output an answer to your question.

00:02:13.540 | And we can see that this works for general knowledge questions pretty well across the

00:02:19.180 | board.

00:02:20.180 | So if we take a look at OpenAI's DaVinci 003 model, Cohere's extra large model behind

00:02:26.700 | the generation endpoint, or even Open Source models that we can access through Hugging

00:02:31.380 | Face Transformers, we will get a good answer for general knowledge questions.

00:02:37.760 | So if we ask, "Who was the first person on the moon?" we will get across the board

00:02:42.780 | the answer, Neil Armstrong.

00:02:44.600 | So we can see that this works incredibly well for things that are within the general knowledge

00:02:49.620 | base of these large language models.

00:02:51.780 | However, if we start asking more specific or advanced questions, these large language

00:02:57.020 | models will begin to fail.

00:02:59.380 | So if we ask it a very specific question about machine learning methods and specifically

00:03:05.100 | NLP and semantic search training methods, like, "Which training method should I use

00:03:11.460 | for training sentence transformers when I have just pairs of positive sentences?"

00:03:17.380 | Now, you don't need to understand what that means if you don't, no problem.

00:03:22.140 | One of the correct answers to this should be multiple noted ranking loss, or even just

00:03:27.700 | ranking loss would be fine as well.

00:03:29.420 | Yeah, if we ask this question, and we'll go ahead and ask what I found to be the best

00:03:34.780 | performing of the large language models so far.

00:03:38.540 | If we ask DaVinci 003 this question, it gives us this answer, and it says, "We need to

00:03:44.020 | use a supervised training method," which, yes, that is correct, but it doesn't really

00:03:49.420 | answer the question.

00:03:51.060 | It doesn't give us a specific method to use, and the reason it doesn't give us that is

00:03:55.620 | because the model doesn't know.

00:03:57.980 | This knowledge has not been encoded into the model weights or parameters during training,

00:04:05.120 | so it can't answer the question.

00:04:06.540 | Now, there are two options we can take in order to help the model answer this question.

00:04:12.140 | The first is we can fine tune the large language model on the text data that would contain

00:04:19.660 | this information.

00:04:20.660 | Now, this can be hard to do.

00:04:22.180 | It can take a lot of computational resources or money, and it also requires a lot of text

00:04:28.820 | data as well, which is not always necessarily available.

00:04:31.940 | If we just mention the answer once in a single sentence of a million sentences, the large

00:04:39.620 | language model might not pick up on that information, and when we ask the question again, it may

00:04:44.580 | not have learned the answer.

00:04:46.340 | We need a lot of text data that mentions this in multiple contexts in order for it to learn

00:04:52.300 | this information well.

00:04:54.780 | Considering that, our second option, which I think is probably the easier option, is

00:04:59.740 | to use something called retrieval augmented generation, or in this case, retrieval augmented

00:05:06.260 | generative Q&A.

00:05:08.460 | This simply means that we add what is called a retrieval component to our GQA pipeline.

00:05:15.060 | Adding this retrieval component allows us to retrieve relevant information.

00:05:21.500 | If we have that sentence within our million sentences, we can retrieve that sentence and

00:05:25.700 | feed it into our large language model alongside our query.

00:05:29.620 | We're essentially creating a secondary source of information.

00:05:34.300 | Going ahead with this second option of retrieval augmented ML, when we apply it to large language

00:05:42.260 | models, we can actually think of it as a form of long-term memory.

00:05:47.500 | To implement this long-term memory, we need to integrate a knowledge base into our GQA

00:05:52.980 | pipeline.

00:05:53.980 | This knowledge base is the retrieval component that we're talking about, and it allows us

00:05:58.260 | to take our query and search through our sentences or paragraphs for relevant information and

00:06:04.140 | return that relevant information that we can then pass to our larger language model.

00:06:09.640 | As you can see, using this approach, we get much better results.

00:06:14.740 | Again, using DaVinci 003 for the generation model here, we get, "You should use natural

00:06:20.660 | language inference NLI with multiple negative ranking loss."

00:06:24.540 | Now, NLI is just one option for the format of the data, essentially, but the answer of

00:06:31.880 | multiple negative ranking loss is definitely what we're looking for.

00:06:36.060 | This much better answer is a direct result of adding more contextual information to our

00:06:42.780 | query, which we would refer to as source knowledge.

00:06:47.560 | Source knowledge is basically any knowledge that gets passed through to the large language

00:06:52.940 | model within the input of whatever we're putting into the model at inference time, so when

00:06:58.940 | we're predicting or generating text.

00:07:01.700 | In this example, what we use is OpenAI with both generation and actually embedding, which

00:07:09.820 | I'll explain in a moment, and also Pinecone Vector Database as our knowledge base.

00:07:16.020 | Both these together are what we would refer to as the OP stack, so OpenAI, Pinecone.

00:07:21.740 | This is a more recently popularized option for building very performant AI apps that

00:07:29.100 | rely on a retrieval component like Retrieval Augmented GQA.

00:07:34.660 | At query time in this scenario, the pipeline consisted of three main steps.

00:07:40.020 | The first one, we use an OpenAI embedding endpoint to encode our query into what we

00:07:48.020 | call dense vector, and step two, we took that encoded query, sent it to our knowledge base,

00:07:55.020 | which returned relevant context or text passages back to us, which then we combined with our

00:08:02.700 | query, and that leads on to step three.

00:08:05.700 | We take our query and that relevant information, relevant context, and push them into our large

00:08:12.720 | language model to generate a natural language answer, and as you can see, adding that extra

00:08:20.420 | context from Pinecone, our knowledge base, allowed the large language model to answer

00:08:25.620 | the question much more accurately.

00:08:28.860 | Even beyond providing more factual, accurate answers, the fact that we can retrieve the

00:08:34.140 | sources of information and actually present them to users using this approach also instills

00:08:40.780 | user trust in the system, allowing users to confirm the reliability of the information

00:08:46.380 | that is being presented to them.

00:08:48.180 | Let's go ahead and try a few more examples.

00:08:50.180 | We're going to use the same pipeline that I've already described.

00:08:54.340 | The knowledge base that we're going to be using, so the data source, is the James Callum

00:08:59.060 | YouTube Transcriptions dataset, which is hosted on Hockeying Face Datasets, which is just

00:09:03.980 | a dataset of transcribed audio from various tech and ML YouTube channels.

00:09:10.560 | If we ask questions around ML and tech, generally speaking, if it's within the knowledge base,

00:09:15.820 | it should be able to answer those questions pretty accurately.

00:09:18.060 | We're going to start with, what is NLI?

00:09:21.020 | Our first answer is NLI stands for Natural Language Interface, which is wrong.

00:09:27.300 | The second is correct, so we get Natural Language Inference, NLI is a test that requires pairs

00:09:32.560 | of sentences to be labeled as either contradictory, neutral, or entailing inferring each other,

00:09:38.460 | which is perfect.

00:09:39.900 | Let's try something else.

00:09:41.260 | How can I use OpenAI's clip easily?

00:09:49.060 | No augmentation.

00:09:50.060 | It looks like we're just getting a description of what clip is, which is, I mean, this is

00:09:53.420 | correct.

00:09:54.420 | It used to classify images and generate natural language descriptions of them, which is not

00:10:02.500 | how I would define it.

00:10:04.220 | In fact, I know that's not what I would go with.

00:10:07.380 | To use clip, you need access to a GPU and the OpenAI clip repository.

00:10:11.060 | Yes, you can do that, and you can use the provided scripts to train and evaluate the

00:10:15.940 | model.

00:10:16.940 | Additionally, you can use a so on and so on.

00:10:19.140 | Okay.

00:10:20.140 | It's mostly correct, except from the start, it's not really how I would describe clip,

00:10:24.500 | but then the rest about using the clip repository is correct.

00:10:27.340 | Now, I got a rate limit error, so let me try and comment this part out and try again.

00:10:33.500 | Okay.

00:10:34.500 | And what I wanted to get is this.

00:10:37.460 | So you can use OpenAI's clip easily by using the Hugging Face Transformers library, which

00:10:42.660 | in my opinion is 100% the easiest way to use the model.

00:10:47.660 | And then we get this, which some library for doing anything with NLP and computer vision.

00:10:52.860 | Not necessarily that standard with computer vision, but I think I know the source of information

00:10:57.180 | that's coming from, which is one of my videos.

00:10:59.620 | And I probably do say something along those lines, because that is what we're using clip

00:11:03.180 | for in this instance.

00:11:05.540 | And then to get started, you should install PyTorch and the Transformers and Datasets

00:11:10.460 | libraries, which is actually usually the case using a Dataset from Datasets.

00:11:17.420 | And you do need to install PyTorch with Transformers.

00:11:20.460 | So that is really cool.

00:11:21.980 | And let's ask one more question.

00:11:23.620 | I want to know what is a good, what is a good de facto model or sentence transformer model

00:11:31.620 | to use in semantic search?

00:11:34.940 | And let's see what we get.

00:11:36.660 | So in no augmentation, we get a popular de facto sentence transformer model for semantic

00:11:42.060 | search.

00:11:43.060 | It's BERT.

00:11:44.060 | It's a deep learning model.

00:11:45.060 | It's been pre-trained and so on and so on.

00:11:46.980 | Not actually.

00:11:47.980 | So here it seems like they're talking about the standard BERT model and not even the sentence

00:11:53.740 | transformer or bi-encoded version of BERT.

00:11:56.840 | So I would say it's definitely wrong.

00:11:58.980 | So I'm hitting a rate limit again.

00:12:00.860 | So let me comment this out and run it again.

00:12:04.700 | Okay.

00:12:05.700 | And here we go.

00:12:06.700 | So the pre-trained universal sentence encoder model is a good de facto sentence transformer

00:12:12.280 | model to use in semantic search.

00:12:14.180 | Now I would disagree with that.

00:12:16.220 | I think there are better models to use, but that is actually, I think one of the most

00:12:21.740 | popular ones to use as the sort of first sentence transformer that people end up using or sentence

00:12:28.380 | encoding model that people end up using.

00:12:31.420 | And this is a much more accurate answer than what we got before without the context, without

00:12:37.380 | the augmentation, which was BERT, which is not even a sentence transformer.

00:12:42.100 | So I think this is still a pretty good answer.

00:12:45.100 | Personally, I would like to see like an MPNet model or something on there, but that's actually

00:12:50.020 | more my personal preference.

00:12:51.820 | So I think this is probably a more broadly accepted answer.

00:12:55.940 | Okay, so as demonstrated, large language models do work incredibly well, particularly for

00:13:01.940 | general knowledge questions, but they definitely struggle with more niche or more specific

00:13:07.900 | pointed questions.

00:13:09.820 | And this typically leads to what we call hallucinations, which is where the model is basically spewing

00:13:16.340 | out things that are not true.

00:13:19.240 | And it's really obvious to the user that these models are being inaccurate in what they are

00:13:25.880 | saying because they can say very untruthful things very convincingly because these models,

00:13:33.060 | we can think of them as essentially masters of linguistic patterns.

00:13:37.740 | So they can say things that are completely false and say them in a way that makes them

00:13:42.980 | just seem true.

00:13:44.500 | So to protect us from this issue, we can add what we call a long term memory component

00:13:52.220 | to our GQA systems.

00:13:54.420 | And through this, we benefit from having an external knowledge base to improve system

00:13:59.260 | factuality and also improve user trust in the system.

00:14:03.820 | Naturally, there is a very vast potential for this type of technology.

00:14:10.340 | And despite being very new, there are already many people using it.

00:14:14.900 | I've seen that you.com have their new YouChat feature, which gives you natural language

00:14:22.220 | responses to your search queries.

00:14:25.160 | I've seen many podcast search apps recently using this technology.

00:14:30.300 | And there are even rumors of Microsoft with Bing using ChatGPT, which is another form

00:14:36.980 | of this technology as a challenger to Google itself.

00:14:42.260 | So as I think we can all see, there's very big potential and opportunity here for disruption

00:14:49.620 | within the space of information retrieval.

00:14:52.660 | Essentially any industry, any company that relies on information in some way and retrieving

00:15:00.000 | that information efficiently can benefit from the use of retrieval augmented generative

00:15:06.100 | question answering and other retrieval augmented generative AI technologies.

00:15:11.580 | So this really represents an opportunity for replacing some of those outdated information

00:15:16.380 | retrieval technologies that we use today.

00:15:19.900 | Now that's it for this video.

00:15:22.180 | I hope all of this has been somewhat thought provoking, interesting, and useful, but that's

00:15:29.140 | it for now.

00:15:30.140 | So thank you very much for watching and I will see you again in the next one.

00:15:35.140 | Bye.

00:15:35.460 | Bye.

00:15:36.280 | Bye.

00:15:37.100 | Bye.

00:15:37.920 | Bye.

00:15:38.740 | Bye.

00:15:39.560 | Bye.

00:15:40.380 | Bye.

00:15:41.200 | Bye.

00:15:42.020 | Bye.

00:15:43.020 | Bye.

00:15:44.020 | Bye.

00:15:45.020 | Bye.

00:15:46.020 | Bye.

00:15:47.020 | Bye.

00:15:48.020 | Bye.

00:15:49.020 | Bye.

Generative AI and Long-Term Memory for LLMs (OpenAI, Cohere, OS, Pinecone)

Chapters