Back to Index

Generative AI and Long-Term Memory for LLMs (OpenAI, Cohere, OS, Pinecone)


Chapters

0:0 What is generative AI
1:40 Generative question answering
4:6 Two options for helping LLMs
5:33 Long-term memory in LLMs
7:1 OP stack for retrieval augmented GQA
8:48 Testing a few examples
12:56 Final thoughts on Generative AI

Transcript

Generative AI is what many expect to be the next big technology boom. And being what it is, AI, it could have far-reaching implications that are beyond what we would imagine today. That's not to say that we have entered the end game of AI with AGI or anything like that, but I think that generative AI is a pretty big step forwards.

And it seems that investors are aware of this as well. We all know that the majority of industries had a very bad 2022, yet generative AI startups actually received $1.37 billion in funding. According to New York Times, that's almost as much as they received in the past five years combined.

However, it's hardly surprising. There were several wow moments that came from generative AI in 2022. From generative art tools like OpenAI's DALI 2, Mid-Journey, and Sable Diffusion, to the next generation of large language models from the likes of OpenAI with the GPT 3.5 models, the Open Source Bloom project, and the chatbots like Goose Lambda, and of course, the chat GPT.

All of this together marks just the first year of the widespread adoption of generative AI. We're still in the very early days of a technology that is poised to completely change the way that we interact with the machines. And one of the most thought provoking use cases in how we interact with machines, I think belongs to generative question answering or GQA.

Now the most simple GQA pipeline consists of nothing more than a user's question or query and a large language model. The query is passed to the large language model and based on what the large language model has learned during its training, so the knowledge that's stored within the model parameters, it will output an answer to your question.

And we can see that this works for general knowledge questions pretty well across the board. So if we take a look at OpenAI's DaVinci 003 model, Cohere's extra large model behind the generation endpoint, or even Open Source models that we can access through Hugging Face Transformers, we will get a good answer for general knowledge questions.

So if we ask, "Who was the first person on the moon?" we will get across the board the answer, Neil Armstrong. So we can see that this works incredibly well for things that are within the general knowledge base of these large language models. However, if we start asking more specific or advanced questions, these large language models will begin to fail.

So if we ask it a very specific question about machine learning methods and specifically NLP and semantic search training methods, like, "Which training method should I use for training sentence transformers when I have just pairs of positive sentences?" Now, you don't need to understand what that means if you don't, no problem.

One of the correct answers to this should be multiple noted ranking loss, or even just ranking loss would be fine as well. Yeah, if we ask this question, and we'll go ahead and ask what I found to be the best performing of the large language models so far. If we ask DaVinci 003 this question, it gives us this answer, and it says, "We need to use a supervised training method," which, yes, that is correct, but it doesn't really answer the question.

It doesn't give us a specific method to use, and the reason it doesn't give us that is because the model doesn't know. This knowledge has not been encoded into the model weights or parameters during training, so it can't answer the question. Now, there are two options we can take in order to help the model answer this question.

The first is we can fine tune the large language model on the text data that would contain this information. Now, this can be hard to do. It can take a lot of computational resources or money, and it also requires a lot of text data as well, which is not always necessarily available.

If we just mention the answer once in a single sentence of a million sentences, the large language model might not pick up on that information, and when we ask the question again, it may not have learned the answer. We need a lot of text data that mentions this in multiple contexts in order for it to learn this information well.

Considering that, our second option, which I think is probably the easier option, is to use something called retrieval augmented generation, or in this case, retrieval augmented generative Q&A. This simply means that we add what is called a retrieval component to our GQA pipeline. Adding this retrieval component allows us to retrieve relevant information.

If we have that sentence within our million sentences, we can retrieve that sentence and feed it into our large language model alongside our query. We're essentially creating a secondary source of information. Going ahead with this second option of retrieval augmented ML, when we apply it to large language models, we can actually think of it as a form of long-term memory.

To implement this long-term memory, we need to integrate a knowledge base into our GQA pipeline. This knowledge base is the retrieval component that we're talking about, and it allows us to take our query and search through our sentences or paragraphs for relevant information and return that relevant information that we can then pass to our larger language model.

As you can see, using this approach, we get much better results. Again, using DaVinci 003 for the generation model here, we get, "You should use natural language inference NLI with multiple negative ranking loss." Now, NLI is just one option for the format of the data, essentially, but the answer of multiple negative ranking loss is definitely what we're looking for.

This much better answer is a direct result of adding more contextual information to our query, which we would refer to as source knowledge. Source knowledge is basically any knowledge that gets passed through to the large language model within the input of whatever we're putting into the model at inference time, so when we're predicting or generating text.

In this example, what we use is OpenAI with both generation and actually embedding, which I'll explain in a moment, and also Pinecone Vector Database as our knowledge base. Both these together are what we would refer to as the OP stack, so OpenAI, Pinecone. This is a more recently popularized option for building very performant AI apps that rely on a retrieval component like Retrieval Augmented GQA.

At query time in this scenario, the pipeline consisted of three main steps. The first one, we use an OpenAI embedding endpoint to encode our query into what we call dense vector, and step two, we took that encoded query, sent it to our knowledge base, which returned relevant context or text passages back to us, which then we combined with our query, and that leads on to step three.

We take our query and that relevant information, relevant context, and push them into our large language model to generate a natural language answer, and as you can see, adding that extra context from Pinecone, our knowledge base, allowed the large language model to answer the question much more accurately. Even beyond providing more factual, accurate answers, the fact that we can retrieve the sources of information and actually present them to users using this approach also instills user trust in the system, allowing users to confirm the reliability of the information that is being presented to them.

Let's go ahead and try a few more examples. We're going to use the same pipeline that I've already described. The knowledge base that we're going to be using, so the data source, is the James Callum YouTube Transcriptions dataset, which is hosted on Hockeying Face Datasets, which is just a dataset of transcribed audio from various tech and ML YouTube channels.

If we ask questions around ML and tech, generally speaking, if it's within the knowledge base, it should be able to answer those questions pretty accurately. We're going to start with, what is NLI? Our first answer is NLI stands for Natural Language Interface, which is wrong. The second is correct, so we get Natural Language Inference, NLI is a test that requires pairs of sentences to be labeled as either contradictory, neutral, or entailing inferring each other, which is perfect.

Let's try something else. How can I use OpenAI's clip easily? No augmentation. It looks like we're just getting a description of what clip is, which is, I mean, this is correct. It used to classify images and generate natural language descriptions of them, which is not how I would define it.

In fact, I know that's not what I would go with. To use clip, you need access to a GPU and the OpenAI clip repository. Yes, you can do that, and you can use the provided scripts to train and evaluate the model. Additionally, you can use a so on and so on.

Okay. It's mostly correct, except from the start, it's not really how I would describe clip, but then the rest about using the clip repository is correct. Now, I got a rate limit error, so let me try and comment this part out and try again. Okay. And what I wanted to get is this.

So you can use OpenAI's clip easily by using the Hugging Face Transformers library, which in my opinion is 100% the easiest way to use the model. And then we get this, which some library for doing anything with NLP and computer vision. Not necessarily that standard with computer vision, but I think I know the source of information that's coming from, which is one of my videos.

And I probably do say something along those lines, because that is what we're using clip for in this instance. And then to get started, you should install PyTorch and the Transformers and Datasets libraries, which is actually usually the case using a Dataset from Datasets. And you do need to install PyTorch with Transformers.

So that is really cool. And let's ask one more question. I want to know what is a good, what is a good de facto model or sentence transformer model to use in semantic search? And let's see what we get. So in no augmentation, we get a popular de facto sentence transformer model for semantic search.

It's BERT. It's a deep learning model. It's been pre-trained and so on and so on. Not actually. So here it seems like they're talking about the standard BERT model and not even the sentence transformer or bi-encoded version of BERT. So I would say it's definitely wrong. So I'm hitting a rate limit again.

So let me comment this out and run it again. Okay. And here we go. So the pre-trained universal sentence encoder model is a good de facto sentence transformer model to use in semantic search. Now I would disagree with that. I think there are better models to use, but that is actually, I think one of the most popular ones to use as the sort of first sentence transformer that people end up using or sentence encoding model that people end up using.

And this is a much more accurate answer than what we got before without the context, without the augmentation, which was BERT, which is not even a sentence transformer. So I think this is still a pretty good answer. Personally, I would like to see like an MPNet model or something on there, but that's actually more my personal preference.

So I think this is probably a more broadly accepted answer. Okay, so as demonstrated, large language models do work incredibly well, particularly for general knowledge questions, but they definitely struggle with more niche or more specific pointed questions. And this typically leads to what we call hallucinations, which is where the model is basically spewing out things that are not true.

And it's really obvious to the user that these models are being inaccurate in what they are saying because they can say very untruthful things very convincingly because these models, we can think of them as essentially masters of linguistic patterns. So they can say things that are completely false and say them in a way that makes them just seem true.

So to protect us from this issue, we can add what we call a long term memory component to our GQA systems. And through this, we benefit from having an external knowledge base to improve system factuality and also improve user trust in the system. Naturally, there is a very vast potential for this type of technology.

And despite being very new, there are already many people using it. I've seen that you.com have their new YouChat feature, which gives you natural language responses to your search queries. I've seen many podcast search apps recently using this technology. And there are even rumors of Microsoft with Bing using ChatGPT, which is another form of this technology as a challenger to Google itself.

So as I think we can all see, there's very big potential and opportunity here for disruption within the space of information retrieval. Essentially any industry, any company that relies on information in some way and retrieving that information efficiently can benefit from the use of retrieval augmented generative question answering and other retrieval augmented generative AI technologies.

So this really represents an opportunity for replacing some of those outdated information retrieval technologies that we use today. Now that's it for this video. I hope all of this has been somewhat thought provoking, interesting, and useful, but that's it for now. So thank you very much for watching and I will see you again in the next one.

Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye.