Long Form Question Answering (LFQA) in Haystack

Today, we are going to talk about a subdomain of question answering called long-form question answering. Now, before we get into the specifics, let's just talk very quickly about question answering as a subdomain of NLP. Question answering has, I think, exploded as a subdomain of NLP in the past few years, mainly because I think question answering is an incredibly widely applicable use case for NLP.

But it wasn't possible to do question answering or not anything good with question answering until we had transformer models like BERT. So that means that as soon as we got something like BERT, the question answering became viable and with the huge number of use cases for question answering, it obviously kind of took off.

Now, question answering is quite complicated, but at its core, it's basically just the retrieval of information in a more human-like way. And when we consider this, I think it makes it really clear how broadly applicable question answering is, because almost every organization in the world, if not all, are going to need to retrieve information.

And for a lot of companies and particularly larger organizations, I think the act of information retrieval is actually a big component of their day-to-day operations. Now, at the moment, most organizations do information retrieval across a suite of tools. So they will have people using some sort of internal search tools, which are typically keyword-based, which is generally not always that helpful.

Sometimes it's useful, but a lot of the time, it's not great. Then another key form of information retrieval in most organizations is literally person to person. So you go and ask someone who you think will probably know where some information is, like a document or so on. And obviously, this sort of patchwork of information retrieval, to an extent, sure it works, but it's inefficient.

Now, if we consider that many organizations contain thousands of employees, each of those employees producing pages upon pages of unstructured data, e.g. pages of documents and texts that are meant for human consumption, in most cases, all of that information is just being lost in some sort of void. And rather than that information being lost in a void that we're never going to see again and it becomes useless to the organization or the company, we can instead place it in a database that a question answering agent has access to.

And when we ask a question to that Q&A agent, which we ask in a human-like way, it will go and retrieve the relevant information for us instantly. Well, not instantly, but pretty close. The majority of data in the world is unstructured. And there's a few different sources for this, but I think places like Forbes estimate that number to be around 90% of the world's data.

So in your organization, you probably have a number similar to this. So 90% of your data is unstructured. That means it's meant for human consumption, not machines. And it means it's liable to get lost in that void where we're just never going to see that information ever again. Now, that's massively inefficient.

Question answering is an opportunity to not lose that and actually benefit from that information. Now, in question answering, there are two main approaches. In both cases of question answering, we saw those documents in, or usually we saw those documents in a document store or vector database. So these documents are what we would call sentences or paragraphs extracted from your, for example, PDFs or emails or whatever unstructured data you have out there.

And we retrieve data from that. And then the next step is where we have the two different forms of question answering. With that relevant information that we have from our document store, based on a query that we've passed through there, we either generate an answer or we extract an answer.

So obviously, when we're generating an answer, we look at all of the context that we've retrieved and we use an NLP model to generate some sort of human answer to our query based on that information. Otherwise, we use an extractive model, which is literally going to take a snippet of information from the data that we have retrieved.

So there's a few components that I just described there. There was a document store at the start. When we're using a document store, which we will in most cases I'd imagine, we call that open book question answering. Now, the reason it's called open book is it is like students in an exam.

We have a typical exam. You don't have any outside materials to refer to. You have to rely on what is in your brain. That's very similar to using, for example, a generator model that, given a question, it doesn't refer to any document store. It just refers to what is within its own memory or its own model memory.

And that model memory has been built during model training. So that would be referred to as closed book, generative or abstractive Q&A. On the other hand, we have a document store. So that document store is like we are in our exam as students. And we have a open book that we can refer to for information.

So we're not just relying on what is in our head. We're looking at the information in this book. And we still need to rely on the knowledge in our head in order to apply what is in that book to the questions we're given in the exam. It's exactly the same for open book abstractive question answering in that you have the generator model.

But we're not just relying on a generator model to answer our questions. We are also relying on a document store, which is our book and what is called a retrieval model. And this retrieval model is going to take our question. It will encode it into a vector embedding, takes it to that document store, which is actually just a vector database in our scenario of what we're doing.

And in a vector database, what you have is lots of other vector embeddings, which are essentially numerical representations of the documents that you stored in it before. So remember, documents are those chunks of paragraph or sentences from different sources. That vector database has loads of these what we call context vectors.

And we pass our query vector into that document store or vector database, and we retrieve the most similar context vectors from there and pass them back to our retrieval pipeline. Then that is passed to our generator model. Our generator model is going to see the query followed by the set of retrieved relevant, hopefully, context.

And it uses all of that to generate an answer. So we can see with this open book format, we are passing a lot more information into the generator, which allows the generator to answer more specific questions. Now, long form question answering, which is what we are going to go through, is one form of this abstractive question answering.

The only difference with -- or the one thing that makes long form question answering long form question answering is that the generator model has been trained to produce a multi-sentence output. So rather than just outputting maybe an answer of three or four words or one sentence, it is going to try and output a full paragraph answer to you.

So that's long form question answering, or LFQA. So we are going to implement LFQA in Haystack. Haystack is a very popular NLP library, mainly for question answering. Now, to install Haystack and the other libraries that we need, today we do this. So we have PIP installed. We need the Pinecone client, farm Haystack, specify Pinecone in there, datasets, and pandas.

Actually, I think you can ignore pandas. Let's remove that. So just these three here. With farm Haystack, we are going to be using something called a Pinecone document store. So for that, you need either version 1.3 or above. Now, to initialize that Pinecone document store, so remember the document store is that thing that you saw on the right before, where we're storing all of our context vectors.

We will do this. So we first need an API key from Pinecone. So there's a link here. I'll just open it and show you quickly. And that will bring you to this page here. Now, you can sign up for free. You don't need to pay for anything. And we don't need to pay for anything to do what we're doing here either.

It's all completely free. So you just sign up. And once you've signed up, you will see it should just be one project on your homepage. So for me, it is the default project, James's default project. So you can go into that. And then on the left over here, we have API keys.

So we open that. And we get our default API key. We can just copy it. So we come over here. And we use that to authenticate our Pinecone document store back in our code. So I would paste that here. And with that, we just run this. So we are initializing our document store.

We are calling our index. So remember, document store is actually a vector database in this case. And inside that vector database, we have what's called an index. The index is basically the list of all the context vectors that we have. We call that index haystack LFQA. Now, you can call it whatever you want.

But when you are wanting to load this document store again, you need to specify the correct index. That's all. That's the only difference it makes. Similarity, we're using cosine similarity. And we're using embedding dimensions 768. Now, it's important to align this to whatever the similarity metric and embedding dimension of your retrieval model is.

In our case, cosine and 768. These are pretty typical retriever model metrics and dimensionalities. Now, we can go down. We can check our metric type. We can also see the number of documents and the embeddings that we have in there. Now, we don't have any at the moment because we haven't pushed anything to our document store.

We don't have any data. So we need to get some data. For that, we are going to use Hugging Face datasets. So over here. We're going to use this dataset here, which is a set of snippets from Wikipedia. There are a lot of them. In full, this dataset is 9 gigabytes.

Now, to avoid downloading this full dataset, what we do is set streaming equal to true. And what this will do is allow us to iteratively load one record at a time from this dataset. And we can check what we have inside that dataset by running this. So next, we create a iterable from our dataset.

And we see this. So the main things to take note of here are section title and passage text. Passage text is going to create our context or that document. And there are a couple of other things. So history is going to be what we are going to filter for in our dataset.

This is a very big dataset, and I don't want to process all of it. So I'm restricting our scope to just history, and we're going to only return a certain number of records from that section. That's important to us purely for that filtering out of other sections or section titles.

And we will include article title as metadata in our documents, although it's not really important because we're not actually going to use it. It's just so you can see how you would include metadata in there in case you did want to use it. So here, what we're doing is filtering only for documents that have the section title history.

And we just get this iterable object because we're streaming. So it just knows now when we're streaming one by one, when it's pulling an object, it's going to check if that object section title starts with history. If it does, it will pull it. If not, it will move on to the next one.

So we're just going to pull those with history. Now what we need to do is process those and add them to our document store. Now what I've done here is said, "Okay, we are only going to pull 50,000 of those and no more." At that point, we cut off.

And it's actually, it cuts off just before 50,000. And what we're going to do is we're going to add in a single batch. So we're going to loop through all of, or we're going to pull all of these records. We're going to collect 10,000 of them, and then we're going to add them to our document store.

And this is a Haystack document object. So we have a content. The content is the document text, that big paragraph you saw before. Meta is any metadata that we'd like to add in there. Now with the Pinecone document store, we can use metadata filtering, although I won't show you how to do that here.

But that can be really useful if it's something you're interested in. So that's how you'd add metadata to your document as well. And all I'm doing is adding that doc to a docs list. And we increase the counter. And once the counter hits the batch size, which is the 10,000, we write those documents to our document store.

Now you will remember I said the document store is a vector database, and inside the vector database, we have vectors. At the moment, when we write those documents, we're not actually creating those vectors, because we haven't specified the retriever model yet. We're going to do that later. So at the moment, what we're doing is kind of adding the documents as just plain text to almost be ready to be processed into vectors to put into that vector database.

So it's almost like they're in limbo, waiting to be added to our database. So we add all of those. It can take a little bit of time, not too long, though. And then once we hit or get close to 50,000, we break. So we stop the loop. And then we can see, if we get the document count, we see that we have the almost 50,000 documents in there.

But then when we look at the embedding count, zero. And that's because they're waiting to be added into the vector database, the text documents. So they exist as documents. They just don't exist as embeddings yet. So what we now need to do is convert those documents into vector embeddings.

Now, to do that, we need a retriever model. Now, at this point, it's probably best to check if you have a GPU that is available, like a CUDA-enabled GPU. If you don't, this step will take longer, unfortunately. But if you do, that's great, because this will be pretty quick in most cases, depending on your GPU, of course.

So we initialize our retriever model. So we're using the embedding retriever. And this allows us to use what are called sentence transformer models from the sentence transformers library. Now, I'm using this model here. And we can find all the sentence transformer models over on the HuggingFace model hub. So let's have a quick look at that.

So we are here, HuggingFace.co/models. And I can paste that model name. Maybe I'll just do flight sentence embeddings. Now, flight sentence embeddings are a set of models that were trained on a lot of data using the Flights library. But there are a lot of other sentence transform models. See the one we're using here.

So for example, if we go sentence transformers, you will see all of the default models used by the sentence transformers library. So we are using this MPNet model. We also specify that we're using sentence transformers model format. And when we initialize our retriever, we also need to add the document store that we'll be retrieving documents from.

So we've already initialized our document store, so we just add that in there. And at this point, it's time for us to update those embeddings. So when we say update embeddings, what this is going to do is look at any of all of the documents that are ready and with your document store.

And it's going to use the retriever model that you pass here and embed them into vector representations of those contents. And then it's going to store those in your Pinecone Vector database. That will be processed. And at this point, we could run this get embedding count again, and we would get this 49995 value.

Now another way that you can also see this number is if we go back to our Pinecone dashboard, we can head over to our index, so Haystack LFQA. We click on that, scroll down, and we can click on index info. And then we can see the total number of vectors, which is the same.

So that number will be reflected in your vector database once you have updated the embeddings using your retriever model. And at that point, we can just test the first part of our LFQA pipeline, which is just a document store and a retriever. So we initialize this document search pipeline with our retriever model, and we can ask the question, when was the first electric power system built?

And all this is going to do is retrieve the relevant context. It's not going to generate an answer yet. It's just going to retrieve what it thinks is the relevant context. So we have here electrical power system in 1881. Two electricians built the world's first power system in Goldaming in England, which is pretty good.

So that's pretty cool. And what we now need to do is we have our document store or vector database, and then we have our retriever model. Now we need to initialize our generator model to actually generate those answers. So we come down here. We are going to be using a sequence-to-sequence generator.

And we are going to be using this model here. So this, again, you can find this on the Hugging Face Model Hub. And there are different generator models you can use here, but you do want to find one that has been trained for long-form question answering. So for example, we have the BART LFQA that you can find here, or you have the BART Explain Like I'm Five model that we can find here.

Now, I think the BART LFQA model seems to perform better, so we have gone with that. Also, it's been trained with a newer dataset. And yeah, we just initialize it like that. Now, when we say sequence-to-sequence, that's because it is taking in a sequence of characters or some input, and it's going to output a sequence of characters, e.g.

the output, the answer. And if you are curious, the input will look something like what you see here. So we have the question, and then we have the user's query. It's followed by context. And then we have this P token here. And that P token indicates to the model the start of new context that has been retrieved from our document store.

So in this case, we've retrieved three contexts, and all of that is being passed to the generator model, where it will then generate an answer based on all of that. OK. So yeah, we just initialize the generator model, and then we initialize the generative Q and A pipeline. We pass in the generator and the retriever model.

We don't need to include document store here, because the document store has already been passed to the retriever model when we're initializing that. So it's almost like it's embedded within the retriever. So we don't need to worry about adding that in there. And then we can begin asking questions.

Now, this is where it starts to get, I think, more interesting. Now, one thing to make note of here is we have this top K parameter, and that's just saying how many contexts to retrieve in the context of our retriever model, and then for the generator, how many answers to generate.

So in this case, we're retrieving three contexts, and then we are generating one answer based on the query and those three contexts, like we saw in the example. So in this, I'm asking, what is a wall of currents? It's good to be specific to test this. And if we have the data within our data set, it seems to be pretty good at pulling that out and producing a relatively accurate answer.

So the wall of currents was a rivalry between Thomas Edison and George Westinghouse's companies over which form of transmission, DC or AC, was superior. That's the answer, which is pretty cool. And we can see what it's pulled that from. So it's pulled it from this content, this content, and this content.

So there were three parts that got fed into the model. And that's good. We can see a lot of information there, but maybe we can see a little bit too much information. So we can actually use the print answers utility to minimize what we're outputting there. And here we get just this, which is obviously a lot easier to read.

So we just pass our result into print answers and specify details of minimum. The rest of that is the same as what we asked before. So it's much more readable. Now one thing to point out here is that this is actually a very good answer, but maybe there's not that much detail.

Now, if we find that we're not getting much detail in our answers or that the answer is just wrong, what the issue might be is first, the retrieved context may not contain any relevant information for the model to actually view and answer the question correctly. So it's not retrieving relevant information from that external open book document source.

And the second is if it's not also not retrieving information from there and it's also not retrieving information from-- you remember I mentioned that these models can have a memory. If it's not able to find any relevant information within its memory for your particular query, if both of those conditions are not satisfied, so we don't have relevant information coming from the external source and we don't have relevant information coming from the model memory, the generator is going to output usually something nonsensical.

So in this scenario, we have two options really. The generator model, we can increase its size so we can use a larger generator model because larger generator models have more model parameters, which means they have basically more memory that they have learned during training. Or we can increase the amount of data that we are pulling from the document store.

So if we are just returning three documents or contexts, we can increase it to 10 because then the generator is being fed a lot more information and it might be that the correct information that we need may come in maybe context five or context six and nine. And the generator will see that and be like, OK, that's the answer.

I'm going to reformulate this into my answer. So we can try that here. Now, we already got a good answer, but we can just see what we get if we increase the retriever. So audio retrieved number of documents, so increase that to 10. And we see that we get this much longer chunk of text now.

And I think the first half of this is relatively accurate. So we have this in 1891, first power system was installed in the United States. I think that's relatively correct. And then it starts to get a little bit silly after that because we've pulled more context from our document store.

But with that, we have pulled in more irrelevant information because we're retrieving 10 now. So there's a good chance that the last few of those are not relevant. So we're feeding a lot of irrelevant information into our generator model. And so it starts to get confused and then it can start to ramble like we see here.

So that's what we see happening. Another thing I just want to point out is that the generator has this memory. So a lot of people always think when they hear, okay, the generator has memory, does that mean I don't need the document store? Because we have this memory, can't I just fine tune the model so that it knows everything within my particular use case?

In some cases, yes, you might be able to do that. But it generally only works for more general questions or general knowledge. If you start to get specific, it tends to fail with that sort of memory part because the memory can only source so much information. And in the end, what you will probably need is you want a model with good memory so it can maybe pull out some facts from there.

But for anything specific, it's probably going to need to refer to its document store. So what we have done here is we've asked the same question, but this time I've replaced the retrieve document with just nothing. And we can see the result of that straight away. So the answer is, I'm not sure what you mean by war.

So it has no idea what the war occurrence is. It doesn't have that information within its memory. So without that external document source, it doesn't know what to say. It's just, OK, I don't even know what war is. But like I said, in some cases, particularly when you're asking more general knowledge query, it will be able to pull that out from its memory.

So who was the first person on the moon? It knows this because it's such a common thing to know. It's probably seen it in the training data that the model has been trained on a million times. Maybe not a million, but a few times at least. So that is the first man to walk on the moon was Neil Armstrong.

OK, cool. So I think that's pretty much it. We can ask a few more questions. When was the first electrical power system built? So we ask this in the start, and it will give us this answer. If we want to confirm that this is correct-- so this is what I did with this.

I was a bit confused because Google was telling me something different. You can print out the contents using this. So we loop through the result documents, and we just print dot content. And this, OK, so two electricians built the first power system at gold damming in England. So that information is actually coming from somewhere.

It's not just making it up. So that can be really useful. Another thing just to be aware of with generators is that they can generate misleading information. So you need to be careful with that. So for example, in this one, I asked, where did COVID-19 originate? Now, this is pretty unfair because the generator probably hasn't seen anything about COVID-19.

And at the same time, it doesn't have any COVID-19 information within its document store because we looked at history, not anything else. So it just says, COVID-19 isn't a virus, which it is. It's a bacterium. So straightaway, that's pretty wrong. So just one example of where you need to just be cautious with this sort of thing because it can just give completely wrong answers if it doesn't have the relevant information available to it.

So with that, there's a couple of things you could do to mitigate that. You can, one, just include the sources of information. If you build some sort of search interface, make sure you include those so users can look at that and see where this information is coming from. And two, there are confidence scores that are given to these answers.

So you could put threshold. So you say anything below 0.2 confidence, we just don't show or we show, I'm not confident in this answer, but it might be this or something along those lines. So that's just one drawback. We'll just go through a few final questions. So what was NASA's most expensive project?

I would say the Space Shuttle project. That's correct. Tell me something interesting about the history of the Earth. In this case, it really, it's nothing, it's not really history, I don't think. But it does give us an interesting fact about the magnetic field being weak compared to the rest of the solar system.

I don't know if that's true or not. It seems like it might not be. When it says compared to the rest of the solar system, I'm thinking, is it weak compared to Mars? I don't think so. So that might not be true. Another thing to be wary of. Who created the Nobel Prize and why?

So this one is correct and I think quite interesting. And how is the Nobel Prize funded? We kind of see it down here, so I know the information is in there, hence why I've asked the question. And it tells you that as well with a little bit more information.

So that is it for long-form question answering with Haystack. As I said at the start, I think question answering is one of the most widely applicable forms of NLP or use cases of NLP. It can be applied almost everywhere. So it's a really good one to just go away and see maybe I can implement document search in my organization or I can create some sort of internal search engine that helps people in some way.

And I think in a lot of organizations, it's very possible to do this and add a lot of benefit and reduce a lot of friction in day-to-day processes of most companies. So that's it for this video. I hope it's been useful and I will see you in the next one.

Long Form Question Answering (LFQA) in Haystack

Chapters

Transcript