back to indexLong Form Question Answering (LFQA) in Haystack
Chapters
0:0 Intro
4:20 Approaches to Question Answering
5:43 Components of QA Pipeline
8:58 LFQA Generator
9:40 Haystack Setup
10:32 Initialize Document Store
13:2 Getting Data
17:53 Indexing Embeddings
21:51 Initialize Generator
24:10 Asking Questions
26:12 Common Problems
29:32 Generator Memory
31:30 Few More Questions
34:54 Outro
00:00:00.600 |
Today, we are going to talk about a subdomain of question answering called long-form question 00:00:08.480 |
answering. Now, before we get into the specifics, let's just talk very quickly about question 00:00:14.160 |
answering as a subdomain of NLP. Question answering has, I think, exploded as a subdomain 00:00:23.100 |
of NLP in the past few years, mainly because I think question answering is an incredibly 00:00:31.300 |
widely applicable use case for NLP. But it wasn't possible to do question answering or 00:00:39.000 |
not anything good with question answering until we had transformer models like BERT. 00:00:45.500 |
So that means that as soon as we got something like BERT, the question answering became viable 00:00:51.720 |
and with the huge number of use cases for question answering, it obviously kind of took 00:00:59.300 |
off. Now, question answering is quite complicated, but at its core, it's basically just the retrieval 00:01:08.500 |
of information in a more human-like way. And when we consider this, I think it makes it 00:01:17.320 |
really clear how broadly applicable question answering is, because almost every organization 00:01:24.800 |
in the world, if not all, are going to need to retrieve information. And for a lot of 00:01:32.960 |
companies and particularly larger organizations, I think the act of information retrieval is 00:01:40.120 |
actually a big component of their day-to-day operations. Now, at the moment, most organizations 00:01:49.240 |
do information retrieval across a suite of tools. So they will have people using some 00:01:57.640 |
sort of internal search tools, which are typically keyword-based, which is generally not always 00:02:06.080 |
that helpful. Sometimes it's useful, but a lot of the time, it's not great. 00:02:12.720 |
Then another key form of information retrieval in most organizations is literally person 00:02:19.840 |
to person. So you go and ask someone who you think will probably know where some information 00:02:25.600 |
is, like a document or so on. And obviously, this sort of patchwork of information retrieval, 00:02:34.200 |
to an extent, sure it works, but it's inefficient. Now, if we consider that many organizations 00:02:42.720 |
contain thousands of employees, each of those employees producing pages upon pages of unstructured 00:02:50.760 |
data, e.g. pages of documents and texts that are meant for human consumption, in most cases, 00:02:59.040 |
all of that information is just being lost in some sort of void. 00:03:04.400 |
And rather than that information being lost in a void that we're never going to see again 00:03:09.720 |
and it becomes useless to the organization or the company, we can instead place it in 00:03:16.960 |
a database that a question answering agent has access to. And when we ask a question 00:03:25.360 |
to that Q&A agent, which we ask in a human-like way, it will go and retrieve the relevant 00:03:30.420 |
information for us instantly. Well, not instantly, but pretty close. The majority of data in 00:03:37.600 |
the world is unstructured. And there's a few different sources for this, but I think places 00:03:43.480 |
like Forbes estimate that number to be around 90% of the world's data. So in your organization, 00:03:52.320 |
you probably have a number similar to this. So 90% of your data is unstructured. That 00:03:58.240 |
means it's meant for human consumption, not machines. And it means it's liable to get 00:04:03.400 |
lost in that void where we're just never going to see that information ever again. Now, that's 00:04:08.760 |
massively inefficient. Question answering is an opportunity to not lose that and actually 00:04:17.140 |
benefit from that information. Now, in question answering, there are two main approaches. 00:04:25.940 |
In both cases of question answering, we saw those documents in, or usually we saw those 00:04:34.040 |
documents in a document store or vector database. So these documents are what we would call 00:04:40.520 |
sentences or paragraphs extracted from your, for example, PDFs or emails or whatever unstructured 00:04:48.920 |
data you have out there. And we retrieve data from that. And then the next step is where 00:04:57.080 |
we have the two different forms of question answering. With that relevant information 00:05:02.440 |
that we have from our document store, based on a query that we've passed through there, 00:05:09.160 |
we either generate an answer or we extract an answer. So obviously, when we're generating 00:05:16.480 |
an answer, we look at all of the context that we've retrieved and we use an NLP model to 00:05:23.240 |
generate some sort of human answer to our query based on that information. Otherwise, 00:05:32.220 |
we use an extractive model, which is literally going to take a snippet of information from 00:05:42.600 |
So there's a few components that I just described there. There was a document store at the start. 00:05:50.220 |
When we're using a document store, which we will in most cases I'd imagine, we call that 00:05:56.560 |
open book question answering. Now, the reason it's called open book is it is like students 00:06:02.280 |
in an exam. We have a typical exam. You don't have any outside materials to refer to. You 00:06:10.720 |
have to rely on what is in your brain. That's very similar to using, for example, a generator 00:06:17.800 |
model that, given a question, it doesn't refer to any document store. It just refers to what 00:06:24.320 |
is within its own memory or its own model memory. And that model memory has been built 00:06:31.280 |
during model training. So that would be referred to as closed book, generative or abstractive 00:06:39.960 |
On the other hand, we have a document store. So that document store is like we are in our 00:06:44.880 |
exam as students. And we have a open book that we can refer to for information. So we're 00:06:50.800 |
not just relying on what is in our head. We're looking at the information in this book. And 00:06:56.240 |
we still need to rely on the knowledge in our head in order to apply what is in that 00:07:01.800 |
book to the questions we're given in the exam. It's exactly the same for open book abstractive 00:07:10.100 |
question answering in that you have the generator model. But we're not just relying on a generator 00:07:17.760 |
model to answer our questions. We are also relying on a document store, which is our 00:07:22.520 |
book and what is called a retrieval model. And this retrieval model is going to take 00:07:29.320 |
our question. It will encode it into a vector embedding, takes it to that document store, 00:07:39.520 |
which is actually just a vector database in our scenario of what we're doing. And in a 00:07:48.560 |
vector database, what you have is lots of other vector embeddings, which are essentially 00:07:54.740 |
numerical representations of the documents that you stored in it before. So remember, 00:08:00.720 |
documents are those chunks of paragraph or sentences from different sources. That vector 00:08:06.400 |
database has loads of these what we call context vectors. And we pass our query vector into 00:08:17.360 |
that document store or vector database, and we retrieve the most similar context vectors 00:08:24.160 |
from there and pass them back to our retrieval pipeline. Then that is passed to our generator 00:08:32.440 |
model. Our generator model is going to see the query followed by the set of retrieved 00:08:39.280 |
relevant, hopefully, context. And it uses all of that to generate an answer. So we can 00:08:48.360 |
see with this open book format, we are passing a lot more information into the generator, 00:08:53.960 |
which allows the generator to answer more specific questions. Now, long form question 00:09:00.960 |
answering, which is what we are going to go through, is one form of this abstractive question 00:09:09.180 |
answering. The only difference with -- or the one thing that makes long form question 00:09:14.960 |
answering long form question answering is that the generator model has been trained 00:09:20.960 |
to produce a multi-sentence output. So rather than just outputting maybe an answer of three 00:09:29.600 |
or four words or one sentence, it is going to try and output a full paragraph answer 00:09:36.280 |
to you. So that's long form question answering, or LFQA. 00:09:40.760 |
So we are going to implement LFQA in Haystack. Haystack is a very popular NLP library, mainly 00:09:51.780 |
for question answering. Now, to install Haystack and the other libraries that we need, today 00:09:59.140 |
we do this. So we have PIP installed. We need the Pinecone client, farm Haystack, specify 00:10:06.180 |
Pinecone in there, datasets, and pandas. Actually, I think you can ignore pandas. Let's remove 00:10:15.700 |
that. So just these three here. With farm Haystack, we are going to be using something 00:10:23.980 |
called a Pinecone document store. So for that, you need either version 1.3 or above. Now, 00:10:32.380 |
to initialize that Pinecone document store, so remember the document store is that thing 00:10:37.520 |
that you saw on the right before, where we're storing all of our context vectors. We will 00:10:44.780 |
do this. So we first need an API key from Pinecone. So there's a link here. I'll just 00:10:50.680 |
open it and show you quickly. And that will bring you to this page here. Now, you can 00:10:56.420 |
sign up for free. You don't need to pay for anything. And we don't need to pay for anything 00:11:00.660 |
to do what we're doing here either. It's all completely free. So you just sign up. And 00:11:06.380 |
once you've signed up, you will see it should just be one project on your homepage. So for 00:11:13.740 |
me, it is the default project, James's default project. So you can go into that. And then 00:11:20.660 |
on the left over here, we have API keys. So we open that. And we get our default API key. 00:11:28.000 |
We can just copy it. So we come over here. And we use that to authenticate our Pinecone 00:11:35.600 |
document store back in our code. So I would paste that here. And with that, we just run 00:11:42.720 |
this. So we are initializing our document store. We are calling our index. So remember, 00:11:48.480 |
document store is actually a vector database in this case. And inside that vector database, 00:11:54.680 |
we have what's called an index. The index is basically the list of all the context vectors 00:12:00.160 |
that we have. We call that index haystack LFQA. Now, you can call it whatever you want. 00:12:06.920 |
But when you are wanting to load this document store again, you need to specify the correct 00:12:14.320 |
index. That's all. That's the only difference it makes. Similarity, we're using cosine similarity. 00:12:20.420 |
And we're using embedding dimensions 768. Now, it's important to align this to whatever 00:12:28.880 |
the similarity metric and embedding dimension of your retrieval model is. In our case, cosine 00:12:36.600 |
and 768. These are pretty typical retriever model metrics and dimensionalities. 00:12:44.200 |
Now, we can go down. We can check our metric type. We can also see the number of documents 00:12:51.240 |
and the embeddings that we have in there. Now, we don't have any at the moment because 00:12:56.760 |
we haven't pushed anything to our document store. We don't have any data. So we need 00:13:02.840 |
to get some data. For that, we are going to use Hugging Face datasets. So over here. We're 00:13:12.180 |
going to use this dataset here, which is a set of snippets from Wikipedia. There are 00:13:21.280 |
a lot of them. In full, this dataset is 9 gigabytes. Now, to avoid downloading this 00:13:27.820 |
full dataset, what we do is set streaming equal to true. And what this will do is allow 00:13:33.700 |
us to iteratively load one record at a time from this dataset. 00:13:39.840 |
And we can check what we have inside that dataset by running this. So next, we create 00:13:45.480 |
a iterable from our dataset. And we see this. So the main things to take note of here are 00:13:54.060 |
section title and passage text. Passage text is going to create our context or that document. 00:14:02.920 |
And there are a couple of other things. So history is going to be what we are going to 00:14:08.200 |
filter for in our dataset. This is a very big dataset, and I don't want to process all 00:14:12.880 |
of it. So I'm restricting our scope to just history, and we're going to only return a 00:14:18.200 |
certain number of records from that section. That's important to us purely for that filtering 00:14:24.560 |
out of other sections or section titles. And we will include article title as metadata 00:14:36.700 |
in our documents, although it's not really important because we're not actually going 00:14:39.820 |
to use it. It's just so you can see how you would include metadata in there in case you 00:14:45.640 |
did want to use it. So here, what we're doing is filtering only for documents that have 00:14:52.740 |
the section title history. And we just get this iterable object because we're streaming. 00:15:00.380 |
So it just knows now when we're streaming one by one, when it's pulling an object, it's 00:15:06.700 |
going to check if that object section title starts with history. If it does, it will pull 00:15:13.100 |
it. If not, it will move on to the next one. So we're just going to pull those with history. 00:15:18.600 |
Now what we need to do is process those and add them to our document store. Now what I've 00:15:25.620 |
done here is said, "Okay, we are only going to pull 50,000 of those and no more." At that 00:15:32.580 |
point, we cut off. And it's actually, it cuts off just before 50,000. And what we're going 00:15:38.660 |
to do is we're going to add in a single batch. So we're going to loop through all of, or 00:15:44.020 |
we're going to pull all of these records. We're going to collect 10,000 of them, and 00:15:48.420 |
then we're going to add them to our document store. And this is a Haystack document object. 00:15:57.380 |
So we have a content. The content is the document text, that big paragraph you saw before. Meta 00:16:04.540 |
is any metadata that we'd like to add in there. Now with the Pinecone document store, we can 00:16:09.460 |
use metadata filtering, although I won't show you how to do that here. But that can be really 00:16:14.520 |
useful if it's something you're interested in. So that's how you'd add metadata to your 00:16:20.220 |
document as well. And all I'm doing is adding that doc to a docs list. And we increase the 00:16:28.700 |
counter. And once the counter hits the batch size, which is the 10,000, we write those 00:16:38.100 |
documents to our document store. Now you will remember I said the document store is a vector 00:16:44.900 |
database, and inside the vector database, we have vectors. At the moment, when we write 00:16:49.740 |
those documents, we're not actually creating those vectors, because we haven't specified 00:16:53.820 |
the retriever model yet. We're going to do that later. So at the moment, what we're doing 00:16:58.860 |
is kind of adding the documents as just plain text to almost be ready to be processed into 00:17:07.700 |
vectors to put into that vector database. So it's almost like they're in limbo, waiting 00:17:18.500 |
So we add all of those. It can take a little bit of time, not too long, though. And then 00:17:24.180 |
once we hit or get close to 50,000, we break. So we stop the loop. And then we can see, 00:17:32.100 |
if we get the document count, we see that we have the almost 50,000 documents in there. 00:17:38.340 |
But then when we look at the embedding count, zero. And that's because they're waiting to 00:17:44.620 |
be added into the vector database, the text documents. So they exist as documents. They 00:17:53.860 |
So what we now need to do is convert those documents into vector embeddings. Now, to 00:18:04.060 |
do that, we need a retriever model. Now, at this point, it's probably best to check if 00:18:12.700 |
you have a GPU that is available, like a CUDA-enabled GPU. If you don't, this step will take longer, 00:18:21.340 |
unfortunately. But if you do, that's great, because this will be pretty quick in most 00:18:31.300 |
So we initialize our retriever model. So we're using the embedding retriever. And this allows 00:18:38.380 |
us to use what are called sentence transformer models from the sentence transformers library. 00:18:43.380 |
Now, I'm using this model here. And we can find all the sentence transformer models over 00:18:49.980 |
on the HuggingFace model hub. So let's have a quick look at that. 00:18:53.720 |
So we are here, HuggingFace.co/models. And I can paste that model name. Maybe I'll just 00:19:02.420 |
do flight sentence embeddings. Now, flight sentence embeddings are a set of models that 00:19:07.180 |
were trained on a lot of data using the Flights library. But there are a lot of other sentence 00:19:14.740 |
transform models. See the one we're using here. So for example, if we go sentence transformers, 00:19:21.660 |
you will see all of the default models used by the sentence transformers library. 00:19:28.100 |
So we are using this MPNet model. We also specify that we're using sentence transformers 00:19:34.980 |
model format. And when we initialize our retriever, we also need to add the document store that 00:19:40.740 |
we'll be retrieving documents from. So we've already initialized our document store, so 00:19:49.900 |
And at this point, it's time for us to update those embeddings. So when we say update embeddings, 00:19:58.140 |
what this is going to do is look at any of all of the documents that are ready and with 00:20:04.140 |
your document store. And it's going to use the retriever model that you pass here and 00:20:09.500 |
embed them into vector representations of those contents. And then it's going to store 00:20:16.700 |
those in your Pinecone Vector database. That will be processed. And at this point, we could 00:20:23.620 |
run this get embedding count again, and we would get this 49995 value. 00:20:31.500 |
Now another way that you can also see this number is if we go back to our Pinecone dashboard, 00:20:41.980 |
we can head over to our index, so Haystack LFQA. We click on that, scroll down, and we 00:20:51.220 |
can click on index info. And then we can see the total number of vectors, which is the 00:20:55.140 |
same. So that number will be reflected in your vector database once you have updated 00:21:06.180 |
And at that point, we can just test the first part of our LFQA pipeline, which is just a 00:21:13.820 |
document store and a retriever. So we initialize this document search pipeline with our retriever 00:21:19.900 |
model, and we can ask the question, when was the first electric power system built? And 00:21:25.900 |
all this is going to do is retrieve the relevant context. It's not going to generate an answer 00:21:33.100 |
yet. It's just going to retrieve what it thinks is the relevant context. 00:21:37.900 |
So we have here electrical power system in 1881. Two electricians built the world's first 00:21:46.140 |
power system in Goldaming in England, which is pretty good. So that's pretty cool. And 00:21:56.060 |
what we now need to do is we have our document store or vector database, and then we have 00:22:01.140 |
our retriever model. Now we need to initialize our generator model to actually generate those 00:22:08.180 |
So we come down here. We are going to be using a sequence-to-sequence generator. And we are 00:22:16.340 |
going to be using this model here. So this, again, you can find this on the Hugging Face 00:22:21.900 |
Model Hub. And there are different generator models you can use here, but you do want to 00:22:29.780 |
find one that has been trained for long-form question answering. 00:22:33.420 |
So for example, we have the BART LFQA that you can find here, or you have the BART Explain 00:22:41.460 |
Like I'm Five model that we can find here. Now, I think the BART LFQA model seems to 00:22:48.840 |
perform better, so we have gone with that. Also, it's been trained with a newer dataset. 00:22:55.940 |
And yeah, we just initialize it like that. Now, when we say sequence-to-sequence, that's 00:23:00.620 |
because it is taking in a sequence of characters or some input, and it's going to output a 00:23:06.220 |
sequence of characters, e.g. the output, the answer. And if you are curious, the input 00:23:16.420 |
So we have the question, and then we have the user's query. It's followed by context. 00:23:21.780 |
And then we have this P token here. And that P token indicates to the model the start of 00:23:28.420 |
new context that has been retrieved from our document store. So in this case, we've retrieved 00:23:35.100 |
three contexts, and all of that is being passed to the generator model, where it will then 00:23:44.780 |
OK. So yeah, we just initialize the generator model, and then we initialize the generative 00:23:53.700 |
Q and A pipeline. We pass in the generator and the retriever model. We don't need to 00:23:58.580 |
include document store here, because the document store has already been passed to the retriever 00:24:03.700 |
model when we're initializing that. So it's almost like it's embedded within the retriever. 00:24:08.860 |
So we don't need to worry about adding that in there. 00:24:10.940 |
And then we can begin asking questions. Now, this is where it starts to get, I think, more 00:24:14.340 |
interesting. Now, one thing to make note of here is we have this top K parameter, and 00:24:20.460 |
that's just saying how many contexts to retrieve in the context of our retriever model, and 00:24:28.040 |
then for the generator, how many answers to generate. So in this case, we're retrieving 00:24:33.900 |
three contexts, and then we are generating one answer based on the query and those three 00:24:43.820 |
So in this, I'm asking, what is a wall of currents? It's good to be specific to test 00:24:51.220 |
this. And if we have the data within our data set, it seems to be pretty good at pulling 00:24:58.960 |
that out and producing a relatively accurate answer. So the wall of currents was a rivalry 00:25:05.860 |
between Thomas Edison and George Westinghouse's companies over which form of transmission, 00:25:11.580 |
DC or AC, was superior. That's the answer, which is pretty cool. And we can see what 00:25:17.940 |
it's pulled that from. So it's pulled it from this content, this content, and this content. 00:25:28.580 |
So there were three parts that got fed into the model. 00:25:34.860 |
And that's good. We can see a lot of information there, but maybe we can see a little bit too 00:25:39.140 |
much information. So we can actually use the print answers utility to minimize what we're 00:25:46.900 |
outputting there. And here we get just this, which is obviously a lot easier to read. So 00:25:52.180 |
we just pass our result into print answers and specify details of minimum. The rest of 00:25:57.140 |
that is the same as what we asked before. So it's much more readable. 00:26:02.620 |
Now one thing to point out here is that this is actually a very good answer, but maybe 00:26:10.460 |
there's not that much detail. Now, if we find that we're not getting much detail in our 00:26:16.100 |
answers or that the answer is just wrong, what the issue might be is first, the retrieved 00:26:26.180 |
context may not contain any relevant information for the model to actually view and answer 00:26:33.800 |
the question correctly. So it's not retrieving relevant information from that external open 00:26:44.100 |
And the second is if it's not also not retrieving information from there and it's also not retrieving 00:26:50.660 |
information from-- you remember I mentioned that these models can have a memory. If it's 00:26:55.280 |
not able to find any relevant information within its memory for your particular query, 00:27:02.860 |
if both of those conditions are not satisfied, so we don't have relevant information coming 00:27:10.820 |
from the external source and we don't have relevant information coming from the model 00:27:14.140 |
memory, the generator is going to output usually something nonsensical. 00:27:21.580 |
So in this scenario, we have two options really. The generator model, we can increase its size 00:27:29.060 |
so we can use a larger generator model because larger generator models have more model parameters, 00:27:35.820 |
which means they have basically more memory that they have learned during training. Or 00:27:43.340 |
we can increase the amount of data that we are pulling from the document store. So if 00:27:50.420 |
we are just returning three documents or contexts, we can increase it to 10 because then the 00:27:57.780 |
generator is being fed a lot more information and it might be that the correct information 00:28:05.620 |
that we need may come in maybe context five or context six and nine. And the generator 00:28:13.260 |
will see that and be like, OK, that's the answer. I'm going to reformulate this into 00:28:22.120 |
So we can try that here. Now, we already got a good answer, but we can just see what we 00:28:26.460 |
get if we increase the retriever. So audio retrieved number of documents, so increase 00:28:32.500 |
that to 10. And we see that we get this much longer chunk of text now. And I think the 00:28:40.060 |
first half of this is relatively accurate. So we have this in 1891, first power system 00:28:49.220 |
was installed in the United States. I think that's relatively correct. And then it starts 00:28:56.780 |
to get a little bit silly after that because we've pulled more context from our document 00:29:05.660 |
store. But with that, we have pulled in more irrelevant information because we're retrieving 00:29:12.020 |
10 now. So there's a good chance that the last few of those are not relevant. So we're 00:29:16.740 |
feeding a lot of irrelevant information into our generator model. And so it starts to get 00:29:21.180 |
confused and then it can start to ramble like we see here. 00:29:27.420 |
So that's what we see happening. Another thing I just want to point out is that the generator 00:29:36.180 |
has this memory. So a lot of people always think when they hear, okay, the generator 00:29:43.060 |
has memory, does that mean I don't need the document store? Because we have this memory, 00:29:46.940 |
can't I just fine tune the model so that it knows everything within my particular use 00:29:51.300 |
case? In some cases, yes, you might be able to do that. But it generally only works for 00:29:59.500 |
more general questions or general knowledge. If you start to get specific, it tends to 00:30:05.980 |
fail with that sort of memory part because the memory can only source so much information. 00:30:12.260 |
And in the end, what you will probably need is you want a model with good memory so it 00:30:17.660 |
can maybe pull out some facts from there. But for anything specific, it's probably going 00:30:26.140 |
So what we have done here is we've asked the same question, but this time I've replaced 00:30:30.980 |
the retrieve document with just nothing. And we can see the result of that straight away. 00:30:36.220 |
So the answer is, I'm not sure what you mean by war. So it has no idea what the war occurrence 00:30:43.300 |
is. It doesn't have that information within its memory. So without that external document 00:30:48.860 |
source, it doesn't know what to say. It's just, OK, I don't even know what war is. 00:30:57.380 |
But like I said, in some cases, particularly when you're asking more general knowledge 00:31:02.700 |
query, it will be able to pull that out from its memory. So who was the first person on 00:31:08.300 |
the moon? It knows this because it's such a common thing to know. It's probably seen 00:31:14.500 |
it in the training data that the model has been trained on a million times. Maybe not 00:31:19.340 |
a million, but a few times at least. So that is the first man to walk on the moon was Neil 00:31:27.500 |
OK, cool. So I think that's pretty much it. We can ask a few more questions. When was 00:31:34.900 |
the first electrical power system built? So we ask this in the start, and it will give 00:31:38.980 |
us this answer. If we want to confirm that this is correct-- so this is what I did with 00:31:47.260 |
this. I was a bit confused because Google was telling me something different. You can 00:31:52.100 |
print out the contents using this. So we loop through the result documents, and we just 00:31:59.220 |
print dot content. And this, OK, so two electricians built the first power system at gold damming 00:32:06.820 |
in England. So that information is actually coming from somewhere. It's not just making 00:32:12.220 |
So that can be really useful. Another thing just to be aware of with generators is that 00:32:18.180 |
they can generate misleading information. So you need to be careful with that. So for 00:32:26.780 |
example, in this one, I asked, where did COVID-19 originate? Now, this is pretty unfair because 00:32:32.140 |
the generator probably hasn't seen anything about COVID-19. And at the same time, it doesn't 00:32:40.460 |
have any COVID-19 information within its document store because we looked at history, not anything 00:32:49.060 |
So it just says, COVID-19 isn't a virus, which it is. It's a bacterium. So straightaway, 00:32:55.900 |
that's pretty wrong. So just one example of where you need to just be cautious with this 00:33:05.580 |
sort of thing because it can just give completely wrong answers if it doesn't have the relevant 00:33:12.820 |
So with that, there's a couple of things you could do to mitigate that. You can, one, just 00:33:18.180 |
include the sources of information. If you build some sort of search interface, make 00:33:23.340 |
sure you include those so users can look at that and see where this information is coming 00:33:27.700 |
from. And two, there are confidence scores that are given to these answers. So you could 00:33:37.500 |
put threshold. So you say anything below 0.2 confidence, we just don't show or we show, 00:33:46.220 |
I'm not confident in this answer, but it might be this or something along those lines. 00:33:54.580 |
So that's just one drawback. We'll just go through a few final questions. So what was 00:34:00.620 |
NASA's most expensive project? I would say the Space Shuttle project. That's correct. 00:34:07.260 |
Tell me something interesting about the history of the Earth. In this case, it really, it's 00:34:11.420 |
nothing, it's not really history, I don't think. But it does give us an interesting 00:34:17.180 |
fact about the magnetic field being weak compared to the rest of the solar system. I don't know 00:34:21.460 |
if that's true or not. It seems like it might not be. When it says compared to the rest 00:34:27.140 |
of the solar system, I'm thinking, is it weak compared to Mars? I don't think so. So that 00:34:32.580 |
might not be true. Another thing to be wary of. 00:34:35.300 |
Who created the Nobel Prize and why? So this one is correct and I think quite interesting. 00:34:42.100 |
And how is the Nobel Prize funded? We kind of see it down here, so I know the information 00:34:46.500 |
is in there, hence why I've asked the question. And it tells you that as well with a little 00:34:52.300 |
bit more information. So that is it for long-form question answering with Haystack. As I said 00:35:03.340 |
at the start, I think question answering is one of the most widely applicable forms of 00:35:09.860 |
NLP or use cases of NLP. It can be applied almost everywhere. So it's a really good one 00:35:18.700 |
to just go away and see maybe I can implement document search in my organization or I can 00:35:29.420 |
create some sort of internal search engine that helps people in some way. And I think 00:35:36.740 |
in a lot of organizations, it's very possible to do this and add a lot of benefit and reduce 00:35:44.420 |
a lot of friction in day-to-day processes of most companies. 00:35:50.940 |
So that's it for this video. I hope it's been useful and I will see you in the next one.