back to index

Long Form Question Answering (LFQA) in Haystack


Chapters

0:0 Intro
4:20 Approaches to Question Answering
5:43 Components of QA Pipeline
8:58 LFQA Generator
9:40 Haystack Setup
10:32 Initialize Document Store
13:2 Getting Data
17:53 Indexing Embeddings
21:51 Initialize Generator
24:10 Asking Questions
26:12 Common Problems
29:32 Generator Memory
31:30 Few More Questions
34:54 Outro

Whisper Transcript | Transcript Only Page

00:00:00.600 | Today, we are going to talk about a subdomain of question answering called long-form question
00:00:08.480 | answering. Now, before we get into the specifics, let's just talk very quickly about question
00:00:14.160 | answering as a subdomain of NLP. Question answering has, I think, exploded as a subdomain
00:00:23.100 | of NLP in the past few years, mainly because I think question answering is an incredibly
00:00:31.300 | widely applicable use case for NLP. But it wasn't possible to do question answering or
00:00:39.000 | not anything good with question answering until we had transformer models like BERT.
00:00:45.500 | So that means that as soon as we got something like BERT, the question answering became viable
00:00:51.720 | and with the huge number of use cases for question answering, it obviously kind of took
00:00:59.300 | off. Now, question answering is quite complicated, but at its core, it's basically just the retrieval
00:01:08.500 | of information in a more human-like way. And when we consider this, I think it makes it
00:01:17.320 | really clear how broadly applicable question answering is, because almost every organization
00:01:24.800 | in the world, if not all, are going to need to retrieve information. And for a lot of
00:01:32.960 | companies and particularly larger organizations, I think the act of information retrieval is
00:01:40.120 | actually a big component of their day-to-day operations. Now, at the moment, most organizations
00:01:49.240 | do information retrieval across a suite of tools. So they will have people using some
00:01:57.640 | sort of internal search tools, which are typically keyword-based, which is generally not always
00:02:06.080 | that helpful. Sometimes it's useful, but a lot of the time, it's not great.
00:02:12.720 | Then another key form of information retrieval in most organizations is literally person
00:02:19.840 | to person. So you go and ask someone who you think will probably know where some information
00:02:25.600 | is, like a document or so on. And obviously, this sort of patchwork of information retrieval,
00:02:34.200 | to an extent, sure it works, but it's inefficient. Now, if we consider that many organizations
00:02:42.720 | contain thousands of employees, each of those employees producing pages upon pages of unstructured
00:02:50.760 | data, e.g. pages of documents and texts that are meant for human consumption, in most cases,
00:02:59.040 | all of that information is just being lost in some sort of void.
00:03:04.400 | And rather than that information being lost in a void that we're never going to see again
00:03:09.720 | and it becomes useless to the organization or the company, we can instead place it in
00:03:16.960 | a database that a question answering agent has access to. And when we ask a question
00:03:25.360 | to that Q&A agent, which we ask in a human-like way, it will go and retrieve the relevant
00:03:30.420 | information for us instantly. Well, not instantly, but pretty close. The majority of data in
00:03:37.600 | the world is unstructured. And there's a few different sources for this, but I think places
00:03:43.480 | like Forbes estimate that number to be around 90% of the world's data. So in your organization,
00:03:52.320 | you probably have a number similar to this. So 90% of your data is unstructured. That
00:03:58.240 | means it's meant for human consumption, not machines. And it means it's liable to get
00:04:03.400 | lost in that void where we're just never going to see that information ever again. Now, that's
00:04:08.760 | massively inefficient. Question answering is an opportunity to not lose that and actually
00:04:17.140 | benefit from that information. Now, in question answering, there are two main approaches.
00:04:25.940 | In both cases of question answering, we saw those documents in, or usually we saw those
00:04:34.040 | documents in a document store or vector database. So these documents are what we would call
00:04:40.520 | sentences or paragraphs extracted from your, for example, PDFs or emails or whatever unstructured
00:04:48.920 | data you have out there. And we retrieve data from that. And then the next step is where
00:04:57.080 | we have the two different forms of question answering. With that relevant information
00:05:02.440 | that we have from our document store, based on a query that we've passed through there,
00:05:09.160 | we either generate an answer or we extract an answer. So obviously, when we're generating
00:05:16.480 | an answer, we look at all of the context that we've retrieved and we use an NLP model to
00:05:23.240 | generate some sort of human answer to our query based on that information. Otherwise,
00:05:32.220 | we use an extractive model, which is literally going to take a snippet of information from
00:05:39.080 | the data that we have retrieved.
00:05:42.600 | So there's a few components that I just described there. There was a document store at the start.
00:05:50.220 | When we're using a document store, which we will in most cases I'd imagine, we call that
00:05:56.560 | open book question answering. Now, the reason it's called open book is it is like students
00:06:02.280 | in an exam. We have a typical exam. You don't have any outside materials to refer to. You
00:06:10.720 | have to rely on what is in your brain. That's very similar to using, for example, a generator
00:06:17.800 | model that, given a question, it doesn't refer to any document store. It just refers to what
00:06:24.320 | is within its own memory or its own model memory. And that model memory has been built
00:06:31.280 | during model training. So that would be referred to as closed book, generative or abstractive
00:06:39.960 | On the other hand, we have a document store. So that document store is like we are in our
00:06:44.880 | exam as students. And we have a open book that we can refer to for information. So we're
00:06:50.800 | not just relying on what is in our head. We're looking at the information in this book. And
00:06:56.240 | we still need to rely on the knowledge in our head in order to apply what is in that
00:07:01.800 | book to the questions we're given in the exam. It's exactly the same for open book abstractive
00:07:10.100 | question answering in that you have the generator model. But we're not just relying on a generator
00:07:17.760 | model to answer our questions. We are also relying on a document store, which is our
00:07:22.520 | book and what is called a retrieval model. And this retrieval model is going to take
00:07:29.320 | our question. It will encode it into a vector embedding, takes it to that document store,
00:07:39.520 | which is actually just a vector database in our scenario of what we're doing. And in a
00:07:48.560 | vector database, what you have is lots of other vector embeddings, which are essentially
00:07:54.740 | numerical representations of the documents that you stored in it before. So remember,
00:08:00.720 | documents are those chunks of paragraph or sentences from different sources. That vector
00:08:06.400 | database has loads of these what we call context vectors. And we pass our query vector into
00:08:17.360 | that document store or vector database, and we retrieve the most similar context vectors
00:08:24.160 | from there and pass them back to our retrieval pipeline. Then that is passed to our generator
00:08:32.440 | model. Our generator model is going to see the query followed by the set of retrieved
00:08:39.280 | relevant, hopefully, context. And it uses all of that to generate an answer. So we can
00:08:48.360 | see with this open book format, we are passing a lot more information into the generator,
00:08:53.960 | which allows the generator to answer more specific questions. Now, long form question
00:09:00.960 | answering, which is what we are going to go through, is one form of this abstractive question
00:09:09.180 | answering. The only difference with -- or the one thing that makes long form question
00:09:14.960 | answering long form question answering is that the generator model has been trained
00:09:20.960 | to produce a multi-sentence output. So rather than just outputting maybe an answer of three
00:09:29.600 | or four words or one sentence, it is going to try and output a full paragraph answer
00:09:36.280 | to you. So that's long form question answering, or LFQA.
00:09:40.760 | So we are going to implement LFQA in Haystack. Haystack is a very popular NLP library, mainly
00:09:51.780 | for question answering. Now, to install Haystack and the other libraries that we need, today
00:09:59.140 | we do this. So we have PIP installed. We need the Pinecone client, farm Haystack, specify
00:10:06.180 | Pinecone in there, datasets, and pandas. Actually, I think you can ignore pandas. Let's remove
00:10:15.700 | that. So just these three here. With farm Haystack, we are going to be using something
00:10:23.980 | called a Pinecone document store. So for that, you need either version 1.3 or above. Now,
00:10:32.380 | to initialize that Pinecone document store, so remember the document store is that thing
00:10:37.520 | that you saw on the right before, where we're storing all of our context vectors. We will
00:10:44.780 | do this. So we first need an API key from Pinecone. So there's a link here. I'll just
00:10:50.680 | open it and show you quickly. And that will bring you to this page here. Now, you can
00:10:56.420 | sign up for free. You don't need to pay for anything. And we don't need to pay for anything
00:11:00.660 | to do what we're doing here either. It's all completely free. So you just sign up. And
00:11:06.380 | once you've signed up, you will see it should just be one project on your homepage. So for
00:11:13.740 | me, it is the default project, James's default project. So you can go into that. And then
00:11:20.660 | on the left over here, we have API keys. So we open that. And we get our default API key.
00:11:28.000 | We can just copy it. So we come over here. And we use that to authenticate our Pinecone
00:11:35.600 | document store back in our code. So I would paste that here. And with that, we just run
00:11:42.720 | this. So we are initializing our document store. We are calling our index. So remember,
00:11:48.480 | document store is actually a vector database in this case. And inside that vector database,
00:11:54.680 | we have what's called an index. The index is basically the list of all the context vectors
00:12:00.160 | that we have. We call that index haystack LFQA. Now, you can call it whatever you want.
00:12:06.920 | But when you are wanting to load this document store again, you need to specify the correct
00:12:14.320 | index. That's all. That's the only difference it makes. Similarity, we're using cosine similarity.
00:12:20.420 | And we're using embedding dimensions 768. Now, it's important to align this to whatever
00:12:28.880 | the similarity metric and embedding dimension of your retrieval model is. In our case, cosine
00:12:36.600 | and 768. These are pretty typical retriever model metrics and dimensionalities.
00:12:44.200 | Now, we can go down. We can check our metric type. We can also see the number of documents
00:12:51.240 | and the embeddings that we have in there. Now, we don't have any at the moment because
00:12:56.760 | we haven't pushed anything to our document store. We don't have any data. So we need
00:13:02.840 | to get some data. For that, we are going to use Hugging Face datasets. So over here. We're
00:13:12.180 | going to use this dataset here, which is a set of snippets from Wikipedia. There are
00:13:21.280 | a lot of them. In full, this dataset is 9 gigabytes. Now, to avoid downloading this
00:13:27.820 | full dataset, what we do is set streaming equal to true. And what this will do is allow
00:13:33.700 | us to iteratively load one record at a time from this dataset.
00:13:39.840 | And we can check what we have inside that dataset by running this. So next, we create
00:13:45.480 | a iterable from our dataset. And we see this. So the main things to take note of here are
00:13:54.060 | section title and passage text. Passage text is going to create our context or that document.
00:14:02.920 | And there are a couple of other things. So history is going to be what we are going to
00:14:08.200 | filter for in our dataset. This is a very big dataset, and I don't want to process all
00:14:12.880 | of it. So I'm restricting our scope to just history, and we're going to only return a
00:14:18.200 | certain number of records from that section. That's important to us purely for that filtering
00:14:24.560 | out of other sections or section titles. And we will include article title as metadata
00:14:36.700 | in our documents, although it's not really important because we're not actually going
00:14:39.820 | to use it. It's just so you can see how you would include metadata in there in case you
00:14:45.640 | did want to use it. So here, what we're doing is filtering only for documents that have
00:14:52.740 | the section title history. And we just get this iterable object because we're streaming.
00:15:00.380 | So it just knows now when we're streaming one by one, when it's pulling an object, it's
00:15:06.700 | going to check if that object section title starts with history. If it does, it will pull
00:15:13.100 | it. If not, it will move on to the next one. So we're just going to pull those with history.
00:15:18.600 | Now what we need to do is process those and add them to our document store. Now what I've
00:15:25.620 | done here is said, "Okay, we are only going to pull 50,000 of those and no more." At that
00:15:32.580 | point, we cut off. And it's actually, it cuts off just before 50,000. And what we're going
00:15:38.660 | to do is we're going to add in a single batch. So we're going to loop through all of, or
00:15:44.020 | we're going to pull all of these records. We're going to collect 10,000 of them, and
00:15:48.420 | then we're going to add them to our document store. And this is a Haystack document object.
00:15:57.380 | So we have a content. The content is the document text, that big paragraph you saw before. Meta
00:16:04.540 | is any metadata that we'd like to add in there. Now with the Pinecone document store, we can
00:16:09.460 | use metadata filtering, although I won't show you how to do that here. But that can be really
00:16:14.520 | useful if it's something you're interested in. So that's how you'd add metadata to your
00:16:20.220 | document as well. And all I'm doing is adding that doc to a docs list. And we increase the
00:16:28.700 | counter. And once the counter hits the batch size, which is the 10,000, we write those
00:16:38.100 | documents to our document store. Now you will remember I said the document store is a vector
00:16:44.900 | database, and inside the vector database, we have vectors. At the moment, when we write
00:16:49.740 | those documents, we're not actually creating those vectors, because we haven't specified
00:16:53.820 | the retriever model yet. We're going to do that later. So at the moment, what we're doing
00:16:58.860 | is kind of adding the documents as just plain text to almost be ready to be processed into
00:17:07.700 | vectors to put into that vector database. So it's almost like they're in limbo, waiting
00:17:12.820 | to be added to our database.
00:17:18.500 | So we add all of those. It can take a little bit of time, not too long, though. And then
00:17:24.180 | once we hit or get close to 50,000, we break. So we stop the loop. And then we can see,
00:17:32.100 | if we get the document count, we see that we have the almost 50,000 documents in there.
00:17:38.340 | But then when we look at the embedding count, zero. And that's because they're waiting to
00:17:44.620 | be added into the vector database, the text documents. So they exist as documents. They
00:17:50.760 | just don't exist as embeddings yet.
00:17:53.860 | So what we now need to do is convert those documents into vector embeddings. Now, to
00:18:04.060 | do that, we need a retriever model. Now, at this point, it's probably best to check if
00:18:12.700 | you have a GPU that is available, like a CUDA-enabled GPU. If you don't, this step will take longer,
00:18:21.340 | unfortunately. But if you do, that's great, because this will be pretty quick in most
00:18:26.380 | cases, depending on your GPU, of course.
00:18:31.300 | So we initialize our retriever model. So we're using the embedding retriever. And this allows
00:18:38.380 | us to use what are called sentence transformer models from the sentence transformers library.
00:18:43.380 | Now, I'm using this model here. And we can find all the sentence transformer models over
00:18:49.980 | on the HuggingFace model hub. So let's have a quick look at that.
00:18:53.720 | So we are here, HuggingFace.co/models. And I can paste that model name. Maybe I'll just
00:19:02.420 | do flight sentence embeddings. Now, flight sentence embeddings are a set of models that
00:19:07.180 | were trained on a lot of data using the Flights library. But there are a lot of other sentence
00:19:14.740 | transform models. See the one we're using here. So for example, if we go sentence transformers,
00:19:21.660 | you will see all of the default models used by the sentence transformers library.
00:19:28.100 | So we are using this MPNet model. We also specify that we're using sentence transformers
00:19:34.980 | model format. And when we initialize our retriever, we also need to add the document store that
00:19:40.740 | we'll be retrieving documents from. So we've already initialized our document store, so
00:19:46.300 | we just add that in there.
00:19:49.900 | And at this point, it's time for us to update those embeddings. So when we say update embeddings,
00:19:58.140 | what this is going to do is look at any of all of the documents that are ready and with
00:20:04.140 | your document store. And it's going to use the retriever model that you pass here and
00:20:09.500 | embed them into vector representations of those contents. And then it's going to store
00:20:16.700 | those in your Pinecone Vector database. That will be processed. And at this point, we could
00:20:23.620 | run this get embedding count again, and we would get this 49995 value.
00:20:31.500 | Now another way that you can also see this number is if we go back to our Pinecone dashboard,
00:20:41.980 | we can head over to our index, so Haystack LFQA. We click on that, scroll down, and we
00:20:51.220 | can click on index info. And then we can see the total number of vectors, which is the
00:20:55.140 | same. So that number will be reflected in your vector database once you have updated
00:21:01.900 | the embeddings using your retriever model.
00:21:06.180 | And at that point, we can just test the first part of our LFQA pipeline, which is just a
00:21:13.820 | document store and a retriever. So we initialize this document search pipeline with our retriever
00:21:19.900 | model, and we can ask the question, when was the first electric power system built? And
00:21:25.900 | all this is going to do is retrieve the relevant context. It's not going to generate an answer
00:21:33.100 | yet. It's just going to retrieve what it thinks is the relevant context.
00:21:37.900 | So we have here electrical power system in 1881. Two electricians built the world's first
00:21:46.140 | power system in Goldaming in England, which is pretty good. So that's pretty cool. And
00:21:56.060 | what we now need to do is we have our document store or vector database, and then we have
00:22:01.140 | our retriever model. Now we need to initialize our generator model to actually generate those
00:22:06.180 | answers.
00:22:08.180 | So we come down here. We are going to be using a sequence-to-sequence generator. And we are
00:22:16.340 | going to be using this model here. So this, again, you can find this on the Hugging Face
00:22:21.900 | Model Hub. And there are different generator models you can use here, but you do want to
00:22:29.780 | find one that has been trained for long-form question answering.
00:22:33.420 | So for example, we have the BART LFQA that you can find here, or you have the BART Explain
00:22:41.460 | Like I'm Five model that we can find here. Now, I think the BART LFQA model seems to
00:22:48.840 | perform better, so we have gone with that. Also, it's been trained with a newer dataset.
00:22:55.940 | And yeah, we just initialize it like that. Now, when we say sequence-to-sequence, that's
00:23:00.620 | because it is taking in a sequence of characters or some input, and it's going to output a
00:23:06.220 | sequence of characters, e.g. the output, the answer. And if you are curious, the input
00:23:13.700 | will look something like what you see here.
00:23:16.420 | So we have the question, and then we have the user's query. It's followed by context.
00:23:21.780 | And then we have this P token here. And that P token indicates to the model the start of
00:23:28.420 | new context that has been retrieved from our document store. So in this case, we've retrieved
00:23:35.100 | three contexts, and all of that is being passed to the generator model, where it will then
00:23:41.900 | generate an answer based on all of that.
00:23:44.780 | OK. So yeah, we just initialize the generator model, and then we initialize the generative
00:23:53.700 | Q and A pipeline. We pass in the generator and the retriever model. We don't need to
00:23:58.580 | include document store here, because the document store has already been passed to the retriever
00:24:03.700 | model when we're initializing that. So it's almost like it's embedded within the retriever.
00:24:08.860 | So we don't need to worry about adding that in there.
00:24:10.940 | And then we can begin asking questions. Now, this is where it starts to get, I think, more
00:24:14.340 | interesting. Now, one thing to make note of here is we have this top K parameter, and
00:24:20.460 | that's just saying how many contexts to retrieve in the context of our retriever model, and
00:24:28.040 | then for the generator, how many answers to generate. So in this case, we're retrieving
00:24:33.900 | three contexts, and then we are generating one answer based on the query and those three
00:24:39.780 | contexts, like we saw in the example.
00:24:43.820 | So in this, I'm asking, what is a wall of currents? It's good to be specific to test
00:24:51.220 | this. And if we have the data within our data set, it seems to be pretty good at pulling
00:24:58.960 | that out and producing a relatively accurate answer. So the wall of currents was a rivalry
00:25:05.860 | between Thomas Edison and George Westinghouse's companies over which form of transmission,
00:25:11.580 | DC or AC, was superior. That's the answer, which is pretty cool. And we can see what
00:25:17.940 | it's pulled that from. So it's pulled it from this content, this content, and this content.
00:25:28.580 | So there were three parts that got fed into the model.
00:25:34.860 | And that's good. We can see a lot of information there, but maybe we can see a little bit too
00:25:39.140 | much information. So we can actually use the print answers utility to minimize what we're
00:25:46.900 | outputting there. And here we get just this, which is obviously a lot easier to read. So
00:25:52.180 | we just pass our result into print answers and specify details of minimum. The rest of
00:25:57.140 | that is the same as what we asked before. So it's much more readable.
00:26:02.620 | Now one thing to point out here is that this is actually a very good answer, but maybe
00:26:10.460 | there's not that much detail. Now, if we find that we're not getting much detail in our
00:26:16.100 | answers or that the answer is just wrong, what the issue might be is first, the retrieved
00:26:26.180 | context may not contain any relevant information for the model to actually view and answer
00:26:33.800 | the question correctly. So it's not retrieving relevant information from that external open
00:26:41.700 | book document source.
00:26:44.100 | And the second is if it's not also not retrieving information from there and it's also not retrieving
00:26:50.660 | information from-- you remember I mentioned that these models can have a memory. If it's
00:26:55.280 | not able to find any relevant information within its memory for your particular query,
00:27:02.860 | if both of those conditions are not satisfied, so we don't have relevant information coming
00:27:10.820 | from the external source and we don't have relevant information coming from the model
00:27:14.140 | memory, the generator is going to output usually something nonsensical.
00:27:21.580 | So in this scenario, we have two options really. The generator model, we can increase its size
00:27:29.060 | so we can use a larger generator model because larger generator models have more model parameters,
00:27:35.820 | which means they have basically more memory that they have learned during training. Or
00:27:43.340 | we can increase the amount of data that we are pulling from the document store. So if
00:27:50.420 | we are just returning three documents or contexts, we can increase it to 10 because then the
00:27:57.780 | generator is being fed a lot more information and it might be that the correct information
00:28:05.620 | that we need may come in maybe context five or context six and nine. And the generator
00:28:13.260 | will see that and be like, OK, that's the answer. I'm going to reformulate this into
00:28:19.220 | my answer.
00:28:22.120 | So we can try that here. Now, we already got a good answer, but we can just see what we
00:28:26.460 | get if we increase the retriever. So audio retrieved number of documents, so increase
00:28:32.500 | that to 10. And we see that we get this much longer chunk of text now. And I think the
00:28:40.060 | first half of this is relatively accurate. So we have this in 1891, first power system
00:28:49.220 | was installed in the United States. I think that's relatively correct. And then it starts
00:28:56.780 | to get a little bit silly after that because we've pulled more context from our document
00:29:05.660 | store. But with that, we have pulled in more irrelevant information because we're retrieving
00:29:12.020 | 10 now. So there's a good chance that the last few of those are not relevant. So we're
00:29:16.740 | feeding a lot of irrelevant information into our generator model. And so it starts to get
00:29:21.180 | confused and then it can start to ramble like we see here.
00:29:27.420 | So that's what we see happening. Another thing I just want to point out is that the generator
00:29:36.180 | has this memory. So a lot of people always think when they hear, okay, the generator
00:29:43.060 | has memory, does that mean I don't need the document store? Because we have this memory,
00:29:46.940 | can't I just fine tune the model so that it knows everything within my particular use
00:29:51.300 | case? In some cases, yes, you might be able to do that. But it generally only works for
00:29:59.500 | more general questions or general knowledge. If you start to get specific, it tends to
00:30:05.980 | fail with that sort of memory part because the memory can only source so much information.
00:30:12.260 | And in the end, what you will probably need is you want a model with good memory so it
00:30:17.660 | can maybe pull out some facts from there. But for anything specific, it's probably going
00:30:22.260 | to need to refer to its document store.
00:30:26.140 | So what we have done here is we've asked the same question, but this time I've replaced
00:30:30.980 | the retrieve document with just nothing. And we can see the result of that straight away.
00:30:36.220 | So the answer is, I'm not sure what you mean by war. So it has no idea what the war occurrence
00:30:43.300 | is. It doesn't have that information within its memory. So without that external document
00:30:48.860 | source, it doesn't know what to say. It's just, OK, I don't even know what war is.
00:30:57.380 | But like I said, in some cases, particularly when you're asking more general knowledge
00:31:02.700 | query, it will be able to pull that out from its memory. So who was the first person on
00:31:08.300 | the moon? It knows this because it's such a common thing to know. It's probably seen
00:31:14.500 | it in the training data that the model has been trained on a million times. Maybe not
00:31:19.340 | a million, but a few times at least. So that is the first man to walk on the moon was Neil
00:31:26.500 | Armstrong.
00:31:27.500 | OK, cool. So I think that's pretty much it. We can ask a few more questions. When was
00:31:34.900 | the first electrical power system built? So we ask this in the start, and it will give
00:31:38.980 | us this answer. If we want to confirm that this is correct-- so this is what I did with
00:31:47.260 | this. I was a bit confused because Google was telling me something different. You can
00:31:52.100 | print out the contents using this. So we loop through the result documents, and we just
00:31:59.220 | print dot content. And this, OK, so two electricians built the first power system at gold damming
00:32:06.820 | in England. So that information is actually coming from somewhere. It's not just making
00:32:10.940 | it up.
00:32:12.220 | So that can be really useful. Another thing just to be aware of with generators is that
00:32:18.180 | they can generate misleading information. So you need to be careful with that. So for
00:32:26.780 | example, in this one, I asked, where did COVID-19 originate? Now, this is pretty unfair because
00:32:32.140 | the generator probably hasn't seen anything about COVID-19. And at the same time, it doesn't
00:32:40.460 | have any COVID-19 information within its document store because we looked at history, not anything
00:32:47.540 | else.
00:32:49.060 | So it just says, COVID-19 isn't a virus, which it is. It's a bacterium. So straightaway,
00:32:55.900 | that's pretty wrong. So just one example of where you need to just be cautious with this
00:33:05.580 | sort of thing because it can just give completely wrong answers if it doesn't have the relevant
00:33:10.780 | information available to it.
00:33:12.820 | So with that, there's a couple of things you could do to mitigate that. You can, one, just
00:33:18.180 | include the sources of information. If you build some sort of search interface, make
00:33:23.340 | sure you include those so users can look at that and see where this information is coming
00:33:27.700 | from. And two, there are confidence scores that are given to these answers. So you could
00:33:37.500 | put threshold. So you say anything below 0.2 confidence, we just don't show or we show,
00:33:46.220 | I'm not confident in this answer, but it might be this or something along those lines.
00:33:54.580 | So that's just one drawback. We'll just go through a few final questions. So what was
00:34:00.620 | NASA's most expensive project? I would say the Space Shuttle project. That's correct.
00:34:07.260 | Tell me something interesting about the history of the Earth. In this case, it really, it's
00:34:11.420 | nothing, it's not really history, I don't think. But it does give us an interesting
00:34:17.180 | fact about the magnetic field being weak compared to the rest of the solar system. I don't know
00:34:21.460 | if that's true or not. It seems like it might not be. When it says compared to the rest
00:34:27.140 | of the solar system, I'm thinking, is it weak compared to Mars? I don't think so. So that
00:34:32.580 | might not be true. Another thing to be wary of.
00:34:35.300 | Who created the Nobel Prize and why? So this one is correct and I think quite interesting.
00:34:42.100 | And how is the Nobel Prize funded? We kind of see it down here, so I know the information
00:34:46.500 | is in there, hence why I've asked the question. And it tells you that as well with a little
00:34:52.300 | bit more information. So that is it for long-form question answering with Haystack. As I said
00:35:03.340 | at the start, I think question answering is one of the most widely applicable forms of
00:35:09.860 | NLP or use cases of NLP. It can be applied almost everywhere. So it's a really good one
00:35:18.700 | to just go away and see maybe I can implement document search in my organization or I can
00:35:29.420 | create some sort of internal search engine that helps people in some way. And I think
00:35:36.740 | in a lot of organizations, it's very possible to do this and add a lot of benefit and reduce
00:35:44.420 | a lot of friction in day-to-day processes of most companies.
00:35:50.940 | So that's it for this video. I hope it's been useful and I will see you in the next one.