back to indexHow to Build Q&A Models in Python (Transformers)
Chapters
0:0 Introduction
0:47 Question Answering
2:5 Models Page
4:37 Importing Models
5:25 Loading Models
7:34 Tokenizer
13:51 Pipeline Setup
16:1 Output Dictionary
17:57 Results
00:00:00.000 |
Hi and welcome to this video on question answering with Bert. 00:00:03.960 |
So firstly we're going to have a look at the Transformers library 00:00:08.480 |
and we're going to look at how we can find a QnA model 00:00:12.320 |
and then we're going to look at the QnA pipeline so we're going to look at 00:00:15.840 |
actually loading a model in Python using the Transformers library. 00:00:19.840 |
We're going to look at tokenization, how we load a tokenizer and what exactly 00:00:23.680 |
the tokenizer is actually doing and then we're going to take a look at 00:00:27.760 |
the pipeline class which is essentially a wrapper 00:00:32.080 |
made available by the Hugging Face Transformers library 00:00:35.440 |
and it basically just makes our job in terms of building a QnA pipeline 00:00:40.080 |
incredibly easy. So we're going to cover all those, it's going to be 00:00:44.000 |
quite straightforward and quite simple so let's just get straight into it. 00:00:51.360 |
we're essentially asking the model a question 00:00:54.640 |
and passing a context which is what you can see here 00:00:58.880 |
for the model to use to answer that question. 00:01:02.400 |
So you can see down here we have these three questions 00:01:06.240 |
so what organization is the IPCC a part of and then the model will read 00:01:12.240 |
through this and use its language modeling to figure 00:01:19.600 |
which is not inherently clear from reading this. 00:01:22.800 |
We can see we've got IPCC here and is a scientific 00:01:27.280 |
intergovernmental body under the auspices of the United Nations. 00:01:32.400 |
So clearly the IPCC is a part of the United Nations 00:01:37.440 |
but it's not clear, it's not definitively saying that 00:01:40.880 |
in this but once we've actually built this model 00:01:44.000 |
it will quite easily be able to answer each one of these questions 00:01:47.200 |
without any issues. So the first thing we want to do is go over 00:01:58.080 |
on the HuggingFace website we just want to go over to the 00:02:03.600 |
Models page so it's here. Okay and on this Models page the thing 00:02:10.160 |
that we want to be looking at is this question and answering task. 00:02:14.160 |
So here we have all these tasks because when you're working with transformers 00:02:21.200 |
text summarization, text classification, generation 00:02:25.040 |
you know loads of different things but what we want to do is question answering 00:02:28.640 |
so we click on here and this filters all of the models that 00:02:33.600 |
are available to us just purely for question and answering. 00:02:39.200 |
So this is the sort of power of using the HuggingFace transformers 00:02:47.520 |
models that we can just download and start using. Now 00:02:52.800 |
when you want to go and apply these to specific use cases you 00:02:56.320 |
probably want to fine tune it which means you want to train it a little bit 00:03:00.160 |
more than what it is already trained but for actually getting used to how 00:03:06.000 |
all of this works all you need to do is download this 00:03:09.200 |
model and start asking questions and understanding 00:03:12.560 |
how everything is actually functioning. So obviously there's a lot of models 00:03:17.600 |
here we've got 262 models for question answering 00:03:21.120 |
and there's new ones being added all the time. A few of the ones that I would 00:03:29.120 |
So here are the DeepSet models there's eight of them for question answering 00:03:33.200 |
the one that we will be using is this BERT BaseCaseSquad2. 00:03:36.960 |
Another one that I would definitely recommend trying out is this Elektra 00:03:47.600 |
it's from the DeepSet AI company and this model is being pulled directly 00:03:51.920 |
from their github repository. So DeepSet is actually the github 00:03:56.160 |
organization and then this is the repository BERT 00:03:58.880 |
BaseCaseSquad2. BERT is obviously the model BERT from 00:04:03.120 |
Google AI. Base is the base version of BERT so you can see here we have 00:04:09.360 |
BERT large that's just a large model we're using the base model. 00:04:12.640 |
Case just refers to the fact that this model will differentiate between 00:04:21.120 |
The alternative to this would be uncase here where there's no differentiation 00:04:25.280 |
between uppercase and lowercase and then squad2 00:04:28.640 |
refers to the question answering data set that this model has been 00:04:36.080 |
Stanford University. So we're going to take this model 00:04:40.160 |
so you see DeepSet BERT BaseCaseSquad2 and we are going to load it into here 00:04:46.800 |
and all we need to do to do that is from transformers so this is the 00:04:53.360 |
HuggingFaceTransformers library. We're going to import BERT 00:04:59.680 |
for question answering. So this is a specific class 00:05:07.040 |
and using this class we can initialize a few different models not just this 00:05:12.240 |
specific model so you can see here we have this BERT BaseCase we can also 00:05:20.240 |
and if there's a distill BERT as well we can also 00:05:23.360 |
load those in and what this does is it loads that specific model with 00:05:30.640 |
its question and answering layer added on there as well. 00:05:34.720 |
So this model has been trained with the extra layer specifically for question 00:05:38.560 |
answering and we need to use BERT for question 00:05:41.360 |
answering to load that otherwise if you are not using it with a 00:05:46.320 |
specific use case and you're just wanting to get the model 00:05:53.120 |
like that but we want it for question answering so we load this one. 00:05:57.760 |
Another thing to note is that we are using the PyTorch 00:06:01.040 |
implementation of BERT here so Transformers works by having both 00:06:07.520 |
TensorFlow and PyTorch as alternative frameworks working behind the scenes 00:06:11.600 |
in this case we're using PyTorch if you want to switch over to TensorFlow 00:06:20.400 |
So that is our model and to actually load that in all we do 00:06:30.560 |
and we use the from pre-train method and then this is where the model name 00:06:37.280 |
from over here comes into play so we've got DeepSet 00:06:40.240 |
BERT base case squad 2 and we just enter that in there. 00:06:47.760 |
Okay and with that we've actually just loaded the model that's all we had to do. 00:06:59.360 |
Of course there are a few other steps this is just a model but there are a few 00:07:03.920 |
steps before we actually get the data to the model 00:07:11.120 |
so we have this context here and this is just a string. 00:07:15.120 |
BERT doesn't understand strings BERT understands an array of integers where 00:07:18.720 |
each integer represents a token id and that token id 00:07:23.200 |
is very specific to BERT and each one is unique and 00:07:28.000 |
represents a specific word or piece of syntax punctuation or so 00:07:38.720 |
into that specific BERT ready format and to do that we need to use a tokenizer 00:07:50.320 |
and we're going to import the auto tokenizer class. 00:07:55.360 |
Here we can use for example the BERT tokenizer 00:08:00.320 |
but for this we don't need anything specific it's 00:08:08.400 |
mappings from the string or the word into the tokens there's no real issue 00:08:18.320 |
to initialize it we just see this it's practically the 00:08:25.440 |
same syntax as what we used before we use this from pre-train method 00:08:37.760 |
Okay and then with this we can actually tokenize our data so 00:08:52.480 |
all we need to do is write tokenizer and code 00:08:55.680 |
and then let's just pass in one of these questions so we'll 00:08:58.800 |
pass in the first one the questions and the first question there 00:09:07.520 |
and two variables that we will need to add in here 00:09:19.360 |
and the padding which we also set to true. So 00:09:23.840 |
when we are setting up these models and the data going into them BERT in 00:09:29.840 |
particular will expect 512 tokens with every input. 00:09:36.000 |
Now here when we look at this we can see there's probably 00:09:41.600 |
one so each one of these words is most likely to be a token 00:09:46.320 |
and then this question mark at the end of will also be a token so we have 00:09:50.720 |
around 10 tokens in there. Now because we have padding this will add 00:10:03.440 |
512. Now alternatively say if we had 600 tokens in there 00:10:14.160 |
the final 88 tokens to make it a total of 512 00:10:19.840 |
and that's why we need those two arguments in there. 00:10:23.600 |
So let's see what we get from this you can see here that we have 00:10:27.520 |
our tokenized input so BERT will be able to read and understand this 00:10:32.560 |
and essentially what we have so this 1327 is the equivalent to what 00:10:39.360 |
this 2369 is equivalent to organization and so on and so on. Now what you 00:10:50.160 |
So 101 for BERT actually refers to a special token which 00:10:56.480 |
looks like this and this just signifies the start of any 00:11:07.040 |
we can see that okay we get the same again we get this 101 00:11:14.720 |
which is the start sequence then we get the start sequence token again 00:11:18.960 |
because that's all we've put into here and the BERT the tokenizer is reading 00:11:22.640 |
that and converting into the 101 and then we also get this final 00:11:27.440 |
special token as well and we can also see that's here so 00:11:31.200 |
this is another special token which signifies the end of a sequence or 00:11:37.440 |
it signifies a separator point so if we write this out we see here that 00:11:49.040 |
it signifies a separation point or a separator. 00:11:53.120 |
So when we feed this context and this question 00:11:57.440 |
into our BERT model BERT will expect it to be 00:12:01.120 |
within the format something like this so we have the 00:12:08.880 |
our context tokens so this will just be a list of integers which are the 00:12:19.680 |
separator token here followed by our question 00:12:25.840 |
which again after this is followed by a separator token and again 00:12:33.520 |
after this we get a set of padding tokens which 00:12:37.840 |
look like this and that will just take us up to 00:12:40.880 |
the 512 token amount and that's how the data going into BERT 00:12:48.480 |
will look like we have that start sequence we have the context we will 00:12:52.400 |
separate we have a question we have separating and we have padding 00:12:55.840 |
it's always going to look like that when it's going into 00:12:59.520 |
a BERT Q&A model so if we just remove that and this here 00:13:06.160 |
and what we want to do now is actually set up this 00:13:09.360 |
tokenizer and our model into a pipeline into a Q&A pipeline 00:13:15.840 |
so again we get this pipeline from the transformers library so we come 00:13:28.880 |
and now what we want to do is just initialize a 00:13:34.400 |
pipeline object so to do that we just write pipeline 00:13:38.240 |
and then in here what we need to add is a model type so obviously you can see 00:13:51.360 |
the transformers library needs to understand or this pipeline object 00:13:55.280 |
needs to understand which one of those pipelines or functions 00:13:59.360 |
we are intending to use so to tell it that we want to do question 00:14:08.960 |
and that basically sets the wrapper of the pipeline 00:14:12.160 |
to handle question answering formats so we'll see 00:14:15.760 |
our input and for our input we will be passing a context and a question so 00:14:20.320 |
we'll see that it will convert into the right structure 00:14:23.600 |
that we need for question answering which is the 00:14:26.160 |
CLS context separator question separator and padding it will 00:14:31.200 |
convert into that feed it into our tokenizer and the 00:14:34.320 |
output of that tokenizer our token ids will be fed into BERT 00:14:38.000 |
BERT will return us a span start and span end which is essentially 00:14:44.320 |
two numbers which signify the start position and end position of our answer 00:14:49.040 |
within the context and this pipeline will take those two 00:14:52.400 |
numbers and apply them to our context to get the text which is our answer 00:14:58.320 |
from that so it's essentially just a little wrapper and it adds a few 00:15:02.240 |
functionalities so that we don't have to worry about 00:15:05.120 |
converting all of these things so now we just need to pass in our 00:15:15.760 |
and it's as simple as that that's our pipeline setup 00:15:20.080 |
so if we want to use that now all we need to do 00:15:23.280 |
is write nlp and then here we pass a dictionary and this dictionary 00:15:30.720 |
like i said before needs to contain our question and context 00:15:38.240 |
and for this we will just pass the first of our questions up here again so 00:16:00.640 |
okay and this will output a dictionary containing 00:16:05.760 |
the well we can see the score of the answer so that is the model's 00:16:14.960 |
like i said before the start index and end index and what those start 00:16:22.000 |
index and end index map to which is united nations so our 00:16:33.120 |
correct so let me just show you what i mean with 00:16:43.200 |
we get the first letter of our answer because we are 00:16:46.560 |
going through here and it is pulling out this 00:16:49.840 |
specific character if we then add this and go all the way up to our end 00:16:57.840 |
which is at one three two we get the full set because what we're 00:17:02.560 |
doing here is pulling out all the characters from you or at 00:17:10.800 |
character one three two which is actually this 00:17:14.320 |
comma here but obviously with python list indexing 00:17:17.840 |
we get the character before and that gives us united nations which 00:17:28.400 |
we have what your own organizations establish the ipcc 00:17:35.200 |
and we get this wmo and united nations environment program unit 00:17:42.560 |
so if we go in here we can see it was first established in 1988 by two 00:17:47.520 |
united nations organizations the world meteorological organization 00:17:56.320 |
unit so here we have two organizations and it is only actually 00:18:02.800 |
pulling out one of those so i think the reason for that is all 00:18:07.600 |
that is reading is wmo and united nations environment program 00:18:11.360 |
so it is pulling out those two organizations in the end just not the 00:18:15.600 |
full name of the first one so it's still a pretty 00:18:23.280 |
question so what does the un want to stabilize 00:18:30.640 |
and here we're getting the answer of greenhouse gas concentrations in the 00:18:36.800 |
so if we go down here we can see the ultimate objective of the 00:18:42.720 |
unfccc is to stabilize greenhouse gas concentrations in the 00:18:48.080 |
atmosphere at a level that would prevent dangerous 00:18:50.800 |
anthropogenic interference with the climate system 00:18:57.760 |
greenhouse gas concentrations so our model has gone through each one 00:19:04.240 |
of those questions and successfully answered them and all 00:19:11.040 |
and this is without us fine-tuning them at all now 00:19:14.800 |
when you do go and apply these to your own problems sometimes you won't need to 00:19:18.800 |
do any fine-tuning and the model as is will be more than enough but a lot of 00:19:23.840 |
time you will need to fine-tune it and in that case there are a few extra 00:19:32.240 |
everything i wanted to cover there in terms of 00:19:35.040 |
fine-tuning i have covered that in another video so i will 00:19:37.840 |
put a link to that in the description but that's everything for this video so 00:19:43.200 |
thank you very much for watching i hope you enjoyed and i will see you again