back to index

How to Build Q&A Models in Python (Transformers)


Chapters

0:0 Introduction
0:47 Question Answering
2:5 Models Page
4:37 Importing Models
5:25 Loading Models
7:34 Tokenizer
13:51 Pipeline Setup
16:1 Output Dictionary
17:57 Results

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi and welcome to this video on question answering with Bert.
00:00:03.960 | So firstly we're going to have a look at the Transformers library
00:00:08.480 | and we're going to look at how we can find a QnA model
00:00:12.320 | and then we're going to look at the QnA pipeline so we're going to look at
00:00:15.840 | actually loading a model in Python using the Transformers library.
00:00:19.840 | We're going to look at tokenization, how we load a tokenizer and what exactly
00:00:23.680 | the tokenizer is actually doing and then we're going to take a look at
00:00:27.760 | the pipeline class which is essentially a wrapper
00:00:32.080 | made available by the Hugging Face Transformers library
00:00:35.440 | and it basically just makes our job in terms of building a QnA pipeline
00:00:40.080 | incredibly easy. So we're going to cover all those, it's going to be
00:00:44.000 | quite straightforward and quite simple so let's just get straight into it.
00:00:47.280 | Okay so when we're doing question answering
00:00:51.360 | we're essentially asking the model a question
00:00:54.640 | and passing a context which is what you can see here
00:00:58.880 | for the model to use to answer that question.
00:01:02.400 | So you can see down here we have these three questions
00:01:06.240 | so what organization is the IPCC a part of and then the model will read
00:01:12.240 | through this and use its language modeling to figure
00:01:16.400 | out which organization the IPCC is part of
00:01:19.600 | which is not inherently clear from reading this.
00:01:22.800 | We can see we've got IPCC here and is a scientific
00:01:27.280 | intergovernmental body under the auspices of the United Nations.
00:01:32.400 | So clearly the IPCC is a part of the United Nations
00:01:37.440 | but it's not clear, it's not definitively saying that
00:01:40.880 | in this but once we've actually built this model
00:01:44.000 | it will quite easily be able to answer each one of these questions
00:01:47.200 | without any issues. So the first thing we want to do is go over
00:01:52.880 | to the HuggingFace website and
00:01:58.080 | on the HuggingFace website we just want to go over to the
00:02:03.600 | Models page so it's here. Okay and on this Models page the thing
00:02:10.160 | that we want to be looking at is this question and answering task.
00:02:14.160 | So here we have all these tasks because when you're working with transformers
00:02:17.680 | they can work with a lot of different things
00:02:21.200 | text summarization, text classification, generation
00:02:25.040 | you know loads of different things but what we want to do is question answering
00:02:28.640 | so we click on here and this filters all of the models that
00:02:33.600 | are available to us just purely for question and answering.
00:02:39.200 | So this is the sort of power of using the HuggingFace transformers
00:02:44.720 | library it already has all these pre-trained
00:02:47.520 | models that we can just download and start using. Now
00:02:52.800 | when you want to go and apply these to specific use cases you
00:02:56.320 | probably want to fine tune it which means you want to train it a little bit
00:03:00.160 | more than what it is already trained but for actually getting used to how
00:03:06.000 | all of this works all you need to do is download this
00:03:09.200 | model and start asking questions and understanding
00:03:12.560 | how everything is actually functioning. So obviously there's a lot of models
00:03:17.600 | here we've got 262 models for question answering
00:03:21.120 | and there's new ones being added all the time. A few of the ones that I would
00:03:24.720 | recommend using are the DeepSets models.
00:03:29.120 | So here are the DeepSet models there's eight of them for question answering
00:03:33.200 | the one that we will be using is this BERT BaseCaseSquad2.
00:03:36.960 | Another one that I would definitely recommend trying out is this Elektra
00:03:40.160 | BaseSquad2 but we will be sticking with BERT
00:03:44.160 | Base. Now it's called DeepSet here because
00:03:47.600 | it's from the DeepSet AI company and this model is being pulled directly
00:03:51.920 | from their github repository. So DeepSet is actually the github
00:03:56.160 | organization and then this is the repository BERT
00:03:58.880 | BaseCaseSquad2. BERT is obviously the model BERT from
00:04:03.120 | Google AI. Base is the base version of BERT so you can see here we have
00:04:09.360 | BERT large that's just a large model we're using the base model.
00:04:12.640 | Case just refers to the fact that this model will differentiate between
00:04:17.440 | uppercase and lowercase characters or words.
00:04:21.120 | The alternative to this would be uncase here where there's no differentiation
00:04:25.280 | between uppercase and lowercase and then squad2
00:04:28.640 | refers to the question answering data set that this model has been
00:04:32.720 | trained on which is the squad2 data set from
00:04:36.080 | Stanford University. So we're going to take this model
00:04:40.160 | so you see DeepSet BERT BaseCaseSquad2 and we are going to load it into here
00:04:46.800 | and all we need to do to do that is from transformers so this is the
00:04:53.360 | HuggingFaceTransformers library. We're going to import BERT
00:04:59.680 | for question answering. So this is a specific class
00:05:07.040 | and using this class we can initialize a few different models not just this
00:05:12.240 | specific model so you can see here we have this BERT BaseCase we can also
00:05:16.800 | initialize this BERT large uncase Roberta
00:05:20.240 | and if there's a distill BERT as well we can also
00:05:23.360 | load those in and what this does is it loads that specific model with
00:05:30.640 | its question and answering layer added on there as well.
00:05:34.720 | So this model has been trained with the extra layer specifically for question
00:05:38.560 | answering and we need to use BERT for question
00:05:41.360 | answering to load that otherwise if you are not using it with a
00:05:46.320 | specific use case and you're just wanting to get the model
00:05:49.520 | itself you can just use the AutoModel class
00:05:53.120 | like that but we want it for question answering so we load this one.
00:05:57.760 | Another thing to note is that we are using the PyTorch
00:06:01.040 | implementation of BERT here so Transformers works by having both
00:06:07.520 | TensorFlow and PyTorch as alternative frameworks working behind the scenes
00:06:11.600 | in this case we're using PyTorch if you want to switch over to TensorFlow
00:06:15.840 | all you do is add TF in front of that class.
00:06:20.400 | So that is our model and to actually load that in all we do
00:06:26.880 | is copy this
00:06:30.560 | and we use the from pre-train method and then this is where the model name
00:06:37.280 | from over here comes into play so we've got DeepSet
00:06:40.240 | BERT base case squad 2 and we just enter that in there.
00:06:47.760 | Okay and with that we've actually just loaded the model that's all we had to do.
00:06:59.360 | Of course there are a few other steps this is just a model but there are a few
00:07:03.920 | steps before we actually get the data to the model
00:07:07.200 | so we need to actually process this data
00:07:11.120 | so we have this context here and this is just a string.
00:07:15.120 | BERT doesn't understand strings BERT understands an array of integers where
00:07:18.720 | each integer represents a token id and that token id
00:07:23.200 | is very specific to BERT and each one is unique and
00:07:28.000 | represents a specific word or piece of syntax punctuation or so
00:07:32.640 | on. So we need to convert this string
00:07:38.720 | into that specific BERT ready format and to do that we need to use a tokenizer
00:07:45.200 | so again we're going to go from transformers
00:07:50.320 | and we're going to import the auto tokenizer class.
00:07:55.360 | Here we can use for example the BERT tokenizer
00:08:00.320 | but for this we don't need anything specific it's
00:08:05.040 | quite generic it will just load all of those
00:08:08.400 | mappings from the string or the word into the tokens there's no real issue
00:08:13.840 | there. So we input our auto tokenizer and
00:08:18.320 | to initialize it we just see this it's practically the
00:08:25.440 | same syntax as what we used before we use this from pre-train method
00:08:30.800 | and then again we're using the same model.
00:08:37.760 | Okay and then with this we can actually tokenize our data so
00:08:52.480 | all we need to do is write tokenizer and code
00:08:55.680 | and then let's just pass in one of these questions so we'll
00:08:58.800 | pass in the first one the questions and the first question there
00:09:07.520 | and two variables that we will need to add in here
00:09:12.080 | are the truncation which we will set to true
00:09:19.360 | and the padding which we also set to true. So
00:09:23.840 | when we are setting up these models and the data going into them BERT in
00:09:29.840 | particular will expect 512 tokens with every input.
00:09:36.000 | Now here when we look at this we can see there's probably
00:09:41.600 | one so each one of these words is most likely to be a token
00:09:46.320 | and then this question mark at the end of will also be a token so we have
00:09:50.720 | around 10 tokens in there. Now because we have padding this will add
00:09:56.960 | a set of padding tokens onto the end of it
00:10:00.320 | to bring that total number of tokens up to
00:10:03.440 | 512. Now alternatively say if we had 600 tokens in there
00:10:11.200 | we would be relying on the truncation to cut
00:10:14.160 | the final 88 tokens to make it a total of 512
00:10:19.840 | and that's why we need those two arguments in there.
00:10:23.600 | So let's see what we get from this you can see here that we have
00:10:27.520 | our tokenized input so BERT will be able to read and understand this
00:10:32.560 | and essentially what we have so this 1327 is the equivalent to what
00:10:39.360 | this 2369 is equivalent to organization and so on and so on. Now what you
00:10:46.000 | might not see here is why we have this 101.
00:10:50.160 | So 101 for BERT actually refers to a special token which
00:10:56.480 | looks like this and this just signifies the start of any
00:11:01.600 | sequence so if we were to just take this
00:11:07.040 | we can see that okay we get the same again we get this 101
00:11:14.720 | which is the start sequence then we get the start sequence token again
00:11:18.960 | because that's all we've put into here and the BERT the tokenizer is reading
00:11:22.640 | that and converting into the 101 and then we also get this final
00:11:27.440 | special token as well and we can also see that's here so
00:11:31.200 | this is another special token which signifies the end of a sequence or
00:11:37.440 | it signifies a separator point so if we write this out we see here that
00:11:44.080 | separator is 102 and what I mean by
00:11:49.040 | it signifies a separation point or a separator.
00:11:53.120 | So when we feed this context and this question
00:11:57.440 | into our BERT model BERT will expect it to be
00:12:01.120 | within the format something like this so we have the
00:12:04.800 | the start sequence token then we will have
00:12:08.880 | our context tokens so this will just be a list of integers which are the
00:12:15.440 | token ids and then what we will see is a
00:12:19.680 | separator token here followed by our question
00:12:25.840 | which again after this is followed by a separator token and again
00:12:33.520 | after this we get a set of padding tokens which
00:12:37.840 | look like this and that will just take us up to
00:12:40.880 | the 512 token amount and that's how the data going into BERT
00:12:48.480 | will look like we have that start sequence we have the context we will
00:12:52.400 | separate we have a question we have separating and we have padding
00:12:55.840 | it's always going to look like that when it's going into
00:12:59.520 | a BERT Q&A model so if we just remove that and this here
00:13:06.160 | and what we want to do now is actually set up this
00:13:09.360 | tokenizer and our model into a pipeline into a Q&A pipeline
00:13:15.840 | so again we get this pipeline from the transformers library so we come
00:13:21.840 | down here do from transformers import
00:13:26.240 | pipeline
00:13:28.880 | and now what we want to do is just initialize a
00:13:34.400 | pipeline object so to do that we just write pipeline
00:13:38.240 | and then in here what we need to add is a model type so obviously you can see
00:13:45.600 | up here we have all of these different tasks
00:13:48.240 | so summarization text generation and so on
00:13:51.360 | the transformers library needs to understand or this pipeline object
00:13:55.280 | needs to understand which one of those pipelines or functions
00:13:59.360 | we are intending to use so to tell it that we want to do question
00:14:04.000 | answering we just write question answering
00:14:08.960 | and that basically sets the wrapper of the pipeline
00:14:12.160 | to handle question answering formats so we'll see
00:14:15.760 | our input and for our input we will be passing a context and a question so
00:14:20.320 | we'll see that it will convert into the right structure
00:14:23.600 | that we need for question answering which is the
00:14:26.160 | CLS context separator question separator and padding it will
00:14:31.200 | convert into that feed it into our tokenizer and the
00:14:34.320 | output of that tokenizer our token ids will be fed into BERT
00:14:38.000 | BERT will return us a span start and span end which is essentially
00:14:44.320 | two numbers which signify the start position and end position of our answer
00:14:49.040 | within the context and this pipeline will take those two
00:14:52.400 | numbers and apply them to our context to get the text which is our answer
00:14:58.320 | from that so it's essentially just a little wrapper and it adds a few
00:15:02.240 | functionalities so that we don't have to worry about
00:15:05.120 | converting all of these things so now we just need to pass in our
00:15:10.960 | model and the tokenizer as well
00:15:15.760 | and it's as simple as that that's our pipeline setup
00:15:20.080 | so if we want to use that now all we need to do
00:15:23.280 | is write nlp and then here we pass a dictionary and this dictionary
00:15:30.720 | like i said before needs to contain our question and context
00:15:35.440 | so the question
00:15:38.240 | and for this we will just pass the first of our questions up here again so
00:15:45.040 | this questions at the
00:15:49.040 | index zero
00:15:51.920 | and then we also pass our context which is
00:15:55.760 | inside the context variable up here
00:16:00.640 | okay and this will output a dictionary containing
00:16:05.760 | the well we can see the score of the answer so that is the model's
00:16:12.160 | confidence that this is actually an answer
00:16:14.960 | like i said before the start index and end index and what those start
00:16:22.000 | index and end index map to which is united nations so our
00:16:25.760 | question was what organization is the ipcc a
00:16:30.080 | part of and we got united nations which is
00:16:33.120 | correct so let me just show you what i mean with
00:16:37.920 | this start and end so if we go 118 here
00:16:43.200 | we get the first letter of our answer because we are
00:16:46.560 | going through here and it is pulling out this
00:16:49.840 | specific character if we then add this and go all the way up to our end
00:16:57.840 | which is at one three two we get the full set because what we're
00:17:02.560 | doing here is pulling out all the characters from you or at
00:17:07.520 | character one one eight all the way up to
00:17:10.800 | character one three two which is actually this
00:17:14.320 | comma here but obviously with python list indexing
00:17:17.840 | we get the character before and that gives us united nations which
00:17:21.680 | is our answer so let's ask another question
00:17:28.400 | we have what your own organizations establish the ipcc
00:17:35.200 | and we get this wmo and united nations environment program unit
00:17:42.560 | so if we go in here we can see it was first established in 1988 by two
00:17:47.520 | united nations organizations the world meteorological organization
00:17:52.720 | wmo and united nations environment program
00:17:56.320 | unit so here we have two organizations and it is only actually
00:18:02.800 | pulling out one of those so i think the reason for that is all
00:18:07.600 | that is reading is wmo and united nations environment program
00:18:11.360 | so it is pulling out those two organizations in the end just not the
00:18:15.600 | full name of the first one so it's still a pretty
00:18:18.880 | good result and let's go down to this final
00:18:23.280 | question so what does the un want to stabilize
00:18:30.640 | and here we're getting the answer of greenhouse gas concentrations in the
00:18:34.160 | atmosphere
00:18:36.800 | so if we go down here we can see the ultimate objective of the
00:18:42.720 | unfccc is to stabilize greenhouse gas concentrations in the
00:18:48.080 | atmosphere at a level that would prevent dangerous
00:18:50.800 | anthropogenic interference with the climate system
00:18:54.640 | so again we are getting the answer stabilize
00:18:57.760 | greenhouse gas concentrations so our model has gone through each one
00:19:04.240 | of those questions and successfully answered them and all
00:19:08.000 | we've done is written a few lines of code
00:19:11.040 | and this is without us fine-tuning them at all now
00:19:14.800 | when you do go and apply these to your own problems sometimes you won't need to
00:19:18.800 | do any fine-tuning and the model as is will be more than enough but a lot of
00:19:23.840 | time you will need to fine-tune it and in that case there are a few extra
00:19:29.120 | steps but for this introduction that's
00:19:32.240 | everything i wanted to cover there in terms of
00:19:35.040 | fine-tuning i have covered that in another video so i will
00:19:37.840 | put a link to that in the description but that's everything for this video so
00:19:43.200 | thank you very much for watching i hope you enjoyed and i will see you again
00:19:47.040 | next time thanks bye