Hi and welcome to this video on question answering with Bert. So firstly we're going to have a look at the Transformers library and we're going to look at how we can find a QnA model and then we're going to look at the QnA pipeline so we're going to look at actually loading a model in Python using the Transformers library.
We're going to look at tokenization, how we load a tokenizer and what exactly the tokenizer is actually doing and then we're going to take a look at the pipeline class which is essentially a wrapper made available by the Hugging Face Transformers library and it basically just makes our job in terms of building a QnA pipeline incredibly easy.
So we're going to cover all those, it's going to be quite straightforward and quite simple so let's just get straight into it. Okay so when we're doing question answering we're essentially asking the model a question and passing a context which is what you can see here for the model to use to answer that question.
So you can see down here we have these three questions so what organization is the IPCC a part of and then the model will read through this and use its language modeling to figure out which organization the IPCC is part of which is not inherently clear from reading this.
We can see we've got IPCC here and is a scientific intergovernmental body under the auspices of the United Nations. So clearly the IPCC is a part of the United Nations but it's not clear, it's not definitively saying that in this but once we've actually built this model it will quite easily be able to answer each one of these questions without any issues.
So the first thing we want to do is go over to the HuggingFace website and on the HuggingFace website we just want to go over to the Models page so it's here. Okay and on this Models page the thing that we want to be looking at is this question and answering task.
So here we have all these tasks because when you're working with transformers they can work with a lot of different things text summarization, text classification, generation you know loads of different things but what we want to do is question answering so we click on here and this filters all of the models that are available to us just purely for question and answering.
So this is the sort of power of using the HuggingFace transformers library it already has all these pre-trained models that we can just download and start using. Now when you want to go and apply these to specific use cases you probably want to fine tune it which means you want to train it a little bit more than what it is already trained but for actually getting used to how all of this works all you need to do is download this model and start asking questions and understanding how everything is actually functioning.
So obviously there's a lot of models here we've got 262 models for question answering and there's new ones being added all the time. A few of the ones that I would recommend using are the DeepSets models. So here are the DeepSet models there's eight of them for question answering the one that we will be using is this BERT BaseCaseSquad2.
Another one that I would definitely recommend trying out is this Elektra BaseSquad2 but we will be sticking with BERT Base. Now it's called DeepSet here because it's from the DeepSet AI company and this model is being pulled directly from their github repository. So DeepSet is actually the github organization and then this is the repository BERT BaseCaseSquad2.
BERT is obviously the model BERT from Google AI. Base is the base version of BERT so you can see here we have BERT large that's just a large model we're using the base model. Case just refers to the fact that this model will differentiate between uppercase and lowercase characters or words.
The alternative to this would be uncase here where there's no differentiation between uppercase and lowercase and then squad2 refers to the question answering data set that this model has been trained on which is the squad2 data set from Stanford University. So we're going to take this model so you see DeepSet BERT BaseCaseSquad2 and we are going to load it into here and all we need to do to do that is from transformers so this is the HuggingFaceTransformers library.
We're going to import BERT for question answering. So this is a specific class and using this class we can initialize a few different models not just this specific model so you can see here we have this BERT BaseCase we can also initialize this BERT large uncase Roberta and if there's a distill BERT as well we can also load those in and what this does is it loads that specific model with its question and answering layer added on there as well.
So this model has been trained with the extra layer specifically for question answering and we need to use BERT for question answering to load that otherwise if you are not using it with a specific use case and you're just wanting to get the model itself you can just use the AutoModel class like that but we want it for question answering so we load this one.
Another thing to note is that we are using the PyTorch implementation of BERT here so Transformers works by having both TensorFlow and PyTorch as alternative frameworks working behind the scenes in this case we're using PyTorch if you want to switch over to TensorFlow all you do is add TF in front of that class.
So that is our model and to actually load that in all we do is copy this and we use the from pre-train method and then this is where the model name from over here comes into play so we've got DeepSet BERT base case squad 2 and we just enter that in there.
Okay and with that we've actually just loaded the model that's all we had to do. Of course there are a few other steps this is just a model but there are a few steps before we actually get the data to the model so we need to actually process this data so we have this context here and this is just a string.
BERT doesn't understand strings BERT understands an array of integers where each integer represents a token id and that token id is very specific to BERT and each one is unique and represents a specific word or piece of syntax punctuation or so on. So we need to convert this string into that specific BERT ready format and to do that we need to use a tokenizer so again we're going to go from transformers and we're going to import the auto tokenizer class.
Here we can use for example the BERT tokenizer but for this we don't need anything specific it's quite generic it will just load all of those mappings from the string or the word into the tokens there's no real issue there. So we input our auto tokenizer and to initialize it we just see this it's practically the same syntax as what we used before we use this from pre-train method and then again we're using the same model.
Okay and then with this we can actually tokenize our data so all we need to do is write tokenizer and code and then let's just pass in one of these questions so we'll pass in the first one the questions and the first question there and two variables that we will need to add in here are the truncation which we will set to true and the padding which we also set to true.
So when we are setting up these models and the data going into them BERT in particular will expect 512 tokens with every input. Now here when we look at this we can see there's probably one so each one of these words is most likely to be a token and then this question mark at the end of will also be a token so we have around 10 tokens in there.
Now because we have padding this will add a set of padding tokens onto the end of it to bring that total number of tokens up to 512. Now alternatively say if we had 600 tokens in there we would be relying on the truncation to cut the final 88 tokens to make it a total of 512 and that's why we need those two arguments in there.
So let's see what we get from this you can see here that we have our tokenized input so BERT will be able to read and understand this and essentially what we have so this 1327 is the equivalent to what this 2369 is equivalent to organization and so on and so on.
Now what you might not see here is why we have this 101. So 101 for BERT actually refers to a special token which looks like this and this just signifies the start of any sequence so if we were to just take this we can see that okay we get the same again we get this 101 which is the start sequence then we get the start sequence token again because that's all we've put into here and the BERT the tokenizer is reading that and converting into the 101 and then we also get this final special token as well and we can also see that's here so this is another special token which signifies the end of a sequence or it signifies a separator point so if we write this out we see here that separator is 102 and what I mean by it signifies a separation point or a separator.
So when we feed this context and this question into our BERT model BERT will expect it to be within the format something like this so we have the the start sequence token then we will have our context tokens so this will just be a list of integers which are the token ids and then what we will see is a separator token here followed by our question which again after this is followed by a separator token and again after this we get a set of padding tokens which look like this and that will just take us up to the 512 token amount and that's how the data going into BERT will look like we have that start sequence we have the context we will separate we have a question we have separating and we have padding it's always going to look like that when it's going into a BERT Q&A model so if we just remove that and this here and what we want to do now is actually set up this tokenizer and our model into a pipeline into a Q&A pipeline so again we get this pipeline from the transformers library so we come down here do from transformers import pipeline and now what we want to do is just initialize a pipeline object so to do that we just write pipeline and then in here what we need to add is a model type so obviously you can see up here we have all of these different tasks so summarization text generation and so on the transformers library needs to understand or this pipeline object needs to understand which one of those pipelines or functions we are intending to use so to tell it that we want to do question answering we just write question answering and that basically sets the wrapper of the pipeline to handle question answering formats so we'll see our input and for our input we will be passing a context and a question so we'll see that it will convert into the right structure that we need for question answering which is the CLS context separator question separator and padding it will convert into that feed it into our tokenizer and the output of that tokenizer our token ids will be fed into BERT BERT will return us a span start and span end which is essentially two numbers which signify the start position and end position of our answer within the context and this pipeline will take those two numbers and apply them to our context to get the text which is our answer from that so it's essentially just a little wrapper and it adds a few functionalities so that we don't have to worry about converting all of these things so now we just need to pass in our model and the tokenizer as well and it's as simple as that that's our pipeline setup so if we want to use that now all we need to do is write nlp and then here we pass a dictionary and this dictionary like i said before needs to contain our question and context so the question and for this we will just pass the first of our questions up here again so this questions at the index zero and then we also pass our context which is inside the context variable up here okay and this will output a dictionary containing the well we can see the score of the answer so that is the model's confidence that this is actually an answer like i said before the start index and end index and what those start index and end index map to which is united nations so our question was what organization is the ipcc a part of and we got united nations which is correct so let me just show you what i mean with this start and end so if we go 118 here we get the first letter of our answer because we are going through here and it is pulling out this specific character if we then add this and go all the way up to our end which is at one three two we get the full set because what we're doing here is pulling out all the characters from you or at character one one eight all the way up to character one three two which is actually this comma here but obviously with python list indexing we get the character before and that gives us united nations which is our answer so let's ask another question we have what your own organizations establish the ipcc and we get this wmo and united nations environment program unit so if we go in here we can see it was first established in 1988 by two united nations organizations the world meteorological organization wmo and united nations environment program unit so here we have two organizations and it is only actually pulling out one of those so i think the reason for that is all that is reading is wmo and united nations environment program so it is pulling out those two organizations in the end just not the full name of the first one so it's still a pretty good result and let's go down to this final question so what does the un want to stabilize and here we're getting the answer of greenhouse gas concentrations in the atmosphere so if we go down here we can see the ultimate objective of the unfccc is to stabilize greenhouse gas concentrations in the atmosphere at a level that would prevent dangerous anthropogenic interference with the climate system so again we are getting the answer stabilize greenhouse gas concentrations so our model has gone through each one of those questions and successfully answered them and all we've done is written a few lines of code and this is without us fine-tuning them at all now when you do go and apply these to your own problems sometimes you won't need to do any fine-tuning and the model as is will be more than enough but a lot of time you will need to fine-tune it and in that case there are a few extra steps but for this introduction that's everything i wanted to cover there in terms of fine-tuning i have covered that in another video so i will put a link to that in the description but that's everything for this video so thank you very much for watching i hope you enjoyed and i will see you again next time thanks bye