Back to Index

Table Question-Answering with TAPAS in Python


Chapters

0:0 Intro
1:4 Table QA process
3:38 Getting the code
4:8 Colab GPU and prerequisites
4:33 Dataset download and preprocessing
6:10 Table QA retrieval pipeline
11:29 First test, can it retrieve tables?
12:55 TAPAS model for table QA
15:4 Asking more table QA questions
17:37 Asking advanced aggregation questions to TAPAS
19:38 Final thoughts

Transcript

Today we're going to be taking a look at table question answering, which is essentially if you could ask a excel sheet a question like what is the GDP across both China and Indonesia and it would be able to look at the table, identify the two parts of the table that are relevant to that question, sum those both together and return you that answer.

But imagine we take that and we apply it not to just one table in excel but we apply it to millions or even billions of tables and the system is actually capable of taking our question, retrieving the correct table to answer that question and then repeating the process I just mentioned before where it extracts the specific parts of the table that are relevant to our query and even performs operations like summing over those values or averaging over those values.

That is what we're going to learn about in this video. So let me just describe the process that we're going to be taking in order to implement this. We're going to start with a vector database here naturally we'll be using Pinecone for that then what we do is we add something called a retriever model.

Now this retriever model will be an MPNET model so typically with natural language semantic search MPNET is a really good option but this MPNET model has been trained for table or reading tables so this is our retriever model. Now what's going to happen is we're going to ask a question like the question I mentioned before so something like what is GDP across certain conditions.

So we're going to ask that sort of question we're going to take that it's going to go into this retriever model which is our MPNET table retriever model and it's going to encode that text into a vector and that vector represents the meaning behind that question. So that MPNET encoded vector goes into our vector database our vector database then returns relevant tables which have also been encoded by that MPNET table model and it returns them to our next model which is going to be a table reader model.

It's going to be a table reader now this table reader model is also going to read our question from up here so it's going to see both of those it's going to see what we've returned from Pinecone over here and also our question and for the table reader we're going to be using a model called TAPAS.

Now what TAPAS can do is what I mentioned before where we take a table essentially and it's going to identify the parts of that table that answer our particular question and it's going to also say whether we need to sum over those parts whether we need to average or whether we don't even need to do anything whether just the value itself is relevant and answers our question.

So that's the system we're going to be building let's move on to the code and we'll start putting that all together. Okay so we're going to be running through this table question answering document example from Pinecone so you can find that docs pinecone io slash docs table QA and then what I'm going to be doing is just going through the Colab so you can just click on open Colab and run through the exact same code that I'm going to be going through.

So that will open this Colab notebook here this is another really cool idea and example notebook from Archeroke so again thank you for that. Now the first thing we want to do is come up to runtime go to change runtime type and switch this or make sure this is on GPU if it is not that will just make things a lot faster later on.

There are a few prerequisites that we need to install torch scatter might take a little bit of time so if it is taking some time to install everything that is the reason why. I'm not going to rerun that because I have already done it and then what we want to do is we need to initialize this notebook here.

This is just from the HuggingFace datasets hub, Archeroke has uploaded it. It is a subset of the open table and text question answering dataset which is just below text and tables from Wikipedia. Now once that has downloaded we'll see that we have a few features of URL so where is it from title headers which is literally the headers of the table and then data within that table.

So we can have a look at one of those now the bits that we are most interested in is here so the headers so this is about American football no baseball I think one of those things I'm not sure and you have your headers here level team league manager and then you have the data so in level you'd have triple A, double A, AA and rookie and then so on with the other bits of data in there as well.

Now what we can do is we can format all those into pandas state frames which just makes things a lot easier for us in both reading and later formatting so let's go ahead and do that this again might take a moment to run okay 14 seconds and then we can run this and we can have a look at what I just showed you set from in table format so now we'll see it's a lot easier to read nice formatting so great now I want to do is move on to that retriever so remember in that visual before we had the pinecone vector database which led into the MPNET table retriever model we're going to go ahead and initialize all of that so the retriever we're going to be using this deep set all MPNET based v2 table model so we execute that as I said this model has been fine tuned specifically on retrieving and embedding table like data and matching those up to natural language queries now once that had downloaded we'll see this kind of explainer or summary of the model so we have the MPNET transforming model fine-tuned specifically for this we have the pooling method it is using mean pooling you can see that there and there's a normalization after so because it has that normalization that means we can use both cosine similarity which we can use if there is normalization or not and we can also use dot product similarity because we have that normalization component now this retriever does expect tables to be in a particular format so we need to initialize this and let's have a look what that format actually looks like so we are going to have something like this so looking again at that same table at the top here we have the headers and then we have a new line character okay and then we have the next row of the table all these separated by commas as you can see and then we have a new line again so essentially we're just reformatting it into a comma separated file format now the next thing we want to do is initialize our pinecone vector database for that we need an api key which is free and we can get it from this link here if you're in the notebook or if not we just head on over to app.pinecone.io this will either lead you to a sign up or sign in page or it will lead to this if you're already signed up and what you need to do is head over to your default project or any other project if you have other projects in there you go to api keys go to default here and you want to copy this and then you need to just paste it into here now i have pasted mine into a variable called apig so i can add that in there run this and that just initializes our connection to pinecone from there what we need to do is create a new vector index we're going to store all of these formatted table objects but after they've been encoded by our retriever model so i'm going to call my index table qa i'm going to use cosine here although like i said before we also use dot product similarity dimensionality this just aligns with the model so we can actually see that if we do model dot get sentence embedding some dimension or retriever okay and then we get this seven six eight so we could also put that in here if we wanted so rather than hard coding it you can just do this and yeah we run that for me i've already created this index so that will happen very quickly if you haven't created the index that will probably take like 10-15 seconds to run then what we want to do is we're essentially going to go through our entire data set in batches of 64 we are going to get our process tables we're going to then encode them using our retrieve model and the output of that we need to convert into a list for pinecone we're going to create a set of unique ids now this is just a count if you prefer you can use something else but this is this works for this example so leave it with that we add all those into what we call an upsert list so we're just going to pass the ids and embeddings we could also if you wanted to saw we're going to store the tables locally we could also saw the plain text version of the tables in a metadata dictionary and upload those but we're not going to do that we're just going to use the local ones for the sake of simplicity and then what we want to do is just upsert all these into pinecone we would run that that will take a little bit of time i don't think too long maybe six or seven minutes on colab but i have already run it so okay i can see it's working again here so now i'm just going to stop it because all of these have already been uploaded into my vector index now what we're going to do is begin asking questions so this is not the full what we're doing right now we're just we have the vector database and we have the retriever we don't have the later table reader and we're going to implement that in a moment but for now i just want to see is it going to return the correct table for us so we're going to say what was the gdp of china in 2020 we're going to encode that using the retriever to create our query vector and then we're going to pass that to pinecone and we're just going to return the top table we could return several tables if we wanted to return 100 if you wanted but we're just going to be applying a reader to a single table so i'm going to go with that for now okay and you see that we get this id here now this id is like i said it was a count that we created earlier so we can actually use that value in order to extract from our tables that we created earlier the tables variable we can just extract that specific item and you can see that it does seem to give us a pretty relevant table so right top here of china and we have their millions of usd in gdp and the year as well which is 2020 so that looks pretty accurate okay so we've retrieved the correct table now what we need to do is extract that specific piece of information using our table reader model now for the table reader model we're going to be using a tapas model that has been fine-tuned for this specific task and to do that we we need this so we are going to use the model name google tapas base fine-tuned wtq and we're going to be using the hug and face transformers library and we need to initialize a tapas tokenizer which is going to convert our natural language query and the tables themselves into tokens or token ids they get passed into this tapas for question answering which is a tapas transform model followed by a question answering head and it'll basically go through all those and it will identify the specific part of the table that answers our question and it can also do things like say whether we need to sum certain values within that table or whether we need to average them or do all these different operations which is pretty impressive in my opinion at least so we're going to package all that up into this pipeline here which is a table question answering pipeline and it would just include our model and the tokenizer we run that and then what we can do is we'll pass the table that we retrieved okay so the china gdp table we pass that and we also pass our query which is what is the gdp of china in 2020 okay and run that okay cool so you see that it wants us to take the average over one cell okay so it is correct that we should just take the average over this one cell because that is our answer so the 27.8 million million i think it is in usd so if we come up here we can see it right there okay so that is our answer now what i want to do is i want to ask more questions okay i'm going to ask more questions but i want to do it a lot more efficiently than writing all that code out again so i'm just going to create a few functions here that will help us so query pine cone which is going to retrieve the relevant information and then match that up to a particular table return that to us and then we want this which is just get the answer of the table and that is just going to feed everything into our pipe and return those answers okay so for this first question i'm saying which car manufacturers produce cars with top speed of above 108 kilometers per hour now you can see that this is again a super relevant table and this is already at least for me impressive in itself that is managing to get this and we can see max weight okay 220 190 185 186 so there's four manufacturers that do that that is fiat bugatti bend and miller so we come down here and we're going to do get answer from table and we get this so fiat bugatti bends and miller is our answer there's no aggregator this is text so it's saying okay you don't need to average or do anything here these are just the answers okay let's do another one which scientist is known for improving the seam engine okay and we can see in this table if we have a look here for improving the seam engine so we should expect the answer to be george henry callis let's get the answer from the table george henry callis pretty cool let's do another one another kind of simple query and we'll move on to more advanced queries so what is the maldivian island name for obluse select at sanghelli resort okay and we can see obluse let at sanghelli and we have akiri fushi it's probably terrible pronunciation i'm very sorry to any maldivians watching this and yeah we get the right answer of course so that in itself is already really impressive but it actually doesn't stop there it gets even more insane than this we can start asking really more complex questions that take sort of more than one step for this model to figure out so i want to say what was the total gdp of china and indonesia in 2020 okay let's query we should get the same table that we got before yes we do and then we want to get the answer from this table and we get this so we get this aggregator sum so it sounds to sum these two values here okay so the 27.8 million and 3.8 million and you can see here that these this is correct right so we could just maybe add a little bit of a wrapper function that consumes different types of aggregators like some or average and just handles that little bit of logic at the end there and we have our answer which is insane so that is that's another thing really really impressive and it's not just some we actually kind of saw this earlier although it wasn't in that used in the right way but let's have a look at this what is the average carbon emission of power stations in australia canada and germany okay let's take a look okay looks pretty accurate although this is just like a random selection of different power stations in these different countries so it's not perfect but nonetheless we can we can go with this and then we can see okay we have an aggregator average and we need to average over these values here so number one is not being very good who is that australia yeah very bad but that is really impressive at least to me i was pretty blown away with this example so that's it for this video i hope this has been interesting and useful it definitely is for me i'm really enjoying seeing how we can actually apply question answering to tables and even more so with the little aggregations at the end very little feature but i think makes a pretty big difference so thank you very much for watching the video i hope it has been useful and i will see you again in the next one bye you