back to index

Table Question-Answering with TAPAS in Python


Chapters

0:0 Intro
1:4 Table QA process
3:38 Getting the code
4:8 Colab GPU and prerequisites
4:33 Dataset download and preprocessing
6:10 Table QA retrieval pipeline
11:29 First test, can it retrieve tables?
12:55 TAPAS model for table QA
15:4 Asking more table QA questions
17:37 Asking advanced aggregation questions to TAPAS
19:38 Final thoughts

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to be taking a look at table question answering, which is essentially if you
00:00:06.240 | could ask a excel sheet a question like what is the GDP across both China and Indonesia and it
00:00:17.200 | would be able to look at the table, identify the two parts of the table that are relevant to that
00:00:23.680 | question, sum those both together and return you that answer. But imagine we take that and we apply
00:00:31.280 | it not to just one table in excel but we apply it to millions or even billions of tables and the
00:00:41.360 | system is actually capable of taking our question, retrieving the correct table to answer that
00:00:46.640 | question and then repeating the process I just mentioned before where it extracts the specific
00:00:52.240 | parts of the table that are relevant to our query and even performs operations like summing over
00:00:58.560 | those values or averaging over those values. That is what we're going to learn about in this video.
00:01:04.400 | So let me just describe the process that we're going to be taking in order to implement this.
00:01:10.640 | We're going to start with a vector database
00:01:16.480 | here naturally we'll be using Pinecone for that then what we do is we add something called a
00:01:23.680 | retriever model. Now this retriever model will be an MPNET model so typically with natural language
00:01:31.760 | semantic search MPNET is a really good option but this MPNET model has been trained for table
00:01:38.720 | or reading tables so this is our retriever model. Now what's going to happen is we're
00:01:46.720 | going to ask a question like the question I mentioned before so something like what is GDP
00:01:52.480 | across certain conditions. So we're going to ask that sort of question we're going to take that
00:02:01.440 | it's going to go into this retriever model which is our MPNET table retriever model and it's going
00:02:07.760 | to encode that text into a vector and that vector represents the meaning behind that question. So
00:02:17.600 | that MPNET encoded vector goes into our vector database our vector database then returns
00:02:25.360 | relevant tables which have also been encoded by that MPNET table model and it returns them
00:02:34.480 | to our next model which is going to be a table reader model.
00:02:38.160 | It's going to be a table reader now this table reader model is also going to read our question
00:02:49.440 | from up here so it's going to see both of those it's going to see what we've returned from Pinecone
00:02:55.040 | over here and also our question and for the table reader we're going to be using a model called
00:03:01.200 | TAPAS. Now what TAPAS can do is what I mentioned before where we take a table essentially
00:03:09.760 | and it's going to identify the parts of that table that answer our particular question
00:03:19.520 | and it's going to also say whether we need to sum over those parts whether we need to average or
00:03:25.440 | whether we don't even need to do anything whether just the value itself is relevant and answers our
00:03:31.280 | question. So that's the system we're going to be building let's move on to the code and we'll start
00:03:37.440 | putting that all together. Okay so we're going to be running through this table question answering
00:03:42.320 | document example from Pinecone so you can find that docs pinecone io slash docs table QA and
00:03:52.320 | then what I'm going to be doing is just going through the Colab so you can just click on open
00:03:55.600 | Colab and run through the exact same code that I'm going to be going through. So that will open this
00:04:01.200 | Colab notebook here this is another really cool idea and example notebook from Archeroke so again
00:04:06.960 | thank you for that. Now the first thing we want to do is come up to runtime go to change runtime type
00:04:14.080 | and switch this or make sure this is on GPU if it is not that will just make things a lot faster
00:04:19.680 | later on. There are a few prerequisites that we need to install torch scatter might take a little
00:04:24.480 | bit of time so if it is taking some time to install everything that is the reason why. I'm not going to
00:04:30.800 | rerun that because I have already done it and then what we want to do is we need to initialize this
00:04:37.120 | notebook here. This is just from the HuggingFace datasets hub, Archeroke has uploaded it. It is a
00:04:45.040 | subset of the open table and text question answering dataset which is just below text and
00:04:51.120 | tables from Wikipedia. Now once that has downloaded we'll see that we have a few features of URL so
00:04:58.000 | where is it from title headers which is literally the headers of the table and then data within
00:05:04.400 | that table. So we can have a look at one of those now the bits that we are most interested in is
00:05:11.520 | here so the headers so this is about American football no baseball I think one of those things
00:05:23.360 | I'm not sure and you have your headers here level team league manager and then you have the data
00:05:32.000 | so in level you'd have triple A, double A, AA and rookie and then so on with the other bits of data
00:05:39.120 | in there as well. Now what we can do is we can format all those into pandas state frames which
00:05:45.360 | just makes things a lot easier for us in both reading and later formatting so let's go ahead
00:05:51.600 | and do that this again might take a moment to run okay 14 seconds and then we can run this and we
00:05:58.720 | can have a look at what I just showed you set from in table format so now we'll see it's a lot
00:06:06.000 | easier to read nice formatting so great now I want to do is move on to that retriever so
00:06:13.520 | remember in that visual before we had the pinecone vector database which led into the
00:06:21.360 | MPNET table retriever model we're going to go ahead and initialize all of that so the retriever
00:06:29.600 | we're going to be using this deep set all MPNET based v2 table model so we execute that as I said
00:06:37.120 | this model has been fine tuned specifically on retrieving and embedding table like data and
00:06:46.400 | matching those up to natural language queries now once that had downloaded we'll see this kind of
00:06:53.680 | explainer or summary of the model so we have the MPNET transforming model fine-tuned specifically
00:07:00.320 | for this we have the pooling method it is using mean pooling you can see that there
00:07:06.960 | and there's a normalization after so because it has that normalization that means we can use
00:07:12.240 | both cosine similarity which we can use if there is normalization or not and we can also use dot
00:07:20.240 | product similarity because we have that normalization component now this retriever does
00:07:27.360 | expect tables to be in a particular format so we need to initialize this and let's have a look what
00:07:33.440 | that format actually looks like so we are going to have something like this so looking again at
00:07:41.840 | that same table at the top here we have the headers and then we have a new line character
00:07:48.320 | okay and then we have the next row of the table all these separated by commas as you can see and
00:07:54.240 | then we have a new line again so essentially we're just reformatting it into a comma separated
00:08:01.120 | file format now the next thing we want to do is initialize our pinecone vector database for that
00:08:10.720 | we need an api key which is free and we can get it from this link here if you're in the notebook
00:08:16.320 | or if not we just head on over to app.pinecone.io this will either lead you to a sign up or sign
00:08:26.240 | in page or it will lead to this if you're already signed up and what you need to do is head over to
00:08:31.680 | your default project or any other project if you have other projects in there you go to api keys
00:08:39.600 | go to default here and you want to copy this and then you need to just paste it into here now
00:08:46.720 | i have pasted mine into a variable called apig so i can add that in there run this and that just
00:08:55.920 | initializes our connection to pinecone from there what we need to do is create a new vector index
00:09:02.800 | we're going to store all of these formatted table objects but after they've been encoded
00:09:09.920 | by our retriever model so i'm going to call my index table qa i'm going to use cosine here
00:09:17.040 | although like i said before we also use dot product similarity dimensionality this just
00:09:22.640 | aligns with the model so we can actually see that if we do model dot get sentence
00:09:31.280 | embedding some dimension or retriever
00:09:35.760 | okay and then we get this seven six eight so we could also put that in here if we wanted
00:09:44.480 | so rather than hard coding it you can just do this and yeah we run that for me
00:09:50.640 | i've already created this index so that will happen very quickly if you haven't
00:09:56.640 | created the index that will probably take like 10-15 seconds to run then what we want to do is
00:10:02.960 | we're essentially going to go through our entire data set in batches of 64 we are going to
00:10:09.840 | get our process tables we're going to then encode them using our retrieve model
00:10:19.360 | and the output of that we need to convert into a list for pinecone we're going to create a set
00:10:25.840 | of unique ids now this is just a count if you prefer you can use something else but this is
00:10:32.160 | this works for this example so leave it with that we add all those into what we call an upsert list
00:10:41.360 | so we're just going to pass the ids and embeddings we could also if you wanted to saw we're going to
00:10:47.520 | store the tables locally we could also saw the plain text version of the tables in a metadata
00:10:55.040 | dictionary and upload those but we're not going to do that we're just going to use the local ones
00:11:01.280 | for the sake of simplicity and then what we want to do is just upsert all these into pinecone
00:11:07.680 | we would run that that will take a little bit of time i don't think too long maybe six or seven
00:11:14.800 | minutes on colab but i have already run it so okay i can see it's working again here so now i'm just
00:11:22.320 | going to stop it because all of these have already been uploaded into my vector index now what we're
00:11:29.760 | going to do is begin asking questions so this is not the full what we're doing right now we're just
00:11:36.400 | we have the vector database and we have the retriever we don't have the later table reader
00:11:42.080 | and we're going to implement that in a moment but for now i just want to see is it going to
00:11:45.840 | return the correct table for us so we're going to say what was the gdp of china in 2020 we're going
00:11:53.520 | to encode that using the retriever to create our query vector and then we're going to pass that
00:12:00.480 | to pinecone and we're just going to return the top table we could return several tables if we
00:12:06.080 | wanted to return 100 if you wanted but we're just going to be applying a reader to a single
00:12:12.560 | table so i'm going to go with that for now okay and you see that we get this id here now this id
00:12:21.440 | is like i said it was a count that we created earlier so we can actually use that value in
00:12:27.360 | order to extract from our tables that we created earlier the tables variable we can just extract
00:12:34.800 | that specific item and you can see that it does seem to give us a pretty relevant table so right
00:12:43.360 | top here of china and we have their millions of usd in gdp and the year as well which is 2020 so
00:12:50.720 | that looks pretty accurate okay so we've retrieved the correct table now what we need to do is
00:12:58.000 | extract that specific piece of information using our table reader model now for the table reader
00:13:05.520 | model we're going to be using a tapas model that has been fine-tuned for this specific task
00:13:10.560 | and to do that we we need this so we are going to use the model name google tapas base fine-tuned
00:13:22.320 | wtq and we're going to be using the hug and face transformers library and we need to initialize a
00:13:30.400 | tapas tokenizer which is going to convert our natural language query and the tables themselves
00:13:38.160 | into tokens or token ids they get passed into this tapas for question answering which is a tapas
00:13:46.160 | transform model followed by a question answering head and it'll basically go through all those and
00:13:52.240 | it will identify the specific part of the table that answers our question and it can also do
00:13:58.400 | things like say whether we need to sum certain values within that table or whether we need to
00:14:03.440 | average them or do all these different operations which is pretty impressive in my opinion at least
00:14:09.840 | so we're going to package all that up into this pipeline here which is a table question
00:14:15.760 | answering pipeline and it would just include our model and the tokenizer we run that and then what
00:14:21.200 | we can do is we'll pass the table that we retrieved okay so the china gdp table we pass that
00:14:29.360 | and we also pass our query which is what is the gdp of china in 2020 okay and run that okay cool so
00:14:40.000 | you see that it wants us to take the average over one cell okay so it is correct that we should just
00:14:49.280 | take the average over this one cell because that is our answer so the 27.8 million million i think
00:14:56.240 | it is in usd so if we come up here we can see it right there okay so that is our answer
00:15:04.320 | now what i want to do is i want to ask more questions okay i'm going to ask more questions
00:15:09.920 | but i want to do it a lot more efficiently than writing all that code out again so i'm just going
00:15:15.280 | to create a few functions here that will help us so query pine cone which is going to retrieve
00:15:22.800 | the relevant information and then match that up to a particular table return that to us
00:15:30.800 | and then we want this which is just get the answer of the table
00:15:35.680 | and that is just going to feed everything into our pipe and return those answers
00:15:40.320 | okay so for this first question i'm saying which car manufacturers produce cars with top speed of
00:15:50.400 | above 108 kilometers per hour now you can see that this is again a super relevant table and
00:15:58.080 | this is already at least for me impressive in itself that is managing to get this and we can
00:16:03.840 | see max weight okay 220 190 185 186 so there's four manufacturers that do that that is fiat bugatti
00:16:16.720 | bend and miller so we come down here and we're going to do get answer from table
00:16:25.280 | and we get this so fiat bugatti bends and miller is our answer there's no aggregator this is text
00:16:32.000 | so it's saying okay you don't need to average or do anything here these are just the answers
00:16:37.280 | okay let's do another one which scientist is known for improving the seam engine okay and we can see
00:16:45.600 | in this table if we have a look here for improving the seam engine so we should expect the answer to
00:16:50.640 | be george henry callis let's get the answer from the table george henry callis pretty cool let's do
00:16:59.680 | another one another kind of simple query and we'll move on to more advanced queries so what is the
00:17:04.320 | maldivian island name for obluse select at sanghelli resort okay and we can see obluse
00:17:12.960 | let at sanghelli and we have akiri fushi it's probably terrible pronunciation i'm very sorry
00:17:22.560 | to any maldivians watching this and yeah we get the right answer of course so that in itself is
00:17:31.920 | already really impressive but it actually doesn't stop there it gets even more insane than this we
00:17:37.600 | can start asking really more complex questions that take sort of more than one step for this
00:17:44.560 | model to figure out so i want to say what was the total gdp of china and indonesia in 2020 okay let's
00:17:52.800 | query we should get the same table that we got before yes we do and then we want to get the
00:17:59.200 | answer from this table and we get this so we get this aggregator sum so it sounds to sum these two
00:18:08.240 | values here okay so the 27.8 million and 3.8 million and you can see here that these this is
00:18:17.760 | correct right so we could just maybe add a little bit of a wrapper function that consumes different
00:18:24.000 | types of aggregators like some or average and just handles that little bit of logic at the end there
00:18:30.960 | and we have our answer which is insane so that is that's another thing really really impressive and
00:18:39.440 | it's not just some we actually kind of saw this earlier although it wasn't in that used in the
00:18:45.200 | right way but let's have a look at this what is the average carbon emission of power stations in
00:18:51.440 | australia canada and germany okay let's take a look okay looks pretty accurate although this is
00:19:03.920 | just like a random selection of different power stations in these different countries so it's not
00:19:09.760 | perfect but nonetheless we can we can go with this and then we can see okay we have an aggregator
00:19:16.720 | average and we need to average over these values here so number one is not being very good who is
00:19:24.240 | that australia yeah very bad but that is really impressive at least to me i was pretty blown away
00:19:36.240 | with this example so that's it for this video i hope this has been interesting and useful
00:19:45.360 | it definitely is for me i'm really enjoying seeing how we can actually apply question answering
00:19:50.960 | to tables and even more so with the little aggregations at the end very little feature
00:19:56.640 | but i think makes a pretty big difference so thank you very much for watching the video
00:20:03.120 | i hope it has been useful and i will see you again in the next one bye
00:20:18.940 | [BLANK_AUDIO]