back to index

How-to use the Kaggle API in Python


Chapters

0:0
0:10 Pip Install Kaggle
2:20 Import the Kaggle Api Class
3:1 Downloading the Competition Data Sets
5:27 Download the Standalone Data Sets

Whisper Transcript | Transcript Only Page

00:00:00.800 | Hi and welcome to this video where we are going to go through setting up and using the Kaggle API.
00:00:07.280 | So the first thing we want to do is actually pip install Kaggle.
00:00:12.560 | Now I already have it installed so I'm not going to go ahead and install it again but once you do
00:00:20.720 | have it installed you can try and import the Kaggle module and you will get this error here.
00:00:30.800 | So this OS error simply tells you that you could not find the Kaggle.json
00:00:37.120 | and you need to add it to this location here. Now the reason it's telling you this is because
00:00:43.760 | we use Kaggle.json to authenticate our API access. Obviously Kaggle is not going to let anyone access
00:00:51.760 | their API, you need to have an account before you start downloading their data. So to get our
00:00:59.760 | Kaggle.json credentials we simply go over to Kaggle.com. Now if you don't have an account you'll
00:01:09.360 | have to go ahead and create one. Once you've created your account you simply go over to this
00:01:15.840 | little icon over here in the top right, click account and scroll down until you see this API
00:01:24.320 | section. Now all you need to do is create a new API token and this creates the Kaggle.json
00:01:34.720 | credentials and allows me to save them to my computer. So I'm just going to save them
00:01:41.280 | in my documents for now and then head back to the notebook and we're going to see that we
00:01:50.880 | need to save it here. So I'm going to copy and paste that across and here we have the directory
00:01:57.680 | that we need to put our Kaggle.json. I'm going to take my Kaggle.json and simply move it into here.
00:02:05.600 | Okay so to check that it's worked we simply rerun this cell and there we can see that our Kaggle
00:02:12.720 | API is now functional. Now we don't actually need this import Kaggle, instead we need to
00:02:20.880 | import the Kaggle API class from the Kaggle API extended module.
00:02:28.320 | once we've imported that we simply initialize our API
00:02:39.600 | and then authenticate it.
00:02:51.120 | Now we're ready to start downloading datasets and the Kaggle API gives us several options
00:02:57.440 | for doing this. The two that you're most likely to use are for downloading the competition datasets
00:03:04.080 | or standalone datasets. Now a competition dataset is related to a current or past competition.
00:03:10.880 | So for example there is a sentiment analysis on movie reviews competition.
00:03:18.320 | We can actually find it over here and you can see here in the URL Kaggle.com is followed by this
00:03:25.440 | C and this C essentially means that this is a competition and we can also see playground
00:03:31.520 | prediction competition everything is telling us that this is a competition and in this competition
00:03:37.360 | it comes with some data. Now this is different to a standalone dataset and these standalone datasets
00:03:47.760 | can simply be uploaded by anyone. So if we go to sentiment 140 dataset here you look in the URL
00:03:55.600 | and we can see that this dataset has been uploaded by Casanova and there's a slightly different
00:04:02.160 | structure to the dataset page as well. We can see here it's a dataset first tab takes us to data
00:04:09.120 | and we can scroll down and see the data that we can get here. So there are two different methods
00:04:16.800 | for downloading each one of these we can't download competition datasets with the standalone
00:04:22.080 | dataset method and we can't download standalone datasets with the competition dataset method.
00:04:27.840 | So we'll start with the competition dataset and to download one of these all we need to do
00:04:34.960 | is use the competition download IOP method and then we need to pass the competition name followed
00:04:47.360 | by the dataset. So head back over here we can see the competition name is this
00:05:02.560 | and the data that we would like
00:05:04.880 | is train.tsv.zip
00:05:11.360 | and that is downloaded into our current directory you can see here. Okay so that's how we
00:05:24.480 | download the competition datasets we can also download the standalone datasets. To do so
00:05:31.840 | we use the dataset download file method
00:05:35.040 | and then here we need to pass the username followed by the dataset name. So if we head over here
00:05:51.760 | you can find both in the url so this one is casanova/sentiment140.
00:05:57.840 | We also need to specify the file name
00:06:08.000 | which in this case is this text here
00:06:18.080 | and then just execute that
00:06:19.280 | and now we can see that we have downloaded both files here. Now you will notice that both of
00:06:28.960 | these files are actually zipped so we can just quickly unzip them using python all we need to do
00:06:36.160 | is import zip file and with zip file
00:06:43.680 | we specify the path to the data which in this case is just the file name
00:06:57.920 | and we specify that we are simply reading it.
00:07:05.280 | And then we simply call the extract all method
00:07:15.360 | and we have our data set here
00:07:22.320 | and we see everything is in the right format. So that's everything for this tutorial on using the
00:07:32.480 | Kaggle API. If you have any questions just let me know in the comments below but otherwise
00:07:38.960 | thank you for watching and I will see you again next time.