back to indexHow-to use the Kaggle API in Python
Chapters
0:0
0:10 Pip Install Kaggle
2:20 Import the Kaggle Api Class
3:1 Downloading the Competition Data Sets
5:27 Download the Standalone Data Sets
00:00:00.800 |
Hi and welcome to this video where we are going to go through setting up and using the Kaggle API. 00:00:07.280 |
So the first thing we want to do is actually pip install Kaggle. 00:00:12.560 |
Now I already have it installed so I'm not going to go ahead and install it again but once you do 00:00:20.720 |
have it installed you can try and import the Kaggle module and you will get this error here. 00:00:30.800 |
So this OS error simply tells you that you could not find the Kaggle.json 00:00:37.120 |
and you need to add it to this location here. Now the reason it's telling you this is because 00:00:43.760 |
we use Kaggle.json to authenticate our API access. Obviously Kaggle is not going to let anyone access 00:00:51.760 |
their API, you need to have an account before you start downloading their data. So to get our 00:00:59.760 |
Kaggle.json credentials we simply go over to Kaggle.com. Now if you don't have an account you'll 00:01:09.360 |
have to go ahead and create one. Once you've created your account you simply go over to this 00:01:15.840 |
little icon over here in the top right, click account and scroll down until you see this API 00:01:24.320 |
section. Now all you need to do is create a new API token and this creates the Kaggle.json 00:01:34.720 |
credentials and allows me to save them to my computer. So I'm just going to save them 00:01:41.280 |
in my documents for now and then head back to the notebook and we're going to see that we 00:01:50.880 |
need to save it here. So I'm going to copy and paste that across and here we have the directory 00:01:57.680 |
that we need to put our Kaggle.json. I'm going to take my Kaggle.json and simply move it into here. 00:02:05.600 |
Okay so to check that it's worked we simply rerun this cell and there we can see that our Kaggle 00:02:12.720 |
API is now functional. Now we don't actually need this import Kaggle, instead we need to 00:02:20.880 |
import the Kaggle API class from the Kaggle API extended module. 00:02:28.320 |
once we've imported that we simply initialize our API 00:02:51.120 |
Now we're ready to start downloading datasets and the Kaggle API gives us several options 00:02:57.440 |
for doing this. The two that you're most likely to use are for downloading the competition datasets 00:03:04.080 |
or standalone datasets. Now a competition dataset is related to a current or past competition. 00:03:10.880 |
So for example there is a sentiment analysis on movie reviews competition. 00:03:18.320 |
We can actually find it over here and you can see here in the URL Kaggle.com is followed by this 00:03:25.440 |
C and this C essentially means that this is a competition and we can also see playground 00:03:31.520 |
prediction competition everything is telling us that this is a competition and in this competition 00:03:37.360 |
it comes with some data. Now this is different to a standalone dataset and these standalone datasets 00:03:47.760 |
can simply be uploaded by anyone. So if we go to sentiment 140 dataset here you look in the URL 00:03:55.600 |
and we can see that this dataset has been uploaded by Casanova and there's a slightly different 00:04:02.160 |
structure to the dataset page as well. We can see here it's a dataset first tab takes us to data 00:04:09.120 |
and we can scroll down and see the data that we can get here. So there are two different methods 00:04:16.800 |
for downloading each one of these we can't download competition datasets with the standalone 00:04:22.080 |
dataset method and we can't download standalone datasets with the competition dataset method. 00:04:27.840 |
So we'll start with the competition dataset and to download one of these all we need to do 00:04:34.960 |
is use the competition download IOP method and then we need to pass the competition name followed 00:04:47.360 |
by the dataset. So head back over here we can see the competition name is this 00:05:11.360 |
and that is downloaded into our current directory you can see here. Okay so that's how we 00:05:24.480 |
download the competition datasets we can also download the standalone datasets. To do so 00:05:35.040 |
and then here we need to pass the username followed by the dataset name. So if we head over here 00:05:51.760 |
you can find both in the url so this one is casanova/sentiment140. 00:06:19.280 |
and now we can see that we have downloaded both files here. Now you will notice that both of 00:06:28.960 |
these files are actually zipped so we can just quickly unzip them using python all we need to do 00:06:43.680 |
we specify the path to the data which in this case is just the file name 00:06:57.920 |
and we specify that we are simply reading it. 00:07:05.280 |
And then we simply call the extract all method 00:07:22.320 |
and we see everything is in the right format. So that's everything for this tutorial on using the 00:07:32.480 |
Kaggle API. If you have any questions just let me know in the comments below but otherwise 00:07:38.960 |
thank you for watching and I will see you again next time.