back to indexHow-to Use The Reddit API in Python
Chapters
0:0 Intro
0:53 Creating the API
8:20 Retrieving hot posts
12:18 Extracting data
12:38 Importing pandas
13:18 Extracting posts
14:48 Extracting selftext
17:8 Streaming the latest posts
18:38 Adding a limit parameter
19:58 Looping back in time
00:00:00.000 |
Hi and welcome to this video on how to use the reddit API in Python 00:00:05.000 |
So I'm gonna keep this really short and we'll just get straight into the code in just a moment 00:00:10.000 |
But I just want to describe what we're actually going to cover in this video 00:00:14.020 |
So the first thing we need to do is obviously get access to the API 00:00:19.080 |
So I'll just take you through how we can do that and then I'll explain how we authenticate ourselves when accessing 00:00:26.640 |
The API after that I'll take you through some of the most common 00:00:31.480 |
uses of the API that I think most of you are probably going to be most interested in so that's stuff like getting the 00:00:42.020 |
Just a steady stream of all the threads being posted onto a subreddit 00:00:47.140 |
So let's just get straight into it and we'll start putting together our API 00:00:53.320 |
Okay, so the first thing we need to do is head over to this page here, which is reddit.com/prefs/apps 00:01:01.880 |
Now we just want to scroll down here and find this create another app or create an app button 00:01:10.320 |
Now you just give it a name. It doesn't really matter what you call it. Just something that you recognize 00:01:20.680 |
We are using this as a script for personal use 00:01:24.640 |
Obviously if you are using this API for something else then tick one of the other options that is relevant 00:01:39.160 |
redirect URI so for me, I'm just gonna enter my 00:01:49.160 |
Basically, you can put anything you want in here 00:01:51.400 |
But it's so that when people are wanting to find out something about your API 00:01:56.040 |
They will be directed to whatever you put in this box 00:01:59.000 |
So obviously if someone's find out about my API they'll come to here and they know that they can ask me about it 00:02:08.900 |
Okay, and then here this is our secret key which we are going to need later 00:02:16.560 |
So make sure you keep note of this and also this personal use script as well. So I'm just gonna copy those across and 00:02:36.140 |
Here we have our secret key as well. So this one you need to keep secret 00:02:44.720 |
Obviously, I'm showing you this but this API won't exist by the time I upload the video 00:03:02.400 |
so now we have those the next step is to request a temporary auth token from reddit and 00:03:10.960 |
The first thing we need to do is actually import the request library 00:03:19.240 |
And here we enter our client ID and secret key 00:03:37.080 |
Now once we've done that we are going to need to actually log in 00:03:47.240 |
Dictionary where we specify that we are going to be logging in with a password 00:03:54.680 |
And then we pass in our username and password as well 00:04:09.800 |
For my password, I'm just going to read it in from this text file here 00:04:24.560 |
You can if you want and this is just a simple script you can just enter your password here. It's not recommended 00:04:36.080 |
It's recommended that you read it from elsewhere, but it's completely up to you how you deal with this 00:04:41.120 |
But this is how you can read it in from a text file 00:04:48.480 |
And just make sure you put R there instead of W for read 00:05:05.600 |
Okay, so that is the dictionary that we will need to pass along to read it in just a moment 00:05:14.680 |
essentially identify the version of our API and 00:05:20.120 |
For this you can literally put anything you want, but we'll put something that is at least slightly descriptive 00:05:38.840 |
Now all we need to do is actually send a request for our OAuth token 00:06:10.480 |
And in there, we also need to include our OAuth that we received earlier 00:06:33.120 |
And this will return us hopefully everything that we need 00:06:43.000 |
Okay, and then here we can see our access token 00:06:49.920 |
Need to access that and we just store it in a 00:06:54.880 |
Variable here. So this token is something that we will need to add to our headers whenever we're using the 00:07:16.720 |
The token itself needs to be formatted in a string 00:07:22.360 |
That contains the word bearer space and then the token itself 00:07:27.480 |
So then if we just print out headers this is what we get 00:07:33.120 |
So now we can access every endpoint within the reddit API 00:07:59.080 |
If we'd have tried to access this we would have not been allowed so 00:08:07.520 |
We will just put this user agent API that we had before 00:08:27.000 |
Use headers which includes our authorization bearer token 00:08:31.080 |
and see you get a 200 which means everything is okay and 00:08:35.840 |
Then we can add JSON onto the end here and we get all of this information 00:08:42.200 |
So, that's great we now have access to anything 00:08:50.160 |
We can start accessing what I think is probably the more relevant important information 00:08:59.720 |
Retrieving the most popular posts on a subreddit. So if I head over to the 00:09:16.360 |
Subreddit hot and this returns all of the hot posts on that subreddit 00:09:22.400 |
So in our case, let's go with the hot threads in the Python subreddit 00:09:34.800 |
Like and see here it's this are subreddit hot 00:09:52.560 |
And then we have our our subreddit get rid of this end bracket 00:09:57.400 |
Hot and of course the subreddit that we want to look at is Python 00:10:17.320 |
And then we can see what is in there using this JSON method and then here we get all this data so 00:10:26.920 |
This is obviously not very clean at the moment 00:10:30.280 |
So let's clean this up and we can put it into a pandas data frame. So it's a bit more readable 00:10:36.160 |
so first let's figure out how to access each post within the 00:10:49.120 |
Now within this JSON all of our posts are contained within this data key here 00:11:00.120 |
And then once we get into data we have a few different options 00:11:07.000 |
So we have this mod hash which is and nothing we need to care about 00:11:11.240 |
We have dist which is 27. That's not the post I want and then we have this one here, which is children 00:11:21.640 |
Within this list we have all the information about all of the hot posts within the Python subreddit 00:11:29.360 |
So that is where we want to extract data from 00:11:47.960 |
You can see there's quite a lot of data in each one of these 00:11:53.400 |
So it's probably worth us clean this up a little bit more so you can see here. This is our other 00:12:05.840 |
So what we probably want to do here is extract the data within the post so this is giving us this 00:12:13.360 |
other dictionary which contains all the relevant information we want and 00:12:18.160 |
Then it is within here that we are going to want to extract 00:12:22.080 |
different parts of information into our data frame, so 00:12:26.960 |
Just as an example, we have the title. Okay, and then here we can see all of these 00:12:33.680 |
Titles of the most popular threads in the subreddit 00:12:38.080 |
So this is essentially the syntax that we're going to use to populate our data frame. So first, let's just 00:13:02.320 |
And then we need to initialize our pandas data frame so we do it like so 00:13:08.560 |
Okay, and that just gives us an empty data frame and then we're gonna use the for loop like we did before 00:13:19.000 |
Just extract them as a row into this data frame 00:13:31.600 |
then within this we create a dictionary which is going to contain everything that we would like to include and 00:13:39.160 |
At the end of that as well. We also need to remember to ignore 00:13:43.360 |
Index, otherwise, we'll end up with a load of errors and we want to avoid doing that 00:13:52.720 |
Just so we know where this data is actually coming from 00:13:57.360 |
So just like before we want to do the post data 00:14:05.260 |
Okay, and let's just have a look at what we have there. So, okay perfect as expected. We're getting all of these entries through 00:14:14.240 |
So that's great, but obviously we're probably gonna want a little bit more than just the subreddit 00:14:25.200 |
So, let's just add a few more items as well, so we have the title like we did before 00:14:33.000 |
And another pretty important one in my opinion, so let's just go 00:14:49.720 |
Yes, another important one is the self text which contains the actual content of the thread 00:15:02.080 |
That one is pretty important. If you're wanting to extract any information about well anything from reddit 00:15:17.960 |
Okay, so this is starting to look a little bit better, let's see what we have. Okay, it looks good 00:15:23.880 |
And maybe we want to also include a few other items 00:15:33.520 |
Maybe the number of upvotes, the downvotes and the score of the posts 00:15:39.600 |
So we can do a few different things here. We have the upvote ratio 00:15:47.360 |
Which is of course the number of upvotes it is getting in comparison to downvotes 00:15:52.360 |
Maybe we'd also just like to include the actual number of upvotes and downvotes as well 00:16:06.520 |
Again it's pretty straightforward. We just include these 00:16:23.000 |
And finally we can also include the score of the post 00:16:41.240 |
Okay, so that gives us quite a lot of information that we can sort of go ahead with this 00:16:47.760 |
Now if there are other things that you're interested in adding in here 00:16:53.200 |
You can just do this to actually see what what keys you can include 00:17:02.200 |
And then keys and this will just return a list of everything in there 00:17:07.400 |
Now this is pretty useful for actually finding the most relevant or the most popular posts 00:17:14.440 |
But a lot of the time what you might want to do is actually stream the newest post 00:17:20.920 |
So you essentially get a real-time update of what is actually going on 00:17:25.760 |
And I would say this is probably what most people are going to want to use the API for 00:17:37.040 |
And we can find it just over here we have this R subreddit new 00:17:41.560 |
Okay, so essentially all we actually need to do here is adjust our old call 00:17:47.720 |
To instead of reaching out to the hot endpoint, we reach out to the new endpoint 00:17:55.880 |
Okay, so up here where we have hot, we just change that to new 00:18:09.600 |
And then we just do the same thing again, so we just rerun this code 00:18:21.920 |
And we get all of the latest posts on our subreddit 00:18:38.320 |
Of course, you're probably going to want maybe a few more than that 00:18:42.880 |
So what we can do is actually add a limit parameter 00:19:05.440 |
And let's just take a look at what we had before we had this json and we had this this equals 25 00:19:13.920 |
Now if we run that we will see 100. So now we're returning 100 items 00:19:18.560 |
And of course that's pretty useful. So now we're getting more data back 00:19:22.480 |
and we can essentially just keep running this again and again and 00:19:31.280 |
So if we just rerun this so you can see we go up to 00:19:36.880 |
Rerun that and we will go over to 99. Okay, so that again is pretty useful 00:19:46.800 |
That is pretty important to understand with this and that is how we can extract the ids of a post 00:20:02.000 |
We have these two different items we have kind 00:20:13.520 |
So we have this t3. So reddit posts just have these different uh types or kinds 00:20:19.120 |
And it's essentially a code that says whether it's a thread or some other type of post which I think is something like ads 00:20:27.360 |
Or videos or something along those lines, but generally we're always going to be working with t3 which are threads 00:20:33.600 |
But if you are working something else, of course 00:20:45.600 |
Which is here and we can put both of these together 00:21:06.480 |
That is the unique id and that is unique for every post on the subreddit 00:21:12.080 |
And in the api documentation, you will see this referred to as the full name 00:21:22.620 |
essentially loop back in time with the api so 00:21:26.700 |
One of the things we can do is only request threads that are further back in time 00:21:33.420 |
Than a post given a specific full name, which it would be this t3 mix of letters 00:22:07.500 |
and then what we can do rather than actually initializing our 00:22:14.140 |
Avoid doing that and we can actually loop through and add all of these new posts to our data frame 00:22:28.300 |
Okay, so that's how we can walk through and keep extracting more and more data 00:22:35.340 |
From the reddit api now at some point it will stop allowing you to do this 00:22:44.540 |
Which depends on the volume of requests that you're making the volume of threads on a specific subreddit 00:22:51.340 |
But that is essentially all you need to actually do that 00:22:55.980 |
So like I said at the start the reddit api is incredibly powerful and unlike most other apis on social networks 00:23:06.700 |
So definitely something to take advantage of and see how you can implement it in your own projects 00:23:17.500 |
Thank you for watching. See you next time. Bye