Back to Index

How-to Use The Reddit API in Python


Chapters

0:0 Intro
0:53 Creating the API
8:20 Retrieving hot posts
12:18 Extracting data
12:38 Importing pandas
13:18 Extracting posts
14:48 Extracting selftext
17:8 Streaming the latest posts
18:38 Adding a limit parameter
19:58 Looping back in time

Transcript

Hi and welcome to this video on how to use the reddit API in Python So I'm gonna keep this really short and we'll just get straight into the code in just a moment But I just want to describe what we're actually going to cover in this video So the first thing we need to do is obviously get access to the API So I'll just take you through how we can do that and then I'll explain how we authenticate ourselves when accessing The API after that I'll take you through some of the most common uses of the API that I think most of you are probably going to be most interested in so that's stuff like getting the most popular threads from a subreddit or Just a steady stream of all the threads being posted onto a subreddit So let's just get straight into it and we'll start putting together our API Okay, so the first thing we need to do is head over to this page here, which is reddit.com/prefs/apps Now we just want to scroll down here and find this create another app or create an app button And you click on there Now you just give it a name.

It doesn't really matter what you call it. Just something that you recognize We are using this as a script for personal use Obviously if you are using this API for something else then tick one of the other options that is relevant You can give it a quick description And then here you need to give it a redirect URI so for me, I'm just gonna enter my Twitter address because Basically, you can put anything you want in here But it's so that when people are wanting to find out something about your API They will be directed to whatever you put in this box So obviously if someone's find out about my API they'll come to here and they know that they can ask me about it Okay, and then here this is our secret key which we are going to need later So make sure you keep note of this and also this personal use script as well.

So I'm just gonna copy those across and put them into my Jupyter lab here and I'm just gonna call it client ID So identify and this is the public key and Here we have our secret key as well. So this one you need to keep secret Obviously, I'm showing you this but this API won't exist by the time I upload the video And we just enter those so now we have those the next step is to request a temporary auth token from reddit and The first thing we need to do is actually import the request library Then we get our authorization like so And here we enter our client ID and secret key Now once we've done that we are going to need to actually log in so to do that we can first initialize a Dictionary where we specify that we are going to be logging in with a password Which we do like this And then we pass in our username and password as well and For my password, I'm just going to read it in from this text file here You can if you want and this is just a simple script you can just enter your password here.

It's not recommended It's recommended that you read it from elsewhere, but it's completely up to you how you deal with this But this is how you can read it in from a text file And just make sure you put R there instead of W for read Okay, so that is the dictionary that we will need to pass along to read it in just a moment so we also need to essentially identify the version of our API and For this you can literally put anything you want, but we'll put something that is at least slightly descriptive We'll just call it my API And put this is the version number Now all we need to do is actually send a request for our OAuth token We send this request to this address We are accessing the API version 1 and the access token endpoint And in there, we also need to include our OAuth that we received earlier We need to include our login data And we also need to include the headers And this will return us hopefully everything that we need Okay, and then here we can see our access token so Need to access that and we just store it in a Variable here.

So this token is something that we will need to add to our headers whenever we're using the API So to do that, we just write this And we need to add that within authorization and The token itself needs to be formatted in a string That contains the word bearer space and then the token itself So then if we just print out headers this is what we get So now we can access every endpoint within the reddit API so beforehand if we had Tried to access this endpoint The OAuth Reddit.com then API v1.me If we'd have tried to access this we would have not been allowed so let's say we just put the headers and We will just put this user agent API that we had before Okay And we get a 401 response so Let's copy this and try again but this time Use headers which includes our authorization bearer token and see you get a 200 which means everything is okay and Then we can add JSON onto the end here and we get all of this information So, that's great we now have access to anything and We can start accessing what I think is probably the more relevant important information so the first one those I want to focus on is Retrieving the most popular posts on a subreddit.

So if I head over to the reddit API documentation over here Okay, so we can see here. We have this get Subreddit hot and this returns all of the hot posts on that subreddit So in our case, let's go with the hot threads in the Python subreddit so to do that we send a get request and Like and see here it's this are subreddit hot So we can copy that across and We start the request with the OAuth reddit.com And then we have our our subreddit get rid of this end bracket Hot and of course the subreddit that we want to look at is Python And then we can just add our headers in here So, this is his request not read it And then we can see what is in there using this JSON method and then here we get all this data so This is obviously not very clean at the moment So let's clean this up and we can put it into a pandas data frame.

So it's a bit more readable so first let's figure out how to access each post within the response So, let's open this again Now within this JSON all of our posts are contained within this data key here so sad data And then once we get into data we have a few different options So we have this mod hash which is and nothing we need to care about We have dist which is 27.

That's not the post I want and then we have this one here, which is children and then you'll see that this is a list and Within this list we have all the information about all of the hot posts within the Python subreddit So that is where we want to extract data from So let's do that Let's print that post Okay Okay, and now we are getting somewhere and You can see there's quite a lot of data in each one of these So it's probably worth us clean this up a little bit more so you can see here.

This is our other The next entry in this list So what we probably want to do here is extract the data within the post so this is giving us this other dictionary which contains all the relevant information we want and Then it is within here that we are going to want to extract different parts of information into our data frame, so Just as an example, we have the title.

Okay, and then here we can see all of these Titles of the most popular threads in the subreddit So this is essentially the syntax that we're going to use to populate our data frame. So first, let's just import pandas Maybe install it Okay And then we need to initialize our pandas data frame so we do it like so Okay, and that just gives us an empty data frame and then we're gonna use the for loop like we did before to loop through each one of the posts and Just extract them as a row into this data frame So we'll do df equals append and then within this we create a dictionary which is going to contain everything that we would like to include and At the end of that as well.

We also need to remember to ignore Index, otherwise, we'll end up with a load of errors and we want to avoid doing that So first, let's include the subreddit Just so we know where this data is actually coming from So just like before we want to do the post data And then we just access the subreddit Okay, and let's just have a look at what we have there.

So, okay perfect as expected. We're getting all of these entries through So that's great, but obviously we're probably gonna want a little bit more than just the subreddit So, let's just add a few more items as well, so we have the title like we did before And another pretty important one in my opinion, so let's just go Yes, another important one is the self text which contains the actual content of the thread Or the text content of that thread So That one is pretty important.

If you're wanting to extract any information about well anything from reddit Okay, so this is starting to look a little bit better, let's see what we have. Okay, it looks good And maybe we want to also include a few other items Maybe the number of upvotes, the downvotes and the score of the posts So we can do a few different things here.

We have the upvote ratio Which is of course the number of upvotes it is getting in comparison to downvotes Maybe we'd also just like to include the actual number of upvotes and downvotes as well Again it's pretty straightforward. We just include these And we can include downs like so And finally we can also include the score of the post Okay, so that gives us quite a lot of information that we can sort of go ahead with this Now if there are other things that you're interested in adding in here You can just do this to actually see what what keys you can include So it's access to data And then keys and this will just return a list of everything in there Now this is pretty useful for actually finding the most relevant or the most popular posts But a lot of the time what you might want to do is actually stream the newest post So you essentially get a real-time update of what is actually going on And I would say this is probably what most people are going to want to use the API for So we can take a quick look at that as well And we can find it just over here we have this R subreddit new Okay, so essentially all we actually need to do here is adjust our old call To instead of reaching out to the hot endpoint, we reach out to the new endpoint So let's just modify our code to do that Okay, so up here where we have hot, we just change that to new Okay, seems to have worked And then we just do the same thing again, so we just rerun this code Okay, great And then we do this And we get all of the latest posts on our subreddit Which of course is pretty useful Now this is returning Around 27 to this one is 25 posts at once Of course, you're probably going to want maybe a few more than that So what we can do is actually add a limit parameter And this limit parameter we just add like so Add params And then in here we add limit And we can go up to 100 items So if we run that And let's just take a look at what we had before we had this json and we had this this equals 25 Which means that we return 25 items before Now if we run that we will see 100.

So now we're returning 100 items And of course that's pretty useful. So now we're getting more data back and we can essentially just keep running this again and again and Extracting as much data as we would like So if we just rerun this so you can see we go up to 24 here Rerun that and we will go over to 99.

Okay, so that again is pretty useful Now there's also one more thing That is pretty important to understand with this and that is how we can extract the ids of a post from the reddit api so If we go into post here We have these two different items we have kind Which is actually I think here So we have this t3.

So reddit posts just have these different uh types or kinds And it's essentially a code that says whether it's a thread or some other type of post which I think is something like ads Or videos or something along those lines, but generally we're always going to be working with t3 which are threads But if you are working something else, of course That may change and then as well as that we also have The id Which is here and we can put both of these together In order to create the reddit post id so We add this With a underscore in the middle And this That is the unique id and that is unique for every post on the subreddit And in the api documentation, you will see this referred to as the full name So what we can do with this is actually essentially loop back in time with the api so One of the things we can do is only request threads that are further back in time Than a post given a specific full name, which it would be this t3 mix of letters so if we would like to do that, so let's Take this final one we have here And all we do is add that into another variable After like this and this will only take 100 new threads that have appeared after this post So we can do that and then what we can do rather than actually initializing our new data frame we can Avoid doing that and we can actually loop through and add all of these new posts to our data frame And then we end up with even more data And here we go Okay, so that's how we can walk through and keep extracting more and more data From the reddit api now at some point it will stop allowing you to do this You can only go so far back in time Which depends on the volume of requests that you're making the volume of threads on a specific subreddit But that is essentially all you need to actually do that So like I said at the start the reddit api is incredibly powerful and unlike most other apis on social networks It's free to use So definitely something to take advantage of and see how you can implement it in your own projects So I hope you've enjoyed the video and Thank you for watching.

See you next time. Bye