back to index

How-to Use The Reddit API in Python


Chapters

0:0 Intro
0:53 Creating the API
8:20 Retrieving hot posts
12:18 Extracting data
12:38 Importing pandas
13:18 Extracting posts
14:48 Extracting selftext
17:8 Streaming the latest posts
18:38 Adding a limit parameter
19:58 Looping back in time

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi and welcome to this video on how to use the reddit API in Python
00:00:05.000 | So I'm gonna keep this really short and we'll just get straight into the code in just a moment
00:00:10.000 | But I just want to describe what we're actually going to cover in this video
00:00:14.020 | So the first thing we need to do is obviously get access to the API
00:00:19.080 | So I'll just take you through how we can do that and then I'll explain how we authenticate ourselves when accessing
00:00:26.640 | The API after that I'll take you through some of the most common
00:00:31.480 | uses of the API that I think most of you are probably going to be most interested in so that's stuff like getting the
00:00:38.480 | most popular threads from a subreddit or
00:00:42.020 | Just a steady stream of all the threads being posted onto a subreddit
00:00:47.140 | So let's just get straight into it and we'll start putting together our API
00:00:53.320 | Okay, so the first thing we need to do is head over to this page here, which is reddit.com/prefs/apps
00:01:01.880 | Now we just want to scroll down here and find this create another app or create an app button
00:01:07.680 | And you click on there
00:01:10.320 | Now you just give it a name. It doesn't really matter what you call it. Just something that you recognize
00:01:20.680 | We are using this as a script for personal use
00:01:24.640 | Obviously if you are using this API for something else then tick one of the other options that is relevant
00:01:31.260 | You can give it a quick description
00:01:33.600 | And then here you need to give it a
00:01:39.160 | redirect URI so for me, I'm just gonna enter my
00:01:45.700 | Twitter address because
00:01:49.160 | Basically, you can put anything you want in here
00:01:51.400 | But it's so that when people are wanting to find out something about your API
00:01:56.040 | They will be directed to whatever you put in this box
00:01:59.000 | So obviously if someone's find out about my API they'll come to here and they know that they can ask me about it
00:02:08.900 | Okay, and then here this is our secret key which we are going to need later
00:02:16.560 | So make sure you keep note of this and also this personal use script as well. So I'm just gonna copy those across and
00:02:24.840 | put them into my Jupyter lab here and
00:02:28.680 | I'm just gonna call it client ID
00:02:31.560 | So identify and this is the public key and
00:02:36.140 | Here we have our secret key as well. So this one you need to keep secret
00:02:44.720 | Obviously, I'm showing you this but this API won't exist by the time I upload the video
00:02:50.100 | And we just enter those
00:03:02.400 | so now we have those the next step is to request a temporary auth token from reddit and
00:03:10.960 | The first thing we need to do is actually import the request library
00:03:14.360 | Then we get our
00:03:17.080 | authorization like so
00:03:19.240 | And here we enter our client ID and secret key
00:03:37.080 | Now once we've done that we are going to need to actually log in
00:03:41.660 | so to do that we can first initialize a
00:03:47.240 | Dictionary where we specify that we are going to be logging in with a password
00:03:51.680 | Which we do like this
00:03:54.680 | And then we pass in our username and password as well
00:04:09.800 | For my password, I'm just going to read it in from this text file here
00:04:24.560 | You can if you want and this is just a simple script you can just enter your password here. It's not recommended
00:04:36.080 | It's recommended that you read it from elsewhere, but it's completely up to you how you deal with this
00:04:41.120 | But this is how you can read it in from a text file
00:04:48.480 | And just make sure you put R there instead of W for read
00:05:05.600 | Okay, so that is the dictionary that we will need to pass along to read it in just a moment
00:05:12.280 | so we also need to
00:05:14.680 | essentially identify the version of our API and
00:05:20.120 | For this you can literally put anything you want, but we'll put something that is at least slightly descriptive
00:05:29.320 | We'll just call it my API
00:05:35.240 | And put this is the version number
00:05:38.840 | Now all we need to do is actually send a request for our OAuth token
00:05:47.600 | We send this request to this address
00:06:02.360 | We are accessing the API
00:06:04.360 | version 1 and
00:06:07.320 | the access token endpoint
00:06:10.480 | And in there, we also need to include our OAuth that we received earlier
00:06:20.520 | We need to include our login data
00:06:30.280 | And we also need to include the headers
00:06:33.120 | And this will return us hopefully everything that we need
00:06:43.000 | Okay, and then here we can see our access token
00:06:49.920 | Need to access that and we just store it in a
00:06:54.880 | Variable here. So this token is something that we will need to add to our headers whenever we're using the
00:07:04.360 | So to do that, we just write this
00:07:06.880 | And we need to add that within authorization
00:07:16.720 | The token itself needs to be formatted in a string
00:07:22.360 | That contains the word bearer space and then the token itself
00:07:27.480 | So then if we just print out headers this is what we get
00:07:33.120 | So now we can access every endpoint within the reddit API
00:07:38.040 | so beforehand if we had
00:07:40.960 | Tried to access this endpoint
00:07:50.280 | The OAuth
00:07:52.280 | Reddit.com then API
00:07:56.000 | v1.me
00:07:59.080 | If we'd have tried to access this we would have not been allowed so
00:08:04.440 | let's say we just put the headers and
00:08:07.520 | We will just put this user agent API that we had before
00:08:16.160 | And we get a 401 response
00:08:20.920 | Let's copy this and try again
00:08:23.640 | but this time
00:08:27.000 | Use headers which includes our authorization bearer token
00:08:31.080 | and see you get a 200 which means everything is okay and
00:08:35.840 | Then we can add JSON onto the end here and we get all of this information
00:08:42.200 | So, that's great we now have access to anything
00:08:50.160 | We can start accessing what I think is probably the more relevant important information
00:08:55.800 | so the first one those I want to focus on is
00:08:59.720 | Retrieving the most popular posts on a subreddit. So if I head over to the
00:09:07.320 | reddit API
00:09:09.960 | documentation over here
00:09:12.040 | Okay, so we can see here. We have this get
00:09:16.360 | Subreddit hot and this returns all of the hot posts on that subreddit
00:09:22.400 | So in our case, let's go with the hot threads in the Python subreddit
00:09:30.560 | so to do that we send a get request and
00:09:34.800 | Like and see here it's this are subreddit hot
00:09:40.040 | So we can copy that across and
00:09:45.960 | We start the request with the
00:09:49.040 | OAuth reddit.com
00:09:52.560 | And then we have our our subreddit get rid of this end bracket
00:09:57.400 | Hot and of course the subreddit that we want to look at is Python
00:10:04.600 | And then we can just add our headers in here
00:10:14.320 | So, this is his request not read it
00:10:17.320 | And then we can see what is in there using this JSON method and then here we get all this data so
00:10:26.920 | This is obviously not very clean at the moment
00:10:30.280 | So let's clean this up and we can put it into a pandas data frame. So it's a bit more readable
00:10:36.160 | so first let's figure out how to access each post within the
00:10:43.480 | response
00:10:45.480 | So, let's open this again
00:10:49.120 | Now within this JSON all of our posts are contained within this data key here
00:10:58.120 | sad data
00:11:00.120 | And then once we get into data we have a few different options
00:11:07.000 | So we have this mod hash which is and nothing we need to care about
00:11:11.240 | We have dist which is 27. That's not the post I want and then we have this one here, which is children
00:11:18.680 | and then you'll see that this is a list and
00:11:21.640 | Within this list we have all the information about all of the hot posts within the Python subreddit
00:11:29.360 | So that is where we want to extract data from
00:11:33.000 | So let's do that
00:11:35.760 | Let's print that post
00:11:42.240 | Okay, and now we are getting somewhere and
00:11:47.960 | You can see there's quite a lot of data in each one of these
00:11:53.400 | So it's probably worth us clean this up a little bit more so you can see here. This is our other
00:12:00.400 | The next entry in this list
00:12:05.840 | So what we probably want to do here is extract the data within the post so this is giving us this
00:12:13.360 | other dictionary which contains all the relevant information we want and
00:12:18.160 | Then it is within here that we are going to want to extract
00:12:22.080 | different parts of information into our data frame, so
00:12:26.960 | Just as an example, we have the title. Okay, and then here we can see all of these
00:12:33.680 | Titles of the most popular threads in the subreddit
00:12:38.080 | So this is essentially the syntax that we're going to use to populate our data frame. So first, let's just
00:12:45.480 | import pandas
00:12:48.240 | Maybe install it
00:13:02.320 | And then we need to initialize our pandas data frame so we do it like so
00:13:08.560 | Okay, and that just gives us an empty data frame and then we're gonna use the for loop like we did before
00:13:16.160 | to loop through each one of the posts and
00:13:19.000 | Just extract them as a row into this data frame
00:13:24.080 | So we'll do df equals
00:13:29.720 | append and
00:13:31.600 | then within this we create a dictionary which is going to contain everything that we would like to include and
00:13:39.160 | At the end of that as well. We also need to remember to ignore
00:13:43.360 | Index, otherwise, we'll end up with a load of errors and we want to avoid doing that
00:13:48.640 | So first, let's include the subreddit
00:13:52.720 | Just so we know where this data is actually coming from
00:13:57.360 | So just like before we want to do the post data
00:14:00.120 | And then we just access the subreddit
00:14:05.260 | Okay, and let's just have a look at what we have there. So, okay perfect as expected. We're getting all of these entries through
00:14:14.240 | So that's great, but obviously we're probably gonna want a little bit more than just the subreddit
00:14:25.200 | So, let's just add a few more items as well, so we have the title like we did before
00:14:33.000 | And another pretty important one in my opinion, so let's just go
00:14:49.720 | Yes, another important one is the self text which contains the actual content of the thread
00:14:57.000 | Or the text content of that thread
00:15:02.080 | That one is pretty important. If you're wanting to extract any information about well anything from reddit
00:15:17.960 | Okay, so this is starting to look a little bit better, let's see what we have. Okay, it looks good
00:15:23.880 | And maybe we want to also include a few other items
00:15:33.520 | Maybe the number of upvotes, the downvotes and the score of the posts
00:15:39.600 | So we can do a few different things here. We have the upvote ratio
00:15:47.360 | Which is of course the number of upvotes it is getting in comparison to downvotes
00:15:52.360 | Maybe we'd also just like to include the actual number of upvotes and downvotes as well
00:16:06.520 | Again it's pretty straightforward. We just include these
00:16:12.600 | And we can include downs like so
00:16:23.000 | And finally we can also include the score of the post
00:16:41.240 | Okay, so that gives us quite a lot of information that we can sort of go ahead with this
00:16:47.760 | Now if there are other things that you're interested in adding in here
00:16:53.200 | You can just do this to actually see what what keys you can include
00:16:59.560 | So it's access to data
00:17:02.200 | And then keys and this will just return a list of everything in there
00:17:07.400 | Now this is pretty useful for actually finding the most relevant or the most popular posts
00:17:14.440 | But a lot of the time what you might want to do is actually stream the newest post
00:17:20.920 | So you essentially get a real-time update of what is actually going on
00:17:25.760 | And I would say this is probably what most people are going to want to use the API for
00:17:31.040 | So we can take a quick look at that as well
00:17:37.040 | And we can find it just over here we have this R subreddit new
00:17:41.560 | Okay, so essentially all we actually need to do here is adjust our old call
00:17:47.720 | To instead of reaching out to the hot endpoint, we reach out to the new endpoint
00:17:53.080 | So let's just modify our code to do that
00:17:55.880 | Okay, so up here where we have hot, we just change that to new
00:18:06.720 | Okay, seems to have worked
00:18:09.600 | And then we just do the same thing again, so we just rerun this code
00:18:15.280 | Okay, great
00:18:18.800 | And then we do this
00:18:21.920 | And we get all of the latest posts on our subreddit
00:18:27.760 | Which of course is pretty useful
00:18:29.760 | Now this is returning
00:18:34.160 | Around 27 to this one is 25 posts at once
00:18:38.320 | Of course, you're probably going to want maybe a few more than that
00:18:42.880 | So what we can do is actually add a limit parameter
00:18:47.920 | And this limit parameter we just add like so
00:18:53.040 | Add params
00:18:55.440 | And then in here we add limit
00:18:57.440 | And we can go up to 100 items
00:19:03.680 | So if we run that
00:19:05.440 | And let's just take a look at what we had before we had this json and we had this this equals 25
00:19:10.640 | Which means that we return 25 items before
00:19:13.920 | Now if we run that we will see 100. So now we're returning 100 items
00:19:18.560 | And of course that's pretty useful. So now we're getting more data back
00:19:22.480 | and we can essentially just keep running this again and again and
00:19:27.040 | Extracting as much data as we would like
00:19:31.280 | So if we just rerun this so you can see we go up to
00:19:34.240 | 24 here
00:19:36.880 | Rerun that and we will go over to 99. Okay, so that again is pretty useful
00:19:44.000 | Now there's also one more thing
00:19:46.800 | That is pretty important to understand with this and that is how we can extract the ids of a post
00:19:55.680 | from the reddit api
00:19:59.520 | If we go into post here
00:20:02.000 | We have these two different items we have kind
00:20:07.840 | Which is actually I think
00:20:13.520 | So we have this t3. So reddit posts just have these different uh types or kinds
00:20:19.120 | And it's essentially a code that says whether it's a thread or some other type of post which I think is something like ads
00:20:27.360 | Or videos or something along those lines, but generally we're always going to be working with t3 which are threads
00:20:33.600 | But if you are working something else, of course
00:20:36.320 | That may change
00:20:38.720 | and then as well as that we also have
00:20:41.520 | The id
00:20:45.600 | Which is here and we can put both of these together
00:20:50.560 | In order to create the reddit post id
00:20:58.000 | We add this
00:21:00.000 | With a underscore in the middle
00:21:02.960 | And this
00:21:06.480 | That is the unique id and that is unique for every post on the subreddit
00:21:12.080 | And in the api documentation, you will see this referred to as the full name
00:21:17.280 | So what we can do with this is actually
00:21:22.620 | essentially loop back in time with the api so
00:21:26.700 | One of the things we can do is only request threads that are further back in time
00:21:33.420 | Than a post given a specific full name, which it would be this t3 mix of letters
00:21:39.820 | so if we would like to do that, so let's
00:21:43.660 | Take this final one we have here
00:21:46.700 | And all we do is add that into
00:21:53.100 | another variable
00:21:55.100 | After like this and this will only take
00:21:58.380 | 100 new threads that have appeared after
00:22:02.940 | this post
00:22:05.580 | So we can do that
00:22:07.500 | and then what we can do rather than actually initializing our
00:22:11.420 | new data frame we can
00:22:14.140 | Avoid doing that and we can actually loop through and add all of these new posts to our data frame
00:22:20.460 | And then we end up with even more data
00:22:22.460 | And here we go
00:22:28.300 | Okay, so that's how we can walk through and keep extracting more and more data
00:22:35.340 | From the reddit api now at some point it will stop allowing you to do this
00:22:41.340 | You can only go so far back in time
00:22:44.540 | Which depends on the volume of requests that you're making the volume of threads on a specific subreddit
00:22:51.340 | But that is essentially all you need to actually do that
00:22:55.980 | So like I said at the start the reddit api is incredibly powerful and unlike most other apis on social networks
00:23:04.780 | It's free to use
00:23:06.700 | So definitely something to take advantage of and see how you can implement it in your own projects
00:23:13.580 | So I hope you've enjoyed the video
00:23:17.500 | Thank you for watching. See you next time. Bye