Scraping ArXiv Papers with OpenAI's GPT 3.5

00:00:00.000 | Today, we're going to take a look at the next part of the AI system project,

00:00:04.320 | which is going to be how we approach getting all the archive papers from Archive in order to feed

00:00:11.680 | into the memory of our AI system. Now, there were two approaches that I took to doing this,

00:00:19.600 | and there was kind of like a dumb approach and a more sophisticated approach. The dumb approach,

00:00:29.040 | number one is basically we search on Archive through Python. So, we do a search. This is like

00:00:38.080 | an advanced search that they like to do, and we're basically searching for the category

00:00:42.320 | to be within computer science and computation and language. And we go through and we basically just

00:00:51.360 | get the top 1,000 items from there, and they would be ordered by the date. So, we'd have the most

00:00:57.840 | recent ones at the top here. And this is fine, but if you want to target a particular field of study

00:01:05.360 | or something that you're learning at the moment, it's not very good because this is a pretty broad

00:01:10.480 | scope. So, what I decided to go for is number two, which is, I think, a little bit more sophisticated

00:01:19.920 | wherein we essentially take one paper that we're interested in. So, we start one paper up here,

00:01:27.600 | right? That will have like the archive ID. You can look something like that.

00:01:31.760 | And what we're going to do is from that, we're going to store it, and then we're going to pass

00:01:37.440 | it through a large language model, and we're going to ask it to return the references mentioned in

00:01:42.960 | the paper, right? But just not all of them because that would be an insane number of them.

00:01:47.680 | And we're just going to get those. We're going to then perform a Google search to find those papers

00:01:55.200 | or find the actual IDs of those papers, and then we saw them, right? And then we do the same again.

00:02:02.080 | Okay. So, it's like a tree. So, we're building this graph of all these papers that are relevant

00:02:08.800 | from a particular topic here, which I think kind of emulates what we, or at least what I end up

00:02:14.560 | doing, which is I read a paper and I don't understand something. So, I go to the bottom,

00:02:18.240 | I find where it was, where it came from, go read that paper and probably don't understand something

00:02:22.880 | again, and just basically keep going through. I go further and further and back until I

00:02:27.360 | understand something well enough. So, I think that's a pretty natural approach.

00:02:32.240 | So, what I'm going to do very quickly is we'll have a look at number one,

00:02:35.760 | you know, kind of what I did there, but very quickly. And then number two, we'll dive into

00:02:41.440 | a lot more detail. So, let's go have a look at that. Right. So, this is number one. We're using

00:02:47.280 | the archive API and you can see here I'm doing a search, right? So, category is CSCL. We're sorting

00:02:53.760 | by date and we're returning 1,100 results. That is basically the equivalent to going over to archive,

00:03:03.360 | going to advanced search and then going to all fields, CS.CL and searching.

00:03:13.200 | That is pretty much what I was doing, but just in Python. So, what you end up getting is all

00:03:19.280 | these papers. And I don't think these are, maybe they are ordered by the, okay. So, these are also

00:03:25.600 | already ordered by the most recent items. So, based on that, we'd get all these papers. This

00:03:31.120 | is just like the archive item or paper object that we would get from that. And then we'd get all this

00:03:36.240 | information. So, authors, categories, the ID, all these other things. And then we'd end up kind of

00:03:43.840 | looping through that, getting all this information, putting it into a data frame and we get something

00:03:48.320 | like this. So, we end up with a summary title and a PDF URL. PDF URL is important, right? Because,

00:03:54.640 | you know, in this information here, we don't actually have the text of the PDF. We need to go

00:04:00.400 | and get that. So, that's what we would do down here. So, we get the PDF URL. Note I added export

00:04:08.960 | onto the end there. Basically, when you are scraping data on archive, they prefer you to

00:04:14.400 | use them on their mirror sites, which essentially just frees up bandwidth on their main site,

00:04:20.400 | which is archive.org for actual like real users rather than like robots what we're using here.

00:04:26.880 | So, I'll download the PDF. Then I'll process it with something called PyPDF2 here. Yeah. You just

00:04:34.080 | open my PDF file and then you'd go through each page and append all the text from each page like

00:04:39.280 | this. And you get something like this. Now, it's not perfect. There's a few things that could be

00:04:45.040 | cleaned up, but that was the basic process. So, that was okay, but not very efficient.

00:04:52.800 | So, what we end up going with is the second approach. Now, the second approach, what we do

00:04:58.080 | is we, for example, we load this paper here. Then we split based on some texts, let's say

00:05:03.760 | references. You know, this isn't foolproof, but I didn't see any examples of this being an issue.

00:05:10.160 | Actually, there were some cases where they weren't called references, but they're called

00:05:13.680 | like references with all capitals. And I think I fixed that in the later version of this code,

00:05:18.720 | which we'll come to. But for now, let's have a look at this. So, split on references. And then

00:05:23.680 | we get something like this. So, this is getting the end of that. So, like the second part of that

00:05:30.720 | split. And then you can see that we have all these references, right? Now, the references,

00:05:37.520 | the format of those varies quite a lot. So, I thought maybe you could probably build a regex

00:05:44.000 | formula that would kind of cover everything. Probably that is the case. I'm not sure. But I

00:05:49.840 | thought, okay, it'd be really cool if we actually use a large language model to extract all the

00:05:54.320 | information we need from these references. So, to do that, we go and use LangChain.

00:06:00.000 | We get a prompt template. We're using OpenAI's Text Definition 0.0.3 here. And we couple those

00:06:06.000 | things together. So, our prompt template followed by the large language model using

00:06:09.600 | large language model chain. Right. In here, I've entered my OpenAI API key and set a maximum number

00:06:16.000 | of tokens, which I don't want to be too excessive because I just want some references here. It's

00:06:19.680 | nothing crazy. And the temperature is set to zero to reduce the amount or the possibility of the

00:06:26.000 | model making stuff up. It doesn't stop it from making stuff up, but it means it's less likely

00:06:30.640 | to. And then, you know, we just kind of told the model it's really good at reading or getting

00:06:37.680 | references from papers. Right. And then we gave a couple of examples because it wasn't actually

00:06:43.440 | doing that well. So, these just a few examples, actually, from the example above, but it works

00:06:48.720 | on other papers. I tried it. So, we're saying based on, you know, this snippet here, you would

00:06:55.920 | extract this. Right. So, we've got some easy to parse formatting here, which is really important

00:07:01.360 | for us. And then we say in the references below, there are many papers. Extract their titles,

00:07:05.840 | authors, and years. Right. Which is what we've got here. Titles, authors, and years. Right. And

00:07:11.520 | then we feed in the references into there. And then we ask it to extract everything. Right. So,

00:07:18.080 | we have our prompt template. We created this chain. So, it's going to go. We're going to feed

00:07:22.800 | this prompt into our large language model with some references as per this part here. Now, okay.

00:07:29.600 | If we go down. So, over here, I was actually, I was just checking how many tokens I should expect

00:07:35.920 | within the references page. I basically said I want just about one page of references, nothing

00:07:41.200 | more. So, I found this near the bottom of the page. Probably best if I actually visualize this

00:07:48.640 | a little bit. So, if I open this. Right. So, this is the paper actually looking at. Go all the way

00:07:57.440 | down to the references. And yes. So, we have the references. So, we're splitting here from the rest

00:08:02.640 | of the paper. And then near the end of this page, we have this. What makes good in context examples

00:08:09.200 | for GPT-3. That's what we are kind of splitting on here to say this is the first page. Now, we split

00:08:15.680 | on there. Come to here and we are counting the number of tokens with tick token. This is the

00:08:21.200 | tokenizer that OpenAI is using. And this is the one for text, although I should probably check.

00:08:28.960 | It might be another one. But anyway, I figured all roughly the same. And we get about 1.5,000

00:08:36.800 | tokens. So, what we then do is we create this text splitter. It's the token text splitter. And we say

00:08:43.280 | just get the first 1,500 tokens from each reference page. Okay. That's what we're doing here.

00:08:50.720 | And I think we actually went a little bit lower than 1.5. Okay. 1,000. I figured that's plenty.

00:08:56.240 | Okay. And then from that, we extract the references. So, running on the first one,

00:09:03.600 | we get all of these. And I believe at least most of these are quite accurate. And then we try some

00:09:10.640 | other papers. Okay. So, based on this, let's say, you know, we've got this math QA paper at the top

00:09:16.640 | here. Come down here. And what we do is we actually perform a Google search for that paper.

00:09:22.960 | Right? So, we literally just take the title of that paper, math QA towards interpreter.

00:09:27.120 | And then we based on those results from Google, we're going to search for one of those results

00:09:36.720 | that contains this here. All right. So, this is the archive, the paper location. Basically,

00:09:45.520 | we're searching for that. Anything in various Google results that looks like an archive paper

00:09:52.400 | link. We go through that and then we extract the ID. Right? And that is what we get here.

00:10:00.080 | Then we use this to extract another paper. So, we get here. We can click on this, see what we got.

00:10:07.680 | Okay. And we got language models, a few short learners. So, come a little bit further down

00:10:13.120 | and then we've just come to here in order to download the actual paper. Right. So, after that,

00:10:20.800 | I just put everything into a set of functions. Yeah, it needs to go through all that.

00:10:25.680 | However, basically, after we've pulled those in different functions, we go paper equals this item

00:10:34.080 | here. We load the paper and then we say save equals true to basically save the paper to file.

00:10:42.000 | Load here is actually so it can do two things. It will look on your local drive and see if you

00:10:46.720 | already have the paper downloaded. If not, it's going to go to archive and request the PDF.

00:10:54.800 | And then you can see I've done that for this one. Right. So, I've loaded the paper and then we have

00:10:59.280 | this other function called get meta. Right. And that will basically get all the relevant

00:11:03.920 | information for you for that particular paper. And it does that. Let me open the code for that.

00:11:12.080 | So, come down to here. So, this is a more recent version of this code of the archive

00:11:18.640 | object. We go to get meta here. Okay. That's actually where it happens. I think it might be in

00:11:25.920 | load. Okay. Load. And it calls to this self.downloadmeta. Right. You go to that function

00:11:36.800 | and we get to here. Basically, that's calling to the archive API again. You've got your ID. You

00:11:42.640 | pass it into there. And we just return all the relevant information from that paper. All right.

00:11:49.200 | So, we have the authors, categories, comment, everything. So, we're just calling to the

00:11:53.040 | archive API in order to get that information. Right. So, we end up with all of this. That's

00:11:58.640 | cool. And then after that, we also have the content of the paper, which we got actually

00:12:04.720 | when we did this. Same thing as I just showed you. Just download the PDF and extract it all with

00:12:11.920 | pypdf2. After that, we get our references. So, this is actually using the large language model

00:12:19.120 | like I just showed you. I think this is slightly different at this point, but it's basically doing

00:12:23.280 | the same thing. And we get all of the references in this format here. And this actually gets

00:12:28.800 | attached to the paper object here so that we can see all the children essentially of that paper,

00:12:35.280 | if that makes sense. I suppose they would be more like the fathers of the paper. But in this

00:12:39.280 | arrangement, we have like the paper and then we have all of its references, which acts as

00:12:44.480 | children within that tree or the knowledge graph. So, after that, I don't know why I repeated it

00:12:52.160 | here. And then we just save that information. Then what we do is we actually go through and we use

00:12:57.520 | this get paper ID. This is the same thing I mentioned before. So, where we are doing a Google

00:13:02.960 | search in order to get the paper information, you see we have a couple of nones here. That's where

00:13:07.440 | I couldn't find anything. So, I don't really worry so much about that. If you can't find a paper

00:13:13.360 | immediately on Google, it's probably not so important. So, we just ignore those. I'll just

00:13:19.440 | skip them. But for these papers here, just the same as what we did before. So, when we came up

00:13:26.080 | here and we did it, we searched the first paper using this paper archive, so on, so on. We can

00:13:32.000 | actually just do that for all of these other papers as well. So, we basically just create this

00:13:37.760 | big graph of all these different papers. And so, right now, that's kind of formalized to some

00:13:44.880 | degree. Not the best code in the world, but formalized to some degree with this here. So,

00:13:52.080 | we have this like this. I need to clean up. But we have this archive object, a couple of functions

00:13:58.160 | up here that just kind of help us handle everything, like get the paper ID, initialize our

00:14:04.320 | extractors for extracting references and so on. And based on that, so based on all this code here,

00:14:11.680 | which there will be a link to in the description, this is within the constructors file at the

00:14:16.720 | moment. I need to rename everything. It's kind of just thrown together at the moment. We actually

00:14:22.240 | just process everything. So, everything you just saw is contained here and a few other things are

00:14:27.200 | contained here because I've added a couple of things since then. Okay. So, that's it for this

00:14:33.200 | one. We'll leave it there for now. I hope this has not been too messy and it's at least kind of

00:14:40.320 | interesting as to how we're handling this data pre-processing thing. And I suppose the most

00:14:46.880 | interesting bit here is actually extracting those references and using like a search in order to get

00:14:52.640 | other papers that have been referenced within our current paper. At least that was the most

00:14:57.760 | interesting part for me. So, that being said, we'll leave it there. Thank you very much for

00:15:03.120 | watching. I hope this has been useful in some way and I will see you again in the next one. Bye.

00:15:22.160 | you

Scraping ArXiv Papers with OpenAI's GPT 3.5 — AI Assistant #2