Back to Index

Scraping ArXiv Papers with OpenAI's GPT 3.5 — AI Assistant #2


Transcript

Today, we're going to take a look at the next part of the AI system project, which is going to be how we approach getting all the archive papers from Archive in order to feed into the memory of our AI system. Now, there were two approaches that I took to doing this, and there was kind of like a dumb approach and a more sophisticated approach.

The dumb approach, number one is basically we search on Archive through Python. So, we do a search. This is like an advanced search that they like to do, and we're basically searching for the category to be within computer science and computation and language. And we go through and we basically just get the top 1,000 items from there, and they would be ordered by the date.

So, we'd have the most recent ones at the top here. And this is fine, but if you want to target a particular field of study or something that you're learning at the moment, it's not very good because this is a pretty broad scope. So, what I decided to go for is number two, which is, I think, a little bit more sophisticated wherein we essentially take one paper that we're interested in.

So, we start one paper up here, right? That will have like the archive ID. You can look something like that. And what we're going to do is from that, we're going to store it, and then we're going to pass it through a large language model, and we're going to ask it to return the references mentioned in the paper, right?

But just not all of them because that would be an insane number of them. And we're just going to get those. We're going to then perform a Google search to find those papers or find the actual IDs of those papers, and then we saw them, right? And then we do the same again.

Okay. So, it's like a tree. So, we're building this graph of all these papers that are relevant from a particular topic here, which I think kind of emulates what we, or at least what I end up doing, which is I read a paper and I don't understand something. So, I go to the bottom, I find where it was, where it came from, go read that paper and probably don't understand something again, and just basically keep going through.

I go further and further and back until I understand something well enough. So, I think that's a pretty natural approach. So, what I'm going to do very quickly is we'll have a look at number one, you know, kind of what I did there, but very quickly. And then number two, we'll dive into a lot more detail.

So, let's go have a look at that. Right. So, this is number one. We're using the archive API and you can see here I'm doing a search, right? So, category is CSCL. We're sorting by date and we're returning 1,100 results. That is basically the equivalent to going over to archive, going to advanced search and then going to all fields, CS.CL and searching.

That is pretty much what I was doing, but just in Python. So, what you end up getting is all these papers. And I don't think these are, maybe they are ordered by the, okay. So, these are also already ordered by the most recent items. So, based on that, we'd get all these papers.

This is just like the archive item or paper object that we would get from that. And then we'd get all this information. So, authors, categories, the ID, all these other things. And then we'd end up kind of looping through that, getting all this information, putting it into a data frame and we get something like this.

So, we end up with a summary title and a PDF URL. PDF URL is important, right? Because, you know, in this information here, we don't actually have the text of the PDF. We need to go and get that. So, that's what we would do down here. So, we get the PDF URL.

Note I added export onto the end there. Basically, when you are scraping data on archive, they prefer you to use them on their mirror sites, which essentially just frees up bandwidth on their main site, which is archive.org for actual like real users rather than like robots what we're using here.

So, I'll download the PDF. Then I'll process it with something called PyPDF2 here. Yeah. You just open my PDF file and then you'd go through each page and append all the text from each page like this. And you get something like this. Now, it's not perfect. There's a few things that could be cleaned up, but that was the basic process.

So, that was okay, but not very efficient. So, what we end up going with is the second approach. Now, the second approach, what we do is we, for example, we load this paper here. Then we split based on some texts, let's say references. You know, this isn't foolproof, but I didn't see any examples of this being an issue.

Actually, there were some cases where they weren't called references, but they're called like references with all capitals. And I think I fixed that in the later version of this code, which we'll come to. But for now, let's have a look at this. So, split on references. And then we get something like this.

So, this is getting the end of that. So, like the second part of that split. And then you can see that we have all these references, right? Now, the references, the format of those varies quite a lot. So, I thought maybe you could probably build a regex formula that would kind of cover everything.

Probably that is the case. I'm not sure. But I thought, okay, it'd be really cool if we actually use a large language model to extract all the information we need from these references. So, to do that, we go and use LangChain. We get a prompt template. We're using OpenAI's Text Definition 0.0.3 here.

And we couple those things together. So, our prompt template followed by the large language model using large language model chain. Right. In here, I've entered my OpenAI API key and set a maximum number of tokens, which I don't want to be too excessive because I just want some references here.

It's nothing crazy. And the temperature is set to zero to reduce the amount or the possibility of the model making stuff up. It doesn't stop it from making stuff up, but it means it's less likely to. And then, you know, we just kind of told the model it's really good at reading or getting references from papers.

Right. And then we gave a couple of examples because it wasn't actually doing that well. So, these just a few examples, actually, from the example above, but it works on other papers. I tried it. So, we're saying based on, you know, this snippet here, you would extract this. Right.

So, we've got some easy to parse formatting here, which is really important for us. And then we say in the references below, there are many papers. Extract their titles, authors, and years. Right. Which is what we've got here. Titles, authors, and years. Right. And then we feed in the references into there.

And then we ask it to extract everything. Right. So, we have our prompt template. We created this chain. So, it's going to go. We're going to feed this prompt into our large language model with some references as per this part here. Now, okay. If we go down. So, over here, I was actually, I was just checking how many tokens I should expect within the references page.

I basically said I want just about one page of references, nothing more. So, I found this near the bottom of the page. Probably best if I actually visualize this a little bit. So, if I open this. Right. So, this is the paper actually looking at. Go all the way down to the references.

And yes. So, we have the references. So, we're splitting here from the rest of the paper. And then near the end of this page, we have this. What makes good in context examples for GPT-3. That's what we are kind of splitting on here to say this is the first page.

Now, we split on there. Come to here and we are counting the number of tokens with tick token. This is the tokenizer that OpenAI is using. And this is the one for text, although I should probably check. It might be another one. But anyway, I figured all roughly the same.

And we get about 1.5,000 tokens. So, what we then do is we create this text splitter. It's the token text splitter. And we say just get the first 1,500 tokens from each reference page. Okay. That's what we're doing here. And I think we actually went a little bit lower than 1.5.

Okay. 1,000. I figured that's plenty. Okay. And then from that, we extract the references. So, running on the first one, we get all of these. And I believe at least most of these are quite accurate. And then we try some other papers. Okay. So, based on this, let's say, you know, we've got this math QA paper at the top here.

Come down here. And what we do is we actually perform a Google search for that paper. Right? So, we literally just take the title of that paper, math QA towards interpreter. And then we based on those results from Google, we're going to search for one of those results that contains this here.

All right. So, this is the archive, the paper location. Basically, we're searching for that. Anything in various Google results that looks like an archive paper link. We go through that and then we extract the ID. Right? And that is what we get here. Then we use this to extract another paper.

So, we get here. We can click on this, see what we got. Okay. And we got language models, a few short learners. So, come a little bit further down and then we've just come to here in order to download the actual paper. Right. So, after that, I just put everything into a set of functions.

Yeah, it needs to go through all that. However, basically, after we've pulled those in different functions, we go paper equals this item here. We load the paper and then we say save equals true to basically save the paper to file. Load here is actually so it can do two things.

It will look on your local drive and see if you already have the paper downloaded. If not, it's going to go to archive and request the PDF. And then you can see I've done that for this one. Right. So, I've loaded the paper and then we have this other function called get meta.

Right. And that will basically get all the relevant information for you for that particular paper. And it does that. Let me open the code for that. So, come down to here. So, this is a more recent version of this code of the archive object. We go to get meta here.

Okay. That's actually where it happens. I think it might be in load. Okay. Load. And it calls to this self.downloadmeta. Right. You go to that function and we get to here. Basically, that's calling to the archive API again. You've got your ID. You pass it into there. And we just return all the relevant information from that paper.

All right. So, we have the authors, categories, comment, everything. So, we're just calling to the archive API in order to get that information. Right. So, we end up with all of this. That's cool. And then after that, we also have the content of the paper, which we got actually when we did this.

Same thing as I just showed you. Just download the PDF and extract it all with pypdf2. After that, we get our references. So, this is actually using the large language model like I just showed you. I think this is slightly different at this point, but it's basically doing the same thing.

And we get all of the references in this format here. And this actually gets attached to the paper object here so that we can see all the children essentially of that paper, if that makes sense. I suppose they would be more like the fathers of the paper. But in this arrangement, we have like the paper and then we have all of its references, which acts as children within that tree or the knowledge graph.

So, after that, I don't know why I repeated it here. And then we just save that information. Then what we do is we actually go through and we use this get paper ID. This is the same thing I mentioned before. So, where we are doing a Google search in order to get the paper information, you see we have a couple of nones here.

That's where I couldn't find anything. So, I don't really worry so much about that. If you can't find a paper immediately on Google, it's probably not so important. So, we just ignore those. I'll just skip them. But for these papers here, just the same as what we did before.

So, when we came up here and we did it, we searched the first paper using this paper archive, so on, so on. We can actually just do that for all of these other papers as well. So, we basically just create this big graph of all these different papers. And so, right now, that's kind of formalized to some degree.

Not the best code in the world, but formalized to some degree with this here. So, we have this like this. I need to clean up. But we have this archive object, a couple of functions up here that just kind of help us handle everything, like get the paper ID, initialize our extractors for extracting references and so on.

And based on that, so based on all this code here, which there will be a link to in the description, this is within the constructors file at the moment. I need to rename everything. It's kind of just thrown together at the moment. We actually just process everything. So, everything you just saw is contained here and a few other things are contained here because I've added a couple of things since then.

Okay. So, that's it for this one. We'll leave it there for now. I hope this has not been too messy and it's at least kind of interesting as to how we're handling this data pre-processing thing. And I suppose the most interesting bit here is actually extracting those references and using like a search in order to get other papers that have been referenced within our current paper.

At least that was the most interesting part for me. So, that being said, we'll leave it there. Thank you very much for watching. I hope this has been useful in some way and I will see you again in the next one. Bye. you