back to indexScraping ArXiv Papers with OpenAI's GPT 3.5 — AI Assistant #2
00:00:00.000 |
Today, we're going to take a look at the next part of the AI system project, 00:00:04.320 |
which is going to be how we approach getting all the archive papers from Archive in order to feed 00:00:11.680 |
into the memory of our AI system. Now, there were two approaches that I took to doing this, 00:00:19.600 |
and there was kind of like a dumb approach and a more sophisticated approach. The dumb approach, 00:00:29.040 |
number one is basically we search on Archive through Python. So, we do a search. This is like 00:00:38.080 |
an advanced search that they like to do, and we're basically searching for the category 00:00:42.320 |
to be within computer science and computation and language. And we go through and we basically just 00:00:51.360 |
get the top 1,000 items from there, and they would be ordered by the date. So, we'd have the most 00:00:57.840 |
recent ones at the top here. And this is fine, but if you want to target a particular field of study 00:01:05.360 |
or something that you're learning at the moment, it's not very good because this is a pretty broad 00:01:10.480 |
scope. So, what I decided to go for is number two, which is, I think, a little bit more sophisticated 00:01:19.920 |
wherein we essentially take one paper that we're interested in. So, we start one paper up here, 00:01:27.600 |
right? That will have like the archive ID. You can look something like that. 00:01:31.760 |
And what we're going to do is from that, we're going to store it, and then we're going to pass 00:01:37.440 |
it through a large language model, and we're going to ask it to return the references mentioned in 00:01:42.960 |
the paper, right? But just not all of them because that would be an insane number of them. 00:01:47.680 |
And we're just going to get those. We're going to then perform a Google search to find those papers 00:01:55.200 |
or find the actual IDs of those papers, and then we saw them, right? And then we do the same again. 00:02:02.080 |
Okay. So, it's like a tree. So, we're building this graph of all these papers that are relevant 00:02:08.800 |
from a particular topic here, which I think kind of emulates what we, or at least what I end up 00:02:14.560 |
doing, which is I read a paper and I don't understand something. So, I go to the bottom, 00:02:18.240 |
I find where it was, where it came from, go read that paper and probably don't understand something 00:02:22.880 |
again, and just basically keep going through. I go further and further and back until I 00:02:27.360 |
understand something well enough. So, I think that's a pretty natural approach. 00:02:32.240 |
So, what I'm going to do very quickly is we'll have a look at number one, 00:02:35.760 |
you know, kind of what I did there, but very quickly. And then number two, we'll dive into 00:02:41.440 |
a lot more detail. So, let's go have a look at that. Right. So, this is number one. We're using 00:02:47.280 |
the archive API and you can see here I'm doing a search, right? So, category is CSCL. We're sorting 00:02:53.760 |
by date and we're returning 1,100 results. That is basically the equivalent to going over to archive, 00:03:03.360 |
going to advanced search and then going to all fields, CS.CL and searching. 00:03:13.200 |
That is pretty much what I was doing, but just in Python. So, what you end up getting is all 00:03:19.280 |
these papers. And I don't think these are, maybe they are ordered by the, okay. So, these are also 00:03:25.600 |
already ordered by the most recent items. So, based on that, we'd get all these papers. This 00:03:31.120 |
is just like the archive item or paper object that we would get from that. And then we'd get all this 00:03:36.240 |
information. So, authors, categories, the ID, all these other things. And then we'd end up kind of 00:03:43.840 |
looping through that, getting all this information, putting it into a data frame and we get something 00:03:48.320 |
like this. So, we end up with a summary title and a PDF URL. PDF URL is important, right? Because, 00:03:54.640 |
you know, in this information here, we don't actually have the text of the PDF. We need to go 00:04:00.400 |
and get that. So, that's what we would do down here. So, we get the PDF URL. Note I added export 00:04:08.960 |
onto the end there. Basically, when you are scraping data on archive, they prefer you to 00:04:14.400 |
use them on their mirror sites, which essentially just frees up bandwidth on their main site, 00:04:20.400 |
which is archive.org for actual like real users rather than like robots what we're using here. 00:04:26.880 |
So, I'll download the PDF. Then I'll process it with something called PyPDF2 here. Yeah. You just 00:04:34.080 |
open my PDF file and then you'd go through each page and append all the text from each page like 00:04:39.280 |
this. And you get something like this. Now, it's not perfect. There's a few things that could be 00:04:45.040 |
cleaned up, but that was the basic process. So, that was okay, but not very efficient. 00:04:52.800 |
So, what we end up going with is the second approach. Now, the second approach, what we do 00:04:58.080 |
is we, for example, we load this paper here. Then we split based on some texts, let's say 00:05:03.760 |
references. You know, this isn't foolproof, but I didn't see any examples of this being an issue. 00:05:10.160 |
Actually, there were some cases where they weren't called references, but they're called 00:05:13.680 |
like references with all capitals. And I think I fixed that in the later version of this code, 00:05:18.720 |
which we'll come to. But for now, let's have a look at this. So, split on references. And then 00:05:23.680 |
we get something like this. So, this is getting the end of that. So, like the second part of that 00:05:30.720 |
split. And then you can see that we have all these references, right? Now, the references, 00:05:37.520 |
the format of those varies quite a lot. So, I thought maybe you could probably build a regex 00:05:44.000 |
formula that would kind of cover everything. Probably that is the case. I'm not sure. But I 00:05:49.840 |
thought, okay, it'd be really cool if we actually use a large language model to extract all the 00:05:54.320 |
information we need from these references. So, to do that, we go and use LangChain. 00:06:00.000 |
We get a prompt template. We're using OpenAI's Text Definition 0.0.3 here. And we couple those 00:06:06.000 |
things together. So, our prompt template followed by the large language model using 00:06:09.600 |
large language model chain. Right. In here, I've entered my OpenAI API key and set a maximum number 00:06:16.000 |
of tokens, which I don't want to be too excessive because I just want some references here. It's 00:06:19.680 |
nothing crazy. And the temperature is set to zero to reduce the amount or the possibility of the 00:06:26.000 |
model making stuff up. It doesn't stop it from making stuff up, but it means it's less likely 00:06:30.640 |
to. And then, you know, we just kind of told the model it's really good at reading or getting 00:06:37.680 |
references from papers. Right. And then we gave a couple of examples because it wasn't actually 00:06:43.440 |
doing that well. So, these just a few examples, actually, from the example above, but it works 00:06:48.720 |
on other papers. I tried it. So, we're saying based on, you know, this snippet here, you would 00:06:55.920 |
extract this. Right. So, we've got some easy to parse formatting here, which is really important 00:07:01.360 |
for us. And then we say in the references below, there are many papers. Extract their titles, 00:07:05.840 |
authors, and years. Right. Which is what we've got here. Titles, authors, and years. Right. And 00:07:11.520 |
then we feed in the references into there. And then we ask it to extract everything. Right. So, 00:07:18.080 |
we have our prompt template. We created this chain. So, it's going to go. We're going to feed 00:07:22.800 |
this prompt into our large language model with some references as per this part here. Now, okay. 00:07:29.600 |
If we go down. So, over here, I was actually, I was just checking how many tokens I should expect 00:07:35.920 |
within the references page. I basically said I want just about one page of references, nothing 00:07:41.200 |
more. So, I found this near the bottom of the page. Probably best if I actually visualize this 00:07:48.640 |
a little bit. So, if I open this. Right. So, this is the paper actually looking at. Go all the way 00:07:57.440 |
down to the references. And yes. So, we have the references. So, we're splitting here from the rest 00:08:02.640 |
of the paper. And then near the end of this page, we have this. What makes good in context examples 00:08:09.200 |
for GPT-3. That's what we are kind of splitting on here to say this is the first page. Now, we split 00:08:15.680 |
on there. Come to here and we are counting the number of tokens with tick token. This is the 00:08:21.200 |
tokenizer that OpenAI is using. And this is the one for text, although I should probably check. 00:08:28.960 |
It might be another one. But anyway, I figured all roughly the same. And we get about 1.5,000 00:08:36.800 |
tokens. So, what we then do is we create this text splitter. It's the token text splitter. And we say 00:08:43.280 |
just get the first 1,500 tokens from each reference page. Okay. That's what we're doing here. 00:08:50.720 |
And I think we actually went a little bit lower than 1.5. Okay. 1,000. I figured that's plenty. 00:08:56.240 |
Okay. And then from that, we extract the references. So, running on the first one, 00:09:03.600 |
we get all of these. And I believe at least most of these are quite accurate. And then we try some 00:09:10.640 |
other papers. Okay. So, based on this, let's say, you know, we've got this math QA paper at the top 00:09:16.640 |
here. Come down here. And what we do is we actually perform a Google search for that paper. 00:09:22.960 |
Right? So, we literally just take the title of that paper, math QA towards interpreter. 00:09:27.120 |
And then we based on those results from Google, we're going to search for one of those results 00:09:36.720 |
that contains this here. All right. So, this is the archive, the paper location. Basically, 00:09:45.520 |
we're searching for that. Anything in various Google results that looks like an archive paper 00:09:52.400 |
link. We go through that and then we extract the ID. Right? And that is what we get here. 00:10:00.080 |
Then we use this to extract another paper. So, we get here. We can click on this, see what we got. 00:10:07.680 |
Okay. And we got language models, a few short learners. So, come a little bit further down 00:10:13.120 |
and then we've just come to here in order to download the actual paper. Right. So, after that, 00:10:20.800 |
I just put everything into a set of functions. Yeah, it needs to go through all that. 00:10:25.680 |
However, basically, after we've pulled those in different functions, we go paper equals this item 00:10:34.080 |
here. We load the paper and then we say save equals true to basically save the paper to file. 00:10:42.000 |
Load here is actually so it can do two things. It will look on your local drive and see if you 00:10:46.720 |
already have the paper downloaded. If not, it's going to go to archive and request the PDF. 00:10:54.800 |
And then you can see I've done that for this one. Right. So, I've loaded the paper and then we have 00:10:59.280 |
this other function called get meta. Right. And that will basically get all the relevant 00:11:03.920 |
information for you for that particular paper. And it does that. Let me open the code for that. 00:11:12.080 |
So, come down to here. So, this is a more recent version of this code of the archive 00:11:18.640 |
object. We go to get meta here. Okay. That's actually where it happens. I think it might be in 00:11:25.920 |
load. Okay. Load. And it calls to this self.downloadmeta. Right. You go to that function 00:11:36.800 |
and we get to here. Basically, that's calling to the archive API again. You've got your ID. You 00:11:42.640 |
pass it into there. And we just return all the relevant information from that paper. All right. 00:11:49.200 |
So, we have the authors, categories, comment, everything. So, we're just calling to the 00:11:53.040 |
archive API in order to get that information. Right. So, we end up with all of this. That's 00:11:58.640 |
cool. And then after that, we also have the content of the paper, which we got actually 00:12:04.720 |
when we did this. Same thing as I just showed you. Just download the PDF and extract it all with 00:12:11.920 |
pypdf2. After that, we get our references. So, this is actually using the large language model 00:12:19.120 |
like I just showed you. I think this is slightly different at this point, but it's basically doing 00:12:23.280 |
the same thing. And we get all of the references in this format here. And this actually gets 00:12:28.800 |
attached to the paper object here so that we can see all the children essentially of that paper, 00:12:35.280 |
if that makes sense. I suppose they would be more like the fathers of the paper. But in this 00:12:39.280 |
arrangement, we have like the paper and then we have all of its references, which acts as 00:12:44.480 |
children within that tree or the knowledge graph. So, after that, I don't know why I repeated it 00:12:52.160 |
here. And then we just save that information. Then what we do is we actually go through and we use 00:12:57.520 |
this get paper ID. This is the same thing I mentioned before. So, where we are doing a Google 00:13:02.960 |
search in order to get the paper information, you see we have a couple of nones here. That's where 00:13:07.440 |
I couldn't find anything. So, I don't really worry so much about that. If you can't find a paper 00:13:13.360 |
immediately on Google, it's probably not so important. So, we just ignore those. I'll just 00:13:19.440 |
skip them. But for these papers here, just the same as what we did before. So, when we came up 00:13:26.080 |
here and we did it, we searched the first paper using this paper archive, so on, so on. We can 00:13:32.000 |
actually just do that for all of these other papers as well. So, we basically just create this 00:13:37.760 |
big graph of all these different papers. And so, right now, that's kind of formalized to some 00:13:44.880 |
degree. Not the best code in the world, but formalized to some degree with this here. So, 00:13:52.080 |
we have this like this. I need to clean up. But we have this archive object, a couple of functions 00:13:58.160 |
up here that just kind of help us handle everything, like get the paper ID, initialize our 00:14:04.320 |
extractors for extracting references and so on. And based on that, so based on all this code here, 00:14:11.680 |
which there will be a link to in the description, this is within the constructors file at the 00:14:16.720 |
moment. I need to rename everything. It's kind of just thrown together at the moment. We actually 00:14:22.240 |
just process everything. So, everything you just saw is contained here and a few other things are 00:14:27.200 |
contained here because I've added a couple of things since then. Okay. So, that's it for this 00:14:33.200 |
one. We'll leave it there for now. I hope this has not been too messy and it's at least kind of 00:14:40.320 |
interesting as to how we're handling this data pre-processing thing. And I suppose the most 00:14:46.880 |
interesting bit here is actually extracting those references and using like a search in order to get 00:14:52.640 |
other papers that have been referenced within our current paper. At least that was the most 00:14:57.760 |
interesting part for me. So, that being said, we'll leave it there. Thank you very much for 00:15:03.120 |
watching. I hope this has been useful in some way and I will see you again in the next one. Bye.