back to indexIdentify Stocks on Reddit with SpaCy (NER in Python)
Chapters
0:0 Introduction
1:18 Importing SpaCy
2:20 SpaCy Models
3:59 Load Model
4:50 Document Object
8:45 Organization List
10:30 Importing Data
12:20 Getting Entities
14:13 Filtering Entities
17:50 Counter Object
00:00:00.000 |
Hi, welcome to this video on named entity recognition 00:00:13.960 |
And what we are aiming to do is extract all the organizations 00:00:19.240 |
So in our case, with this, we are wanting to extract ARC. 00:00:25.120 |
I'm also going to show you how we can visualize that process 00:00:27.580 |
as well, using displacy, which is a visualization package 00:00:33.680 |
Now I'm going to show you how we programmatically extract 00:00:46.640 |
And then once we have done that process with our single example, 00:00:51.200 |
obviously, we are going to want to scale that 00:01:05.000 |
will take all of those, pull out all the entities that 00:01:07.400 |
are being mentioned, prune out a few that we don't really want, 00:01:10.640 |
and then give us the most frequently mentioned samples 00:01:27.360 |
And I'm just going to use this to show you how spaCy works 00:01:40.480 |
And if you don't already have spaCy, it's very easy to install. 00:01:58.680 |
called displayCy, which is like a visualization package that 00:02:20.000 |
So spaCy comes with a lot of options in terms of models. 00:02:30.880 |
If we come down here, we see this is the model 00:02:36.400 |
I just want to quickly cover what we are actually 00:02:40.320 |
So come down here, we have the naming conventions. 00:02:47.160 |
The language for us, of course, English, which you can see here. 00:02:52.920 |
Now, the name consists of the model type, genre, and size. 00:02:59.000 |
The type here is core, which is a general purpose 00:03:02.920 |
model, which includes vocabulary, syntax, entities, 00:03:07.480 |
We're interested in using entities for the NER tasks 00:03:13.880 |
Web is the type of data that the pipeline or the model 00:03:18.440 |
So the two examples I give here is web or news. 00:03:31.320 |
To download that, we go back into our command line 00:03:38.440 |
I would type python m spaCy to access the spaCy module. 00:03:43.720 |
And then download, and then your model name here, 00:03:46.520 |
which in our case, so English core web small model. 00:03:52.520 |
I'm not going to enter this, because I have already 00:04:10.000 |
And then again, we just enter the model name. 00:04:19.600 |
Now to actually process this data, it's super easy. 00:04:35.120 |
And we can see we have this, which just kind of looks 00:04:41.400 |
So it's a bit-- it looks like nothing has actually happened. 00:04:55.800 |
And then we've got all these different methods, attributes, 00:05:10.520 |
that spaCy has identified within this document object 00:05:16.360 |
And we can see here, although it doesn't say what type 00:05:19.880 |
these labels are, we have arc, arc, arc, etf. 00:05:33.760 |
And I'm going to visualize what is actually happening here. 00:05:58.640 |
We see that arc and etf are identified as organizations. 00:06:12.120 |
Nonetheless, it is identifying arc correctly three times, 00:06:21.320 |
To me, that seems like it's Picasso painting or a statue 00:06:26.920 |
I really had no idea what it meant by work of art. 00:06:31.280 |
So what we can do to get a small description of each label, 00:06:36.200 |
if we don't know what it means, is we just type spacey, 00:06:46.160 |
And then we can see, OK, it's titles of books, songs, 00:06:54.360 |
So this bear cave thing here is actually an article. 00:06:58.520 |
And it's not quite a book, but it is something 00:07:01.800 |
that someone has written, just like a book or a song. 00:07:04.400 |
So it fits in with that category, in my opinion. 00:07:11.760 |
But we also want to process this in a more efficient way. 00:07:17.000 |
So this is where we go back to our doc ends here. 00:07:24.440 |
work through each one of these in a for loop. 00:07:29.760 |
are text or something along those lines, they're not. 00:07:36.280 |
So let me just show you how we deal with that. 00:07:54.200 |
And we print that out by accessing the entity object 00:08:10.320 |
And that will give us the label, so org or work of art. 00:08:23.720 |
And then we see here, OK, that's pretty cool, 00:08:25.840 |
because now we've got the organization, work of art. 00:08:38.400 |
And that is actually all we need to start extracting the data 00:08:43.640 |
So if we just come down here and take this loop, 00:08:46.440 |
and we're just going to modify it a little bit. 00:08:48.560 |
And we're going to extract the organizations from this list. 00:08:58.560 |
And then here, we're going to add some logic, which says, 00:09:09.480 |
So to do that, we say, OK, if label is equal to org, 00:09:27.960 |
And let's just view our org list at the bottom here. 00:09:37.680 |
And here, we get our list of all the organizations. 00:09:43.440 |
So it's excluded the bear cave, because the bear cave is not 00:09:50.200 |
But ideally, from my perspective on what we want here, 00:09:54.400 |
is we don't need to have arc popping up three times. 00:10:02.080 |
We don't care about how frequently they've been 00:10:19.840 |
And then let's just see what that looks like. 00:10:26.920 |
And that's exactly where I wanted this to be. 00:10:31.440 |
So we've applied this to a single piece of text. 00:10:36.840 |
But we want to apply this to a full data frame. 00:10:40.960 |
So first thing we need to do is actually import a text. 00:10:47.480 |
So this is the data that we're going to be using. 00:10:51.240 |
So we're pulling this from the investing subreddit. 00:11:03.480 |
So I will leave a link to that in the description. 00:11:06.840 |
Otherwise, you can also just get this data directly 00:11:10.560 |
if you don't want to go through the whole Reddit API thing. 00:11:14.160 |
And I will leave a link to that in the description as well. 00:11:29.600 |
And now we just need to read in our data as pd read csv. 00:11:47.240 |
And the separator we're using here is the pipe delimiter. 00:11:51.160 |
So let's just make sure we've read that in correctly. 00:12:23.560 |
into a function that we can then apply to our data frame. 00:12:31.440 |
And we're just going to convert this into a function. 00:12:42.880 |
And then here, we'll pass in a single string. 00:12:48.920 |
And we'll say here, we need to create our document object. 00:12:58.200 |
We've already defined the model up here as NLP. 00:13:16.400 |
But we also do want to remove any duplicates. 00:13:28.400 |
So we'll create a new column called Organizations. 00:13:53.400 |
because we're processing a lot of entries here. 00:13:56.080 |
Obviously, if you're doing this for a larger data set, 00:13:58.360 |
you're probably going to want to batch this a little bit. 00:14:01.120 |
So keeping it on file somewhere, reading maybe up to 1,000 00:14:06.680 |
samples at once, applying this, and then saving it back to file 00:14:20.960 |
And then we also have this SMP 500p, SMP 500, 00:14:48.880 |
For example, these here, we really don't want those. 00:14:53.680 |
Now, we don't necessarily need to do this as well 00:14:59.320 |
with a lot of these items that we don't really 00:15:02.160 |
want to include in there-- in fact, actually, I think I'll 00:15:19.720 |
that appear at least three or four times within our data set. 00:15:27.640 |
But in other cases, like SEC, that will appear quite a lot. 00:15:53.680 |
So I'm just going to put that in there as well. 00:15:57.560 |
And then there's a few others that I've noticed 00:16:01.680 |
We get the FDA, Treasury, Fed appears all the time, CNBC 00:16:18.360 |
And I think that's probably a fair few, the ones 00:16:36.800 |
And you'll see here, everything is in lowercase. 00:16:44.360 |
So this means we don't have to type out Fed in capital 00:16:50.680 |
We do Entity Text lower, not in the blacklist. 00:16:55.560 |
And that will just exclude any that are included in there. 00:17:00.720 |
So just rerun that, and we'll rerun this as well. 00:17:08.400 |
So what I want to create is essentially a frequency table. 00:17:13.880 |
So we want to have each one of these companies, 00:17:16.640 |
and we want to see how often or how frequently they 00:17:29.040 |
pass a list, for example, and it will go through and count 00:17:36.560 |
and then organize them into the counter object, which 00:17:40.240 |
gives us a few useful methods for actually viewing 00:17:43.040 |
that data, for example, viewing the most common values 00:18:05.880 |
And at the moment, we have a column in the data frame. 00:18:10.120 |
that we need to transform it into a counter object. 00:18:14.240 |
Instead, what we need is just a simple flat list. 00:18:43.120 |
And we don't want that for our counter object. 00:18:47.440 |
We actually just want a plain, straight list. 00:18:50.360 |
So we need to add another step to the process, which 00:18:59.360 |
And here, we're just going to use list comprehension to loop 00:19:08.760 |
What I mean by that is org here is like a single item 00:19:25.400 |
And then that will be our item that makes up this new list 00:19:39.480 |
And we need to iterate through each one of those sub lists. 00:19:49.640 |
And at the end here, we're just saying go through each orb, 00:20:01.960 |
And it's just something that you get used to if you're not 00:20:07.320 |
So then let's view the first five entries in that. 00:20:18.080 |
So now we can pass this into our counter object. 00:20:32.520 |
And then we can view the most frequent of those 00:20:40.640 |
And then here we just pass the number of the most common items 00:20:45.320 |
So if we'd like to see the top 10, we just pass in 10. 00:20:48.040 |
And then here we can see that we have the most frequently 00:20:52.720 |
mentioned organizations from the investing subreddit data 00:20:57.200 |
There's a few things in here that we probably 00:21:05.040 |
There's a few items in there that we can definitely prune 00:21:16.480 |
it is to apply named entity recognition to a data set 00:21:21.280 |
to actually extract what the text within that data set 00:21:27.920 |
like sentiment analysis, it can get pretty cool. 00:21:35.680 |
But for this video, I'm just going to leave it with NER.