Identify Stocks on Reddit with SpaCy (NER in Python)


0:0 Introduction
1:18 Importing SpaCy
2:20 SpaCy Models
3:59 Load Model
4:50 Document Object
8:45 Organization List
10:30 Importing Data
12:20 Getting Entities
14:13 Filtering Entities
17:50 Counter Object

00:00:00.000 | Hi, welcome to this video on named entity recognition
00:00:03.240 | with spaCy.
00:00:04.320 | So I'm going to take you through how
00:00:06.360 | we can use spaCy to perform named entity
00:00:08.880 | recognition, or NER.
00:00:10.200 | And first, what we're going to do
00:00:11.920 | is take this text that you can see here.
00:00:13.960 | And what we are aiming to do is extract all the organizations
00:00:17.720 | I mentioned in this text.
00:00:19.240 | So in our case, with this, we are wanting to extract ARC.
00:00:23.160 | So we're going to look at how we do that.
00:00:25.120 | I'm also going to show you how we can visualize that process
00:00:27.580 | as well, using displacy, which is a visualization package
00:00:31.200 | embedded within spaCy.
00:00:32.520 | It's super cool.
00:00:33.680 | Now I'm going to show you how we programmatically extract
00:00:36.220 | entities.
00:00:36.920 | So obviously, visualization is great.
00:00:39.360 | But we do want to just pull those out
00:00:41.760 | in a more programmatic fashion.
00:00:45.480 | So we're going to do that.
00:00:46.640 | And then once we have done that process with our single example,
00:00:51.200 | obviously, we are going to want to scale that
00:00:53.280 | to a ton of examples.
00:00:56.200 | So what I have is a sample of, I think,
00:00:58.920 | it's 900 posts from the investing subreddit.
00:01:03.100 | And we're going to build out a process that
00:01:05.000 | will take all of those, pull out all the entities that
00:01:07.400 | are being mentioned, prune out a few that we don't really want,
00:01:10.640 | and then give us the most frequently mentioned samples
00:01:13.960 | or organizations within that data set.
00:01:16.880 | So let's just jump straight into it.
00:01:18.920 | We have our text data here.
00:01:22.200 | And this is just a single extract
00:01:24.080 | from the investing subreddit.
00:01:27.360 | And I'm just going to use this to show you how spaCy works
00:01:31.800 | and how we can do name entity recognition
00:01:34.880 | on this single piece of text.
00:01:37.160 | We want to start by first importing spaCy.
00:01:40.480 | And if you don't already have spaCy, it's very easy to install.
00:01:46.120 | You just pip install spaCy and Enter.
00:01:51.960 | And that will install the module for you.
00:01:54.880 | And we're going to be using both spaCy.
00:01:56.840 | And I'm also going to show you something
00:01:58.680 | called displayCy, which is like a visualization package that
00:02:02.600 | comes with spaCy.
00:02:05.000 | So that is from spaCy import displayCy.
00:02:15.200 | And once we've imported those, we also
00:02:17.320 | want to load in our model.
00:02:20.000 | So spaCy comes with a lot of options in terms of models.
00:02:26.120 | And we can see those on spaCy I/O/models.
00:02:30.880 | If we come down here, we see this is the model
00:02:33.960 | that we will actually be using.
00:02:36.400 | I just want to quickly cover what we are actually
00:02:38.680 | looking at here.
00:02:40.320 | So come down here, we have the naming conventions.
00:02:44.360 | And we see that the name is built
00:02:45.880 | with the language and the name.
00:02:47.160 | The language for us, of course, English, which you can see here.
00:02:49.920 | And this last part is the name.
00:02:52.920 | Now, the name consists of the model type, genre, and size.
00:02:59.000 | The type here is core, which is a general purpose
00:03:02.920 | model, which includes vocabulary, syntax, entities,
00:03:06.140 | and word vectors.
00:03:07.480 | We're interested in using entities for the NER tasks
00:03:11.480 | that we are looking at.
00:03:13.880 | Web is the type of data that the pipeline or the model
00:03:17.400 | has been trained on.
00:03:18.440 | So the two examples I give here is web or news.
00:03:21.680 | Web includes stuff like blogs.
00:03:23.320 | And Reddit fits pretty well with that.
00:03:25.680 | So we're going to use web.
00:03:26.880 | And then we just have the model size here.
00:03:29.400 | We're just going to go with small.
00:03:31.320 | To download that, we go back into our command line
00:03:36.120 | interface.
00:03:36.720 | And we type this.
00:03:38.440 | I would type python m spaCy to access the spaCy module.
00:03:43.720 | And then download, and then your model name here,
00:03:46.520 | which in our case, so English core web small model.
00:03:52.520 | I'm not going to enter this, because I have already
00:03:55.000 | downloaded it.
00:03:55.840 | But that is what you need to do.
00:03:57.600 | And once that is downloaded, we can then
00:04:00.480 | load it into our notebook.
00:04:03.720 | So we'll load it into this NLP variable.
00:04:07.520 | And we'll do spaCy load.
00:04:10.000 | And then again, we just enter the model name.
00:04:12.840 | There we go.
00:04:17.960 | That is our model.
00:04:19.600 | Now to actually process this data, it's super easy.
00:04:23.880 | We will assign it into this doc variable.
00:04:28.520 | And we just take NLP and add in the text.
00:04:33.760 | Print that out.
00:04:35.120 | And we can see we have this, which just kind of looks
00:04:38.600 | like the text that we passed in there.
00:04:41.400 | So it's a bit-- it looks like nothing has actually happened.
00:04:44.920 | But that is not the case.
00:04:46.840 | This is actually a spaCy document object.
00:04:51.360 | So if we click Help Doc here, you see, OK,
00:04:54.760 | we have a doc object.
00:04:55.800 | And then we've got all these different methods, attributes,
00:04:58.300 | everything in there.
00:04:59.680 | So it has worked.
00:05:01.960 | And that's good, because we can then
00:05:05.880 | use a doc.ents to access the entities
00:05:10.520 | that spaCy has identified within this document object
00:05:14.680 | or within this text.
00:05:16.360 | And we can see here, although it doesn't say what type
00:05:19.880 | these labels are, we have arc, arc, arc, etf.
00:05:22.600 | We have this bear cave.
00:05:24.320 | This doesn't tell us that much information,
00:05:26.160 | but the information is here.
00:05:29.000 | So I want to quickly show you displayCy,
00:05:32.120 | because it's pretty cool.
00:05:33.760 | And I'm going to visualize what is actually happening here.
00:05:37.040 | So let's do displayCy, Render.
00:05:42.640 | We pass in our document object and the style
00:05:46.920 | for the visualization.
00:05:48.080 | There's a few different styles.
00:05:50.080 | We are going to be using the entity style.
00:05:53.020 | And this is pretty cool.
00:05:54.240 | It shows us the text.
00:05:55.600 | And then we have these labels on top.
00:05:58.640 | We see that arc and etf are identified as organizations.
00:06:02.560 | Etf, we don't really want that in there.
00:06:04.560 | Etf is a exchange-traded fund.
00:06:07.160 | And it's not really what we're looking for
00:06:10.000 | in terms of the organizations.
00:06:12.120 | Nonetheless, it is identifying arc correctly three times,
00:06:16.240 | which is pretty good.
00:06:17.800 | Now, work of art, when I first saw this,
00:06:20.040 | I had no idea what that meant.
00:06:21.320 | To me, that seems like it's Picasso painting or a statue
00:06:25.720 | from Michelangelo.
00:06:26.920 | I really had no idea what it meant by work of art.
00:06:31.280 | So what we can do to get a small description of each label,
00:06:36.200 | if we don't know what it means, is we just type spacey,
00:06:40.720 | explain, and we're going to do work of art.
00:06:46.160 | And then we can see, OK, it's titles of books, songs,
00:06:48.680 | et cetera.
00:06:49.160 | So that makes a lot more sense than what
00:06:50.680 | I was initially thinking.
00:06:51.800 | And it also fits quite well to what this is.
00:06:54.360 | So this bear cave thing here is actually an article.
00:06:58.520 | And it's not quite a book, but it is something
00:07:01.800 | that someone has written, just like a book or a song.
00:07:04.400 | So it fits in with that category, in my opinion.
00:07:08.200 | So that's great.
00:07:09.120 | Visualize these entities and the text.
00:07:11.760 | But we also want to process this in a more efficient way.
00:07:15.160 | We can't just visualize it.
00:07:17.000 | So this is where we go back to our doc ends here.
00:07:22.240 | And what we want to do is actually
00:07:24.440 | work through each one of these in a for loop.
00:07:26.840 | And although these look just like they
00:07:29.760 | are text or something along those lines, they're not.
00:07:32.960 | They're actually entity objects.
00:07:36.280 | So let me just show you how we deal with that.
00:07:38.880 | So we go for entity in doc ends.
00:07:45.000 | So we can print out the label, which
00:07:49.240 | is this org or work of art.
00:07:54.200 | And we print that out by accessing the entity object
00:08:00.760 | and going into the label attribute.
00:08:04.240 | And just notice that there's a underscore
00:08:06.440 | at the end of that attribute name.
00:08:08.480 | So just remember that.
00:08:10.320 | And that will give us the label, so org or work of art.
00:08:14.600 | And then we can also find the entity text.
00:08:18.760 | So we just go entity.
00:08:20.360 | And then we can type in text there as well.
00:08:23.720 | And then we see here, OK, that's pretty cool,
00:08:25.840 | because now we've got the organization, work of art.
00:08:28.720 | And then we have what it is talking about,
00:08:31.160 | which part of the text it's actually
00:08:32.680 | extracting out for us there.
00:08:35.280 | So that is really cool and really useful.
00:08:38.400 | And that is actually all we need to start extracting the data
00:08:41.880 | out and processing it.
00:08:43.640 | So if we just come down here and take this loop,
00:08:46.440 | and we're just going to modify it a little bit.
00:08:48.560 | And we're going to extract the organizations from this list.
00:08:54.200 | So we're going to initialize a org list.
00:08:58.560 | And then here, we're going to add some logic, which says,
00:09:01.800 | OK, if this is a org organization label,
00:09:06.080 | we want to add that to our org list.
00:09:09.480 | So to do that, we say, OK, if label is equal to org,
00:09:21.440 | org list append entity text.
00:09:27.960 | And let's just view our org list at the bottom here.
00:09:33.840 | OK, so here, it's entity, not label.
00:09:37.680 | And here, we get our list of all the organizations.
00:09:43.440 | So it's excluded the bear cave, because the bear cave is not
00:09:46.560 | a org, it's a work of art.
00:09:48.960 | So that's pretty cool.
00:09:50.200 | But ideally, from my perspective on what we want here,
00:09:54.400 | is we don't need to have arc popping up three times.
00:09:58.320 | We just want to say, OK, what organizations
00:10:01.200 | have been mentioned?
00:10:02.080 | We don't care about how frequently they've been
00:10:04.040 | mentioned in a specific item.
00:10:07.800 | So to do that, we just convert this
00:10:09.400 | to a set, which will remove any duplicates.
00:10:11.800 | And then we convert it back into a list.
00:10:15.600 | So org list equals list set org list.
00:10:19.840 | And then let's just see what that looks like.
00:10:25.160 | So we now just have ETF and arc.
00:10:26.920 | And that's exactly where I wanted this to be.
00:10:31.440 | So we've applied this to a single piece of text.
00:10:36.840 | But we want to apply this to a full data frame.
00:10:40.960 | So first thing we need to do is actually import a text.
00:10:44.240 | So I pulled this from Reddit.
00:10:47.480 | So this is the data that we're going to be using.
00:10:51.240 | So we're pulling this from the investing subreddit.
00:10:54.640 | And we're using the Reddit API to do that.
00:10:58.360 | Now, if you haven't used Reddit API before,
00:11:01.080 | I do have a video on that.
00:11:03.480 | So I will leave a link to that in the description.
00:11:06.840 | Otherwise, you can also just get this data directly
00:11:10.560 | if you don't want to go through the whole Reddit API thing.
00:11:14.160 | And I will leave a link to that in the description as well.
00:11:19.120 | So just separate this.
00:11:22.120 | And now we just want to import pandas.
00:11:29.600 | And now we just need to read in our data as pd read csv.
00:11:37.400 | And this is in the data directory for me
00:11:40.600 | and is redditinvesting.csv.
00:11:47.240 | And the separator we're using here is the pipe delimiter.
00:11:51.160 | So let's just make sure we've read that in correctly.
00:11:54.360 | And there we go.
00:11:57.760 | So we have our data.
00:12:00.480 | And the thing that we really focus on here
00:12:03.440 | is this self-text column.
00:12:07.400 | So in here, we just have 836 posts.
00:12:14.120 | And we'll just apply our NER to all of those
00:12:17.840 | and just see what people are talking about.
00:12:20.760 | So we need to convert what we did up here
00:12:23.560 | into a function that we can then apply to our data frame.
00:12:29.600 | So let's take that.
00:12:31.440 | And we're just going to convert this into a function.
00:12:37.440 | So we'll call it get entities.
00:12:42.880 | And then here, we'll pass in a single string.
00:12:46.840 | We'll add that in there.
00:12:48.920 | And we'll say here, we need to create our document object.
00:12:54.560 | So NLP text.
00:12:58.200 | We've already defined the model up here as NLP.
00:13:01.680 | It's this variable.
00:13:02.960 | Initialize our entity or organization list.
00:13:05.960 | And then work through each one of those
00:13:08.840 | and append it to our list.
00:13:12.200 | And then we just want to return that list.
00:13:16.400 | But we also do want to remove any duplicates.
00:13:18.680 | So we'll just return the set list version.
00:13:23.640 | Now we're going to run that.
00:13:25.520 | And let's just apply that to our data frame.
00:13:28.400 | So we'll create a new column called Organizations.
00:13:32.200 | And we will just take the self-text column
00:13:39.320 | and apply our get entities function to it.
00:13:47.040 | Let's just see what we get.
00:13:48.280 | So this will take a little bit of time
00:13:53.400 | because we're processing a lot of entries here.
00:13:56.080 | Obviously, if you're doing this for a larger data set,
00:13:58.360 | you're probably going to want to batch this a little bit.
00:14:01.120 | So keeping it on file somewhere, reading maybe up to 1,000
00:14:06.680 | samples at once, applying this, and then saving it back to file
00:14:10.040 | and just working through like that.
00:14:13.480 | So for us, we can see straight away
00:14:15.840 | we have some things that we probably
00:14:17.440 | don't really want in there.
00:14:18.640 | So I'm not sure what these are.
00:14:20.960 | And then we also have this SMP 500p, SMP 500,
00:14:26.880 | loads of things.
00:14:28.920 | It's not really what we want in there
00:14:30.720 | because we just want actual company names.
00:14:34.200 | So what we can do is create a blacklist.
00:14:39.960 | What I mean by a blacklist is we just
00:14:43.440 | create a list full of anything that we
00:14:46.880 | don't want to be included.
00:14:48.880 | For example, these here, we really don't want those.
00:14:53.680 | Now, we don't necessarily need to do this as well
00:14:56.440 | for everything because what we will find
00:14:59.320 | with a lot of these items that we don't really
00:15:02.160 | want to include in there-- in fact, actually, I think I'll
00:15:04.600 | probably keep these two in as an example.
00:15:07.560 | What we will find with a lot of these
00:15:09.600 | is that they only appear maybe once or twice
00:15:13.040 | in the whole data set.
00:15:14.880 | So we can actually filter those out
00:15:16.440 | by only searching for organizations
00:15:19.720 | that appear at least three or four times within our data set.
00:15:23.360 | And that just filters out all the rubbish
00:15:25.680 | that we get with these ones.
00:15:27.640 | But in other cases, like SEC, that will appear quite a lot.
00:15:32.120 | And we don't necessarily want to be finding
00:15:34.720 | where it comes up with the SEC.
00:15:36.720 | And in some cases, maybe you do want to.
00:15:39.000 | But in this case, I'm going to remove it.
00:15:42.520 | I'm going to remove the S&P 500 as well.
00:15:48.120 | And maybe leave it like that.
00:15:50.760 | I'm not sure where to--
00:15:51.920 | I assume Lemonade isn't a company.
00:15:53.680 | So I'm just going to put that in there as well.
00:15:57.560 | And then there's a few others that I've noticed
00:15:59.480 | before that come up quite a lot.
00:16:01.680 | We get the FDA, Treasury, Fed appears all the time, CNBC
00:16:13.640 | always appears, EU always appears.
00:16:18.360 | And I think that's probably a fair few, the ones
00:16:21.280 | that we don't want in there.
00:16:23.760 | So we'll include those.
00:16:25.360 | And to exclude those from our search,
00:16:30.480 | we just add another condition here.
00:16:32.520 | So we're going to add an AND condition.
00:16:34.240 | We say AND Entity Text.
00:16:36.800 | And you'll see here, everything is in lowercase.
00:16:42.120 | So we also apply a lower here.
00:16:44.360 | So this means we don't have to type out Fed in capital
00:16:48.200 | and Fed in lowercase.
00:16:50.680 | We do Entity Text lower, not in the blacklist.
00:16:55.560 | And that will just exclude any that are included in there.
00:16:58.000 | And we can just update that as we go along.
00:17:00.720 | So just rerun that, and we'll rerun this as well.
00:17:04.640 | And we start writing out the next part
00:17:06.880 | of what we're doing here.
00:17:08.400 | So what I want to create is essentially a frequency table.
00:17:13.880 | So we want to have each one of these companies,
00:17:16.640 | and we want to see how often or how frequently they
00:17:19.760 | are mentioned.
00:17:21.360 | So to do that, we can use a counter object
00:17:24.000 | from the collections library.
00:17:25.800 | So what we can do with that is we simply
00:17:29.040 | pass a list, for example, and it will go through and count
00:17:33.080 | all the instances of a specific value,
00:17:36.560 | and then organize them into the counter object, which
00:17:40.240 | gives us a few useful methods for actually viewing
00:17:43.040 | that data, for example, viewing the most common values
00:17:47.640 | in that data set.
00:17:50.320 | So that's pretty useful, and that's
00:17:52.120 | what we are going to be using.
00:17:55.080 | So to use that, we need to import it
00:17:57.920 | from the collections library.
00:18:01.440 | That's the counter object.
00:18:03.280 | And like I said before, this needs a list.
00:18:05.880 | And at the moment, we have a column in the data frame.
00:18:08.680 | So it's not really the right format
00:18:10.120 | that we need to transform it into a counter object.
00:18:14.240 | Instead, what we need is just a simple flat list.
00:18:18.400 | So first thing we can do is take that column
00:18:22.320 | and convert it into a list.
00:18:24.880 | So we'll do organizations to list.
00:18:33.040 | You can see here, OK, we do have lists,
00:18:38.200 | but it's actually a list of lists.
00:18:39.680 | So we've got a list, and within that list,
00:18:41.520 | we have all these other lists.
00:18:43.120 | And we don't want that for our counter object.
00:18:47.440 | We actually just want a plain, straight list.
00:18:50.360 | So we need to add another step to the process, which
00:18:53.440 | is flattening that list.
00:18:55.400 | So we'll call it orbs flat.
00:18:59.360 | And here, we're just going to use list comprehension to loop
00:19:02.200 | through each list within the list
00:19:04.360 | and pull out each item into our new list.
00:19:08.760 | What I mean by that is org here is like a single item
00:19:12.360 | within the sub list.
00:19:14.960 | So if I just view the first two here.
00:19:20.760 | So org is like the SMP 500P here.
00:19:25.400 | And then that will be our item that makes up this new list
00:19:29.360 | that we are making.
00:19:31.360 | And they will come from a sub list.
00:19:35.040 | And these sub lists are these lists here.
00:19:39.480 | And we need to iterate through each one of those sub lists.
00:19:42.880 | For each one that is within our orbs list,
00:19:47.160 | which is the full thing.
00:19:49.640 | And at the end here, we're just saying go through each orb,
00:19:53.160 | so each item in the sub list, which
00:19:56.280 | is kind of a confusing syntax.
00:20:00.000 | But it works.
00:20:01.960 | And it's just something that you get used to if you're not
00:20:05.680 | already.
00:20:07.320 | So then let's view the first five entries in that.
00:20:12.480 | So it's orbs flat.
00:20:14.080 | And there we go.
00:20:14.760 | We have a few copies.
00:20:18.080 | So now we can pass this into our counter object.
00:20:22.800 | So do frequency counter orbs flat.
00:20:32.520 | And then we can view the most frequent of those
00:20:36.880 | by using the most common method.
00:20:40.640 | And then here we just pass the number of the most common items
00:20:44.360 | that we'd like to see.
00:20:45.320 | So if we'd like to see the top 10, we just pass in 10.
00:20:48.040 | And then here we can see that we have the most frequently
00:20:52.720 | mentioned organizations from the investing subreddit data
00:20:56.520 | that we have.
00:20:57.200 | There's a few things in here that we probably
00:20:59.080 | want to get rid of, like EV, ETF, COVID.
00:21:02.160 | We've got socket exchange, SPAC.
00:21:05.040 | There's a few items in there that we can definitely prune
00:21:07.680 | out with the blacklist.
00:21:10.040 | But overall, I think that looks pretty good.
00:21:13.360 | And this very quickly shows us how easy
00:21:16.480 | it is to apply named entity recognition to a data set
00:21:21.280 | to actually extract what the text within that data set
00:21:24.360 | is actually talking about.
00:21:26.160 | Now, if you start pairing this with things
00:21:27.920 | like sentiment analysis, it can get pretty cool.
00:21:31.320 | So I mean, that's definitely something
00:21:33.000 | that I think we will cover soon.
00:21:35.680 | But for this video, I'm just going to leave it with NER.
00:21:39.240 | So I hope this has been useful.
00:21:41.560 | I really appreciate you watching this.
00:21:43.680 | and I will see you again next time.