Identify Stocks on Reddit with SpaCy (NER in Python)

00:00:00.000 | Hi, welcome to this video on named entity recognition

00:00:03.240 | with spaCy.

00:00:04.320 | So I'm going to take you through how

00:00:06.360 | we can use spaCy to perform named entity

00:00:08.880 | recognition, or NER.

00:00:10.200 | And first, what we're going to do

00:00:11.920 | is take this text that you can see here.

00:00:13.960 | And what we are aiming to do is extract all the organizations

00:00:17.720 | I mentioned in this text.

00:00:19.240 | So in our case, with this, we are wanting to extract ARC.

00:00:23.160 | So we're going to look at how we do that.

00:00:25.120 | I'm also going to show you how we can visualize that process

00:00:27.580 | as well, using displacy, which is a visualization package

00:00:31.200 | embedded within spaCy.

00:00:32.520 | It's super cool.

00:00:33.680 | Now I'm going to show you how we programmatically extract

00:00:36.220 | entities.

00:00:36.920 | So obviously, visualization is great.

00:00:39.360 | But we do want to just pull those out

00:00:41.760 | in a more programmatic fashion.

00:00:45.480 | So we're going to do that.

00:00:46.640 | And then once we have done that process with our single example,

00:00:51.200 | obviously, we are going to want to scale that

00:00:53.280 | to a ton of examples.

00:00:56.200 | So what I have is a sample of, I think,

00:00:58.920 | it's 900 posts from the investing subreddit.

00:01:03.100 | And we're going to build out a process that

00:01:05.000 | will take all of those, pull out all the entities that

00:01:07.400 | are being mentioned, prune out a few that we don't really want,

00:01:10.640 | and then give us the most frequently mentioned samples

00:01:13.960 | or organizations within that data set.

00:01:16.880 | So let's just jump straight into it.

00:01:18.920 | We have our text data here.

00:01:22.200 | And this is just a single extract

00:01:24.080 | from the investing subreddit.

00:01:27.360 | And I'm just going to use this to show you how spaCy works

00:01:31.800 | and how we can do name entity recognition

00:01:34.880 | on this single piece of text.

00:01:37.160 | We want to start by first importing spaCy.

00:01:40.480 | And if you don't already have spaCy, it's very easy to install.

00:01:46.120 | You just pip install spaCy and Enter.

00:01:51.960 | And that will install the module for you.

00:01:54.880 | And we're going to be using both spaCy.

00:01:56.840 | And I'm also going to show you something

00:01:58.680 | called displayCy, which is like a visualization package that

00:02:02.600 | comes with spaCy.

00:02:05.000 | So that is from spaCy import displayCy.

00:02:15.200 | And once we've imported those, we also

00:02:17.320 | want to load in our model.

00:02:20.000 | So spaCy comes with a lot of options in terms of models.

00:02:26.120 | And we can see those on spaCy I/O/models.

00:02:30.880 | If we come down here, we see this is the model

00:02:33.960 | that we will actually be using.

00:02:36.400 | I just want to quickly cover what we are actually

00:02:38.680 | looking at here.

00:02:40.320 | So come down here, we have the naming conventions.

00:02:44.360 | And we see that the name is built

00:02:45.880 | with the language and the name.

00:02:47.160 | The language for us, of course, English, which you can see here.

00:02:49.920 | And this last part is the name.

00:02:52.920 | Now, the name consists of the model type, genre, and size.

00:02:59.000 | The type here is core, which is a general purpose

00:03:02.920 | model, which includes vocabulary, syntax, entities,

00:03:06.140 | and word vectors.

00:03:07.480 | We're interested in using entities for the NER tasks

00:03:11.480 | that we are looking at.

00:03:13.880 | Web is the type of data that the pipeline or the model

00:03:17.400 | has been trained on.

00:03:18.440 | So the two examples I give here is web or news.

00:03:21.680 | Web includes stuff like blogs.

00:03:23.320 | And Reddit fits pretty well with that.

00:03:25.680 | So we're going to use web.

00:03:26.880 | And then we just have the model size here.

00:03:29.400 | We're just going to go with small.

00:03:31.320 | To download that, we go back into our command line

00:03:36.120 | interface.

00:03:36.720 | And we type this.

00:03:38.440 | I would type python m spaCy to access the spaCy module.

00:03:43.720 | And then download, and then your model name here,

00:03:46.520 | which in our case, so English core web small model.

00:03:52.520 | I'm not going to enter this, because I have already

00:03:55.000 | downloaded it.

00:03:55.840 | But that is what you need to do.

00:03:57.600 | And once that is downloaded, we can then

00:04:00.480 | load it into our notebook.

00:04:03.720 | So we'll load it into this NLP variable.

00:04:07.520 | And we'll do spaCy load.

00:04:10.000 | And then again, we just enter the model name.

00:04:12.840 | There we go.

00:04:17.960 | That is our model.

00:04:19.600 | Now to actually process this data, it's super easy.

00:04:23.880 | We will assign it into this doc variable.

00:04:28.520 | And we just take NLP and add in the text.

00:04:33.760 | Print that out.

00:04:35.120 | And we can see we have this, which just kind of looks

00:04:38.600 | like the text that we passed in there.

00:04:41.400 | So it's a bit-- it looks like nothing has actually happened.

00:04:44.920 | But that is not the case.

00:04:46.840 | This is actually a spaCy document object.

00:04:51.360 | So if we click Help Doc here, you see, OK,

00:04:54.760 | we have a doc object.

00:04:55.800 | And then we've got all these different methods, attributes,

00:04:58.300 | everything in there.

00:04:59.680 | So it has worked.

00:05:01.960 | And that's good, because we can then

00:05:05.880 | use a doc.ents to access the entities

00:05:10.520 | that spaCy has identified within this document object

00:05:14.680 | or within this text.

00:05:16.360 | And we can see here, although it doesn't say what type

00:05:19.880 | these labels are, we have arc, arc, arc, etf.

00:05:22.600 | We have this bear cave.

00:05:24.320 | This doesn't tell us that much information,

00:05:26.160 | but the information is here.

00:05:29.000 | So I want to quickly show you displayCy,

00:05:32.120 | because it's pretty cool.

00:05:33.760 | And I'm going to visualize what is actually happening here.

00:05:37.040 | So let's do displayCy, Render.

00:05:42.640 | We pass in our document object and the style

00:05:46.920 | for the visualization.

00:05:48.080 | There's a few different styles.

00:05:50.080 | We are going to be using the entity style.

00:05:53.020 | And this is pretty cool.

00:05:54.240 | It shows us the text.

00:05:55.600 | And then we have these labels on top.

00:05:58.640 | We see that arc and etf are identified as organizations.

00:06:02.560 | Etf, we don't really want that in there.

00:06:04.560 | Etf is a exchange-traded fund.

00:06:07.160 | And it's not really what we're looking for

00:06:10.000 | in terms of the organizations.

00:06:12.120 | Nonetheless, it is identifying arc correctly three times,

00:06:16.240 | which is pretty good.

00:06:17.800 | Now, work of art, when I first saw this,

00:06:20.040 | I had no idea what that meant.

00:06:21.320 | To me, that seems like it's Picasso painting or a statue

00:06:25.720 | from Michelangelo.

00:06:26.920 | I really had no idea what it meant by work of art.

00:06:31.280 | So what we can do to get a small description of each label,

00:06:36.200 | if we don't know what it means, is we just type spacey,

00:06:40.720 | explain, and we're going to do work of art.

00:06:46.160 | And then we can see, OK, it's titles of books, songs,

00:06:48.680 | et cetera.

00:06:49.160 | So that makes a lot more sense than what

00:06:50.680 | I was initially thinking.

00:06:51.800 | And it also fits quite well to what this is.

00:06:54.360 | So this bear cave thing here is actually an article.

00:06:58.520 | And it's not quite a book, but it is something

00:07:01.800 | that someone has written, just like a book or a song.

00:07:04.400 | So it fits in with that category, in my opinion.

00:07:08.200 | So that's great.

00:07:09.120 | Visualize these entities and the text.

00:07:11.760 | But we also want to process this in a more efficient way.

00:07:15.160 | We can't just visualize it.

00:07:17.000 | So this is where we go back to our doc ends here.

00:07:22.240 | And what we want to do is actually

00:07:24.440 | work through each one of these in a for loop.

00:07:26.840 | And although these look just like they

00:07:29.760 | are text or something along those lines, they're not.

00:07:32.960 | They're actually entity objects.

00:07:36.280 | So let me just show you how we deal with that.

00:07:38.880 | So we go for entity in doc ends.

00:07:45.000 | So we can print out the label, which

00:07:49.240 | is this org or work of art.

00:07:54.200 | And we print that out by accessing the entity object

00:08:00.760 | and going into the label attribute.

00:08:04.240 | And just notice that there's a underscore

00:08:06.440 | at the end of that attribute name.

00:08:08.480 | So just remember that.

00:08:10.320 | And that will give us the label, so org or work of art.

00:08:14.600 | And then we can also find the entity text.

00:08:18.760 | So we just go entity.

00:08:20.360 | And then we can type in text there as well.

00:08:23.720 | And then we see here, OK, that's pretty cool,

00:08:25.840 | because now we've got the organization, work of art.

00:08:28.720 | And then we have what it is talking about,

00:08:31.160 | which part of the text it's actually

00:08:32.680 | extracting out for us there.

00:08:35.280 | So that is really cool and really useful.

00:08:38.400 | And that is actually all we need to start extracting the data

00:08:41.880 | out and processing it.

00:08:43.640 | So if we just come down here and take this loop,

00:08:46.440 | and we're just going to modify it a little bit.

00:08:48.560 | And we're going to extract the organizations from this list.

00:08:54.200 | So we're going to initialize a org list.

00:08:58.560 | And then here, we're going to add some logic, which says,

00:09:01.800 | OK, if this is a org organization label,

00:09:06.080 | we want to add that to our org list.

00:09:09.480 | So to do that, we say, OK, if label is equal to org,

00:09:21.440 | org list append entity text.

00:09:27.960 | And let's just view our org list at the bottom here.

00:09:33.840 | OK, so here, it's entity, not label.

00:09:37.680 | And here, we get our list of all the organizations.

00:09:43.440 | So it's excluded the bear cave, because the bear cave is not

00:09:46.560 | a org, it's a work of art.

00:09:48.960 | So that's pretty cool.

00:09:50.200 | But ideally, from my perspective on what we want here,

00:09:54.400 | is we don't need to have arc popping up three times.

00:09:58.320 | We just want to say, OK, what organizations

00:10:01.200 | have been mentioned?

00:10:02.080 | We don't care about how frequently they've been

00:10:04.040 | mentioned in a specific item.

00:10:07.800 | So to do that, we just convert this

00:10:09.400 | to a set, which will remove any duplicates.

00:10:11.800 | And then we convert it back into a list.

00:10:15.600 | So org list equals list set org list.

00:10:19.840 | And then let's just see what that looks like.

00:10:25.160 | So we now just have ETF and arc.

00:10:26.920 | And that's exactly where I wanted this to be.

00:10:31.440 | So we've applied this to a single piece of text.

00:10:36.840 | But we want to apply this to a full data frame.

00:10:40.960 | So first thing we need to do is actually import a text.

00:10:44.240 | So I pulled this from Reddit.

00:10:47.480 | So this is the data that we're going to be using.

00:10:51.240 | So we're pulling this from the investing subreddit.

00:10:54.640 | And we're using the Reddit API to do that.

00:10:58.360 | Now, if you haven't used Reddit API before,

00:11:01.080 | I do have a video on that.

00:11:03.480 | So I will leave a link to that in the description.

00:11:06.840 | Otherwise, you can also just get this data directly

00:11:10.560 | if you don't want to go through the whole Reddit API thing.

00:11:14.160 | And I will leave a link to that in the description as well.

00:11:19.120 | So just separate this.

00:11:22.120 | And now we just want to import pandas.

00:11:29.600 | And now we just need to read in our data as pd read csv.

00:11:37.400 | And this is in the data directory for me

00:11:40.600 | and is redditinvesting.csv.

00:11:47.240 | And the separator we're using here is the pipe delimiter.

00:11:51.160 | So let's just make sure we've read that in correctly.

00:11:54.360 | And there we go.

00:11:57.760 | So we have our data.

00:12:00.480 | And the thing that we really focus on here

00:12:03.440 | is this self-text column.

00:12:07.400 | So in here, we just have 836 posts.

00:12:14.120 | And we'll just apply our NER to all of those

00:12:17.840 | and just see what people are talking about.

00:12:20.760 | So we need to convert what we did up here

00:12:23.560 | into a function that we can then apply to our data frame.

00:12:29.600 | So let's take that.

00:12:31.440 | And we're just going to convert this into a function.

00:12:37.440 | So we'll call it get entities.

00:12:42.880 | And then here, we'll pass in a single string.

00:12:46.840 | We'll add that in there.

00:12:48.920 | And we'll say here, we need to create our document object.

00:12:54.560 | So NLP text.

00:12:58.200 | We've already defined the model up here as NLP.

00:13:01.680 | It's this variable.

00:13:02.960 | Initialize our entity or organization list.

00:13:05.960 | And then work through each one of those

00:13:08.840 | and append it to our list.

00:13:12.200 | And then we just want to return that list.

00:13:16.400 | But we also do want to remove any duplicates.

00:13:18.680 | So we'll just return the set list version.

00:13:23.640 | Now we're going to run that.

00:13:25.520 | And let's just apply that to our data frame.

00:13:28.400 | So we'll create a new column called Organizations.

00:13:32.200 | And we will just take the self-text column

00:13:39.320 | and apply our get entities function to it.

00:13:47.040 | Let's just see what we get.

00:13:48.280 | So this will take a little bit of time

00:13:53.400 | because we're processing a lot of entries here.

00:13:56.080 | Obviously, if you're doing this for a larger data set,

00:13:58.360 | you're probably going to want to batch this a little bit.

00:14:01.120 | So keeping it on file somewhere, reading maybe up to 1,000

00:14:06.680 | samples at once, applying this, and then saving it back to file

00:14:10.040 | and just working through like that.

00:14:13.480 | So for us, we can see straight away

00:14:15.840 | we have some things that we probably

00:14:17.440 | don't really want in there.

00:14:18.640 | So I'm not sure what these are.

00:14:20.960 | And then we also have this SMP 500p, SMP 500,

00:14:26.880 | loads of things.

00:14:28.920 | It's not really what we want in there

00:14:30.720 | because we just want actual company names.

00:14:34.200 | So what we can do is create a blacklist.

00:14:39.960 | What I mean by a blacklist is we just

00:14:43.440 | create a list full of anything that we

00:14:46.880 | don't want to be included.

00:14:48.880 | For example, these here, we really don't want those.

00:14:53.680 | Now, we don't necessarily need to do this as well

00:14:56.440 | for everything because what we will find

00:14:59.320 | with a lot of these items that we don't really

00:15:02.160 | want to include in there-- in fact, actually, I think I'll

00:15:04.600 | probably keep these two in as an example.

00:15:07.560 | What we will find with a lot of these

00:15:09.600 | is that they only appear maybe once or twice

00:15:13.040 | in the whole data set.

00:15:14.880 | So we can actually filter those out

00:15:16.440 | by only searching for organizations

00:15:19.720 | that appear at least three or four times within our data set.

00:15:23.360 | And that just filters out all the rubbish

00:15:25.680 | that we get with these ones.

00:15:27.640 | But in other cases, like SEC, that will appear quite a lot.

00:15:32.120 | And we don't necessarily want to be finding

00:15:34.720 | where it comes up with the SEC.

00:15:36.720 | And in some cases, maybe you do want to.

00:15:39.000 | But in this case, I'm going to remove it.

00:15:42.520 | I'm going to remove the S&P 500 as well.

00:15:48.120 | And maybe leave it like that.

00:15:50.760 | I'm not sure where to--

00:15:51.920 | I assume Lemonade isn't a company.

00:15:53.680 | So I'm just going to put that in there as well.

00:15:57.560 | And then there's a few others that I've noticed

00:15:59.480 | before that come up quite a lot.

00:16:01.680 | We get the FDA, Treasury, Fed appears all the time, CNBC

00:16:13.640 | always appears, EU always appears.

00:16:18.360 | And I think that's probably a fair few, the ones

00:16:21.280 | that we don't want in there.

00:16:23.760 | So we'll include those.

00:16:25.360 | And to exclude those from our search,

00:16:30.480 | we just add another condition here.

00:16:32.520 | So we're going to add an AND condition.

00:16:34.240 | We say AND Entity Text.

00:16:36.800 | And you'll see here, everything is in lowercase.

00:16:42.120 | So we also apply a lower here.

00:16:44.360 | So this means we don't have to type out Fed in capital

00:16:48.200 | and Fed in lowercase.

00:16:50.680 | We do Entity Text lower, not in the blacklist.

00:16:55.560 | And that will just exclude any that are included in there.

00:16:58.000 | And we can just update that as we go along.

00:17:00.720 | So just rerun that, and we'll rerun this as well.

00:17:04.640 | And we start writing out the next part

00:17:06.880 | of what we're doing here.

00:17:08.400 | So what I want to create is essentially a frequency table.

00:17:13.880 | So we want to have each one of these companies,

00:17:16.640 | and we want to see how often or how frequently they

00:17:19.760 | are mentioned.

00:17:21.360 | So to do that, we can use a counter object

00:17:24.000 | from the collections library.

00:17:25.800 | So what we can do with that is we simply

00:17:29.040 | pass a list, for example, and it will go through and count

00:17:33.080 | all the instances of a specific value,

00:17:36.560 | and then organize them into the counter object, which

00:17:40.240 | gives us a few useful methods for actually viewing

00:17:43.040 | that data, for example, viewing the most common values

00:17:47.640 | in that data set.

00:17:50.320 | So that's pretty useful, and that's

00:17:52.120 | what we are going to be using.

00:17:55.080 | So to use that, we need to import it

00:17:57.920 | from the collections library.

00:18:01.440 | That's the counter object.

00:18:03.280 | And like I said before, this needs a list.

00:18:05.880 | And at the moment, we have a column in the data frame.

00:18:08.680 | So it's not really the right format

00:18:10.120 | that we need to transform it into a counter object.

00:18:14.240 | Instead, what we need is just a simple flat list.

00:18:18.400 | So first thing we can do is take that column

00:18:22.320 | and convert it into a list.

00:18:24.880 | So we'll do organizations to list.

00:18:33.040 | You can see here, OK, we do have lists,

00:18:38.200 | but it's actually a list of lists.

00:18:39.680 | So we've got a list, and within that list,

00:18:41.520 | we have all these other lists.

00:18:43.120 | And we don't want that for our counter object.

00:18:47.440 | We actually just want a plain, straight list.

00:18:50.360 | So we need to add another step to the process, which

00:18:53.440 | is flattening that list.

00:18:55.400 | So we'll call it orbs flat.

00:18:59.360 | And here, we're just going to use list comprehension to loop

00:19:02.200 | through each list within the list

00:19:04.360 | and pull out each item into our new list.

00:19:08.760 | What I mean by that is org here is like a single item

00:19:12.360 | within the sub list.

00:19:14.960 | So if I just view the first two here.

00:19:20.760 | So org is like the SMP 500P here.

00:19:25.400 | And then that will be our item that makes up this new list

00:19:29.360 | that we are making.

00:19:31.360 | And they will come from a sub list.

00:19:35.040 | And these sub lists are these lists here.

00:19:39.480 | And we need to iterate through each one of those sub lists.

00:19:42.880 | For each one that is within our orbs list,

00:19:47.160 | which is the full thing.

00:19:49.640 | And at the end here, we're just saying go through each orb,

00:19:53.160 | so each item in the sub list, which

00:19:56.280 | is kind of a confusing syntax.

00:20:00.000 | But it works.

00:20:01.960 | And it's just something that you get used to if you're not

00:20:05.680 | already.

00:20:07.320 | So then let's view the first five entries in that.

00:20:12.480 | So it's orbs flat.

00:20:14.080 | And there we go.

00:20:14.760 | We have a few copies.

00:20:18.080 | So now we can pass this into our counter object.

00:20:22.800 | So do frequency counter orbs flat.

00:20:31.040 | OK.

00:20:32.520 | And then we can view the most frequent of those

00:20:36.880 | by using the most common method.

00:20:40.640 | And then here we just pass the number of the most common items

00:20:44.360 | that we'd like to see.

00:20:45.320 | So if we'd like to see the top 10, we just pass in 10.

00:20:48.040 | And then here we can see that we have the most frequently

00:20:52.720 | mentioned organizations from the investing subreddit data

00:20:56.520 | that we have.

00:20:57.200 | There's a few things in here that we probably

00:20:59.080 | want to get rid of, like EV, ETF, COVID.

00:21:02.160 | We've got socket exchange, SPAC.

00:21:05.040 | There's a few items in there that we can definitely prune

00:21:07.680 | out with the blacklist.

00:21:10.040 | But overall, I think that looks pretty good.

00:21:13.360 | And this very quickly shows us how easy

00:21:16.480 | it is to apply named entity recognition to a data set

00:21:21.280 | to actually extract what the text within that data set

00:21:24.360 | is actually talking about.

00:21:26.160 | Now, if you start pairing this with things

00:21:27.920 | like sentiment analysis, it can get pretty cool.

00:21:31.320 | So I mean, that's definitely something

00:21:33.000 | that I think we will cover soon.

00:21:35.680 | But for this video, I'm just going to leave it with NER.

00:21:39.240 | So I hope this has been useful.

00:21:41.560 | I really appreciate you watching this.

00:21:43.680 | and I will see you again next time.

Identify Stocks on Reddit with SpaCy (NER in Python)

Chapters