Hi, welcome to this video on named entity recognition with spaCy. So I'm going to take you through how we can use spaCy to perform named entity recognition, or NER. And first, what we're going to do is take this text that you can see here. And what we are aiming to do is extract all the organizations I mentioned in this text.
So in our case, with this, we are wanting to extract ARC. So we're going to look at how we do that. I'm also going to show you how we can visualize that process as well, using displacy, which is a visualization package embedded within spaCy. It's super cool. Now I'm going to show you how we programmatically extract entities.
So obviously, visualization is great. But we do want to just pull those out in a more programmatic fashion. So we're going to do that. And then once we have done that process with our single example, obviously, we are going to want to scale that to a ton of examples.
So what I have is a sample of, I think, it's 900 posts from the investing subreddit. And we're going to build out a process that will take all of those, pull out all the entities that are being mentioned, prune out a few that we don't really want, and then give us the most frequently mentioned samples or organizations within that data set.
So let's just jump straight into it. We have our text data here. And this is just a single extract from the investing subreddit. And I'm just going to use this to show you how spaCy works and how we can do name entity recognition on this single piece of text.
We want to start by first importing spaCy. And if you don't already have spaCy, it's very easy to install. You just pip install spaCy and Enter. And that will install the module for you. And we're going to be using both spaCy. And I'm also going to show you something called displayCy, which is like a visualization package that comes with spaCy.
So that is from spaCy import displayCy. And once we've imported those, we also want to load in our model. So spaCy comes with a lot of options in terms of models. And we can see those on spaCy I/O/models. If we come down here, we see this is the model that we will actually be using.
I just want to quickly cover what we are actually looking at here. So come down here, we have the naming conventions. And we see that the name is built with the language and the name. The language for us, of course, English, which you can see here. And this last part is the name.
Now, the name consists of the model type, genre, and size. The type here is core, which is a general purpose model, which includes vocabulary, syntax, entities, and word vectors. We're interested in using entities for the NER tasks that we are looking at. Web is the type of data that the pipeline or the model has been trained on.
So the two examples I give here is web or news. Web includes stuff like blogs. And Reddit fits pretty well with that. So we're going to use web. And then we just have the model size here. We're just going to go with small. To download that, we go back into our command line interface.
And we type this. I would type python m spaCy to access the spaCy module. And then download, and then your model name here, which in our case, so English core web small model. I'm not going to enter this, because I have already downloaded it. But that is what you need to do.
And once that is downloaded, we can then load it into our notebook. So we'll load it into this NLP variable. And we'll do spaCy load. And then again, we just enter the model name. There we go. That is our model. Now to actually process this data, it's super easy.
We will assign it into this doc variable. And we just take NLP and add in the text. Print that out. And we can see we have this, which just kind of looks like the text that we passed in there. So it's a bit-- it looks like nothing has actually happened.
But that is not the case. This is actually a spaCy document object. So if we click Help Doc here, you see, OK, we have a doc object. And then we've got all these different methods, attributes, everything in there. So it has worked. And that's good, because we can then use a doc.ents to access the entities that spaCy has identified within this document object or within this text.
And we can see here, although it doesn't say what type these labels are, we have arc, arc, arc, etf. We have this bear cave. This doesn't tell us that much information, but the information is here. So I want to quickly show you displayCy, because it's pretty cool. And I'm going to visualize what is actually happening here.
So let's do displayCy, Render. We pass in our document object and the style for the visualization. There's a few different styles. We are going to be using the entity style. And this is pretty cool. It shows us the text. And then we have these labels on top. We see that arc and etf are identified as organizations.
Etf, we don't really want that in there. Etf is a exchange-traded fund. And it's not really what we're looking for in terms of the organizations. Nonetheless, it is identifying arc correctly three times, which is pretty good. Now, work of art, when I first saw this, I had no idea what that meant.
To me, that seems like it's Picasso painting or a statue from Michelangelo. I really had no idea what it meant by work of art. So what we can do to get a small description of each label, if we don't know what it means, is we just type spacey, explain, and we're going to do work of art.
And then we can see, OK, it's titles of books, songs, et cetera. So that makes a lot more sense than what I was initially thinking. And it also fits quite well to what this is. So this bear cave thing here is actually an article. And it's not quite a book, but it is something that someone has written, just like a book or a song.
So it fits in with that category, in my opinion. So that's great. Visualize these entities and the text. But we also want to process this in a more efficient way. We can't just visualize it. So this is where we go back to our doc ends here. And what we want to do is actually work through each one of these in a for loop.
And although these look just like they are text or something along those lines, they're not. They're actually entity objects. So let me just show you how we deal with that. So we go for entity in doc ends. So we can print out the label, which is this org or work of art.
And we print that out by accessing the entity object and going into the label attribute. And just notice that there's a underscore at the end of that attribute name. So just remember that. And that will give us the label, so org or work of art. And then we can also find the entity text.
So we just go entity. And then we can type in text there as well. And then we see here, OK, that's pretty cool, because now we've got the organization, work of art. And then we have what it is talking about, which part of the text it's actually extracting out for us there.
So that is really cool and really useful. And that is actually all we need to start extracting the data out and processing it. So if we just come down here and take this loop, and we're just going to modify it a little bit. And we're going to extract the organizations from this list.
So we're going to initialize a org list. And then here, we're going to add some logic, which says, OK, if this is a org organization label, we want to add that to our org list. So to do that, we say, OK, if label is equal to org, org list append entity text.
And let's just view our org list at the bottom here. OK, so here, it's entity, not label. And here, we get our list of all the organizations. So it's excluded the bear cave, because the bear cave is not a org, it's a work of art. So that's pretty cool.
But ideally, from my perspective on what we want here, is we don't need to have arc popping up three times. We just want to say, OK, what organizations have been mentioned? We don't care about how frequently they've been mentioned in a specific item. So to do that, we just convert this to a set, which will remove any duplicates.
And then we convert it back into a list. So org list equals list set org list. And then let's just see what that looks like. So we now just have ETF and arc. And that's exactly where I wanted this to be. So we've applied this to a single piece of text.
But we want to apply this to a full data frame. So first thing we need to do is actually import a text. So I pulled this from Reddit. So this is the data that we're going to be using. So we're pulling this from the investing subreddit. And we're using the Reddit API to do that.
Now, if you haven't used Reddit API before, I do have a video on that. So I will leave a link to that in the description. Otherwise, you can also just get this data directly if you don't want to go through the whole Reddit API thing. And I will leave a link to that in the description as well.
So just separate this. And now we just want to import pandas. And now we just need to read in our data as pd read csv. And this is in the data directory for me and is redditinvesting.csv. And the separator we're using here is the pipe delimiter. So let's just make sure we've read that in correctly.
And there we go. So we have our data. And the thing that we really focus on here is this self-text column. So in here, we just have 836 posts. And we'll just apply our NER to all of those and just see what people are talking about. So we need to convert what we did up here into a function that we can then apply to our data frame.
So let's take that. And we're just going to convert this into a function. So we'll call it get entities. And then here, we'll pass in a single string. We'll add that in there. And we'll say here, we need to create our document object. So NLP text. We've already defined the model up here as NLP.
It's this variable. Initialize our entity or organization list. And then work through each one of those and append it to our list. And then we just want to return that list. But we also do want to remove any duplicates. So we'll just return the set list version. Now we're going to run that.
And let's just apply that to our data frame. So we'll create a new column called Organizations. And we will just take the self-text column and apply our get entities function to it. Let's just see what we get. So this will take a little bit of time because we're processing a lot of entries here.
Obviously, if you're doing this for a larger data set, you're probably going to want to batch this a little bit. So keeping it on file somewhere, reading maybe up to 1,000 samples at once, applying this, and then saving it back to file and just working through like that. So for us, we can see straight away we have some things that we probably don't really want in there.
So I'm not sure what these are. And then we also have this SMP 500p, SMP 500, loads of things. It's not really what we want in there because we just want actual company names. So what we can do is create a blacklist. What I mean by a blacklist is we just create a list full of anything that we don't want to be included.
For example, these here, we really don't want those. Now, we don't necessarily need to do this as well for everything because what we will find with a lot of these items that we don't really want to include in there-- in fact, actually, I think I'll probably keep these two in as an example.
What we will find with a lot of these is that they only appear maybe once or twice in the whole data set. So we can actually filter those out by only searching for organizations that appear at least three or four times within our data set. And that just filters out all the rubbish that we get with these ones.
But in other cases, like SEC, that will appear quite a lot. And we don't necessarily want to be finding where it comes up with the SEC. And in some cases, maybe you do want to. But in this case, I'm going to remove it. I'm going to remove the S&P 500 as well.
And maybe leave it like that. I'm not sure where to-- I assume Lemonade isn't a company. So I'm just going to put that in there as well. And then there's a few others that I've noticed before that come up quite a lot. We get the FDA, Treasury, Fed appears all the time, CNBC always appears, EU always appears.
And I think that's probably a fair few, the ones that we don't want in there. So we'll include those. And to exclude those from our search, we just add another condition here. So we're going to add an AND condition. We say AND Entity Text. And you'll see here, everything is in lowercase.
So we also apply a lower here. So this means we don't have to type out Fed in capital and Fed in lowercase. We do Entity Text lower, not in the blacklist. And that will just exclude any that are included in there. And we can just update that as we go along.
So just rerun that, and we'll rerun this as well. And we start writing out the next part of what we're doing here. So what I want to create is essentially a frequency table. So we want to have each one of these companies, and we want to see how often or how frequently they are mentioned.
So to do that, we can use a counter object from the collections library. So what we can do with that is we simply pass a list, for example, and it will go through and count all the instances of a specific value, and then organize them into the counter object, which gives us a few useful methods for actually viewing that data, for example, viewing the most common values in that data set.
So that's pretty useful, and that's what we are going to be using. So to use that, we need to import it from the collections library. That's the counter object. And like I said before, this needs a list. And at the moment, we have a column in the data frame.
So it's not really the right format that we need to transform it into a counter object. Instead, what we need is just a simple flat list. So first thing we can do is take that column and convert it into a list. So we'll do organizations to list. You can see here, OK, we do have lists, but it's actually a list of lists.
So we've got a list, and within that list, we have all these other lists. And we don't want that for our counter object. We actually just want a plain, straight list. So we need to add another step to the process, which is flattening that list. So we'll call it orbs flat.
And here, we're just going to use list comprehension to loop through each list within the list and pull out each item into our new list. What I mean by that is org here is like a single item within the sub list. So if I just view the first two here.
So org is like the SMP 500P here. And then that will be our item that makes up this new list that we are making. And they will come from a sub list. And these sub lists are these lists here. And we need to iterate through each one of those sub lists.
For each one that is within our orbs list, which is the full thing. And at the end here, we're just saying go through each orb, so each item in the sub list, which is kind of a confusing syntax. But it works. And it's just something that you get used to if you're not already.
So then let's view the first five entries in that. So it's orbs flat. And there we go. We have a few copies. So now we can pass this into our counter object. So do frequency counter orbs flat. OK. And then we can view the most frequent of those by using the most common method.
And then here we just pass the number of the most common items that we'd like to see. So if we'd like to see the top 10, we just pass in 10. And then here we can see that we have the most frequently mentioned organizations from the investing subreddit data that we have.
There's a few things in here that we probably want to get rid of, like EV, ETF, COVID. We've got socket exchange, SPAC. There's a few items in there that we can definitely prune out with the blacklist. But overall, I think that looks pretty good. And this very quickly shows us how easy it is to apply named entity recognition to a data set to actually extract what the text within that data set is actually talking about.
Now, if you start pairing this with things like sentiment analysis, it can get pretty cool. So I mean, that's definitely something that I think we will cover soon. But for this video, I'm just going to leave it with NER. So I hope this has been useful. I really appreciate you watching this.
and I will see you again next time.