Back to Index

Advanced Sentiment Analysis with NLP Transformers + Vector Search


Chapters

0:0 Intro
0:31 What we will build
3:1 Code links and prerequisites
4:16 Dataset download and preprocessing
5:49 Using RoBERTa sentiment analysis model
8:15 Retriever model for building dense vectors
9:39 Create Pinecone vector index
11:40 Sentiment scores, vectors, and indexing
17:35 Sentiment analysis / opinion mining
20:43 Sentiment analysis with specific date range
21:44 Sentiment analysis on specific info
23:58 Final notes

Transcript

Today we're going to learn how to do sentiment mining using vector search and transform models. At the core of this we have sentiment analysis. Now sentiment analysis is a NLP technique where we essentially want to extract the emotion or the sentiment behind some text. Naturally this makes it a really interesting and useful tool when it comes to analyzing a ton of language text data.

And here we're gonna have a look at how to apply sentiment analysis and more specifically sentiment mining to the hotel industry so that we can understand customer perception through hotel reviews. With this we could identify the perfect hotel for whatever it is we want. Or from the other side, from the hotel management side, we can analyze customer reviews and maybe identify areas that are pretty strong and other areas that could do a bit of work.

So the idea here is we're gonna take a ton of these customer reviews for different hotels and apply this technique to them. But of course we don't have to restrict this to reviews, although we can we can do product reviews for example. But we can actually apply this to any text where we want to extract sentiment and maybe you want to analyze different ideas or features within that data set.

So the hotel example maybe you want to have a look at whether people think our room sizes are good or whether they like the breakfast and we could do that with this sentiment mining. So to do that we are essentially going to take all of these reviews we're going to embed them using something called a sentence transformer model.

We're going to index those within a vector database and we're going to search for things like "are customers satisfied with room sizes for hotels in London" and that will return a set of reviews that are relevant to our particular query and then assess the general sentiment around them. Now we can also pair this with metadata which is just attached bits of information to each vector or piece or record within our vector database and we could use that to maybe visualize the sentiment across different hotels.

Or we could even take that a little further and use something called metadata filtering to for example look at reviews over time and we could use that to look at the sentiment of different hotels or different aspects with each of those hotels over time and see if things are either improving or getting worse and so on.

But we'll look at all of that in a lot more detail towards the end of the video. For now what we're going to do is have a look at the actual notebook that we are going to be going through. So if you'd like to follow along with me you can actually get the notebook either in the description of this video or you can go over to here pinecone.io/docs/examples/sentiment-mining and up at the top here you can either get the notebook on github or you can click over here to open it in Colab which is what I'm going to be running through.

Now credit for this idea and notebook goes to usherac again so it's a really cool example and it's really useful. So we'll start going through it now. So the first thing we need to do is install any dependencies that we have. So we have a few here mainly sentence transformers, PyHandClient, datasets which we can pull everything from and then for visualizations seaborn and matplotlib.

So we run that and one other thing we should also check if you are running this in Colab is in your runtime go over to change runtime type and just make sure that you have the hardware accelerator on GPU here. So if you do need to switch you just have to restart and reinstall those prerequisites there.

So once that is ready you can go to load and prepare dataset. So we're going to be using this dataset that usherac is hosting on HuggingFace. So usherac hotel reviews. We're going to take the training split and we're just going to convert it into a pandas data frame. So run this and once that has completed we come down to here and some of these reviews can obviously be much longer than others.

So what we are going to do is just limit it to the first 800 characters of each review because later on when we're creating our embeddings the models we will be using do have a maximum sequence length. So anything beyond that will get truncated anyway. So we're just going to do that now rather than later on.

This will also help us when we later want to sort the text from these reviews in the Pinecone Vector database there are size limits to that. So by cutting down the length here we're going to make sure that we definitely do not go over that. So here we can see we have a single hotel here there are several in this dataset and we just have the reviews.

So you can see it's just natural language here there's nothing specific there are no like star ratings or anything here. So we really have to rely on the natural language of these reviews and sentiment analysis to actually understand those. So we're going to initialize our sentiment analysis model. We're going to be using Roberta here and it has been fine-tuned specifically for this.

So what we're going to do is we're going to set our device to GPU if it is available. It may not be but if you are using Colab and you set the runtime up here you should be able to get that. So this here is just using the CUDA zero device.

Okay and then from Transformers we're going to import a few things. We've got a pipeline which is going to be our sentiment analysis pipeline later on. We have a tokenizer and a model for sequence classification which is our again our sentiment analysis model. Okay so we're going to be using this Roberta based sentiment sentiment analysis model.

We have three labels here so that's positive, negative and neutral. We're going to load our tokenizer from HuggingFace and then we're going to load all of that into this pipeline here. So this sentiment analysis pipeline on the device that we've chosen already and that's everything. So that will just package everything up into like really easy to use pipeline where we can just feed in some text and it's going to output us a little dictionary telling us the sentiment of that text.

Now we will need to map these so the output that we get will say either label 0, 1 or 2. That isn't obviously clear what that means so we're just going to map those to their actual meanings which is negative, neutral or positive. Then we take a look at this one.

So room is small for a superior room and poorly lit. No view, need better lighting. It's obviously not a very good review. They're not particularly happy with the room. So let's run this and we're going to see what we get. So I think I just need to run this again.

Okay and then come down here run again and you see that we get this label 0 and so that's a predicted sentiment and we get this score so the confidence. So if we come up here we can see label 0 is negative right? Label 0 negative. Cool so 77% confidence in this being negative and I think that's fair enough.

Now what we're going to do is initialize what's called a retriever model. So the retriever model is going to handle the construction of our vector embeddings. So we're going to take all of these reviews we're going to convert them into what are called dense vector embeddings and these are essentially a vector representation of the meaning behind each one of these reviews and what that does is allows us to search for something based on its meaning.

So the earlier example I used which was something like hotel room sizes for London hotels. We can search for that and return reviews that are relevant to that particular query. That's not based on words being matched but the actual meaning behind what we have said. So we're going to initialize a model for that it's a sentence transformer it's a pretty small one this mini LM is like an efficient smaller model but it still works very well.

So we will download that and we can see it creates these vector embeddings which are 384 dimensions in size. So we're essentially taking our text and encoding all that meaning into a 384 dimensional vector embedding. Okay and then what we need to do is initialize a pinecone index. So this is going to be our vector database where we're going to store all of these vector embeddings.

What we want to do here is we first need to get an API key. Okay so we would go over to app.pinecone.io When you come into this you probably just have a single project. You may need to sign up and that's free so don't worry about that you sign up and you'll probably have a single project here.

That will be your name followed by default project so like this one for me. I'll click on that and this will probably be empty which is fine because we're going to create an index but what we do need is we need to go to the API keys here. We need to go to copy key value and then you just copy that value into here.

Okay and that is your personal API key. Now for me I have stored that in a variable called API key. So I will run that. That will connect and that initializes our connection to our pinecone project. Now what we can do is initialize a index. So for me I've actually already created this in advance so this here will basically check if it already exists.

If it doesn't already exist it will create the index with these parameters here. So dimensionality is at 384 that's the vector embedding dimensionality of the model that we saw earlier and also the metric here is essentially how we compare each of the vectors within our vector index. So that will depend on the retrieval model that you are using.

In this case with the retrieval model we're using we need to use the cosine metric and once that's created or just identified as already existing we would connect to that index like so. From here what we want to do is actually move on to generating our embeddings. So what we are going to do here is take all of our reviews we're going to run them through our sentiment analysis model and get their sentiment scores whether they are positive, negative, neutral and we're going to do that using this function here.

So get sentiment and we'll just pass the reviews through there and we can see okay we take these first three reviews, run that and we'll get the first three sentiments for those first three reviews. So we have the very first item is negative so if I let's have a look at what we are actually looking at there.

So review first three so actually you can't see that there. Do this okay so this is the first one yeah it looks pretty negative straight away. Number one so this is the neutral one with a confidence score of 77. Yes just the location view doesn't really make any sense and number two positive and yeah I mean it's it's generally positive it's not not super over-the-top but they're saying it's pretty central I think and also it's very helpful.

So that sounds pretty good and there's a confidence of 89.7% there. So we're basically going to do that for all of our data but there are a few other things that I want to be able to do here. So I want to be able to use metadata filtering to search through a particular date range and to do that we need to get the timestamps from our dataset.

Now if we have a look at our dataset we can see that we have this review date but in order to be able to use those dates in Pinecone for filtering like above a certain date or below a certain date we actually need to convert them into a number.

So to do that we use the DateUtil parser and if we just take one example so we're going to get the time some for review date 0 so the first one if we have a look okay so we're going to take that and we can get the timestamp like so.

Okay that would just allow us to compare all the dates using greater than, less than operators and so on. Now from here we have everything we need so we can create our or we can get our sentiments, we can get our dates and what we now do is actually create our embeddings, pull all that together and then insert them into our Pinecone vector database.

So we're going to do this in batches of 64 there's two reasons for that, one to avoid putting too much data into our sentence transformer at once which will basically cause an out-of-memory error and also to avoid sending too much data in one go through the API calls to Pinecone.

So we extract the batch from our data frame, we encode so this creates the embeddings for that batch, we then get the timestamp using the values that we saw before so that getTimeStamp function. We want to get the sentiment labels and scores here so this positive, negative, neutral and the actual confidence score.

We get those and then we use this here to convert all that into a metadata dictionary. So if I just show you very quickly what that might look like, it will basically be something like label is equal to positive for example. We would also have the date or timestamp, that'll be whatever number there.

We'll have a score and very important for us to be able to read the review after is we'll also have the review. So that is essentially what we're building within this meta variable here but it will actually be within a list and we'll have several of these dictionaries one after the other and the final thing that we need to do is just create some unique IDs.

Now these are just numbers, there's nothing special there. Obviously if you have unique IDs that you need to use for your reviews or your data set, use those actual unique IDs rather than just account like we're doing here. Then from there we add all those into what we call the upsert list.

So the upsert list is just a list of tuples. So we have the list here of the tuples and those tuples are the record ID and the record dense vector and the record metadata and then we upsert all of those in to Pineco. So let's go ahead and run that.

Now as you can see this will take a little bit of time so on here showing about 25 minutes which is not too long but what I, I don't know where that long I've already done it so I am going to come here I'm gonna stop this and I'm just gonna take this to show you that we have everything in there already.

Okay so describe index sets and we should see okay total vector count we have 93.7 thousand items in there. Okay cool so now what we can do is move on to our opinion mining or sentiment mining where we are essentially going to ask some questions and we're going to take a look and analyze the sentiment for the hotel reviews based on what is we're searching for.

So the first thing I want to search for which is what I mentioned right at the start is room sizes. So are customers happy with room sizes for London hotels. So we're going to return 500 reviews that are relevant to this query. We're going to include the metadata so that makes sure that we include the metadata that we index so the review text which is it's pretty useful for us to understand what we're returning here.

We also want to see the sentiment scores which is pretty useful for actually analyzing what we're seeing here and the actual sentiment labels. So we can do that now that will return a little dictionary looks like this. Okay so we can see here we have a ton of these actually so let's just have a look at one.

So we have the ID so 37796 like I said it's just a count so it's the 37,000 item that we indexed. We have the hotel name that it's coming from so also included that before. We have labels so this is a negative response and this is the actual taste of review.

Did not like the first room offered was not in hotel but over the road so yeah it doesn't seem that happy about it. It's happy with the manager but reception staffed it's not very happy with so a bit of a mixed review but generally I think it seems kind of negative.

Then down here we have our timestamp which is what we use in order to do the metadata filtering that you'll see later. Okay great so let's come down and what we can do we can run this. This is just showing a few of the reviews we already looked at them so let's not go through that again.

Okay and what we can do is we're just going to count the sentiment of everything returned. So we have negative, neutral, positive you can see here we have the labels here. So we're just going to count out of everything we return what is the overall sentiment that we're seeing.

So we run this or I think I'm skipping ahead a little bit. I need to run this. Okay so you see you get negative, neutral, positive, way more positive than anything else there. So surprisingly people are actually happy with the room sizes for London hotels but generally speaking we don't want to just know this for all London hotels or maybe we want to be a little more specific in what we're searching for.

So let's go through and see how we can do that. So first thing we want to look at is the metadata filtering for fine-tuning our search to a specific date range. So what we're going to do is get a start time and end time. So we're going to do 2015 December so Christmas up to the end of the year in 2015 as well and that will just generate those integer numbers that you saw before or there might be floats actually.

So the timestamps. Yep and we're basically going to say to Pinecone we only want to search within the space where we have records that are either have a timestamp that is larger in this value or less than this value. So to do that we're going to add a filter timestamp greater than or equal to the start time or less than or equal to the end time.

Run that and we're just going to plot our bar chart here and we see that during this time people still generally quite positive although there's definitely a lot more negatives within that time frame. So another thing we can do is actually apply the same thing to different hotels. So we'll apply that and what we're going to do is look into five different areas.

We're going to look at the room sizes, the cleanliness, staff, food and air conditioning. So what we do is we'll have a query for each one of these. Are customers happy with room sizes? Are customers satisfied with cleanliness of rooms? Essentially we're just taking each one of these parameters and we're reformatting it into more of a natural language question.

So we run this, come down and we're going to go through each of the hotels that we pulled from up here. So Strand Palace Hotel, Britannia, Grand Royale and so on and so on and we're just going to iterate through those. We're going to filter for each one of those hotel names here so that we're returning results specific to each one of those hotels.

We're going to make sure that we here we need to encode that query as we did before to create our query embedding. We're going to count the sentiment as we did before and then we're going to store all of this into a set of results here. Okay so let's go and do this.

Area by the way is just what we set up here. So room size, cleanliness and so on. So we'll go through each one of those and get our results. Okay so that will run for a moment. Okay so 10 seconds to run there and then what we can do is take a look.

So we have all these results here and then we can go ahead and visualize that to make a little more sense of it and there we go. So we have all the different parameters here. Room size, cleanliness, staff, food, AC for each one of those hotels. So generally speaking looks like everyone just loves Intercontinental London the O2.

Grand Royale London Hyde Park seems pretty popular as well and we shouldn't never stay here especially if you want AC. So you can go through and read each of these. These are just kind of like an analysis that we've pulled from each one of these results but generally speaking you see really easily how useful sentiment mining can be.

Just analyzing the customer sentiment around in this case hotels but also products or services or cafes you know all these different things and probably a lot of other things that have absolutely nothing to do with customer reviews and we can of course also do this in pretty much near real time.

So we can run our sentiment analysis model on reviews coming in, upsert them into Pinecone and that comes with a almost instant refresh. So we can then update these visuals and our analysis over time in near real time which is pretty cool. So that's it for this video, this example.

I hope this has been interesting and useful. Thank you very much for watching and I will see you again in the next one. Bye. you you (gentle music) you