Back to Index

How to Index Q&A Data With Haystack and Elasticsearch


Chapters

0:0
0:53 install it on windows using the msi installer
11:13 index all of these documents into our elastic search
12:29 count the number of entries

Transcript

Okay so in this video what we're going to do is actually index our data so at the moment we just have all of our paragraphs from Meditations by Marcus Aurelius and to do this we are going to be using the Elasticsearch document store. So of course if we're using Elasticsearch we first need to actually download and install it so I'm just going to take you through those steps now.

And all we need to do is head on over to this website up here and elasticsearch.co and you can see the address just there. Now I'm going to follow the instructions for Windows but of course if you're on Linux or Mac just follow through it's very similar either way.

So here we're going to install it on Windows using the MSI installer. So just scroll down here and we can see we can download the package from this link so download that and once you download it just open it and we'll see this window pop up. So once you see this window pop up we just go through with all of the default settings.

So install as a service and continue through obviously if you do need to change anything change it but for me there's nothing here that I want to modify. Notice here we have the HTTP port and we're using 92.0.0 we'll be using that later. We just continue through here default settings and then we click install and we just let that install.

Okay so now that we've installed Elasticsearch we can go ahead and actually check that it's running. So to do that we're going to import Python requests and whenever we interact with Elasticsearch it's either going to be through haystack or it will be through the request library and we'll just interact with the Elasticsearch API.

So to check the health of our cluster so essentially check that it's actually up and running all we need to do is send a get request to localhost and if you remember earlier we had it was port 9.2.0.0 of course if the port on yours was different modify it this is just the default value and after this we need to reach out to the cluster endpoint and we are checking the health and then we'll just format that as a JSON.

So what you should see here is we have our cluster which is Elasticsearch may have a different name if you modified it but by default it's Elasticsearch the status is yellow which basically just means we have one node up and running you can have multiple nodes in Elasticsearch and for your cluster health to be green it will expect your shards of indexes to have a backup shards across different nodes and obviously we can't do that if we only have one node but it's completely fine for us because we're just in development if you're in production yes you probably want it to have those backup shards if none of that made any sense don't worry about it we really don't need to know any of that for what we're doing here now what we can also do is we can check if we have any indices already now if I take a look at mine I will already have some indices set up which I've just set up prior to recording this and to check that we go to localhost again and this time we want to call the CAT API which is what we would call whenever we want to see data in a table human readable format rather than JSON and what we're checking here are the indices and we'll just add text onto there so we can actually see that and this is quite messy so if we just print it instead look a bit cleaner okay so you can see I have these two indices you shouldn't I don't think have either of those no you won't have either of those so don't worry about that now what we are going to do is create a new index which will be called Aurelius and that is where we will put our documents now to actually implement that we will be going through the Haystack library which you can pip install farm Haystack and what we want to do is from Haystack dot document store elastic search import elastic search document store so this is our document store instance and of course this is not aware of our elastic search instance we need to initialize that so we'll store it in a variable called docstore and all we write is elastic search document store now we need to initialize it with the parameters so it knows where to connect to our elastic search instance so to do that we write host and this is local host now if you have a username and password set which you don't by default you will need to enter them in here I don't have any set so no worries and then we also need to specify our index and at the moment we don't have an Aurelius index and that's fine because this will initialize it for us so we'll just call it Aurelius now if we go down here we can see what it actually did so it sent a put request to here localhost 9200 Aurelius so that's how you create a new index after that what we want to do is first import our data so we have the data here which I got from this website and process with this script which you can find on github I'll keep a link in the description so you can just go and copy that if you need to now I haven't really done much pre-processing it's pretty straightforward and all you need to do here is actually open that data so we do that with open and from here that data file is located two folders up in a data folder it's called meditations.txt I'm going to be reading that and all we do is data equals f dot read and then if we just have a quick look at first 100 characters there we see that we have this new line character and that signifies a new paragraph from the text so what we want to do here is split the data by new line and then if we check the length of that see that we have 508 separate paragraphs in there so what we now want to do is we want to modify this data so that it's in the correct format for haystack and elasticsearch so that format looks like this so it expects a list of dictionaries where each dictionary looks like this of the text and inside here we would have our paragraph so each one of these items here and then there's another optional field called meta and meta contains a dictionary and in here we can put whatever we want so for us I don't think at the moment there's really that much to put into here other than where it came from so the the book or maybe maybe the source is probably a better word to use here and all of these are coming from meditations now later on we will probably add a few other books as well and then the source will be different and when we return that item from our retriever and our reader we'll at least be able to see which book came from him would be also be pretty cool to maybe include like a page number or something but at the moment with this there are no page numbers included so we don't we're not doing that at the moment so that's a format that we need and it's going to be a list of these so to do that we'll just do some list comprehension so we're going to write this and let's just copy this I think yeah that should be fine we'll copy this and just indent that and in here we have our paragraph and sources meditations for all of them and then we just write for paragraph in and data okay so yeah that should work and if we just check what we have here okay so that's that's what we want so we have text we have the paragraph and then in here we have this meta with the source which is always meditations at the moment so that looks pretty good and we'll just double check the length again it should be 508 okay perfect now what we need to do is index all of these documents into our elastic search instance and to do that it's it's super easy all we do is call docstore because we're doing this through haystack now and we do write documents and we just pass in our data.json and that should work okay cool so we can see here what it's done as it's sent a post request to the bulk api and sent two of them i assume because it can only send so many documents at once so that's pretty cool and now what i want to check is that we actually have 508 documents in our elastic search instance so to do that we're going to revert back to requests so we do requests.get again go to our localhost 9200 and here we need to specify the index that we want to count the number of entries in and then all we do is add count onto the end there and this will return a json object so we do this so that we can see it and sure enough we have 508 items in that document store so if we head on back to our original plan so up here we had meditations we've now got that and we've also set up the first part of our stack over here so elastic now has meditations in there so we can cross that off now the next step is setting up our retriever which we'll cover in the next video so that's everything for this video i hope you enjoyed and i will see you again in the next one