back to index

How to Index Q&A Data With Haystack and Elasticsearch


Chapters

0:0
0:53 install it on windows using the msi installer
11:13 index all of these documents into our elastic search
12:29 count the number of entries

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay so in this video what we're going to do is actually
00:00:03.600 | index our data so at the moment we just have
00:00:07.200 | all of our paragraphs from Meditations by Marcus Aurelius
00:00:11.600 | and to do this we are going to be using the Elasticsearch document store.
00:00:16.960 | So of course if we're using Elasticsearch we first need to actually
00:00:20.560 | download and install it so I'm just going to take you through
00:00:23.920 | those steps now.
00:00:29.280 | And all we need to do is head on over to this website up here
00:00:35.520 | and elasticsearch.co and you can see the address just there. Now I'm going
00:00:43.040 | to follow the instructions for Windows but of course if you're on Linux or Mac
00:00:47.520 | just follow through it's very similar either way.
00:00:51.840 | So here we're going to install it on Windows
00:00:57.920 | using the MSI installer. So just scroll down here and we can see
00:01:03.440 | we can download the package from this link so download
00:01:07.200 | that and once you download it just open it
00:01:10.720 | and we'll see this window pop up. So
00:01:14.960 | once you see this window pop up we just go through with all of the default
00:01:19.440 | settings. So install as a service and continue
00:01:23.840 | through obviously if you do need to change anything change it
00:01:28.320 | but for me there's nothing here that I want to modify.
00:01:32.000 | Notice here we have the HTTP port and we're using
00:01:35.520 | 92.0.0 we'll be using that later. We just continue through here default
00:01:40.800 | settings and then we click install and we just
00:01:43.840 | let that install.
00:01:46.560 | Okay so now that we've installed Elasticsearch we can
00:01:51.200 | go ahead and actually check that it's running.
00:01:54.560 | So to do that we're going to import Python requests
00:02:00.000 | and whenever we interact with Elasticsearch it's either going to be
00:02:04.720 | through haystack or it will be through the request library and we'll just
00:02:10.000 | interact with the Elasticsearch API. So to check the health of our cluster
00:02:19.920 | so essentially check that it's actually up and running
00:02:23.360 | all we need to do is send a get request to localhost and if you remember
00:02:30.800 | earlier we had it was port 9.2.0.0 of course if the port
00:02:35.600 | on yours was different modify it this is just the default value
00:02:40.240 | and after this we need to reach out to the cluster endpoint
00:02:44.560 | and we are checking the health and then we'll just
00:02:47.680 | format that as a JSON. So what you should see here
00:02:51.920 | is we have our cluster which is Elasticsearch
00:02:55.520 | may have a different name if you modified it but by default it's Elasticsearch
00:02:59.680 | the status is yellow which basically just means we have one node up and
00:03:05.760 | running you can have multiple nodes in Elasticsearch
00:03:08.880 | and for your cluster health to be green it will expect your
00:03:16.480 | shards of indexes to have a backup shards across different nodes and
00:03:21.920 | obviously we can't do that if we only have one node but it's completely fine
00:03:25.040 | for us because we're just in development if you're in production
00:03:28.080 | yes you probably want it to have those backup shards
00:03:32.720 | if none of that made any sense don't worry about it we really don't need to
00:03:35.840 | know any of that for what we're doing here
00:03:39.840 | now what we can also do is we can check if we have any indices
00:03:46.560 | already
00:03:48.960 | now if I take a look at mine I will already have some indices set up
00:03:55.600 | which I've just set up prior to recording this
00:04:00.960 | and to check that we go to localhost again
00:04:09.760 | and this time we want to call the CAT API which is what we would call
00:04:17.600 | whenever we want to see data in a table human readable format
00:04:21.760 | rather than JSON and what we're checking here are the
00:04:26.400 | indices
00:04:28.880 | and we'll just add text onto there so we can actually see that
00:04:32.800 | and this is quite messy so if we just print it instead
00:04:36.800 | look a bit cleaner okay so you can see I have these
00:04:40.320 | two indices you shouldn't I don't think have
00:04:44.000 | either of those no you won't have either of those so don't
00:04:47.040 | worry about that now what we are going to do is create a
00:04:52.320 | new index which will be called Aurelius and that
00:04:56.000 | is where we will put our documents
00:05:00.720 | now to actually implement that we will be going through the Haystack
00:05:06.480 | library which you can pip install
00:05:10.560 | farm Haystack
00:05:14.720 | and what we want to do is from Haystack dot document store
00:05:22.000 | elastic search import
00:05:28.880 | elastic search document store so this is our document store instance
00:05:35.520 | and of course this is not aware of our elastic search
00:05:39.200 | instance we need to initialize that so we'll store it in a
00:05:46.240 | variable called docstore
00:05:49.440 | and all we write is elastic search document store
00:05:53.840 | now we need to initialize it with the parameters so it knows
00:05:57.200 | where to connect to our elastic search instance
00:06:00.640 | so to do that we write host and this is
00:06:07.120 | local host now if you have a username and password set which you don't by
00:06:13.360 | default you will need to enter them in here I don't have any set so
00:06:19.200 | no worries
00:06:25.200 | and then we also need to specify our index and at the moment we don't have an
00:06:29.040 | Aurelius index and that's fine because this will initialize it for us
00:06:33.520 | so we'll just call it Aurelius
00:06:37.760 | now if we go down here we can see what it actually did so
00:06:45.120 | it sent a put request to here localhost 9200 Aurelius
00:06:52.960 | so that's how you create a new index after that
00:06:56.720 | what we want to do is first import our data so
00:07:02.880 | we have the data here which I got from this website
00:07:09.920 | and process with this script which you can
00:07:13.600 | find on github I'll keep a link in the description so you can just go and
00:07:19.760 | copy that if you need to now I haven't really done much
00:07:23.680 | pre-processing it's pretty straightforward
00:07:26.640 | and all you need to do here is actually open
00:07:29.760 | that data so we do that with open and from here that data file is
00:07:37.360 | located two folders up in a data folder it's called meditations.txt
00:07:46.560 | I'm going to be reading that
00:07:49.680 | and all we do is data equals f dot read
00:07:57.680 | and then if we just have a quick look at first 100 characters there we
00:08:05.760 | see that we have this new line character and that
00:08:09.520 | signifies a new paragraph from the text so
00:08:15.200 | what we want to do here
00:08:18.320 | is split the data by new line
00:08:24.160 | and then if we check the length of that see that we have
00:08:27.920 | 508 separate paragraphs in there so what we now want to do
00:08:36.160 | is we want to modify this data so that it's in the correct format
00:08:41.680 | for haystack and elasticsearch so that format looks like this so it
00:08:48.880 | expects a list of dictionaries where each
00:08:52.160 | dictionary looks like this of the text and inside here we would
00:08:58.800 | have our paragraph so each one
00:09:03.440 | of these items here and then there's another
00:09:06.800 | optional field called meta and meta contains a dictionary and in
00:09:13.920 | here we can put whatever we want so for us I don't think at the moment
00:09:19.280 | there's really that much to put into here other than
00:09:23.920 | where it came from so the the book or maybe
00:09:27.680 | maybe the source is probably a better word to use here
00:09:32.480 | and all of these are coming from meditations
00:09:35.440 | now later on we will probably add a few other books as well and then the source
00:09:40.640 | will be different and when we return that item from our
00:09:45.520 | retriever and our reader we'll at least be able to see which book
00:09:49.120 | came from him would be also be pretty cool to
00:09:53.360 | maybe include like a page number or something but
00:09:56.400 | at the moment with this there are no page numbers included so
00:10:00.880 | we don't we're not doing that at the moment
00:10:05.120 | so that's a format that we need and it's going to be a list of these
00:10:10.480 | so to do that we'll just do some list comprehension
00:10:16.000 | so we're going to write this and let's just copy this
00:10:21.200 | I think yeah that should be fine we'll copy this
00:10:25.440 | and just indent that and in here we have our paragraph
00:10:33.440 | and sources meditations for all of them and then we just write
00:10:36.720 | for paragraph
00:10:40.160 | in and data okay so yeah that should work and if we just
00:10:46.240 | check what we have here
00:10:51.120 | okay so that's that's what we want so we have text
00:10:55.360 | we have the paragraph and then in here we have this meta with the source
00:10:58.560 | which is always meditations at the moment so
00:11:01.680 | that looks pretty good and we'll just double check
00:11:06.240 | the length again it should be 508 okay perfect now what we need to do
00:11:13.280 | is index all of these documents into our elastic search instance
00:11:20.480 | and to do that it's it's super easy all we do is
00:11:23.520 | call docstore because we're doing this through haystack now
00:11:27.200 | and we do write documents and we just pass in our data.json
00:11:34.800 | and that should work okay cool so we can see here what it's done
00:11:42.080 | as it's sent a post request to the bulk api and sent two of them
00:11:49.520 | i assume because it can only send so many documents at once so that's
00:11:56.240 | pretty cool and now what i want to check is
00:12:00.720 | that we actually have 508 documents in our elastic search instance
00:12:08.240 | so to do that we're going to revert back to requests
00:12:11.920 | so we do requests.get again go to our
00:12:20.000 | localhost
00:12:22.800 | 9200 and here we need to specify the index that we want to count the
00:12:30.400 | number of entries in and then all we do is add count onto the
00:12:34.560 | end there and this will return a json object so we
00:12:38.640 | do this so that we can see it and sure enough we
00:12:42.080 | have 508 items in that document store
00:12:47.360 | so if we head on back to our original plan
00:12:51.280 | so up here we had meditations we've now got that
00:12:58.640 | and we've also set up the first part of our stack over here so elastic
00:13:06.400 | now has meditations in there so we can cross that off now the next step
00:13:14.080 | is setting up our retriever which we'll cover in the
00:13:17.280 | next video so that's everything for this video i hope you enjoyed and i will see
00:13:25.040 | you again in the next one