back to index

Data Science as a Service | Kumo AI Full Walkthrough


Chapters

0:0 Kumo AI and GNNs
7:39 Kumo Setup
12:17 Kumo Connectors
14:45 Getting Data into BigQuery
20:39 Building the Graph in Kumo
28:34 Predictive Query Language (PQL)
35:1 Personalized Product Recommendations
38:44 Predicting Purchase Volume
41:44 Making Predictions with Kumo
52:36 When to use Kumo

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we are going to be doing a full end-to-end walkthrough of Kumo.
00:00:05.200 | Now, as a quick introduction to Kumo, it is a almost data science as a service
00:00:11.680 | that allows us to simplify a lot of what we as a data scientist would be doing
00:00:18.200 | in a analytic use case. So it's best I give you an example. Let's say we are a e-commerce platform,
00:00:25.080 | you are a data scientist in that e-commerce platform, and your goal is, given your current
00:00:32.180 | historical data, you may want to, one, predict the lifetime value of a particular customer,
00:00:41.040 | you may want to generate personalized product recommendations for those customers, and also
00:00:47.280 | try and forecast purchase behaviors. So in the next 30 days, what is this customer
00:00:54.540 | most likely to purchase, and in what quantities? So as a data scientist, if you're going to go
00:01:01.140 | and do that, it is a fairly complicated process, which will take a bit of time. And the reason for
00:01:07.800 | that is this type of data set. So let me show you what this type of data set might look like.
00:01:13.040 | You could be looking at something like this. So you may have a customers table, a transactions table,
00:01:18.760 | articles or products table. Now, you're, of course, your customers table. This is actually the data we're
00:01:25.500 | going to be using is structured like this. We're using the H&M e-commerce data set. And that data set
00:01:31.620 | has these three tables. The customers table there is 1.3 million records. So that's 1.3 million
00:01:39.240 | customers. And what you're going to need to do is connect that customer's data over to your
00:01:45.560 | transactions data. Okay. So you're going to have your customer ID connecting those two here. You're
00:01:50.460 | going to have in here, the transaction date and the price of that transaction. So that's going to be
00:01:57.820 | pretty useful information when it comes to making all these predictions. On the other side, within the
00:02:03.280 | transactions that we don't necessarily have the actual article or product information that's going
00:02:09.060 | to be stored over here in the articles table. So you'd connect these two. And in here, you'd have
00:02:14.760 | your product name, the type of product it is. You'd have like the color of the product, a natural language
00:02:20.840 | description. There are also in this data set images that you can attach to that. We're not going to be
00:02:25.280 | using them, but there's a lot of information in there. So you're going to have something like this
00:02:29.400 | and your job as a data scientist is to take this data, which is pretty huge, and transform it into
00:02:38.420 | business predictions that engineers, marketing, leadership, whoever can then go and act upon.
00:02:46.740 | So it can be a pretty hard task. And the way that you would do this, where you have all these
00:02:53.060 | different connections between many different tables, one of the best model architectures for doing this
00:02:58.020 | is graph neural networks. And the reason that graph neural networks are good at this is because of the
00:03:03.620 | relationships between different tables. Graph neural networks are very good at mapping out and
00:03:10.420 | understanding those relationships. You get a better grasp of network effects. So in this scenario,
00:03:17.860 | that is how different customer preferences may influence other customer preferences. Because of
00:03:24.500 | course, these customers are not acting in isolation. They're all part of a broader world. So you have
00:03:30.580 | those network effects, which graph neural networks can handle better than many other predictive models.
00:03:36.660 | You can model temporal dynamics, which is essentially, I fancy what I was saying, you can model predictions
00:03:44.900 | over time. So around Christmas time, for example, if you have data going all the way back to previous
00:03:50.580 | years, Christmas and the year before and so on, it's, there's probably going to be a relatively obvious
00:03:56.180 | like pickup in purchasing volume, but also purchasing different things, especially when you're looking at,
00:04:02.180 | okay, what should I be recommending to a customer? You know, it's summer. Should I be recommending them
00:04:08.660 | a big fluffy coat for like Arctic exploration? Probably not. But should I be recommending them
00:04:16.580 | swim shorts, sunglasses, you know, these sort of things? Probably, right? And a graph neural network,
00:04:21.940 | if you give it enough data, will be able to do that. And another problem that you'll see,
00:04:28.100 | and you'll see this across many disciplines, not just recommendation here. Another very common
00:04:34.980 | problem is the cold start problem. The cold start problem is referring to, okay, you have a new
00:04:40.260 | customer that just come in, you don't have any information about them, okay? Or you have very
00:04:44.820 | little information about them. You might have, okay, they are male, female, their age, their geography,
00:04:52.580 | you might have that information. And what you can, what a graph neural network can do based on that is
00:04:58.660 | say, okay, looking at this baseline bit of information, who are some other customers that seem
00:05:04.740 | similar? Okay. And based on that limited amount of information, let's start making some predictions
00:05:11.380 | anyway. And yeah, maybe we'll get some wrong, maybe we'll get some right. But that, that is far better than
00:05:17.620 | just like kind of giving up, which is what the cold start problem is, is where you just don't have
00:05:23.460 | enough information to start giving out at least reasonable recommendations. With this, these
00:05:29.060 | recommendations are not going to be as good as if we had more information for that customer. But with
00:05:34.660 | graph neural networks, they're probably going to be better than most other methods. So why do I even
00:05:43.620 | care about graph neural networks right now? Well, that's what Kumo is. Kumo is a service that abstracts
00:05:52.340 | away a lot of the complexity when it comes to, okay, getting our data, passing it, you know,
00:05:57.700 | doing like data exploration, data pre-processing and cleansing. Kumo will handle all of that for us.
00:06:03.220 | And then it will also handle the training of our graph neural network, and then the prediction
00:06:08.340 | making of our graph neural network. So that altogether means that as a data scientist,
00:06:14.100 | something that would maybe take you a month to go through and do all this, maybe more or less time
00:06:19.940 | than that, depending on the project and the competency level, of course, rather than going through and
00:06:24.500 | doing all that, you can actually do all this in a few hours, which is still okay. So it's some time,
00:06:30.740 | but in comparison, it's, it's, it's pretty fast. And the level of expertise that has been put into
00:06:38.340 | building Kumo is, is pretty good. I would say better, probably better than the average data
00:06:45.220 | scientist. Okay. I'm even better than even a pre-speller data scientist. So the quality of predictions that
00:06:54.340 | you're going to be getting out of this is probably going to be better than trying to do yourself in
00:06:59.460 | most cases, maybe not all, but in many cases, but in any case, it's so much faster that like this would
00:07:07.780 | seem like the way to go. And this is particularly exciting for someone like me who, yeah, I did data
00:07:13.780 | science, you know, the start of my career, but I've mostly moved away from that. And I do much more
00:07:19.220 | general engineering work, obviously a lot of AI engineering work, but I'm not so much in the
00:07:24.740 | like model training and straight up data science space anymore. So being able to make these sort of
00:07:32.020 | predictions and, and integrate that with some products or services that I'm building, that's
00:07:38.420 | pretty cool. So enough of an introduction here, let's actually start putting together our end to end
00:07:44.980 | workflow for Kumo. So we're going to be working through this notebook here. You can run this
00:07:49.940 | either locally or in Google collab, but I'll be running this locally and you can find a link to
00:07:57.380 | this notebook as well in the comments below, but let's start by running through and connecting to Kumo
00:08:04.500 | in the first place. So you will need to set up an API key. There will be a link for signing up in the
00:08:10.660 | comments below. Or if you already have an account, you probably know where your API key is. For me,
00:08:15.700 | I can go into my Kumo workspace here. I can go to admin and I can get my API key here on the right.
00:08:24.180 | Okay. I would have to reset and generate a new API key if needed. I'm not going to do that. I already
00:08:30.100 | have mine. Alternatively, if you don't have this, you, at least the way that I first got my API key,
00:08:35.620 | was via email. So you can, wherever you can find your API key, go for that. Okay. So once you have
00:08:43.380 | your API key, you should put it inside Kumo API key and I'll just pull this in here. So I'm going to
00:08:50.900 | get and Kumo API key. Or if you, if you want to just paste it straight into the notebook,
00:08:58.980 | it will run get past here. Okay. You should see that this will successfully initialize the SDK and
00:09:06.500 | you can come down here and there are multiple ways of getting your data into Kumo. Generally speaking,
00:09:13.860 | with these sort of projects, you're going to be using a lot of data, right? It's, it's usually a big data
00:09:19.140 | thing. So Kumo does integrate with a few different infrastructure providers for data. You can also just
00:09:27.380 | upload from local, but in this example, we're going to be using big query. Uh, you can also use S3,
00:09:34.180 | Snowflake, Databricks, and I think, I think a few others as well, but I'm going to be focusing on
00:09:39.460 | big query just because I'm most comfortable with GCP, but again, it doesn't really matter just as long as
00:09:46.100 | you have access to one of those. So I'm going into my GCP console. I'm going to big query here and I've
00:09:54.180 | already created a couple of tables in here, but if you hadn't, you don't need to do anything here,
00:10:00.180 | but you will need to go over to your IM controls. And what we need to do is just create a service account
00:10:08.100 | that is going to give Kumo access to our big query. And you can see the permissions that we're going to
00:10:16.180 | need in here. So data viewer, filter data viewer, metadata viewer, read session user, user, and data
00:10:23.700 | editor. So we need all of these in our service account. So let's go ahead and create those.
00:10:29.620 | Well, I've already done it. So you can see in here, I have my service account.
00:10:35.380 | It has all those permissions. Now, once you have created your service account, again, just another
00:10:41.940 | thing. You call it whatever you want. You don't need to call it Kumo or anything like that. That's
00:10:45.700 | just the name I gave it here. So I'm going to go over the service accounts. I have my service account
00:10:53.380 | here. And what I'm going to do is just come over to keys and you would have to add a key here. So you'd
00:10:59.460 | create a new key. I've already created mine. So I'll just show you how you create a new one. So you do
00:11:05.300 | JSON, you click create, and then you'll want to save this in the directory that you're running this
00:11:12.420 | Kumo example from. And make sure you call it this here. So kumogcppcreds.json. Now, the reason that we
00:11:21.460 | need this to sell the access as I've just described is fairly simple. So we're going to be using GCP and
00:11:30.740 | BigQuery over here as the source of truth for our data. So all of our data is going into Google
00:11:39.060 | BigQuery. So that's the customers data, transactions data, and articles data. All of that is going into
00:11:44.340 | BigQuery. Okay. Kumo needs to be able to read and write to BigQuery. So we set up our service account,
00:11:54.420 | which is, you know, sort of thing that I've kind of highlighted here. We set up our service account,
00:11:59.140 | give that to Kumo, and then Kumo can read the source data from GCP. And when we make predictions
00:12:07.380 | later on, we're going to be making predictions and writing them to a table over in GCP. Okay. So that's
00:12:15.300 | why we need to make this connection. So the next thing we want to do after we've settled those
00:12:21.380 | credentials is we need to create our connector to BigQuery. So there's a few items to set up here.
00:12:28.900 | We have the name for our connector, which I have set to Kumo intro live. We have the project ID. So this
00:12:38.420 | is the GCP project ID. So you can see mine is up here, right? I have this Aurelio advocacy project. So just
00:12:46.340 | make sure that is aligned. And then we will also want to set a unique dataset ID for the dataset that
00:12:54.180 | we're going to create. And this is going to be used to read and write the dataset that you have over in
00:13:02.500 | BigQuery. Which, by the way, if you don't already have a dataset in there, we are going to go and create that. But
00:13:10.420 | right now you can see it in here. So I have this Aurelio HM. So that's the HM dataset over here. Okay.
00:13:17.380 | Again, if you don't already have your dataset in BigQuery, it's not an issue where I'm going to show
00:13:20.900 | you how to do that. So that's the setup. We read in our credentials file here, and then we use that
00:13:28.660 | to initialize our BigQuery connector with Kumo. Okay. So this is from Kumo AI. Now, if we, let me even
00:13:38.260 | change the dataset ID here. I'm going to change this to two. And I also need to change this to two very
00:13:46.020 | quickly. So we'll change this. Okay. So now when I come down here and I try to first view our tables,
00:13:56.020 | it's not even like me. Okay. So I'll run this and it's going to throw an error. Okay. So I've got 500,
00:14:00.900 | actually 404, which is, come over here, not found. It's a Aurelio advocacy, RAL HM2. Okay. Now I don't
00:14:11.300 | want to go and upload everything to BigQuery again, because it takes a bit of time. So I'm actually just
00:14:16.900 | going to drop that and switch back to the project that we just created. And now that I've connected to
00:14:23.300 | a project that actually does have my data, if I run this, it will not throw an error, right? So
00:14:29.940 | yeah, it just connected, no errors. That's because the data now exists. But of course, if you're following
00:14:36.340 | this through for the first time, you don't have that data already. So let's take a look at how we get that
00:14:42.820 | data and then put it into BigQuery. So as I mentioned, we're going to be using the H&M dataset.
00:14:49.140 | The H&M dataset is a real world dataset with 33 million transactions, 1.3 million customers in the
00:14:58.180 | customers table and over a hundred thousand products or articles. So it's a pretty big dataset, but that
00:15:05.700 | is similar to the sort of scale that you might see yourself. If you're a data scientist working in
00:15:11.220 | this space, this is the sort of thing that you would see in a production recommendation system. So
00:15:16.020 | we're going to come through down here and I just want to start with, okay, where we download and
00:15:22.740 | say it's everyone. So we're going to pulling it from Kaggle, which is the only place like that.
00:15:28.100 | I think there was like a copy of it on Hugging Face as well, but I don't want to use that because I think
00:15:32.980 | it's an official copy. So we're going to use Kaggle. Now to download data from Kaggle, you do need an
00:15:41.140 | account. Okay. Slightly annoying, but it's fine. So you just sign in or register, you know, whichever.
00:15:49.380 | And once you've signed in, you should see something kind of like this. What you need to do is go over to
00:15:54.980 | your settings, scroll down and you want to go ahead and create a new token. Now you can download this
00:16:04.100 | wherever you want. I would recommend, okay, just download it into the directory that you're running
00:16:09.300 | the notebook from. So I'm going to go and do that as well. And that will download the Kaggle.json file.
00:16:14.900 | Now, once you've done that, you're going to want to move that Kaggle.json file. So I'm going to do
00:16:20.180 | move Kaggle.json to, and this is on Mac. So I'm going to move it to Kaggle, Kaggle.json.
00:16:30.260 | So now when I try and import Kaggle here, Kaggle is going to read my Kaggle.json credentials from here
00:16:42.100 | and actually allow me to authenticate. Otherwise it will throw you, it will throw an error. So you,
00:16:48.100 | you will see that. Okay. If you try and run this, it will throw an error if you haven't set that up
00:16:52.420 | correctly. So for this specific dataset, we will need to approve their terms and conditions,
00:16:57.940 | which we can do by, we can just find the dataset quickly. So H and M, run off the datasets,
00:17:06.740 | or sorry, competitions. And we have this H and M personalized fashion recommendations
00:17:11.860 | competition. So this is the one we're going to be working with. What we can do is, oh,
00:17:19.620 | so around here somewhere, there'll be a little thing that tells you, you need to approve the,
00:17:24.580 | or accept the terms and conditions to use the state. So you need to go and click that. Once you've found
00:17:30.180 | it and clicked it, you will be able to download the competition files using this method here. Okay.
00:17:36.180 | So we're pulling this from H and M personalized fashion recommendations datasets. Now that can
00:17:40.740 | take quite a while to run. I'm not going to, I'm not going to download it myself because I already have
00:17:45.140 | it locally, but once you do have it, everything will be in a zip file that looks like this.
00:17:50.740 | So we need to extract our data out from that zip file. And we're only going to be looking for the
00:17:57.540 | CSV files. There's a lot of other files in there. There's a lot of images and everything, but we're just
00:18:03.220 | looking at the CSVs. So we pull those out. And if I should be able to run this bit, you can see that
00:18:10.180 | this is the sort of data that we have. We don't, this is the sample submission dataset. We're not
00:18:15.380 | interested in that. We're just looking at these first three. So customers, articles, and transactions,
00:18:20.660 | train.csv. And now we need to go ahead and place our data from our local device and throw it into
00:18:29.700 | BigQuery. So to do that, we are setting up, we're going to do this directly with BigQuery.
00:18:35.940 | So from Google, we're importing BigQuery, importing service account, which was how we
00:18:41.860 | authenticate ourselves. We have our credentials file path. This is what we got before from the
00:18:48.900 | service account in GCP. So we create a credentials object using service account credentials. And then
00:18:56.420 | we use that to initialize our BigQuery client. And again, this is using the Aurelio advocacy project.
00:19:04.980 | Okay. So this is the project within GCP that we're using. So we're going to run that. And then what we're
00:19:12.100 | going to want to do is we use our, this is dataset ID. We actually defined this earlier, so we probably
00:19:18.420 | shouldn't define it again here. So we have our dataset ID. We're going to use that to create a
00:19:25.940 | dataset reference with, so like, okay, in GCP, we have a client, our authenticated client. We're saying,
00:19:33.780 | okay, connect to this dataset object. And this is the dataset ID. This will work even if the dataset
00:19:40.020 | doesn't exist, because what will happen is if it doesn't exist, so we're going to try and get the dataset,
00:19:47.300 | that's going to throw an error if it doesn't exist. So we catch that error and we say,
00:19:51.860 | okay, dataset doesn't exist. Now we're going to go ahead and create it, which is exactly what we're
00:19:56.260 | doing here. Okay. So we're creating that dataset. So I'm going to run this for me. It's going to say
00:20:02.020 | dataset already exists because I've already created that dataset. But if this is your first time running
00:20:06.980 | this, your dataset will be empty and it's just been created. So with that, you would need to go through
00:20:14.340 | the files in this. So that is going to be the first three files here. So customers, articles,
00:20:21.380 | and transactions. And this is essentially setting everything up or setting up your table,
00:20:28.420 | and then just pushing over the data to the table. Okay. Now, again, that can take some time,
00:20:34.820 | so I'm not going to run that, but this will take a little bit of time to actually run. Once we have that
00:20:41.220 | set up, we need to move on to actually building our graph in Kumo. Okay. So as we saw briefly before,
00:20:49.620 | this is what the dataset looks like. And based on that, this is how we're going to be connecting
00:20:53.700 | everything. So we have our customers table. We connect that via customer ID to the transactions table.
00:21:00.020 | And then the transactions table is connected to the articles table via the article ID. Okay.
00:21:08.020 | So let's go ahead and define these tables and define these relationships. Okay. We'll run through this.
00:21:14.580 | So first we can just check the tables that we currently have. So I have quite a few tables in
00:21:22.100 | here already. You won't see all of these. So you should only have customers, articles, and transactions
00:21:27.860 | train. Those should be the ones that you see. All these other ones, like all these prediction ones here,
00:21:32.740 | are generated by Kumo later. So at the end, you will have some of these, not all of them, but you
00:21:39.060 | will have most of those. But right now, just a three. So we want to connect to each of the source
00:21:45.300 | tables that we should have. So customers, articles, transactions, train, and we do so like this. So we
00:21:50.820 | just do connector. It's like a dictionary lookup there. So we have articles, customers, transactions,
00:21:57.940 | train, just the table names. Okay. And what that does is it creates within the Kumo space, it sets
00:22:07.140 | up the source tables. A source table is not a table within Kumo. It's a table elsewhere. Okay. So it's a,
00:22:15.060 | you know, it's, it's literally like, okay, this is a source of my data. That's what the source table is.
00:22:20.340 | Now we can view these source tables with a pandas date frame type interface here. So we use head and
00:22:30.100 | we'll see the top five records in our table here. So this is looking at the articles table and see product
00:22:36.980 | code, product name, the number, the type, uh, some more descriptive things. There's quite a lot of useful
00:22:45.940 | information in here. Okay. As you can, as you can see, and again, this is, this is the articles
00:22:52.340 | data set. So roughly about a hundred thousand records in there. Now looking at this, we should be able to see
00:23:01.140 | columns here. My bad.
00:23:06.340 | Okay. And we can see all, all the columns that we're going to be using here. So you see very first one here,
00:23:13.380 | article ID. So we know the articles can be connected or we will see in a minute that the articles can be
00:23:20.100 | connected via this to the transaction table. So, okay. We have that. Let's come down.
00:23:25.940 | Let's take a look at our customer's table. We'll just look at the first two records here.
00:23:31.540 | Okay. So customers table, we see, we have customer ID. So that's what we're going to be using again.
00:23:38.500 | Some other little bits of information here, uh, mainly actually this is empty. I think I'm pretty
00:23:44.580 | sure I've seen age in a few of those. So yeah, I think maybe this is just a couple of bad examples.
00:23:51.460 | Then moving on to the transactions source. So we can go through here. Let me run this.
00:23:59.940 | So there isn't much information in the transaction table, but it's a big, it's a lot of data.
00:24:05.620 | So we can see the transaction date, the customer, so we can connect that to the customer's table,
00:24:10.580 | the article ID. So we can connect that to the articles table, the price, and then sales channel ID. So
00:24:19.060 | we're going to connect all of those up. The way that we're going to do that is we're going to use this
00:24:24.020 | Kumo AI table. So we're going to initialize the Kumo table and we're going to do that by pulling it from
00:24:31.380 | a source table, an existing source table. So we'll have the articles table, customers table, and transactions
00:24:36.900 | train table. The primary key for articles is going to be article ID. The primary key for customers is
00:24:43.140 | going to be customer ID. And then for the transactions there, there isn't actually a primary key, but there
00:24:49.140 | is a time column. So we do just highlight that. So there is a time column there, which is the transaction
00:24:54.580 | date. So we can initialize that and we can go and have a look at the, you can see that we've done this
00:25:01.380 | infer metadata. That's, it's an automated thing from Kumo where it's going to look at your table and it's
00:25:06.980 | going to infer the table schema for you. So I can come down here and I can look at the
00:25:12.900 | metadata for my articles table, for example. Okay. So we have article ID, product code,
00:25:17.380 | it's all data types in there, whether it's primary key, something else. So a lot of cool stuff in there.
00:25:24.180 | And we'll also be able to see all of these tables now in our Kumo dashboard. So I have, I have a few
00:25:30.660 | in here. So let me scroll through. We're going to want to find the previous columns here. I think
00:25:39.060 | mine would be the most recent one. So that would be may here, this one, this one, and this one,
00:25:45.540 | and you can see some just useful information for each one of these columns. So let's go into maybe
00:25:50.740 | article would be interesting. So we can see if we go to, let's say product name or product type name.
00:26:02.820 | Cool. So you can see there's a lot of trousers. We have trousers, sweater, t-shirt, dress, you know,
00:26:10.500 | so on and so on. Okay. There's some really useful information that you just go in and take a look
00:26:15.460 | through those. So we have that. And now what we will also need to do is, okay, we have our tables,
00:26:21.060 | but they're all kind of, they just exist in Kumo independently of one another at the moment.
00:26:25.940 | Now we want to train our graph neural network on these. So we actually have to create a graph of
00:26:31.700 | connections between each one of our tables. The way that we do that is we initialize this graph object.
00:26:38.260 | We set the tables that are going to be within this single graph. Then we define how they are connected.
00:26:44.100 | So we use these edges. So we have the source table using transactions with both of these. So the
00:26:49.620 | transactions table via the customer ID foreign key is going to connect to the customers table. Second
00:26:57.620 | connection here. So again, transactions table via the article ID foreign key is going to go and connect
00:27:04.420 | to the articles destination table. Okay. So we would run this and then we just run a graph and validate to
00:27:12.340 | confirm that this is an actual valid graph that we're building. Okay. So everything went well then.
00:27:19.220 | And yet again, we can go over into our, into our Kumo dashboard and it would be, I suppose my last
00:27:26.260 | one here, which is May. Okay. I need to zoom out a little bit there. So we have, yeah, it's pretty
00:27:34.500 | straightforward. So we have the transactions data here. We have our transactions date, time column there.
00:27:41.780 | We have these two foreign keys. So the customer ID connects over to our primary key of the customer's table.
00:27:46.820 | And then the article ID foreign key connects to the primary key, which is the article ID of the
00:27:53.060 | articles table. Okay. So you can see those connections and you can click through if you want to, and then
00:27:57.940 | see what is in those. We just did that. So I'm not going to, I'm not going to do it again. Okay. So we have
00:28:03.620 | that. Another thing that you can do, if you want to visualize your graph within notebook,
00:28:07.860 | you can, you can install a few additional packages and you can actually use the graph this package to
00:28:16.020 | visualize that. I'm not going into that here because for every, like for Linux versus Mac and I assume
00:28:22.100 | windows as well, you, you need to set up in a different way. So you can see the graph in the Kumo UI. So I,
00:28:29.700 | I will just do that personally. It's up to you. So we've set everything up, right? We're, we're at that
00:28:37.220 | point now. This is almost like we have been through the, the data cleansing, the data pre-processing,
00:28:44.260 | the data upload, you know, all of those steps as a data scientist, we've been through all those sets.
00:28:49.060 | So now we're, now we're getting ready to start making some predictions or training our model
00:28:53.460 | to make some predictions. Okay. That's great. We've, you know, we've done that. You know,
00:28:58.100 | there is quite a lot going on there, but nothing beyond what we would have to do anyway as a data
00:29:03.220 | scientist. So what we've done really simplified quite a bit of work and condense it into what we've
00:29:09.940 | just done now. But now we need to go into the predictions. So Kumo uses what they call the
00:29:16.260 | predictive query language or PQL. Now it's quite interesting. So PQL, you might guess,
00:29:22.900 | is kind of like a SQL type syntax, which allows you to define your prediction parameters. Okay.
00:29:31.060 | So rather than writing like some neural network training code, you write this SQL like a predictive
00:29:39.060 | query and Kumo is going to look at that, understand it, and train your GNN based on your PQL statement.
00:29:48.740 | So let's start with our first use case that we described at the start, which is predicting the
00:29:55.220 | customer value over the next 30 days. So the way that we do that in a PQL statement is like this.
00:30:04.580 | So there's a few different components in PQL. So we have our target, which you can see here.
00:30:11.620 | So the target is what follows predict here. So we have the predict statement. We're saying predict
00:30:18.100 | whatever is within our target here. So this is the defined target. So what is our target here? Okay. We
00:30:24.500 | have the sum of the transactions price over the next zero to 30 days into the future. And also we defined
00:30:34.500 | days here. So that is our target. We're predicting the sum of the transactions price over the next 30 days,
00:30:41.460 | but then we also have an entity. Okay. Which is who or what are we making this prediction for?
00:30:48.260 | So here we're saying for each. So we're getting this sum of predictions broken down for each individual
00:30:56.500 | customer. So what we do by writing this query is we are getting the value of each customer over the next
00:31:08.260 | 30 days. So let's go ahead and implement that. We come down here. We use Kumo AI predictive query. We pass
00:31:16.260 | in our graph and then we just write what I, what I just showed you that, that PQL statement. Okay. So
00:31:22.100 | predict the sum of transactions price for each customer based on the customer ID. Uh, we validate our
00:31:29.780 | P query or PQL statement here. So let's run that. Okay. That is great. Then we come down here and we ask Kumo
00:31:40.980 | to generate a model plan. So basically a, okay, Kumo, how based on everything, based on our data here,
00:31:48.820 | the volume of data based on the query that we've specified, what is the ideal approach for training
00:31:57.620 | our model? And yeah, you can, you can look at this. Okay. So we can see here that we're using mean absolute
00:32:04.580 | error, mean squared error and root mean squared error as the training loss functions. The tuning metric is
00:32:11.940 | actually, is using mean absolute error. We have a network pruning. We, there's no processing there.
00:32:20.980 | We have sampling the optimization here. So it's using the Huber loss function to optimize for regression.
00:32:29.140 | There we have a number of epochs that's here to work over the validation steps, test steps,
00:32:34.420 | the learning rates, weight decay. Just, I mean, I'm not, I'm not going to go through all of this,
00:32:41.060 | but there's a ton of stuff in here. Uh, we can actually see the, the graph network architecture
00:32:46.820 | here, which is probably interesting for a few of us. And yeah, just a ton of stuff in there. So you can,
00:32:53.540 | yeah, you have Kumo telling you what it's going to do, how it's going to train your model. So
00:32:58.260 | if you're, you know, I'm not well versed in GNNs, but if you are, you can take a look at that and,
00:33:04.820 | and see, make sure everything makes sense according to how you, well, you understand them. But of course,
00:33:09.860 | as I said, Kumo literally co-founded by one of the, the co-authors of the GNN paper. So they have,
00:33:18.180 | they have some pretty talented people working there. So that should be some pretty optimal
00:33:24.660 | parameters there. So once we're happy with that, what we're going to do is we're going to run this
00:33:28.340 | train object. So we Kumo AI, train it with model plan, and then we run it with trainer.fit. Okay.
00:33:35.620 | And what this is going to do, okay, let me run this. What this is going to do is it's just going to
00:33:42.420 | initialize the training job. Now we're going to see this cell finish quite quickly because we've set
00:33:48.580 | non-blocking equal to true. So it's going to, this is going to go to Kumo. It's going to start,
00:33:54.260 | like say, okay, I want this training job to start running. Once it has confirmation that the training
00:33:59.940 | job is running, it's going to come back to us and allow us to move on with whatever we're doing.
00:34:06.100 | But the training job will not be complete for quite a while. I think as I've been going through this,
00:34:12.100 | the time has varied, but I would say somewhere between 40 minutes to an hour for, for a training
00:34:19.300 | job here, but you can run multiple training jobs in parallel. So we have three predictions that we'd
00:34:25.140 | like to make here. So I'm going to run all those at the same time. Okay. We've got this one back now.
00:34:30.660 | So you can as well, if you don't, if you want to just start running these all now, you just,
00:34:36.340 | just run the next few cells in the notebook and then come back and I will talk you through what the
00:34:42.420 | other use case PQL statements are doing. So we can check our status for the training job. We'll see that
00:34:50.900 | it's running. You can also click the tracking URL here, and this will actually open up the training
00:34:57.220 | job. So we can see how things are going in the UI if we want to. So coming back, let's move on to the
00:35:03.220 | second use case, which is these personalized product recommendations. This is one I personally, like I
00:35:08.580 | would actually be very likely to use with a lot of projects I currently work on, which is obviously more
00:35:15.460 | like AI, conversational AI, building chatbots, or just AI interfaces in general.
00:35:22.980 | The reason that I could see myself using this is let's say you have the H&M website. I don't know
00:35:29.540 | what's on the H&M website, but let's say they have a chatbot and you can log in. You could log in and
00:35:34.580 | you could talk to this chatbot or not even talk. It doesn't have to be a chatbot. It can just be
00:35:41.620 | that you log into the website and the website is going to surface some product site based on what
00:35:50.500 | we do here, based on these personalized product recommendations that we build with Kumo. It can
00:35:55.060 | surface those to the users. They log in so that you are providing them with what they want before they
00:36:02.340 | even, they don't even need to go and search. It's just, it should just be there what they want,
00:36:07.300 | if possible, right? Obviously, you're not going to get it perfect all the time, but you'll probably
00:36:12.980 | be able to do pretty well with this. So you can do that. You can also surface this, as I was originally
00:36:18.740 | going to say, through a chatbot-like interface. You could tell a chatbot, hey, you have this customer
00:36:24.100 | and you're talking to them. These are the sort of products that they seem most interested in,
00:36:28.260 | you know, kind of place those into the conversation when you're able to, when it makes sense.
00:36:34.100 | So that's another thing that you could do. There are many ways that you could use this.
00:36:37.540 | So this is a little more of a sophisticated prediction here. The reason I say this is a
00:36:43.380 | little more sophisticated is because we have a filter here. So we've added an additional item to our
00:36:49.860 | target entity. So we have target entity and now we also have filter. So the target LPL is pretty similar.
00:36:57.060 | Okay. The operator is different. So we're not summing anymore, actually listing the distinct
00:37:01.860 | articles that over the next 30 days, we expect to appear in the customer's like transactions table.
00:37:12.500 | Okay. So this is a top 10 and what it's saying by, okay, what are the top 10?
00:37:18.180 | These are like the top 10 predictions. So will the customer buy this or not? Okay. Will this customer ID
00:37:29.220 | appear in the transactions table alongside a particular article ID
00:37:35.540 | in the next 30 days? That is what we're predicting here. And then we're filtering for the top 10 predictions,
00:37:43.700 | because otherwise, if we don't filter here, we're going to be looking at, what was it? Like 1.3 million
00:37:50.180 | customer IDs, unique customer IDs with, against 100,000 products. And we're making predictions for all of
00:38:00.580 | those, right? That would be a, that would, that could be a larger number. Okay. So what we're saying is,
00:38:06.180 | okay, just, just give me the top 10, like the top 10 most probable purchases for our customer.
00:38:14.180 | So we would run that again, same as before, nothing, nothing new here. So we're just modifying the PQL
00:38:20.500 | statement. So we run that, we validate it. And we're just going to check again. Okay. We get the model
00:38:28.500 | plan from Kumo here, and then we'll just start training with purchase trainer fit. Okay. So that's
00:38:37.700 | going to run. Again, as before, we will be able to change status with this. Of course, we'll just have
00:38:43.380 | to wait for that other cell to run. Now, final use case here. I want to look at predicting the purchase
00:38:49.380 | volume for our customers. So in this scenario, it's kind of similar. So we're looking at a count of
00:38:55.620 | transactions this time over the next zero to 30 days that generates our target. Again, we're looking for
00:39:01.860 | each customer for each customer. Okay. But we're adding a filter here. So what was this filter doing?
00:39:08.420 | This filter is looking at, if you, if you look here at the range, we've got minus 30 days up to
00:39:15.300 | zero days. So this is looking at 30 days in the past. 30 days in the past is saying, okay,
00:39:24.020 | let's just filter where the number of transactions, so the count of transactions for each customer ID over
00:39:30.900 | the past 30 days is greater than zero. So what does that mean? That's saying, just do this prediction
00:39:40.420 | for customers that have purchased something in the previous 30 days. What this does is it just reduces
00:39:46.820 | the scope of the number of predictions that we have to make by focusing only on active customers.
00:39:53.220 | So ideally we should be able to get a faster prediction out of this by, you know, within the
00:40:00.900 | status that naturally there's probably a lot of customers that are just inactive. We're probably not
00:40:05.460 | going to get much information for, but if we don't add that filter and we're still going to be making
00:40:09.460 | predictions for those customers. So you can, you can do this across the other examples as well.
00:40:16.420 | So just make sure we're focusing on, focusing on like customers that we want to focus on.
00:40:23.140 | So as before, we're just setting up the predictive query, validating it. We do the model plan and then
00:40:29.940 | we fit that model plan. Same as before, no difference other than the query, of course. Okay. And once that
00:40:36.660 | cell has finished up here, we can go here and just check the status of our jobs. I would expect them to
00:40:42.820 | either be running or queued as they are here. So I'm going to go and leave these to run and this one as
00:40:49.940 | well and jump back in when they're ready and show you what we can actually do with these queries. Okay.
00:40:57.540 | So we're back and the jobs have now finished. So we can see done in all of these. We're going to switch
00:41:03.620 | over to the browser as well. And you can see in here that these are done. So training complete. This
00:41:10.580 | one took an hour, 20 minutes. So it was pretty long to be fair, but you can see, yeah, you can see the
00:41:18.900 | various ones here. This one here, 60 minutes, pretty quick. I would imagine that is the one where we have
00:41:25.300 | the filter. Yeah. You see here we're filtering for the like active customers only. And yeah, the duration
00:41:35.860 | for that one is noticeably shorter, which makes sense, of course. So that's great. We can just jump
00:41:44.660 | through and look at the, look at how we can use those predictions now. Okay. So to make predictions,
00:41:49.780 | we're going to come down to the next cell here. I've just added this confirm. Okay. Like the status
00:41:56.500 | is done, which it has, we've already checked, but just in case, then we're going to use trainer predict.
00:42:03.140 | So the first trainer, if we, if we come up here to where we actually initialize that, it's this one here.
00:42:09.540 | So that first one, which is this PQL statement right here. Okay. So predicting the, essentially the
00:42:18.900 | value of the customer over the next 30 days. So let's go ahead and run that prediction. And what
00:42:25.860 | this is going to do is actually create a table in big query. You can see I've put output table name
00:42:31.700 | here. So it's going to create this table. Okay. So once we've run that, this disabled will have been
00:42:38.340 | created. Now, the other thing that we should be aware of is so for our second query, this one here,
00:42:47.300 | we are ranking the top 10, right? So this is a ranking prediction. And that means that we can have a varying
00:43:00.260 | number of predictions per entity. Okay. So in that case, we also need to include the number of classes
00:43:09.460 | to return. Okay. So we, we don't have to stick to 10. Like we, we said top 10 before we can, we could
00:43:19.380 | change this to like 30 if we wanted to. Right. But yeah, we're sticking with 10 here. So, yep. That's a,
00:43:27.700 | that's just a nice parameter that we can set if we want a different numbers of a different number
00:43:33.860 | of predictions to be returned there. And if we come down to the next one, we have the transactions
00:43:40.660 | prediction. So that is looking at the number of predictions for our, our customers over the next 30
00:43:47.540 | days from run. So we run all of those. Then we can actually go in and see what we have from our data.
00:43:56.500 | So first one is the customer value predictions. So who will generate the most revenue for us of our
00:44:03.620 | customers? Okay. So when we, we specified before, this was the table name, the output table name,
00:44:11.540 | Kumo will by default add underscore prediction to the end of that or predictions, sorry.
00:44:16.260 | So yeah, just be aware of that. There's change the table name, but then yeah, we can, we can see this.
00:44:24.980 | Now it's worth noting that the head that we have here is actually the, like the, these are all the
00:44:31.620 | lowest predictions, right? And this is a regression test. So essentially it's saying that all of these
00:44:36.180 | customers are going to over the next 30 days, some of their transactions will be zero. Okay. And because
00:44:43.140 | it's regression, it goes slightly into the negative, right? But essentially just view this as being zero.
00:44:49.700 | That's our prediction. Now, as I said, these are all, this is like the tail end. This is all the,
00:44:56.820 | the lowest number of predictions in the header here. So we actually want to reverse this. Kumo doesn't
00:45:02.980 | give us a way. It doesn't allow us to write like tail like you would with pandas data frames. So
00:45:08.020 | instead we can actually use big query directly to order by the largest values and just get the top five
00:45:17.860 | from there. So this is what we're doing here. We write that as a SQL query. So we're selecting all from
00:45:24.980 | our data sets. We're going for this table. So this is the new table that we have from here and we're ordering
00:45:32.020 | by the target prediction. Okay. So this number here and that is descending. So we have the highest values
00:45:39.220 | at the top. So this is going to give us who should be our top five most valuable customers. So let's
00:45:44.340 | take a look. Okay. And we have these, right? So like these are high numbers, right? So these are
00:45:52.580 | the entity here is our customer ID. So we'll be able to use this to map back to the actual customers table
00:46:00.820 | pretty soon. So now we have who will most likely be our most valuable customers, which is a great
00:46:07.540 | thing to be able to predict. Now let's have a look at what we think these people will buy.
00:46:12.980 | Okay. So we're going to look at our purchase predictions. Just looking ahead again here.
00:46:18.660 | Again, we can just use a big query and go directly through that if we need to. But here we can see,
00:46:24.820 | okay, for this customer, they are like very likely to buy this product here. Okay. And it's pretty high
00:46:32.980 | score there. It's cool. So now let's have a look at transaction volume. We're going to bring all this
00:46:38.900 | together. We're just looking at each table now quickly. We're going to bring all this together in
00:46:42.740 | a moment. Okay. Transaction table. How active would they be again? And again, these are very small numbers.
00:46:50.340 | So we use BigQuery to look at the largest values there. Okay. So these are, again, these are transactions.
00:46:57.220 | Again, I think we're looking, this would be the customer ID here. And this is how many transactions we
00:47:05.060 | actually expect them to make in the next month. So 20, 20 transactions first on here. Okay. So all
00:47:10.820 | that is great, but how can we bring all this together? So what I want to do now is look at the next
00:47:16.980 | month's most valuable customers and join that back to our actual customers table. And then I want to see
00:47:23.380 | what those customers are most likely to buy the actual products. And then again, focusing on those
00:47:32.500 | customers, see how many products we think they will buy. So let's do that. First, we're going to find
00:47:38.820 | next month's most valuable customers. How do we do this? So to identify our most valuable customers,
00:47:44.020 | we're going to be doing a join between the summer transactions predictions table with our customers table.
00:47:53.620 | We're going to be joining those two tables based on the transaction predictions,
00:48:00.020 | the entity values there, which is just a customer ID from our predictions and joining those to the actual
00:48:07.220 | customer IDs from the, the customers original customers table. Okay. So that is what we have
00:48:15.060 | there. We're also limiting this by the top 30 highest scores. So you can see we're ordering by the target
00:48:20.900 | prediction and looking at the top 30 of those. So basically filtering for the top 30 most valuable,
00:48:27.540 | predicted the most valuable customers. So I will run that. Let's see what we get can do to data frame
00:48:34.740 | and just view that. So we have, okay, we have the customer ID now. So these are the top 30 predicted
00:48:40.260 | most valuable customers. So we can come through here. We can see, okay, we have all the ages here call the
00:48:47.700 | young twenties and horns buying all the clothes, of course, with this random 37 year old over here.
00:48:53.860 | Okay. And then we can see their scores. Okay. So that looks pretty good. You can see that they're all
00:49:02.100 | club members, whatever that means. I assume they must have some sort of membership club.
00:49:08.500 | So generally, okay, that looks pretty good. So we have our top, we could do a top 30 here, but right
00:49:16.500 | now we're just looking at the top five and the most valuable customers. Now let's have a look at what
00:49:20.820 | those customers are going to buy. So come down to here, we now need to be joining our customers table,
00:49:28.980 | which is here, join our customers table to the purchase predictions, predictions table based on
00:49:37.780 | the prediction entity. Okay. Which is the customer ID. So customer ID is attaching, is joining the
00:49:46.100 | customers table with the purchase predictions table. Then the purchase predictions table also included a
00:49:51.700 | I think it was class number, which is the article ideal or product ID. So we're going to connect those.
00:49:58.660 | So it was a predictions class. We're joining that to the articles table via the article ID. And that is
00:50:07.220 | going to give us our customers most likely purchases. And we're actually going to focus this. So we're
00:50:13.460 | going to filter down to a specific customer, which is our top rated customer from the top customers
00:50:20.100 | table, which we create here. So we're going to be looking at this person here. So let's run that and see
00:50:27.780 | what we get. Okay, I've got a few here. So customer ID, this is just the same for all these people are
00:50:36.340 | looking at single customer, and we can see what they are interested in buying. Okay, so product name,
00:50:42.980 | magic. So some some magic dress that they have. Then they have this Ellie Goodard dress, another dress,
00:50:52.340 | dress, some heeled sandals, a shirt, dress again, some some joggers.
00:51:01.780 | dress dress, dress, dress, they really like dresses, I think, from the looks of it. So these are the
00:51:10.100 | products that when this user next logs into a website, or if they're talking to a, you know,
00:51:16.260 | H&M chat part or anything like that, or we're sending emails out. These are the products that we should
00:51:22.340 | send and just surface to that user. So like, hey, look, these are some nice things that we think you
00:51:27.860 | probably might like. So hey, look at these. That's pretty cool. Now let's continue and take a look at
00:51:34.980 | the purchase volume. So we're now looking at the valuable customers. So this query here,
00:51:43.780 | let me let me come back up and show you what that query actually is. So the valuable customers is this
00:51:49.620 | query here. All right. So finding out our top was it top 30. Sorry. So top 30 most valuable customers.
00:51:57.140 | So we're going to be joining to that table.
00:52:00.100 | Yeah, we'll be joining our top 30 most valuable customers to the transaction predictions.
00:52:11.300 | And what that is going to do is it's going to get us the predicted transaction volume for each one of
00:52:17.460 | those top 30 customers. Okay. So that's what we're doing here. So let's run that and take a look at
00:52:23.540 | what we get. Okay. So yeah, you can see so for our customers here, this is the expected transaction
00:52:31.700 | volume. And yeah, we've worked through our analysis. So we've gathered a lot of different pieces of
00:52:38.260 | information. And as I mentioned at the start there, this is sort of thing that as a data scientist would
00:52:44.100 | one be hard to do, like training the GNN is like, you need a lot of experience to do that and to do it
00:52:53.940 | well. It's not, you know, it's not impossible, but it's gonna be hard and it's gonna be hard to do
00:53:00.340 | properly. So tumor is really good at just abstracting that out, which is really nice. And then the other
00:53:05.860 | thing that I think is really, really cool here is that, okay, if you're a data scientist, maybe you'd
00:53:12.660 | want to go and do this yourself, although you would save a ton of time and probably get better results
00:53:18.660 | doing this, you know, it's up to you. But the other thing is that this means that not only data scientists
00:53:27.060 | can do this can do this, right? So especially for me as a more generally scoped engineer,
00:53:35.220 | I want to build products, I want to bring in analytics on the data that we're receiving in
00:53:42.180 | those products. And usually for me to do that, okay, we can do some data science stuff, but the results,
00:53:50.500 | one, it's gonna take a long time for me to do it. And two, the results probably won't be that great.
00:53:54.900 | With this, I will have the time to use Kumo to sell that analysis. And two, it will actually
00:54:04.740 | be a good analysis, unlike what it would probably be like if I did it myself. So I get to have
00:54:11.460 | like very fast development and also get world-class results, which is amazing. So yeah, incredible
00:54:21.300 | service in my opinion. This is just one of the services that Kumo offers. There is another one
00:54:26.980 | one that I will be looking into soon, which is a relational foundation model or Kumo RFM. That is
00:54:34.260 | something I'm also quite excited about. So we'll be looking at that soon. But yeah, this is the full
00:54:40.900 | introduction and walkthrough for building out your own data science pipeline for recommendations on this
00:54:50.260 | pretty cool e-commerce data set. But that's it for now. So thank you very much for watching. I hope all
00:54:56.740 | this has been useful and interesting. But for now, I will see you again in the next one. Thanks. Bye.
00:55:13.540 | I'll see you again in the next one.
00:55:21.380 | Thank you.