back to indexData Science as a Service | Kumo AI Full Walkthrough

Chapters
0:0 Kumo AI and GNNs
7:39 Kumo Setup
12:17 Kumo Connectors
14:45 Getting Data into BigQuery
20:39 Building the Graph in Kumo
28:34 Predictive Query Language (PQL)
35:1 Personalized Product Recommendations
38:44 Predicting Purchase Volume
41:44 Making Predictions with Kumo
52:36 When to use Kumo
00:00:00.000 |
Today we are going to be doing a full end-to-end walkthrough of Kumo. 00:00:05.200 |
Now, as a quick introduction to Kumo, it is a almost data science as a service 00:00:11.680 |
that allows us to simplify a lot of what we as a data scientist would be doing 00:00:18.200 |
in a analytic use case. So it's best I give you an example. Let's say we are a e-commerce platform, 00:00:25.080 |
you are a data scientist in that e-commerce platform, and your goal is, given your current 00:00:32.180 |
historical data, you may want to, one, predict the lifetime value of a particular customer, 00:00:41.040 |
you may want to generate personalized product recommendations for those customers, and also 00:00:47.280 |
try and forecast purchase behaviors. So in the next 30 days, what is this customer 00:00:54.540 |
most likely to purchase, and in what quantities? So as a data scientist, if you're going to go 00:01:01.140 |
and do that, it is a fairly complicated process, which will take a bit of time. And the reason for 00:01:07.800 |
that is this type of data set. So let me show you what this type of data set might look like. 00:01:13.040 |
You could be looking at something like this. So you may have a customers table, a transactions table, 00:01:18.760 |
articles or products table. Now, you're, of course, your customers table. This is actually the data we're 00:01:25.500 |
going to be using is structured like this. We're using the H&M e-commerce data set. And that data set 00:01:31.620 |
has these three tables. The customers table there is 1.3 million records. So that's 1.3 million 00:01:39.240 |
customers. And what you're going to need to do is connect that customer's data over to your 00:01:45.560 |
transactions data. Okay. So you're going to have your customer ID connecting those two here. You're 00:01:50.460 |
going to have in here, the transaction date and the price of that transaction. So that's going to be 00:01:57.820 |
pretty useful information when it comes to making all these predictions. On the other side, within the 00:02:03.280 |
transactions that we don't necessarily have the actual article or product information that's going 00:02:09.060 |
to be stored over here in the articles table. So you'd connect these two. And in here, you'd have 00:02:14.760 |
your product name, the type of product it is. You'd have like the color of the product, a natural language 00:02:20.840 |
description. There are also in this data set images that you can attach to that. We're not going to be 00:02:25.280 |
using them, but there's a lot of information in there. So you're going to have something like this 00:02:29.400 |
and your job as a data scientist is to take this data, which is pretty huge, and transform it into 00:02:38.420 |
business predictions that engineers, marketing, leadership, whoever can then go and act upon. 00:02:46.740 |
So it can be a pretty hard task. And the way that you would do this, where you have all these 00:02:53.060 |
different connections between many different tables, one of the best model architectures for doing this 00:02:58.020 |
is graph neural networks. And the reason that graph neural networks are good at this is because of the 00:03:03.620 |
relationships between different tables. Graph neural networks are very good at mapping out and 00:03:10.420 |
understanding those relationships. You get a better grasp of network effects. So in this scenario, 00:03:17.860 |
that is how different customer preferences may influence other customer preferences. Because of 00:03:24.500 |
course, these customers are not acting in isolation. They're all part of a broader world. So you have 00:03:30.580 |
those network effects, which graph neural networks can handle better than many other predictive models. 00:03:36.660 |
You can model temporal dynamics, which is essentially, I fancy what I was saying, you can model predictions 00:03:44.900 |
over time. So around Christmas time, for example, if you have data going all the way back to previous 00:03:50.580 |
years, Christmas and the year before and so on, it's, there's probably going to be a relatively obvious 00:03:56.180 |
like pickup in purchasing volume, but also purchasing different things, especially when you're looking at, 00:04:02.180 |
okay, what should I be recommending to a customer? You know, it's summer. Should I be recommending them 00:04:08.660 |
a big fluffy coat for like Arctic exploration? Probably not. But should I be recommending them 00:04:16.580 |
swim shorts, sunglasses, you know, these sort of things? Probably, right? And a graph neural network, 00:04:21.940 |
if you give it enough data, will be able to do that. And another problem that you'll see, 00:04:28.100 |
and you'll see this across many disciplines, not just recommendation here. Another very common 00:04:34.980 |
problem is the cold start problem. The cold start problem is referring to, okay, you have a new 00:04:40.260 |
customer that just come in, you don't have any information about them, okay? Or you have very 00:04:44.820 |
little information about them. You might have, okay, they are male, female, their age, their geography, 00:04:52.580 |
you might have that information. And what you can, what a graph neural network can do based on that is 00:04:58.660 |
say, okay, looking at this baseline bit of information, who are some other customers that seem 00:05:04.740 |
similar? Okay. And based on that limited amount of information, let's start making some predictions 00:05:11.380 |
anyway. And yeah, maybe we'll get some wrong, maybe we'll get some right. But that, that is far better than 00:05:17.620 |
just like kind of giving up, which is what the cold start problem is, is where you just don't have 00:05:23.460 |
enough information to start giving out at least reasonable recommendations. With this, these 00:05:29.060 |
recommendations are not going to be as good as if we had more information for that customer. But with 00:05:34.660 |
graph neural networks, they're probably going to be better than most other methods. So why do I even 00:05:43.620 |
care about graph neural networks right now? Well, that's what Kumo is. Kumo is a service that abstracts 00:05:52.340 |
away a lot of the complexity when it comes to, okay, getting our data, passing it, you know, 00:05:57.700 |
doing like data exploration, data pre-processing and cleansing. Kumo will handle all of that for us. 00:06:03.220 |
And then it will also handle the training of our graph neural network, and then the prediction 00:06:08.340 |
making of our graph neural network. So that altogether means that as a data scientist, 00:06:14.100 |
something that would maybe take you a month to go through and do all this, maybe more or less time 00:06:19.940 |
than that, depending on the project and the competency level, of course, rather than going through and 00:06:24.500 |
doing all that, you can actually do all this in a few hours, which is still okay. So it's some time, 00:06:30.740 |
but in comparison, it's, it's, it's pretty fast. And the level of expertise that has been put into 00:06:38.340 |
building Kumo is, is pretty good. I would say better, probably better than the average data 00:06:45.220 |
scientist. Okay. I'm even better than even a pre-speller data scientist. So the quality of predictions that 00:06:54.340 |
you're going to be getting out of this is probably going to be better than trying to do yourself in 00:06:59.460 |
most cases, maybe not all, but in many cases, but in any case, it's so much faster that like this would 00:07:07.780 |
seem like the way to go. And this is particularly exciting for someone like me who, yeah, I did data 00:07:13.780 |
science, you know, the start of my career, but I've mostly moved away from that. And I do much more 00:07:19.220 |
general engineering work, obviously a lot of AI engineering work, but I'm not so much in the 00:07:24.740 |
like model training and straight up data science space anymore. So being able to make these sort of 00:07:32.020 |
predictions and, and integrate that with some products or services that I'm building, that's 00:07:38.420 |
pretty cool. So enough of an introduction here, let's actually start putting together our end to end 00:07:44.980 |
workflow for Kumo. So we're going to be working through this notebook here. You can run this 00:07:49.940 |
either locally or in Google collab, but I'll be running this locally and you can find a link to 00:07:57.380 |
this notebook as well in the comments below, but let's start by running through and connecting to Kumo 00:08:04.500 |
in the first place. So you will need to set up an API key. There will be a link for signing up in the 00:08:10.660 |
comments below. Or if you already have an account, you probably know where your API key is. For me, 00:08:15.700 |
I can go into my Kumo workspace here. I can go to admin and I can get my API key here on the right. 00:08:24.180 |
Okay. I would have to reset and generate a new API key if needed. I'm not going to do that. I already 00:08:30.100 |
have mine. Alternatively, if you don't have this, you, at least the way that I first got my API key, 00:08:35.620 |
was via email. So you can, wherever you can find your API key, go for that. Okay. So once you have 00:08:43.380 |
your API key, you should put it inside Kumo API key and I'll just pull this in here. So I'm going to 00:08:50.900 |
get and Kumo API key. Or if you, if you want to just paste it straight into the notebook, 00:08:58.980 |
it will run get past here. Okay. You should see that this will successfully initialize the SDK and 00:09:06.500 |
you can come down here and there are multiple ways of getting your data into Kumo. Generally speaking, 00:09:13.860 |
with these sort of projects, you're going to be using a lot of data, right? It's, it's usually a big data 00:09:19.140 |
thing. So Kumo does integrate with a few different infrastructure providers for data. You can also just 00:09:27.380 |
upload from local, but in this example, we're going to be using big query. Uh, you can also use S3, 00:09:34.180 |
Snowflake, Databricks, and I think, I think a few others as well, but I'm going to be focusing on 00:09:39.460 |
big query just because I'm most comfortable with GCP, but again, it doesn't really matter just as long as 00:09:46.100 |
you have access to one of those. So I'm going into my GCP console. I'm going to big query here and I've 00:09:54.180 |
already created a couple of tables in here, but if you hadn't, you don't need to do anything here, 00:10:00.180 |
but you will need to go over to your IM controls. And what we need to do is just create a service account 00:10:08.100 |
that is going to give Kumo access to our big query. And you can see the permissions that we're going to 00:10:16.180 |
need in here. So data viewer, filter data viewer, metadata viewer, read session user, user, and data 00:10:23.700 |
editor. So we need all of these in our service account. So let's go ahead and create those. 00:10:29.620 |
Well, I've already done it. So you can see in here, I have my service account. 00:10:35.380 |
It has all those permissions. Now, once you have created your service account, again, just another 00:10:41.940 |
thing. You call it whatever you want. You don't need to call it Kumo or anything like that. That's 00:10:45.700 |
just the name I gave it here. So I'm going to go over the service accounts. I have my service account 00:10:53.380 |
here. And what I'm going to do is just come over to keys and you would have to add a key here. So you'd 00:10:59.460 |
create a new key. I've already created mine. So I'll just show you how you create a new one. So you do 00:11:05.300 |
JSON, you click create, and then you'll want to save this in the directory that you're running this 00:11:12.420 |
Kumo example from. And make sure you call it this here. So kumogcppcreds.json. Now, the reason that we 00:11:21.460 |
need this to sell the access as I've just described is fairly simple. So we're going to be using GCP and 00:11:30.740 |
BigQuery over here as the source of truth for our data. So all of our data is going into Google 00:11:39.060 |
BigQuery. So that's the customers data, transactions data, and articles data. All of that is going into 00:11:44.340 |
BigQuery. Okay. Kumo needs to be able to read and write to BigQuery. So we set up our service account, 00:11:54.420 |
which is, you know, sort of thing that I've kind of highlighted here. We set up our service account, 00:11:59.140 |
give that to Kumo, and then Kumo can read the source data from GCP. And when we make predictions 00:12:07.380 |
later on, we're going to be making predictions and writing them to a table over in GCP. Okay. So that's 00:12:15.300 |
why we need to make this connection. So the next thing we want to do after we've settled those 00:12:21.380 |
credentials is we need to create our connector to BigQuery. So there's a few items to set up here. 00:12:28.900 |
We have the name for our connector, which I have set to Kumo intro live. We have the project ID. So this 00:12:38.420 |
is the GCP project ID. So you can see mine is up here, right? I have this Aurelio advocacy project. So just 00:12:46.340 |
make sure that is aligned. And then we will also want to set a unique dataset ID for the dataset that 00:12:54.180 |
we're going to create. And this is going to be used to read and write the dataset that you have over in 00:13:02.500 |
BigQuery. Which, by the way, if you don't already have a dataset in there, we are going to go and create that. But 00:13:10.420 |
right now you can see it in here. So I have this Aurelio HM. So that's the HM dataset over here. Okay. 00:13:17.380 |
Again, if you don't already have your dataset in BigQuery, it's not an issue where I'm going to show 00:13:20.900 |
you how to do that. So that's the setup. We read in our credentials file here, and then we use that 00:13:28.660 |
to initialize our BigQuery connector with Kumo. Okay. So this is from Kumo AI. Now, if we, let me even 00:13:38.260 |
change the dataset ID here. I'm going to change this to two. And I also need to change this to two very 00:13:46.020 |
quickly. So we'll change this. Okay. So now when I come down here and I try to first view our tables, 00:13:56.020 |
it's not even like me. Okay. So I'll run this and it's going to throw an error. Okay. So I've got 500, 00:14:00.900 |
actually 404, which is, come over here, not found. It's a Aurelio advocacy, RAL HM2. Okay. Now I don't 00:14:11.300 |
want to go and upload everything to BigQuery again, because it takes a bit of time. So I'm actually just 00:14:16.900 |
going to drop that and switch back to the project that we just created. And now that I've connected to 00:14:23.300 |
a project that actually does have my data, if I run this, it will not throw an error, right? So 00:14:29.940 |
yeah, it just connected, no errors. That's because the data now exists. But of course, if you're following 00:14:36.340 |
this through for the first time, you don't have that data already. So let's take a look at how we get that 00:14:42.820 |
data and then put it into BigQuery. So as I mentioned, we're going to be using the H&M dataset. 00:14:49.140 |
The H&M dataset is a real world dataset with 33 million transactions, 1.3 million customers in the 00:14:58.180 |
customers table and over a hundred thousand products or articles. So it's a pretty big dataset, but that 00:15:05.700 |
is similar to the sort of scale that you might see yourself. If you're a data scientist working in 00:15:11.220 |
this space, this is the sort of thing that you would see in a production recommendation system. So 00:15:16.020 |
we're going to come through down here and I just want to start with, okay, where we download and 00:15:22.740 |
say it's everyone. So we're going to pulling it from Kaggle, which is the only place like that. 00:15:28.100 |
I think there was like a copy of it on Hugging Face as well, but I don't want to use that because I think 00:15:32.980 |
it's an official copy. So we're going to use Kaggle. Now to download data from Kaggle, you do need an 00:15:41.140 |
account. Okay. Slightly annoying, but it's fine. So you just sign in or register, you know, whichever. 00:15:49.380 |
And once you've signed in, you should see something kind of like this. What you need to do is go over to 00:15:54.980 |
your settings, scroll down and you want to go ahead and create a new token. Now you can download this 00:16:04.100 |
wherever you want. I would recommend, okay, just download it into the directory that you're running 00:16:09.300 |
the notebook from. So I'm going to go and do that as well. And that will download the Kaggle.json file. 00:16:14.900 |
Now, once you've done that, you're going to want to move that Kaggle.json file. So I'm going to do 00:16:20.180 |
move Kaggle.json to, and this is on Mac. So I'm going to move it to Kaggle, Kaggle.json. 00:16:30.260 |
So now when I try and import Kaggle here, Kaggle is going to read my Kaggle.json credentials from here 00:16:42.100 |
and actually allow me to authenticate. Otherwise it will throw you, it will throw an error. So you, 00:16:48.100 |
you will see that. Okay. If you try and run this, it will throw an error if you haven't set that up 00:16:52.420 |
correctly. So for this specific dataset, we will need to approve their terms and conditions, 00:16:57.940 |
which we can do by, we can just find the dataset quickly. So H and M, run off the datasets, 00:17:06.740 |
or sorry, competitions. And we have this H and M personalized fashion recommendations 00:17:11.860 |
competition. So this is the one we're going to be working with. What we can do is, oh, 00:17:19.620 |
so around here somewhere, there'll be a little thing that tells you, you need to approve the, 00:17:24.580 |
or accept the terms and conditions to use the state. So you need to go and click that. Once you've found 00:17:30.180 |
it and clicked it, you will be able to download the competition files using this method here. Okay. 00:17:36.180 |
So we're pulling this from H and M personalized fashion recommendations datasets. Now that can 00:17:40.740 |
take quite a while to run. I'm not going to, I'm not going to download it myself because I already have 00:17:45.140 |
it locally, but once you do have it, everything will be in a zip file that looks like this. 00:17:50.740 |
So we need to extract our data out from that zip file. And we're only going to be looking for the 00:17:57.540 |
CSV files. There's a lot of other files in there. There's a lot of images and everything, but we're just 00:18:03.220 |
looking at the CSVs. So we pull those out. And if I should be able to run this bit, you can see that 00:18:10.180 |
this is the sort of data that we have. We don't, this is the sample submission dataset. We're not 00:18:15.380 |
interested in that. We're just looking at these first three. So customers, articles, and transactions, 00:18:20.660 |
train.csv. And now we need to go ahead and place our data from our local device and throw it into 00:18:29.700 |
BigQuery. So to do that, we are setting up, we're going to do this directly with BigQuery. 00:18:35.940 |
So from Google, we're importing BigQuery, importing service account, which was how we 00:18:41.860 |
authenticate ourselves. We have our credentials file path. This is what we got before from the 00:18:48.900 |
service account in GCP. So we create a credentials object using service account credentials. And then 00:18:56.420 |
we use that to initialize our BigQuery client. And again, this is using the Aurelio advocacy project. 00:19:04.980 |
Okay. So this is the project within GCP that we're using. So we're going to run that. And then what we're 00:19:12.100 |
going to want to do is we use our, this is dataset ID. We actually defined this earlier, so we probably 00:19:18.420 |
shouldn't define it again here. So we have our dataset ID. We're going to use that to create a 00:19:25.940 |
dataset reference with, so like, okay, in GCP, we have a client, our authenticated client. We're saying, 00:19:33.780 |
okay, connect to this dataset object. And this is the dataset ID. This will work even if the dataset 00:19:40.020 |
doesn't exist, because what will happen is if it doesn't exist, so we're going to try and get the dataset, 00:19:47.300 |
that's going to throw an error if it doesn't exist. So we catch that error and we say, 00:19:51.860 |
okay, dataset doesn't exist. Now we're going to go ahead and create it, which is exactly what we're 00:19:56.260 |
doing here. Okay. So we're creating that dataset. So I'm going to run this for me. It's going to say 00:20:02.020 |
dataset already exists because I've already created that dataset. But if this is your first time running 00:20:06.980 |
this, your dataset will be empty and it's just been created. So with that, you would need to go through 00:20:14.340 |
the files in this. So that is going to be the first three files here. So customers, articles, 00:20:21.380 |
and transactions. And this is essentially setting everything up or setting up your table, 00:20:28.420 |
and then just pushing over the data to the table. Okay. Now, again, that can take some time, 00:20:34.820 |
so I'm not going to run that, but this will take a little bit of time to actually run. Once we have that 00:20:41.220 |
set up, we need to move on to actually building our graph in Kumo. Okay. So as we saw briefly before, 00:20:49.620 |
this is what the dataset looks like. And based on that, this is how we're going to be connecting 00:20:53.700 |
everything. So we have our customers table. We connect that via customer ID to the transactions table. 00:21:00.020 |
And then the transactions table is connected to the articles table via the article ID. Okay. 00:21:08.020 |
So let's go ahead and define these tables and define these relationships. Okay. We'll run through this. 00:21:14.580 |
So first we can just check the tables that we currently have. So I have quite a few tables in 00:21:22.100 |
here already. You won't see all of these. So you should only have customers, articles, and transactions 00:21:27.860 |
train. Those should be the ones that you see. All these other ones, like all these prediction ones here, 00:21:32.740 |
are generated by Kumo later. So at the end, you will have some of these, not all of them, but you 00:21:39.060 |
will have most of those. But right now, just a three. So we want to connect to each of the source 00:21:45.300 |
tables that we should have. So customers, articles, transactions, train, and we do so like this. So we 00:21:50.820 |
just do connector. It's like a dictionary lookup there. So we have articles, customers, transactions, 00:21:57.940 |
train, just the table names. Okay. And what that does is it creates within the Kumo space, it sets 00:22:07.140 |
up the source tables. A source table is not a table within Kumo. It's a table elsewhere. Okay. So it's a, 00:22:15.060 |
you know, it's, it's literally like, okay, this is a source of my data. That's what the source table is. 00:22:20.340 |
Now we can view these source tables with a pandas date frame type interface here. So we use head and 00:22:30.100 |
we'll see the top five records in our table here. So this is looking at the articles table and see product 00:22:36.980 |
code, product name, the number, the type, uh, some more descriptive things. There's quite a lot of useful 00:22:45.940 |
information in here. Okay. As you can, as you can see, and again, this is, this is the articles 00:22:52.340 |
data set. So roughly about a hundred thousand records in there. Now looking at this, we should be able to see 00:23:06.340 |
Okay. And we can see all, all the columns that we're going to be using here. So you see very first one here, 00:23:13.380 |
article ID. So we know the articles can be connected or we will see in a minute that the articles can be 00:23:20.100 |
connected via this to the transaction table. So, okay. We have that. Let's come down. 00:23:25.940 |
Let's take a look at our customer's table. We'll just look at the first two records here. 00:23:31.540 |
Okay. So customers table, we see, we have customer ID. So that's what we're going to be using again. 00:23:38.500 |
Some other little bits of information here, uh, mainly actually this is empty. I think I'm pretty 00:23:44.580 |
sure I've seen age in a few of those. So yeah, I think maybe this is just a couple of bad examples. 00:23:51.460 |
Then moving on to the transactions source. So we can go through here. Let me run this. 00:23:59.940 |
So there isn't much information in the transaction table, but it's a big, it's a lot of data. 00:24:05.620 |
So we can see the transaction date, the customer, so we can connect that to the customer's table, 00:24:10.580 |
the article ID. So we can connect that to the articles table, the price, and then sales channel ID. So 00:24:19.060 |
we're going to connect all of those up. The way that we're going to do that is we're going to use this 00:24:24.020 |
Kumo AI table. So we're going to initialize the Kumo table and we're going to do that by pulling it from 00:24:31.380 |
a source table, an existing source table. So we'll have the articles table, customers table, and transactions 00:24:36.900 |
train table. The primary key for articles is going to be article ID. The primary key for customers is 00:24:43.140 |
going to be customer ID. And then for the transactions there, there isn't actually a primary key, but there 00:24:49.140 |
is a time column. So we do just highlight that. So there is a time column there, which is the transaction 00:24:54.580 |
date. So we can initialize that and we can go and have a look at the, you can see that we've done this 00:25:01.380 |
infer metadata. That's, it's an automated thing from Kumo where it's going to look at your table and it's 00:25:06.980 |
going to infer the table schema for you. So I can come down here and I can look at the 00:25:12.900 |
metadata for my articles table, for example. Okay. So we have article ID, product code, 00:25:17.380 |
it's all data types in there, whether it's primary key, something else. So a lot of cool stuff in there. 00:25:24.180 |
And we'll also be able to see all of these tables now in our Kumo dashboard. So I have, I have a few 00:25:30.660 |
in here. So let me scroll through. We're going to want to find the previous columns here. I think 00:25:39.060 |
mine would be the most recent one. So that would be may here, this one, this one, and this one, 00:25:45.540 |
and you can see some just useful information for each one of these columns. So let's go into maybe 00:25:50.740 |
article would be interesting. So we can see if we go to, let's say product name or product type name. 00:26:02.820 |
Cool. So you can see there's a lot of trousers. We have trousers, sweater, t-shirt, dress, you know, 00:26:10.500 |
so on and so on. Okay. There's some really useful information that you just go in and take a look 00:26:15.460 |
through those. So we have that. And now what we will also need to do is, okay, we have our tables, 00:26:21.060 |
but they're all kind of, they just exist in Kumo independently of one another at the moment. 00:26:25.940 |
Now we want to train our graph neural network on these. So we actually have to create a graph of 00:26:31.700 |
connections between each one of our tables. The way that we do that is we initialize this graph object. 00:26:38.260 |
We set the tables that are going to be within this single graph. Then we define how they are connected. 00:26:44.100 |
So we use these edges. So we have the source table using transactions with both of these. So the 00:26:49.620 |
transactions table via the customer ID foreign key is going to connect to the customers table. Second 00:26:57.620 |
connection here. So again, transactions table via the article ID foreign key is going to go and connect 00:27:04.420 |
to the articles destination table. Okay. So we would run this and then we just run a graph and validate to 00:27:12.340 |
confirm that this is an actual valid graph that we're building. Okay. So everything went well then. 00:27:19.220 |
And yet again, we can go over into our, into our Kumo dashboard and it would be, I suppose my last 00:27:26.260 |
one here, which is May. Okay. I need to zoom out a little bit there. So we have, yeah, it's pretty 00:27:34.500 |
straightforward. So we have the transactions data here. We have our transactions date, time column there. 00:27:41.780 |
We have these two foreign keys. So the customer ID connects over to our primary key of the customer's table. 00:27:46.820 |
And then the article ID foreign key connects to the primary key, which is the article ID of the 00:27:53.060 |
articles table. Okay. So you can see those connections and you can click through if you want to, and then 00:27:57.940 |
see what is in those. We just did that. So I'm not going to, I'm not going to do it again. Okay. So we have 00:28:03.620 |
that. Another thing that you can do, if you want to visualize your graph within notebook, 00:28:07.860 |
you can, you can install a few additional packages and you can actually use the graph this package to 00:28:16.020 |
visualize that. I'm not going into that here because for every, like for Linux versus Mac and I assume 00:28:22.100 |
windows as well, you, you need to set up in a different way. So you can see the graph in the Kumo UI. So I, 00:28:29.700 |
I will just do that personally. It's up to you. So we've set everything up, right? We're, we're at that 00:28:37.220 |
point now. This is almost like we have been through the, the data cleansing, the data pre-processing, 00:28:44.260 |
the data upload, you know, all of those steps as a data scientist, we've been through all those sets. 00:28:49.060 |
So now we're, now we're getting ready to start making some predictions or training our model 00:28:53.460 |
to make some predictions. Okay. That's great. We've, you know, we've done that. You know, 00:28:58.100 |
there is quite a lot going on there, but nothing beyond what we would have to do anyway as a data 00:29:03.220 |
scientist. So what we've done really simplified quite a bit of work and condense it into what we've 00:29:09.940 |
just done now. But now we need to go into the predictions. So Kumo uses what they call the 00:29:16.260 |
predictive query language or PQL. Now it's quite interesting. So PQL, you might guess, 00:29:22.900 |
is kind of like a SQL type syntax, which allows you to define your prediction parameters. Okay. 00:29:31.060 |
So rather than writing like some neural network training code, you write this SQL like a predictive 00:29:39.060 |
query and Kumo is going to look at that, understand it, and train your GNN based on your PQL statement. 00:29:48.740 |
So let's start with our first use case that we described at the start, which is predicting the 00:29:55.220 |
customer value over the next 30 days. So the way that we do that in a PQL statement is like this. 00:30:04.580 |
So there's a few different components in PQL. So we have our target, which you can see here. 00:30:11.620 |
So the target is what follows predict here. So we have the predict statement. We're saying predict 00:30:18.100 |
whatever is within our target here. So this is the defined target. So what is our target here? Okay. We 00:30:24.500 |
have the sum of the transactions price over the next zero to 30 days into the future. And also we defined 00:30:34.500 |
days here. So that is our target. We're predicting the sum of the transactions price over the next 30 days, 00:30:41.460 |
but then we also have an entity. Okay. Which is who or what are we making this prediction for? 00:30:48.260 |
So here we're saying for each. So we're getting this sum of predictions broken down for each individual 00:30:56.500 |
customer. So what we do by writing this query is we are getting the value of each customer over the next 00:31:08.260 |
30 days. So let's go ahead and implement that. We come down here. We use Kumo AI predictive query. We pass 00:31:16.260 |
in our graph and then we just write what I, what I just showed you that, that PQL statement. Okay. So 00:31:22.100 |
predict the sum of transactions price for each customer based on the customer ID. Uh, we validate our 00:31:29.780 |
P query or PQL statement here. So let's run that. Okay. That is great. Then we come down here and we ask Kumo 00:31:40.980 |
to generate a model plan. So basically a, okay, Kumo, how based on everything, based on our data here, 00:31:48.820 |
the volume of data based on the query that we've specified, what is the ideal approach for training 00:31:57.620 |
our model? And yeah, you can, you can look at this. Okay. So we can see here that we're using mean absolute 00:32:04.580 |
error, mean squared error and root mean squared error as the training loss functions. The tuning metric is 00:32:11.940 |
actually, is using mean absolute error. We have a network pruning. We, there's no processing there. 00:32:20.980 |
We have sampling the optimization here. So it's using the Huber loss function to optimize for regression. 00:32:29.140 |
There we have a number of epochs that's here to work over the validation steps, test steps, 00:32:34.420 |
the learning rates, weight decay. Just, I mean, I'm not, I'm not going to go through all of this, 00:32:41.060 |
but there's a ton of stuff in here. Uh, we can actually see the, the graph network architecture 00:32:46.820 |
here, which is probably interesting for a few of us. And yeah, just a ton of stuff in there. So you can, 00:32:53.540 |
yeah, you have Kumo telling you what it's going to do, how it's going to train your model. So 00:32:58.260 |
if you're, you know, I'm not well versed in GNNs, but if you are, you can take a look at that and, 00:33:04.820 |
and see, make sure everything makes sense according to how you, well, you understand them. But of course, 00:33:09.860 |
as I said, Kumo literally co-founded by one of the, the co-authors of the GNN paper. So they have, 00:33:18.180 |
they have some pretty talented people working there. So that should be some pretty optimal 00:33:24.660 |
parameters there. So once we're happy with that, what we're going to do is we're going to run this 00:33:28.340 |
train object. So we Kumo AI, train it with model plan, and then we run it with trainer.fit. Okay. 00:33:35.620 |
And what this is going to do, okay, let me run this. What this is going to do is it's just going to 00:33:42.420 |
initialize the training job. Now we're going to see this cell finish quite quickly because we've set 00:33:48.580 |
non-blocking equal to true. So it's going to, this is going to go to Kumo. It's going to start, 00:33:54.260 |
like say, okay, I want this training job to start running. Once it has confirmation that the training 00:33:59.940 |
job is running, it's going to come back to us and allow us to move on with whatever we're doing. 00:34:06.100 |
But the training job will not be complete for quite a while. I think as I've been going through this, 00:34:12.100 |
the time has varied, but I would say somewhere between 40 minutes to an hour for, for a training 00:34:19.300 |
job here, but you can run multiple training jobs in parallel. So we have three predictions that we'd 00:34:25.140 |
like to make here. So I'm going to run all those at the same time. Okay. We've got this one back now. 00:34:30.660 |
So you can as well, if you don't, if you want to just start running these all now, you just, 00:34:36.340 |
just run the next few cells in the notebook and then come back and I will talk you through what the 00:34:42.420 |
other use case PQL statements are doing. So we can check our status for the training job. We'll see that 00:34:50.900 |
it's running. You can also click the tracking URL here, and this will actually open up the training 00:34:57.220 |
job. So we can see how things are going in the UI if we want to. So coming back, let's move on to the 00:35:03.220 |
second use case, which is these personalized product recommendations. This is one I personally, like I 00:35:08.580 |
would actually be very likely to use with a lot of projects I currently work on, which is obviously more 00:35:15.460 |
like AI, conversational AI, building chatbots, or just AI interfaces in general. 00:35:22.980 |
The reason that I could see myself using this is let's say you have the H&M website. I don't know 00:35:29.540 |
what's on the H&M website, but let's say they have a chatbot and you can log in. You could log in and 00:35:34.580 |
you could talk to this chatbot or not even talk. It doesn't have to be a chatbot. It can just be 00:35:41.620 |
that you log into the website and the website is going to surface some product site based on what 00:35:50.500 |
we do here, based on these personalized product recommendations that we build with Kumo. It can 00:35:55.060 |
surface those to the users. They log in so that you are providing them with what they want before they 00:36:02.340 |
even, they don't even need to go and search. It's just, it should just be there what they want, 00:36:07.300 |
if possible, right? Obviously, you're not going to get it perfect all the time, but you'll probably 00:36:12.980 |
be able to do pretty well with this. So you can do that. You can also surface this, as I was originally 00:36:18.740 |
going to say, through a chatbot-like interface. You could tell a chatbot, hey, you have this customer 00:36:24.100 |
and you're talking to them. These are the sort of products that they seem most interested in, 00:36:28.260 |
you know, kind of place those into the conversation when you're able to, when it makes sense. 00:36:34.100 |
So that's another thing that you could do. There are many ways that you could use this. 00:36:37.540 |
So this is a little more of a sophisticated prediction here. The reason I say this is a 00:36:43.380 |
little more sophisticated is because we have a filter here. So we've added an additional item to our 00:36:49.860 |
target entity. So we have target entity and now we also have filter. So the target LPL is pretty similar. 00:36:57.060 |
Okay. The operator is different. So we're not summing anymore, actually listing the distinct 00:37:01.860 |
articles that over the next 30 days, we expect to appear in the customer's like transactions table. 00:37:12.500 |
Okay. So this is a top 10 and what it's saying by, okay, what are the top 10? 00:37:18.180 |
These are like the top 10 predictions. So will the customer buy this or not? Okay. Will this customer ID 00:37:29.220 |
appear in the transactions table alongside a particular article ID 00:37:35.540 |
in the next 30 days? That is what we're predicting here. And then we're filtering for the top 10 predictions, 00:37:43.700 |
because otherwise, if we don't filter here, we're going to be looking at, what was it? Like 1.3 million 00:37:50.180 |
customer IDs, unique customer IDs with, against 100,000 products. And we're making predictions for all of 00:38:00.580 |
those, right? That would be a, that would, that could be a larger number. Okay. So what we're saying is, 00:38:06.180 |
okay, just, just give me the top 10, like the top 10 most probable purchases for our customer. 00:38:14.180 |
So we would run that again, same as before, nothing, nothing new here. So we're just modifying the PQL 00:38:20.500 |
statement. So we run that, we validate it. And we're just going to check again. Okay. We get the model 00:38:28.500 |
plan from Kumo here, and then we'll just start training with purchase trainer fit. Okay. So that's 00:38:37.700 |
going to run. Again, as before, we will be able to change status with this. Of course, we'll just have 00:38:43.380 |
to wait for that other cell to run. Now, final use case here. I want to look at predicting the purchase 00:38:49.380 |
volume for our customers. So in this scenario, it's kind of similar. So we're looking at a count of 00:38:55.620 |
transactions this time over the next zero to 30 days that generates our target. Again, we're looking for 00:39:01.860 |
each customer for each customer. Okay. But we're adding a filter here. So what was this filter doing? 00:39:08.420 |
This filter is looking at, if you, if you look here at the range, we've got minus 30 days up to 00:39:15.300 |
zero days. So this is looking at 30 days in the past. 30 days in the past is saying, okay, 00:39:24.020 |
let's just filter where the number of transactions, so the count of transactions for each customer ID over 00:39:30.900 |
the past 30 days is greater than zero. So what does that mean? That's saying, just do this prediction 00:39:40.420 |
for customers that have purchased something in the previous 30 days. What this does is it just reduces 00:39:46.820 |
the scope of the number of predictions that we have to make by focusing only on active customers. 00:39:53.220 |
So ideally we should be able to get a faster prediction out of this by, you know, within the 00:40:00.900 |
status that naturally there's probably a lot of customers that are just inactive. We're probably not 00:40:05.460 |
going to get much information for, but if we don't add that filter and we're still going to be making 00:40:09.460 |
predictions for those customers. So you can, you can do this across the other examples as well. 00:40:16.420 |
So just make sure we're focusing on, focusing on like customers that we want to focus on. 00:40:23.140 |
So as before, we're just setting up the predictive query, validating it. We do the model plan and then 00:40:29.940 |
we fit that model plan. Same as before, no difference other than the query, of course. Okay. And once that 00:40:36.660 |
cell has finished up here, we can go here and just check the status of our jobs. I would expect them to 00:40:42.820 |
either be running or queued as they are here. So I'm going to go and leave these to run and this one as 00:40:49.940 |
well and jump back in when they're ready and show you what we can actually do with these queries. Okay. 00:40:57.540 |
So we're back and the jobs have now finished. So we can see done in all of these. We're going to switch 00:41:03.620 |
over to the browser as well. And you can see in here that these are done. So training complete. This 00:41:10.580 |
one took an hour, 20 minutes. So it was pretty long to be fair, but you can see, yeah, you can see the 00:41:18.900 |
various ones here. This one here, 60 minutes, pretty quick. I would imagine that is the one where we have 00:41:25.300 |
the filter. Yeah. You see here we're filtering for the like active customers only. And yeah, the duration 00:41:35.860 |
for that one is noticeably shorter, which makes sense, of course. So that's great. We can just jump 00:41:44.660 |
through and look at the, look at how we can use those predictions now. Okay. So to make predictions, 00:41:49.780 |
we're going to come down to the next cell here. I've just added this confirm. Okay. Like the status 00:41:56.500 |
is done, which it has, we've already checked, but just in case, then we're going to use trainer predict. 00:42:03.140 |
So the first trainer, if we, if we come up here to where we actually initialize that, it's this one here. 00:42:09.540 |
So that first one, which is this PQL statement right here. Okay. So predicting the, essentially the 00:42:18.900 |
value of the customer over the next 30 days. So let's go ahead and run that prediction. And what 00:42:25.860 |
this is going to do is actually create a table in big query. You can see I've put output table name 00:42:31.700 |
here. So it's going to create this table. Okay. So once we've run that, this disabled will have been 00:42:38.340 |
created. Now, the other thing that we should be aware of is so for our second query, this one here, 00:42:47.300 |
we are ranking the top 10, right? So this is a ranking prediction. And that means that we can have a varying 00:43:00.260 |
number of predictions per entity. Okay. So in that case, we also need to include the number of classes 00:43:09.460 |
to return. Okay. So we, we don't have to stick to 10. Like we, we said top 10 before we can, we could 00:43:19.380 |
change this to like 30 if we wanted to. Right. But yeah, we're sticking with 10 here. So, yep. That's a, 00:43:27.700 |
that's just a nice parameter that we can set if we want a different numbers of a different number 00:43:33.860 |
of predictions to be returned there. And if we come down to the next one, we have the transactions 00:43:40.660 |
prediction. So that is looking at the number of predictions for our, our customers over the next 30 00:43:47.540 |
days from run. So we run all of those. Then we can actually go in and see what we have from our data. 00:43:56.500 |
So first one is the customer value predictions. So who will generate the most revenue for us of our 00:44:03.620 |
customers? Okay. So when we, we specified before, this was the table name, the output table name, 00:44:11.540 |
Kumo will by default add underscore prediction to the end of that or predictions, sorry. 00:44:16.260 |
So yeah, just be aware of that. There's change the table name, but then yeah, we can, we can see this. 00:44:24.980 |
Now it's worth noting that the head that we have here is actually the, like the, these are all the 00:44:31.620 |
lowest predictions, right? And this is a regression test. So essentially it's saying that all of these 00:44:36.180 |
customers are going to over the next 30 days, some of their transactions will be zero. Okay. And because 00:44:43.140 |
it's regression, it goes slightly into the negative, right? But essentially just view this as being zero. 00:44:49.700 |
That's our prediction. Now, as I said, these are all, this is like the tail end. This is all the, 00:44:56.820 |
the lowest number of predictions in the header here. So we actually want to reverse this. Kumo doesn't 00:45:02.980 |
give us a way. It doesn't allow us to write like tail like you would with pandas data frames. So 00:45:08.020 |
instead we can actually use big query directly to order by the largest values and just get the top five 00:45:17.860 |
from there. So this is what we're doing here. We write that as a SQL query. So we're selecting all from 00:45:24.980 |
our data sets. We're going for this table. So this is the new table that we have from here and we're ordering 00:45:32.020 |
by the target prediction. Okay. So this number here and that is descending. So we have the highest values 00:45:39.220 |
at the top. So this is going to give us who should be our top five most valuable customers. So let's 00:45:44.340 |
take a look. Okay. And we have these, right? So like these are high numbers, right? So these are 00:45:52.580 |
the entity here is our customer ID. So we'll be able to use this to map back to the actual customers table 00:46:00.820 |
pretty soon. So now we have who will most likely be our most valuable customers, which is a great 00:46:07.540 |
thing to be able to predict. Now let's have a look at what we think these people will buy. 00:46:12.980 |
Okay. So we're going to look at our purchase predictions. Just looking ahead again here. 00:46:18.660 |
Again, we can just use a big query and go directly through that if we need to. But here we can see, 00:46:24.820 |
okay, for this customer, they are like very likely to buy this product here. Okay. And it's pretty high 00:46:32.980 |
score there. It's cool. So now let's have a look at transaction volume. We're going to bring all this 00:46:38.900 |
together. We're just looking at each table now quickly. We're going to bring all this together in 00:46:42.740 |
a moment. Okay. Transaction table. How active would they be again? And again, these are very small numbers. 00:46:50.340 |
So we use BigQuery to look at the largest values there. Okay. So these are, again, these are transactions. 00:46:57.220 |
Again, I think we're looking, this would be the customer ID here. And this is how many transactions we 00:47:05.060 |
actually expect them to make in the next month. So 20, 20 transactions first on here. Okay. So all 00:47:10.820 |
that is great, but how can we bring all this together? So what I want to do now is look at the next 00:47:16.980 |
month's most valuable customers and join that back to our actual customers table. And then I want to see 00:47:23.380 |
what those customers are most likely to buy the actual products. And then again, focusing on those 00:47:32.500 |
customers, see how many products we think they will buy. So let's do that. First, we're going to find 00:47:38.820 |
next month's most valuable customers. How do we do this? So to identify our most valuable customers, 00:47:44.020 |
we're going to be doing a join between the summer transactions predictions table with our customers table. 00:47:53.620 |
We're going to be joining those two tables based on the transaction predictions, 00:48:00.020 |
the entity values there, which is just a customer ID from our predictions and joining those to the actual 00:48:07.220 |
customer IDs from the, the customers original customers table. Okay. So that is what we have 00:48:15.060 |
there. We're also limiting this by the top 30 highest scores. So you can see we're ordering by the target 00:48:20.900 |
prediction and looking at the top 30 of those. So basically filtering for the top 30 most valuable, 00:48:27.540 |
predicted the most valuable customers. So I will run that. Let's see what we get can do to data frame 00:48:34.740 |
and just view that. So we have, okay, we have the customer ID now. So these are the top 30 predicted 00:48:40.260 |
most valuable customers. So we can come through here. We can see, okay, we have all the ages here call the 00:48:47.700 |
young twenties and horns buying all the clothes, of course, with this random 37 year old over here. 00:48:53.860 |
Okay. And then we can see their scores. Okay. So that looks pretty good. You can see that they're all 00:49:02.100 |
club members, whatever that means. I assume they must have some sort of membership club. 00:49:08.500 |
So generally, okay, that looks pretty good. So we have our top, we could do a top 30 here, but right 00:49:16.500 |
now we're just looking at the top five and the most valuable customers. Now let's have a look at what 00:49:20.820 |
those customers are going to buy. So come down to here, we now need to be joining our customers table, 00:49:28.980 |
which is here, join our customers table to the purchase predictions, predictions table based on 00:49:37.780 |
the prediction entity. Okay. Which is the customer ID. So customer ID is attaching, is joining the 00:49:46.100 |
customers table with the purchase predictions table. Then the purchase predictions table also included a 00:49:51.700 |
I think it was class number, which is the article ideal or product ID. So we're going to connect those. 00:49:58.660 |
So it was a predictions class. We're joining that to the articles table via the article ID. And that is 00:50:07.220 |
going to give us our customers most likely purchases. And we're actually going to focus this. So we're 00:50:13.460 |
going to filter down to a specific customer, which is our top rated customer from the top customers 00:50:20.100 |
table, which we create here. So we're going to be looking at this person here. So let's run that and see 00:50:27.780 |
what we get. Okay, I've got a few here. So customer ID, this is just the same for all these people are 00:50:36.340 |
looking at single customer, and we can see what they are interested in buying. Okay, so product name, 00:50:42.980 |
magic. So some some magic dress that they have. Then they have this Ellie Goodard dress, another dress, 00:50:52.340 |
dress, some heeled sandals, a shirt, dress again, some some joggers. 00:51:01.780 |
dress dress, dress, dress, they really like dresses, I think, from the looks of it. So these are the 00:51:10.100 |
products that when this user next logs into a website, or if they're talking to a, you know, 00:51:16.260 |
H&M chat part or anything like that, or we're sending emails out. These are the products that we should 00:51:22.340 |
send and just surface to that user. So like, hey, look, these are some nice things that we think you 00:51:27.860 |
probably might like. So hey, look at these. That's pretty cool. Now let's continue and take a look at 00:51:34.980 |
the purchase volume. So we're now looking at the valuable customers. So this query here, 00:51:43.780 |
let me let me come back up and show you what that query actually is. So the valuable customers is this 00:51:49.620 |
query here. All right. So finding out our top was it top 30. Sorry. So top 30 most valuable customers. 00:52:00.100 |
Yeah, we'll be joining our top 30 most valuable customers to the transaction predictions. 00:52:11.300 |
And what that is going to do is it's going to get us the predicted transaction volume for each one of 00:52:17.460 |
those top 30 customers. Okay. So that's what we're doing here. So let's run that and take a look at 00:52:23.540 |
what we get. Okay. So yeah, you can see so for our customers here, this is the expected transaction 00:52:31.700 |
volume. And yeah, we've worked through our analysis. So we've gathered a lot of different pieces of 00:52:38.260 |
information. And as I mentioned at the start there, this is sort of thing that as a data scientist would 00:52:44.100 |
one be hard to do, like training the GNN is like, you need a lot of experience to do that and to do it 00:52:53.940 |
well. It's not, you know, it's not impossible, but it's gonna be hard and it's gonna be hard to do 00:53:00.340 |
properly. So tumor is really good at just abstracting that out, which is really nice. And then the other 00:53:05.860 |
thing that I think is really, really cool here is that, okay, if you're a data scientist, maybe you'd 00:53:12.660 |
want to go and do this yourself, although you would save a ton of time and probably get better results 00:53:18.660 |
doing this, you know, it's up to you. But the other thing is that this means that not only data scientists 00:53:27.060 |
can do this can do this, right? So especially for me as a more generally scoped engineer, 00:53:35.220 |
I want to build products, I want to bring in analytics on the data that we're receiving in 00:53:42.180 |
those products. And usually for me to do that, okay, we can do some data science stuff, but the results, 00:53:50.500 |
one, it's gonna take a long time for me to do it. And two, the results probably won't be that great. 00:53:54.900 |
With this, I will have the time to use Kumo to sell that analysis. And two, it will actually 00:54:04.740 |
be a good analysis, unlike what it would probably be like if I did it myself. So I get to have 00:54:11.460 |
like very fast development and also get world-class results, which is amazing. So yeah, incredible 00:54:21.300 |
service in my opinion. This is just one of the services that Kumo offers. There is another one 00:54:26.980 |
one that I will be looking into soon, which is a relational foundation model or Kumo RFM. That is 00:54:34.260 |
something I'm also quite excited about. So we'll be looking at that soon. But yeah, this is the full 00:54:40.900 |
introduction and walkthrough for building out your own data science pipeline for recommendations on this 00:54:50.260 |
pretty cool e-commerce data set. But that's it for now. So thank you very much for watching. I hope all 00:54:56.740 |
this has been useful and interesting. But for now, I will see you again in the next one. Thanks. Bye.