Data Science as a Service | Kumo AI Full Walkthrough

Today we are going to be doing a full end-to-end walkthrough of Kumo. Now, as a quick introduction to Kumo, it is a almost data science as a service that allows us to simplify a lot of what we as a data scientist would be doing in a analytic use case.

So it's best I give you an example. Let's say we are a e-commerce platform, you are a data scientist in that e-commerce platform, and your goal is, given your current historical data, you may want to, one, predict the lifetime value of a particular customer, you may want to generate personalized product recommendations for those customers, and also try and forecast purchase behaviors.

So in the next 30 days, what is this customer most likely to purchase, and in what quantities? So as a data scientist, if you're going to go and do that, it is a fairly complicated process, which will take a bit of time. And the reason for that is this type of data set.

So let me show you what this type of data set might look like. You could be looking at something like this. So you may have a customers table, a transactions table, articles or products table. Now, you're, of course, your customers table. This is actually the data we're going to be using is structured like this.

We're using the H&M e-commerce data set. And that data set has these three tables. The customers table there is 1.3 million records. So that's 1.3 million customers. And what you're going to need to do is connect that customer's data over to your transactions data. Okay. So you're going to have your customer ID connecting those two here.

You're going to have in here, the transaction date and the price of that transaction. So that's going to be pretty useful information when it comes to making all these predictions. On the other side, within the transactions that we don't necessarily have the actual article or product information that's going to be stored over here in the articles table.

So you'd connect these two. And in here, you'd have your product name, the type of product it is. You'd have like the color of the product, a natural language description. There are also in this data set images that you can attach to that. We're not going to be using them, but there's a lot of information in there.

So you're going to have something like this and your job as a data scientist is to take this data, which is pretty huge, and transform it into business predictions that engineers, marketing, leadership, whoever can then go and act upon. So it can be a pretty hard task. And the way that you would do this, where you have all these different connections between many different tables, one of the best model architectures for doing this is graph neural networks.

And the reason that graph neural networks are good at this is because of the relationships between different tables. Graph neural networks are very good at mapping out and understanding those relationships. You get a better grasp of network effects. So in this scenario, that is how different customer preferences may influence other customer preferences.

Because of course, these customers are not acting in isolation. They're all part of a broader world. So you have those network effects, which graph neural networks can handle better than many other predictive models. You can model temporal dynamics, which is essentially, I fancy what I was saying, you can model predictions over time.

So around Christmas time, for example, if you have data going all the way back to previous years, Christmas and the year before and so on, it's, there's probably going to be a relatively obvious like pickup in purchasing volume, but also purchasing different things, especially when you're looking at, okay, what should I be recommending to a customer?

You know, it's summer. Should I be recommending them a big fluffy coat for like Arctic exploration? Probably not. But should I be recommending them swim shorts, sunglasses, you know, these sort of things? Probably, right? And a graph neural network, if you give it enough data, will be able to do that.

And another problem that you'll see, and you'll see this across many disciplines, not just recommendation here. Another very common problem is the cold start problem. The cold start problem is referring to, okay, you have a new customer that just come in, you don't have any information about them, okay?

Or you have very little information about them. You might have, okay, they are male, female, their age, their geography, you might have that information. And what you can, what a graph neural network can do based on that is say, okay, looking at this baseline bit of information, who are some other customers that seem similar?

Okay. And based on that limited amount of information, let's start making some predictions anyway. And yeah, maybe we'll get some wrong, maybe we'll get some right. But that, that is far better than just like kind of giving up, which is what the cold start problem is, is where you just don't have enough information to start giving out at least reasonable recommendations.

With this, these recommendations are not going to be as good as if we had more information for that customer. But with graph neural networks, they're probably going to be better than most other methods. So why do I even care about graph neural networks right now? Well, that's what Kumo is.

Kumo is a service that abstracts away a lot of the complexity when it comes to, okay, getting our data, passing it, you know, doing like data exploration, data pre-processing and cleansing. Kumo will handle all of that for us. And then it will also handle the training of our graph neural network, and then the prediction making of our graph neural network.

So that altogether means that as a data scientist, something that would maybe take you a month to go through and do all this, maybe more or less time than that, depending on the project and the competency level, of course, rather than going through and doing all that, you can actually do all this in a few hours, which is still okay.

So it's some time, but in comparison, it's, it's, it's pretty fast. And the level of expertise that has been put into building Kumo is, is pretty good. I would say better, probably better than the average data scientist. Okay. I'm even better than even a pre-speller data scientist. So the quality of predictions that you're going to be getting out of this is probably going to be better than trying to do yourself in most cases, maybe not all, but in many cases, but in any case, it's so much faster that like this would seem like the way to go.

And this is particularly exciting for someone like me who, yeah, I did data science, you know, the start of my career, but I've mostly moved away from that. And I do much more general engineering work, obviously a lot of AI engineering work, but I'm not so much in the like model training and straight up data science space anymore.

So being able to make these sort of predictions and, and integrate that with some products or services that I'm building, that's pretty cool. So enough of an introduction here, let's actually start putting together our end to end workflow for Kumo. So we're going to be working through this notebook here.

You can run this either locally or in Google collab, but I'll be running this locally and you can find a link to this notebook as well in the comments below, but let's start by running through and connecting to Kumo in the first place. So you will need to set up an API key.

There will be a link for signing up in the comments below. Or if you already have an account, you probably know where your API key is. For me, I can go into my Kumo workspace here. I can go to admin and I can get my API key here on the right.

Okay. I would have to reset and generate a new API key if needed. I'm not going to do that. I already have mine. Alternatively, if you don't have this, you, at least the way that I first got my API key, was via email. So you can, wherever you can find your API key, go for that.

Okay. So once you have your API key, you should put it inside Kumo API key and I'll just pull this in here. So I'm going to get and Kumo API key. Or if you, if you want to just paste it straight into the notebook, it will run get past here.

Okay. You should see that this will successfully initialize the SDK and you can come down here and there are multiple ways of getting your data into Kumo. Generally speaking, with these sort of projects, you're going to be using a lot of data, right? It's, it's usually a big data thing.

So Kumo does integrate with a few different infrastructure providers for data. You can also just upload from local, but in this example, we're going to be using big query. Uh, you can also use S3, Snowflake, Databricks, and I think, I think a few others as well, but I'm going to be focusing on big query just because I'm most comfortable with GCP, but again, it doesn't really matter just as long as you have access to one of those.

So I'm going into my GCP console. I'm going to big query here and I've already created a couple of tables in here, but if you hadn't, you don't need to do anything here, but you will need to go over to your IM controls. And what we need to do is just create a service account that is going to give Kumo access to our big query.

And you can see the permissions that we're going to need in here. So data viewer, filter data viewer, metadata viewer, read session user, user, and data editor. So we need all of these in our service account. So let's go ahead and create those. Well, I've already done it. So you can see in here, I have my service account.

It has all those permissions. Now, once you have created your service account, again, just another thing. You call it whatever you want. You don't need to call it Kumo or anything like that. That's just the name I gave it here. So I'm going to go over the service accounts.

I have my service account here. And what I'm going to do is just come over to keys and you would have to add a key here. So you'd create a new key. I've already created mine. So I'll just show you how you create a new one. So you do JSON, you click create, and then you'll want to save this in the directory that you're running this Kumo example from.

And make sure you call it this here. So kumogcppcreds.json. Now, the reason that we need this to sell the access as I've just described is fairly simple. So we're going to be using GCP and BigQuery over here as the source of truth for our data. So all of our data is going into Google BigQuery.

So that's the customers data, transactions data, and articles data. All of that is going into BigQuery. Okay. Kumo needs to be able to read and write to BigQuery. So we set up our service account, which is, you know, sort of thing that I've kind of highlighted here. We set up our service account, give that to Kumo, and then Kumo can read the source data from GCP.

And when we make predictions later on, we're going to be making predictions and writing them to a table over in GCP. Okay. So that's why we need to make this connection. So the next thing we want to do after we've settled those credentials is we need to create our connector to BigQuery.

So there's a few items to set up here. We have the name for our connector, which I have set to Kumo intro live. We have the project ID. So this is the GCP project ID. So you can see mine is up here, right? I have this Aurelio advocacy project.

So just make sure that is aligned. And then we will also want to set a unique dataset ID for the dataset that we're going to create. And this is going to be used to read and write the dataset that you have over in BigQuery. Which, by the way, if you don't already have a dataset in there, we are going to go and create that.

But right now you can see it in here. So I have this Aurelio HM. So that's the HM dataset over here. Okay. Again, if you don't already have your dataset in BigQuery, it's not an issue where I'm going to show you how to do that. So that's the setup.

We read in our credentials file here, and then we use that to initialize our BigQuery connector with Kumo. Okay. So this is from Kumo AI. Now, if we, let me even change the dataset ID here. I'm going to change this to two. And I also need to change this to two very quickly.

So we'll change this. Okay. So now when I come down here and I try to first view our tables, it's not even like me. Okay. So I'll run this and it's going to throw an error. Okay. So I've got 500, actually 404, which is, come over here, not found.

It's a Aurelio advocacy, RAL HM2. Okay. Now I don't want to go and upload everything to BigQuery again, because it takes a bit of time. So I'm actually just going to drop that and switch back to the project that we just created. And now that I've connected to a project that actually does have my data, if I run this, it will not throw an error, right?

So yeah, it just connected, no errors. That's because the data now exists. But of course, if you're following this through for the first time, you don't have that data already. So let's take a look at how we get that data and then put it into BigQuery. So as I mentioned, we're going to be using the H&M dataset.

The H&M dataset is a real world dataset with 33 million transactions, 1.3 million customers in the customers table and over a hundred thousand products or articles. So it's a pretty big dataset, but that is similar to the sort of scale that you might see yourself. If you're a data scientist working in this space, this is the sort of thing that you would see in a production recommendation system.

So we're going to come through down here and I just want to start with, okay, where we download and say it's everyone. So we're going to pulling it from Kaggle, which is the only place like that. I think there was like a copy of it on Hugging Face as well, but I don't want to use that because I think it's an official copy.

So we're going to use Kaggle. Now to download data from Kaggle, you do need an account. Okay. Slightly annoying, but it's fine. So you just sign in or register, you know, whichever. And once you've signed in, you should see something kind of like this. What you need to do is go over to your settings, scroll down and you want to go ahead and create a new token.

Now you can download this wherever you want. I would recommend, okay, just download it into the directory that you're running the notebook from. So I'm going to go and do that as well. And that will download the Kaggle.json file. Now, once you've done that, you're going to want to move that Kaggle.json file.

So I'm going to do move Kaggle.json to, and this is on Mac. So I'm going to move it to Kaggle, Kaggle.json. So now when I try and import Kaggle here, Kaggle is going to read my Kaggle.json credentials from here and actually allow me to authenticate. Otherwise it will throw you, it will throw an error.

So you, you will see that. Okay. If you try and run this, it will throw an error if you haven't set that up correctly. So for this specific dataset, we will need to approve their terms and conditions, which we can do by, we can just find the dataset quickly.

So H and M, run off the datasets, or sorry, competitions. And we have this H and M personalized fashion recommendations competition. So this is the one we're going to be working with. What we can do is, oh, so around here somewhere, there'll be a little thing that tells you, you need to approve the, or accept the terms and conditions to use the state.

So you need to go and click that. Once you've found it and clicked it, you will be able to download the competition files using this method here. Okay. So we're pulling this from H and M personalized fashion recommendations datasets. Now that can take quite a while to run. I'm not going to, I'm not going to download it myself because I already have it locally, but once you do have it, everything will be in a zip file that looks like this.

So we need to extract our data out from that zip file. And we're only going to be looking for the CSV files. There's a lot of other files in there. There's a lot of images and everything, but we're just looking at the CSVs. So we pull those out. And if I should be able to run this bit, you can see that this is the sort of data that we have.

We don't, this is the sample submission dataset. We're not interested in that. We're just looking at these first three. So customers, articles, and transactions, train.csv. And now we need to go ahead and place our data from our local device and throw it into BigQuery. So to do that, we are setting up, we're going to do this directly with BigQuery.

So from Google, we're importing BigQuery, importing service account, which was how we authenticate ourselves. We have our credentials file path. This is what we got before from the service account in GCP. So we create a credentials object using service account credentials. And then we use that to initialize our BigQuery client.

And again, this is using the Aurelio advocacy project. Okay. So this is the project within GCP that we're using. So we're going to run that. And then what we're going to want to do is we use our, this is dataset ID. We actually defined this earlier, so we probably shouldn't define it again here.

So we have our dataset ID. We're going to use that to create a dataset reference with, so like, okay, in GCP, we have a client, our authenticated client. We're saying, okay, connect to this dataset object. And this is the dataset ID. This will work even if the dataset doesn't exist, because what will happen is if it doesn't exist, so we're going to try and get the dataset, that's going to throw an error if it doesn't exist.

So we catch that error and we say, okay, dataset doesn't exist. Now we're going to go ahead and create it, which is exactly what we're doing here. Okay. So we're creating that dataset. So I'm going to run this for me. It's going to say dataset already exists because I've already created that dataset.

But if this is your first time running this, your dataset will be empty and it's just been created. So with that, you would need to go through the files in this. So that is going to be the first three files here. So customers, articles, and transactions. And this is essentially setting everything up or setting up your table, and then just pushing over the data to the table.

Okay. Now, again, that can take some time, so I'm not going to run that, but this will take a little bit of time to actually run. Once we have that set up, we need to move on to actually building our graph in Kumo. Okay. So as we saw briefly before, this is what the dataset looks like.

And based on that, this is how we're going to be connecting everything. So we have our customers table. We connect that via customer ID to the transactions table. And then the transactions table is connected to the articles table via the article ID. Okay. So let's go ahead and define these tables and define these relationships.

Okay. We'll run through this. So first we can just check the tables that we currently have. So I have quite a few tables in here already. You won't see all of these. So you should only have customers, articles, and transactions train. Those should be the ones that you see.

All these other ones, like all these prediction ones here, are generated by Kumo later. So at the end, you will have some of these, not all of them, but you will have most of those. But right now, just a three. So we want to connect to each of the source tables that we should have.

So customers, articles, transactions, train, and we do so like this. So we just do connector. It's like a dictionary lookup there. So we have articles, customers, transactions, train, just the table names. Okay. And what that does is it creates within the Kumo space, it sets up the source tables.

A source table is not a table within Kumo. It's a table elsewhere. Okay. So it's a, you know, it's, it's literally like, okay, this is a source of my data. That's what the source table is. Now we can view these source tables with a pandas date frame type interface here.

So we use head and we'll see the top five records in our table here. So this is looking at the articles table and see product code, product name, the number, the type, uh, some more descriptive things. There's quite a lot of useful information in here. Okay. As you can, as you can see, and again, this is, this is the articles data set.

So roughly about a hundred thousand records in there. Now looking at this, we should be able to see columns here. My bad. Okay. And we can see all, all the columns that we're going to be using here. So you see very first one here, article ID. So we know the articles can be connected or we will see in a minute that the articles can be connected via this to the transaction table.

So, okay. We have that. Let's come down. Let's take a look at our customer's table. We'll just look at the first two records here. Okay. So customers table, we see, we have customer ID. So that's what we're going to be using again. Some other little bits of information here, uh, mainly actually this is empty.

I think I'm pretty sure I've seen age in a few of those. So yeah, I think maybe this is just a couple of bad examples. Then moving on to the transactions source. So we can go through here. Let me run this. So there isn't much information in the transaction table, but it's a big, it's a lot of data.

So we can see the transaction date, the customer, so we can connect that to the customer's table, the article ID. So we can connect that to the articles table, the price, and then sales channel ID. So we're going to connect all of those up. The way that we're going to do that is we're going to use this Kumo AI table.

So we're going to initialize the Kumo table and we're going to do that by pulling it from a source table, an existing source table. So we'll have the articles table, customers table, and transactions train table. The primary key for articles is going to be article ID. The primary key for customers is going to be customer ID.

And then for the transactions there, there isn't actually a primary key, but there is a time column. So we do just highlight that. So there is a time column there, which is the transaction date. So we can initialize that and we can go and have a look at the, you can see that we've done this infer metadata.

That's, it's an automated thing from Kumo where it's going to look at your table and it's going to infer the table schema for you. So I can come down here and I can look at the metadata for my articles table, for example. Okay. So we have article ID, product code, it's all data types in there, whether it's primary key, something else.

So a lot of cool stuff in there. And we'll also be able to see all of these tables now in our Kumo dashboard. So I have, I have a few in here. So let me scroll through. We're going to want to find the previous columns here. I think mine would be the most recent one.

So that would be may here, this one, this one, and this one, and you can see some just useful information for each one of these columns. So let's go into maybe article would be interesting. So we can see if we go to, let's say product name or product type name.

Cool. So you can see there's a lot of trousers. We have trousers, sweater, t-shirt, dress, you know, so on and so on. Okay. There's some really useful information that you just go in and take a look through those. So we have that. And now what we will also need to do is, okay, we have our tables, but they're all kind of, they just exist in Kumo independently of one another at the moment.

Now we want to train our graph neural network on these. So we actually have to create a graph of connections between each one of our tables. The way that we do that is we initialize this graph object. We set the tables that are going to be within this single graph.

Then we define how they are connected. So we use these edges. So we have the source table using transactions with both of these. So the transactions table via the customer ID foreign key is going to connect to the customers table. Second connection here. So again, transactions table via the article ID foreign key is going to go and connect to the articles destination table.

Okay. So we would run this and then we just run a graph and validate to confirm that this is an actual valid graph that we're building. Okay. So everything went well then. And yet again, we can go over into our, into our Kumo dashboard and it would be, I suppose my last one here, which is May.

Okay. I need to zoom out a little bit there. So we have, yeah, it's pretty straightforward. So we have the transactions data here. We have our transactions date, time column there. We have these two foreign keys. So the customer ID connects over to our primary key of the customer's table.

And then the article ID foreign key connects to the primary key, which is the article ID of the articles table. Okay. So you can see those connections and you can click through if you want to, and then see what is in those. We just did that. So I'm not going to, I'm not going to do it again.

Okay. So we have that. Another thing that you can do, if you want to visualize your graph within notebook, you can, you can install a few additional packages and you can actually use the graph this package to visualize that. I'm not going into that here because for every, like for Linux versus Mac and I assume windows as well, you, you need to set up in a different way.

So you can see the graph in the Kumo UI. So I, I will just do that personally. It's up to you. So we've set everything up, right? We're, we're at that point now. This is almost like we have been through the, the data cleansing, the data pre-processing, the data upload, you know, all of those steps as a data scientist, we've been through all those sets.

So now we're, now we're getting ready to start making some predictions or training our model to make some predictions. Okay. That's great. We've, you know, we've done that. You know, there is quite a lot going on there, but nothing beyond what we would have to do anyway as a data scientist.

So what we've done really simplified quite a bit of work and condense it into what we've just done now. But now we need to go into the predictions. So Kumo uses what they call the predictive query language or PQL. Now it's quite interesting. So PQL, you might guess, is kind of like a SQL type syntax, which allows you to define your prediction parameters.

Okay. So rather than writing like some neural network training code, you write this SQL like a predictive query and Kumo is going to look at that, understand it, and train your GNN based on your PQL statement. So let's start with our first use case that we described at the start, which is predicting the customer value over the next 30 days.

So the way that we do that in a PQL statement is like this. So there's a few different components in PQL. So we have our target, which you can see here. So the target is what follows predict here. So we have the predict statement. We're saying predict whatever is within our target here.

So this is the defined target. So what is our target here? Okay. We have the sum of the transactions price over the next zero to 30 days into the future. And also we defined days here. So that is our target. We're predicting the sum of the transactions price over the next 30 days, but then we also have an entity.

Okay. Which is who or what are we making this prediction for? So here we're saying for each. So we're getting this sum of predictions broken down for each individual customer. So what we do by writing this query is we are getting the value of each customer over the next 30 days.

So let's go ahead and implement that. We come down here. We use Kumo AI predictive query. We pass in our graph and then we just write what I, what I just showed you that, that PQL statement. Okay. So predict the sum of transactions price for each customer based on the customer ID.

Uh, we validate our P query or PQL statement here. So let's run that. Okay. That is great. Then we come down here and we ask Kumo to generate a model plan. So basically a, okay, Kumo, how based on everything, based on our data here, the volume of data based on the query that we've specified, what is the ideal approach for training our model?

And yeah, you can, you can look at this. Okay. So we can see here that we're using mean absolute error, mean squared error and root mean squared error as the training loss functions. The tuning metric is actually, is using mean absolute error. We have a network pruning. We, there's no processing there.

We have sampling the optimization here. So it's using the Huber loss function to optimize for regression. There we have a number of epochs that's here to work over the validation steps, test steps, the learning rates, weight decay. Just, I mean, I'm not, I'm not going to go through all of this, but there's a ton of stuff in here.

Uh, we can actually see the, the graph network architecture here, which is probably interesting for a few of us. And yeah, just a ton of stuff in there. So you can, yeah, you have Kumo telling you what it's going to do, how it's going to train your model. So if you're, you know, I'm not well versed in GNNs, but if you are, you can take a look at that and, and see, make sure everything makes sense according to how you, well, you understand them.

But of course, as I said, Kumo literally co-founded by one of the, the co-authors of the GNN paper. So they have, they have some pretty talented people working there. So that should be some pretty optimal parameters there. So once we're happy with that, what we're going to do is we're going to run this train object.

So we Kumo AI, train it with model plan, and then we run it with trainer.fit. Okay. And what this is going to do, okay, let me run this. What this is going to do is it's just going to initialize the training job. Now we're going to see this cell finish quite quickly because we've set non-blocking equal to true.

So it's going to, this is going to go to Kumo. It's going to start, like say, okay, I want this training job to start running. Once it has confirmation that the training job is running, it's going to come back to us and allow us to move on with whatever we're doing.

But the training job will not be complete for quite a while. I think as I've been going through this, the time has varied, but I would say somewhere between 40 minutes to an hour for, for a training job here, but you can run multiple training jobs in parallel. So we have three predictions that we'd like to make here.

So I'm going to run all those at the same time. Okay. We've got this one back now. So you can as well, if you don't, if you want to just start running these all now, you just, just run the next few cells in the notebook and then come back and I will talk you through what the other use case PQL statements are doing.

So we can check our status for the training job. We'll see that it's running. You can also click the tracking URL here, and this will actually open up the training job. So we can see how things are going in the UI if we want to. So coming back, let's move on to the second use case, which is these personalized product recommendations.

This is one I personally, like I would actually be very likely to use with a lot of projects I currently work on, which is obviously more like AI, conversational AI, building chatbots, or just AI interfaces in general. The reason that I could see myself using this is let's say you have the H&M website.

I don't know what's on the H&M website, but let's say they have a chatbot and you can log in. You could log in and you could talk to this chatbot or not even talk. It doesn't have to be a chatbot. It can just be that you log into the website and the website is going to surface some product site based on what we do here, based on these personalized product recommendations that we build with Kumo.

It can surface those to the users. They log in so that you are providing them with what they want before they even, they don't even need to go and search. It's just, it should just be there what they want, if possible, right? Obviously, you're not going to get it perfect all the time, but you'll probably be able to do pretty well with this.

So you can do that. You can also surface this, as I was originally going to say, through a chatbot-like interface. You could tell a chatbot, hey, you have this customer and you're talking to them. These are the sort of products that they seem most interested in, you know, kind of place those into the conversation when you're able to, when it makes sense.

So that's another thing that you could do. There are many ways that you could use this. So this is a little more of a sophisticated prediction here. The reason I say this is a little more sophisticated is because we have a filter here. So we've added an additional item to our target entity.

So we have target entity and now we also have filter. So the target LPL is pretty similar. Okay. The operator is different. So we're not summing anymore, actually listing the distinct articles that over the next 30 days, we expect to appear in the customer's like transactions table. Okay. So this is a top 10 and what it's saying by, okay, what are the top 10?

These are like the top 10 predictions. So will the customer buy this or not? Okay. Will this customer ID appear in the transactions table alongside a particular article ID in the next 30 days? That is what we're predicting here. And then we're filtering for the top 10 predictions, because otherwise, if we don't filter here, we're going to be looking at, what was it?

Like 1.3 million customer IDs, unique customer IDs with, against 100,000 products. And we're making predictions for all of those, right? That would be a, that would, that could be a larger number. Okay. So what we're saying is, okay, just, just give me the top 10, like the top 10 most probable purchases for our customer.

So we would run that again, same as before, nothing, nothing new here. So we're just modifying the PQL statement. So we run that, we validate it. And we're just going to check again. Okay. We get the model plan from Kumo here, and then we'll just start training with purchase trainer fit.

Okay. So that's going to run. Again, as before, we will be able to change status with this. Of course, we'll just have to wait for that other cell to run. Now, final use case here. I want to look at predicting the purchase volume for our customers. So in this scenario, it's kind of similar.

So we're looking at a count of transactions this time over the next zero to 30 days that generates our target. Again, we're looking for each customer for each customer. Okay. But we're adding a filter here. So what was this filter doing? This filter is looking at, if you, if you look here at the range, we've got minus 30 days up to zero days.

So this is looking at 30 days in the past. 30 days in the past is saying, okay, let's just filter where the number of transactions, so the count of transactions for each customer ID over the past 30 days is greater than zero. So what does that mean? That's saying, just do this prediction for customers that have purchased something in the previous 30 days.

What this does is it just reduces the scope of the number of predictions that we have to make by focusing only on active customers. So ideally we should be able to get a faster prediction out of this by, you know, within the status that naturally there's probably a lot of customers that are just inactive.

We're probably not going to get much information for, but if we don't add that filter and we're still going to be making predictions for those customers. So you can, you can do this across the other examples as well. So just make sure we're focusing on, focusing on like customers that we want to focus on.

So as before, we're just setting up the predictive query, validating it. We do the model plan and then we fit that model plan. Same as before, no difference other than the query, of course. Okay. And once that cell has finished up here, we can go here and just check the status of our jobs.

I would expect them to either be running or queued as they are here. So I'm going to go and leave these to run and this one as well and jump back in when they're ready and show you what we can actually do with these queries. Okay. So we're back and the jobs have now finished.

So we can see done in all of these. We're going to switch over to the browser as well. And you can see in here that these are done. So training complete. This one took an hour, 20 minutes. So it was pretty long to be fair, but you can see, yeah, you can see the various ones here.

This one here, 60 minutes, pretty quick. I would imagine that is the one where we have the filter. Yeah. You see here we're filtering for the like active customers only. And yeah, the duration for that one is noticeably shorter, which makes sense, of course. So that's great. We can just jump through and look at the, look at how we can use those predictions now.

Okay. So to make predictions, we're going to come down to the next cell here. I've just added this confirm. Okay. Like the status is done, which it has, we've already checked, but just in case, then we're going to use trainer predict. So the first trainer, if we, if we come up here to where we actually initialize that, it's this one here.

So that first one, which is this PQL statement right here. Okay. So predicting the, essentially the value of the customer over the next 30 days. So let's go ahead and run that prediction. And what this is going to do is actually create a table in big query. You can see I've put output table name here.

So it's going to create this table. Okay. So once we've run that, this disabled will have been created. Now, the other thing that we should be aware of is so for our second query, this one here, we are ranking the top 10, right? So this is a ranking prediction.

And that means that we can have a varying number of predictions per entity. Okay. So in that case, we also need to include the number of classes to return. Okay. So we, we don't have to stick to 10. Like we, we said top 10 before we can, we could change this to like 30 if we wanted to.

Right. But yeah, we're sticking with 10 here. So, yep. That's a, that's just a nice parameter that we can set if we want a different numbers of a different number of predictions to be returned there. And if we come down to the next one, we have the transactions prediction.

So that is looking at the number of predictions for our, our customers over the next 30 days from run. So we run all of those. Then we can actually go in and see what we have from our data. So first one is the customer value predictions. So who will generate the most revenue for us of our customers?

Okay. So when we, we specified before, this was the table name, the output table name, Kumo will by default add underscore prediction to the end of that or predictions, sorry. So yeah, just be aware of that. There's change the table name, but then yeah, we can, we can see this.

Now it's worth noting that the head that we have here is actually the, like the, these are all the lowest predictions, right? And this is a regression test. So essentially it's saying that all of these customers are going to over the next 30 days, some of their transactions will be zero.

Okay. And because it's regression, it goes slightly into the negative, right? But essentially just view this as being zero. That's our prediction. Now, as I said, these are all, this is like the tail end. This is all the, the lowest number of predictions in the header here. So we actually want to reverse this.

Kumo doesn't give us a way. It doesn't allow us to write like tail like you would with pandas data frames. So instead we can actually use big query directly to order by the largest values and just get the top five from there. So this is what we're doing here.

We write that as a SQL query. So we're selecting all from our data sets. We're going for this table. So this is the new table that we have from here and we're ordering by the target prediction. Okay. So this number here and that is descending. So we have the highest values at the top.

So this is going to give us who should be our top five most valuable customers. So let's take a look. Okay. And we have these, right? So like these are high numbers, right? So these are the entity here is our customer ID. So we'll be able to use this to map back to the actual customers table pretty soon.

So now we have who will most likely be our most valuable customers, which is a great thing to be able to predict. Now let's have a look at what we think these people will buy. Okay. So we're going to look at our purchase predictions. Just looking ahead again here.

Again, we can just use a big query and go directly through that if we need to. But here we can see, okay, for this customer, they are like very likely to buy this product here. Okay. And it's pretty high score there. It's cool. So now let's have a look at transaction volume.

We're going to bring all this together. We're just looking at each table now quickly. We're going to bring all this together in a moment. Okay. Transaction table. How active would they be again? And again, these are very small numbers. So we use BigQuery to look at the largest values there.

Okay. So these are, again, these are transactions. Again, I think we're looking, this would be the customer ID here. And this is how many transactions we actually expect them to make in the next month. So 20, 20 transactions first on here. Okay. So all that is great, but how can we bring all this together?

So what I want to do now is look at the next month's most valuable customers and join that back to our actual customers table. And then I want to see what those customers are most likely to buy the actual products. And then again, focusing on those customers, see how many products we think they will buy.

So let's do that. First, we're going to find next month's most valuable customers. How do we do this? So to identify our most valuable customers, we're going to be doing a join between the summer transactions predictions table with our customers table. We're going to be joining those two tables based on the transaction predictions, the entity values there, which is just a customer ID from our predictions and joining those to the actual customer IDs from the, the customers original customers table.

Okay. So that is what we have there. We're also limiting this by the top 30 highest scores. So you can see we're ordering by the target prediction and looking at the top 30 of those. So basically filtering for the top 30 most valuable, predicted the most valuable customers. So I will run that.

Let's see what we get can do to data frame and just view that. So we have, okay, we have the customer ID now. So these are the top 30 predicted most valuable customers. So we can come through here. We can see, okay, we have all the ages here call the young twenties and horns buying all the clothes, of course, with this random 37 year old over here.

Okay. And then we can see their scores. Okay. So that looks pretty good. You can see that they're all club members, whatever that means. I assume they must have some sort of membership club. So generally, okay, that looks pretty good. So we have our top, we could do a top 30 here, but right now we're just looking at the top five and the most valuable customers.

Now let's have a look at what those customers are going to buy. So come down to here, we now need to be joining our customers table, which is here, join our customers table to the purchase predictions, predictions table based on the prediction entity. Okay. Which is the customer ID.

So customer ID is attaching, is joining the customers table with the purchase predictions table. Then the purchase predictions table also included a I think it was class number, which is the article ideal or product ID. So we're going to connect those. So it was a predictions class. We're joining that to the articles table via the article ID.

And that is going to give us our customers most likely purchases. And we're actually going to focus this. So we're going to filter down to a specific customer, which is our top rated customer from the top customers table, which we create here. So we're going to be looking at this person here.

So let's run that and see what we get. Okay, I've got a few here. So customer ID, this is just the same for all these people are looking at single customer, and we can see what they are interested in buying. Okay, so product name, magic. So some some magic dress that they have.

Then they have this Ellie Goodard dress, another dress, dress, some heeled sandals, a shirt, dress again, some some joggers. dress dress, dress, dress, they really like dresses, I think, from the looks of it. So these are the products that when this user next logs into a website, or if they're talking to a, you know, H&M chat part or anything like that, or we're sending emails out.

These are the products that we should send and just surface to that user. So like, hey, look, these are some nice things that we think you probably might like. So hey, look at these. That's pretty cool. Now let's continue and take a look at the purchase volume. So we're now looking at the valuable customers.

So this query here, let me let me come back up and show you what that query actually is. So the valuable customers is this query here. All right. So finding out our top was it top 30. Sorry. So top 30 most valuable customers. So we're going to be joining to that table.

Yeah, we'll be joining our top 30 most valuable customers to the transaction predictions. And what that is going to do is it's going to get us the predicted transaction volume for each one of those top 30 customers. Okay. So that's what we're doing here. So let's run that and take a look at what we get.

Okay. So yeah, you can see so for our customers here, this is the expected transaction volume. And yeah, we've worked through our analysis. So we've gathered a lot of different pieces of information. And as I mentioned at the start there, this is sort of thing that as a data scientist would one be hard to do, like training the GNN is like, you need a lot of experience to do that and to do it well.

It's not, you know, it's not impossible, but it's gonna be hard and it's gonna be hard to do properly. So tumor is really good at just abstracting that out, which is really nice. And then the other thing that I think is really, really cool here is that, okay, if you're a data scientist, maybe you'd want to go and do this yourself, although you would save a ton of time and probably get better results doing this, you know, it's up to you.

But the other thing is that this means that not only data scientists can do this can do this, right? So especially for me as a more generally scoped engineer, I want to build products, I want to bring in analytics on the data that we're receiving in those products. And usually for me to do that, okay, we can do some data science stuff, but the results, one, it's gonna take a long time for me to do it.

And two, the results probably won't be that great. With this, I will have the time to use Kumo to sell that analysis. And two, it will actually be a good analysis, unlike what it would probably be like if I did it myself. So I get to have like very fast development and also get world-class results, which is amazing.

So yeah, incredible service in my opinion. This is just one of the services that Kumo offers. There is another one one that I will be looking into soon, which is a relational foundation model or Kumo RFM. That is something I'm also quite excited about. So we'll be looking at that soon.

But yeah, this is the full introduction and walkthrough for building out your own data science pipeline for recommendations on this pretty cool e-commerce data set. But that's it for now. So thank you very much for watching. I hope all this has been useful and interesting. But for now, I will see you again in the next one.

Thanks. Bye. I'll see you again in the next one. Thank you.

Data Science as a Service | Kumo AI Full Walkthrough

Chapters

Transcript