Powering your Copilot for Data - with Artem Keydunov from Cube.dev

00:00:00.000 | (upbeat music)

00:00:02.580 | - Hey everyone.

00:00:08.440 | Welcome to the Latent Space Podcast.

00:00:09.640 | This is Swigx, writer, editor of Latent Space

00:00:11.420 | and founder of Small.ai

00:00:12.640 | and Alessio, partner and CTO in residence

00:00:15.560 | at Decibel Partners.

00:00:16.720 | - Hey everyone.

00:00:17.540 | And today we have Artem Ketunov on the podcast,

00:00:20.520 | co-founder of Cube.

00:00:21.600 | Hey Artem.

00:00:22.440 | - Hey Alessio, hi Swigx.

00:00:23.720 | Good to be here today.

00:00:25.240 | Thank you for inviting me.

00:00:26.480 | - Yeah, thanks for joining.

00:00:28.000 | For people that don't know,

00:00:29.200 | I've known Artem for a long time,

00:00:31.240 | ever since he started Cube.

00:00:32.640 | And Cube is actually a spin out of his previous company,

00:00:35.220 | which is Statsbot.

00:00:37.040 | And this kind of feels like going both backward

00:00:41.000 | and forward in time.

00:00:42.020 | So the premise of Statsbot was having a Slack bot

00:00:46.240 | that you can ask,

00:00:47.080 | basically like text to SQL in Slack.

00:00:49.240 | And this was six, seven years ago,

00:00:51.600 | or something like that.

00:00:52.720 | So a lot ahead of its time

00:00:54.480 | and you see startups trying to do that today.

00:00:56.840 | And then Cube came out of that

00:00:59.080 | as a part of the infrastructure

00:01:00.460 | that was powering Statsbot.

00:01:02.140 | And Cube then evolved from an embedded analytics product

00:01:06.000 | to the semantic layer

00:01:07.320 | and just an awesome open source evolution.

00:01:09.580 | I think you have over 16,000 stars on GitHub today.

00:01:13.480 | You have a very active open source community.

00:01:15.740 | But maybe for people at home,

00:01:18.240 | just give a quick like lay of the land

00:01:21.240 | of the original Statsbot product.

00:01:24.080 | You know, what got you interested in like text to SQL

00:01:27.200 | and what were some of the limitations that you saw then,

00:01:30.960 | the limitations that you're also seeing today

00:01:33.680 | in the new landscape?

00:01:35.440 | - I started Statsbot in 2016.

00:01:39.080 | So the original idea was to just make

00:01:43.320 | sort of a side project

00:01:45.320 | based off my initial project that I did at a company

00:01:49.120 | that I was working for back then.

00:01:50.600 | And I was working for a company

00:01:51.920 | that was building software for schools.

00:01:54.300 | And we were using Slack a lot.

00:01:56.600 | And Slack was growing really fast.

00:01:58.160 | A lot of people were talking about Slack,

00:01:59.880 | you know, like Slack apps, Chatsbots in general.

00:02:02.940 | So I think it was, you know,

00:02:04.420 | like another wave of, you know, bots and all of that.

00:02:07.200 | We have one more wave right now,

00:02:08.840 | but it's always comes in waves.

00:02:10.760 | So we were like living through one of those waves.

00:02:14.020 | And I wanted to build a bot

00:02:16.660 | that would give me information

00:02:18.440 | from the different places where like a data lives to Slack.

00:02:22.200 | So this was like some, you know,

00:02:24.020 | developer data, like New Relic,

00:02:26.400 | you know, maybe some marketing data, Google Analytics,

00:02:29.800 | and then some just regular data,

00:02:31.400 | like a production databases or even Salesforce sometimes.

00:02:34.400 | And I wanted to bring it all into Slack

00:02:36.360 | because we were always talking, chatting,

00:02:38.840 | you know, like in Slack,

00:02:39.840 | and I wanted to see some stats in Slack.

00:02:42.080 | So that was the idea of Statsbot, right?

00:02:43.600 | Like bring stats to Slack.

00:02:46.280 | I built that as a, you know,

00:02:47.520 | like a first sort of a side project,

00:02:49.760 | and I published it on Reddit,

00:02:52.680 | and people started to use it even before Slack came up

00:02:55.320 | with that Slack application directory.

00:02:57.360 | So it was a little, you know,

00:02:58.600 | like a hackish way to install it,

00:03:00.320 | but people are still installing it.

00:03:02.320 | So it was a lot of fun,

00:03:03.300 | and then Slack kind of came up

00:03:05.020 | with that application directory,

00:03:07.040 | and they reached out to me,

00:03:08.840 | and they wanted to feature Statsbot

00:03:10.520 | because it was one of the already being

00:03:12.720 | kind of widely used bots on Slack.

00:03:15.040 | So they featured me on this application directory front page,

00:03:18.200 | and I just got a lot of, you know,

00:03:19.360 | like new users signing up for that.

00:03:21.640 | It was a lot of fun, I think, you know, like,

00:03:23.580 | but it was sort of a big limitation

00:03:26.600 | in terms of how you can process natural language,

00:03:29.260 | because the original idea was to let people ask questions

00:03:34.080 | directly in Slack, right?

00:03:35.400 | Hey, show me my, you know,

00:03:37.340 | like opportunities closed last week or something like that.

00:03:40.000 | My co-founder who kind of started helping me

00:03:42.680 | with this Slack application,

00:03:45.240 | him and I were trying to build a system

00:03:47.280 | to recognize that natural language,

00:03:49.360 | but it was, you know, we didn't have LLMs, right,

00:03:51.720 | back then and all of that technologies,

00:03:53.360 | so it was really hard to build the system,

00:03:55.880 | especially the systems that can kind of, you know,

00:03:57.840 | like keep talking to you,

00:03:59.160 | like maintain some sort of a dialogue.

00:04:01.620 | It was a lot of like one-off requests,

00:04:03.520 | and like it was a lot of hit and miss, right?

00:04:05.320 | If you know how to construct a query in natural language,

00:04:07.880 | you will get a result back,

00:04:09.400 | but, you know, like it was not a system

00:04:11.520 | that was capable of, you know,

00:04:12.960 | like asking follow-up questions

00:04:14.800 | to try to understand what you actually want,

00:04:16.960 | and then kind of finally, you know,

00:04:18.200 | like bring this all context

00:04:20.040 | and go to generate a SQL query,

00:04:21.880 | get the result back and all of that.

00:04:23.760 | So that was a really missing part,

00:04:25.880 | and I think right now that's, you know,

00:04:27.200 | like what is the difference?

00:04:28.680 | So right now I'm kind of bullish that, you know,

00:04:31.040 | like if I would start Statsbot again,

00:04:32.860 | probably would have a much better shot at it,

00:04:35.620 | but back then that was a big limitation.

00:04:37.800 | Funny thing is that we wanted to,

00:04:39.040 | we kind of built a queue, right,

00:04:41.380 | as we were working on Statsbot because we needed it.

00:04:45.300 | - Yeah.

00:04:46.140 | What was the ML stack at the time?

00:04:48.720 | Were you building, trying to build your own,

00:04:50.560 | like a natural language understanding models?

00:04:53.160 | Like were there open source models that were good

00:04:55.740 | that you were trying to leverage?

00:04:57.080 | - I think it was mostly combination of a bunch of things,

00:05:00.040 | and we tried a lot of different approaches.

00:05:02.140 | The first version which I built, like was RegApps.

00:05:06.140 | They were working well.

00:05:07.240 | - This is the same as I did.

00:05:10.700 | I did option pricing when I was in finance,

00:05:13.520 | and I had a natural language pricing tool thing,

00:05:16.120 | and it was Regex.

00:05:16.960 | It was just a lot of Regex.

00:05:18.640 | - Yeah, yeah.

00:05:20.640 | And then, and my co-founder, Jamie Powell,

00:05:23.520 | he's much smarter than I am.

00:05:24.960 | He's like PhD in math, all of that.

00:05:27.000 | And he started to like do some stuff that was like,

00:05:30.200 | I was like, no, you just do that stuff.

00:05:32.120 | I don't know, like I can do Regex.

00:05:33.800 | And, you know, like he started to do like some models

00:05:37.020 | and trying to either, you know,

00:05:38.280 | like look at what we had on the market back then,

00:05:41.240 | or, you know, like try to build a different sort of,

00:05:43.580 | you know, like kind of models.

00:05:45.920 | Again, we didn't have any foundation back in place, right?

00:05:48.680 | We wanted to build something that, you know, like we,

00:05:51.260 | okay, we wanted to try to use existing math, obviously,

00:05:54.200 | right, but it was not something that we can take the model

00:05:56.840 | and, you know, like a try and run it.

00:05:58.940 | I think in 2019, we started to see more like of stuff,

00:06:03.360 | you know, like ecosystem being built,

00:06:05.440 | and then it eventually kind of, you know,

00:06:07.160 | like resulted in all this LLM, like what we have right now.

00:06:10.120 | But back then in 2016, it was not much, you know,

00:06:13.920 | like available for just the people to build on top.

00:06:16.000 | It was some academic research, right,

00:06:18.280 | kind of been happening, but it was like very, very early,

00:06:21.680 | you know, like for something to actually be able to use.

00:06:25.020 | - And then that became Kube,

00:06:28.160 | which was started just as an open source project.

00:06:30.480 | And I think I remember going on a walk with you

00:06:33.120 | in San Mateo in like 2020, something like that.

00:06:36.440 | And you were like, you have people reaching out to you

00:06:38.840 | who are like, "Hey, we use Kube in production."

00:06:41.000 | Like, "I just need to give you some money,

00:06:43.200 | even though you guys are not a company."

00:06:45.600 | What's the story of Kube then from Statsbot

00:06:48.480 | to where you are today?

00:06:49.760 | - We built a Kube at Statsbot because we needed it.

00:06:53.440 | It was like the whole Statsbot stack

00:06:56.120 | was that we first tried to translate

00:06:59.760 | the initial sort of language query

00:07:02.080 | into some sort of multidimensional query.

00:07:06.480 | It's like, we were trying to understand,

00:07:07.680 | okay, people wanted to get active opportunities, right?

00:07:11.480 | What does it mean?

00:07:12.320 | Is it a metric?

00:07:13.320 | Is it what a dimension here?

00:07:15.080 | Because usually in analytics,

00:07:16.320 | you always, you know, like try to reduce everything down

00:07:19.160 | to the sort of, you know, like a multidimensional framework.

00:07:22.240 | So that was the first step.

00:07:24.080 | And that's where, you know, like it didn't really work well

00:07:26.760 | because all this limitation

00:07:28.600 | of us not having foundational technologies.

00:07:31.060 | But then from the multidimensional query,

00:07:33.760 | we wanted to go to SQL.

00:07:36.040 | And that's what was semantic layer

00:07:37.800 | and what was Kube essentially.

00:07:39.680 | So we built a framework where you would be able

00:07:42.760 | to map your data into this concept, into this metrics.

00:07:46.800 | Because when people were coming to Statsbot,

00:07:49.440 | they were bringing their own datasets, right?

00:07:52.040 | And the big question was, how do we tell the system

00:07:55.520 | what is active opportunities for that specific users?

00:07:58.080 | How we kind of, you know, like provide that context,

00:08:00.560 | how we do the training.

00:08:01.740 | So that's why we came up with the idea

00:08:03.340 | of building the semantic layer.

00:08:05.060 | So people can actually define their metrics

00:08:07.140 | and then kind of use them as a Statsbot.

00:08:09.260 | So that's how we built a Kube.

00:08:10.860 | But at some point we saw people started to see more value

00:08:14.700 | in the Kube itself, you know,

00:08:16.340 | like kind of building the semantic layers

00:08:18.580 | and then using it to power different types

00:08:20.760 | of the application.

00:08:22.060 | So in 2019, we decided, okay,

00:08:24.540 | it feels like it might be a standalone product

00:08:27.500 | and a lot of people want to use it.

00:08:28.700 | Let's just try to open source it.

00:08:31.440 | So we took it out of Statsbot and open sourced.

00:08:34.320 | - Can I make sure that everyone

00:08:36.800 | has the same foundational knowledge?

00:08:39.240 | The concept of a Kube is not something that you invented.

00:08:42.280 | I think, you know, not everyone has the same background

00:08:44.440 | in analytics and data that all three of us do.

00:08:47.520 | Maybe you want to explain like OLAP Kube, Hyper Kube,

00:08:50.320 | you know, anything, whatever the brief history of Kubes.

00:08:53.280 | - Right.

00:08:56.560 | I'll try, you know, like there's a lot of like

00:08:58.980 | Wikipedia pages and like a lot of like a blog post

00:09:01.700 | trying to go into academics of it.

00:09:03.500 | So I'm trying to like--

00:09:04.340 | - Kubes according to you, yeah.

00:09:06.020 | - Yeah, it's just,

00:09:08.180 | so when we think about just a table in a database,

00:09:10.860 | the problem with the table, it's not a multidimensional,

00:09:13.740 | meaning that in many cases, if we want to slice the data,

00:09:17.420 | we kind of need to result with a different table, right?

00:09:20.900 | Like think about when you're writing a SQL query

00:09:23.020 | to answer one question,

00:09:24.300 | SQL query always ends up with a data,

00:09:26.500 | with a table, right?

00:09:27.580 | So you write one SQL, you got one.

00:09:29.300 | Then you write to answer a different question,

00:09:31.260 | you write a second query.

00:09:32.300 | So you're kind of getting a bunch of tables.

00:09:35.100 | So now let's imagine that we can kind of

00:09:37.100 | bring all this tables together into multidimensional table.

00:09:41.940 | And that's essentially Kube.

00:09:43.520 | So it's just like the way that we can have measures

00:09:46.340 | and dimension that can potentially be kind of, you know,

00:09:49.040 | like used at the same time from a different angles.

00:09:53.820 | - And so initially, a lot of your use cases

00:09:56.140 | were more, you know, BI related,

00:09:58.220 | but you recently released a link chain integration.

00:10:01.640 | There's obviously more and more interest in, again,

00:10:05.300 | using these models to answer data questions.

00:10:07.340 | So you've seen the chat GPT code interpreter,

00:10:09.740 | which is renamed as like advanced data analysis.

00:10:12.920 | So what's kind of like the future

00:10:16.580 | of like the semantic layer in AI, you know,

00:10:18.460 | what are like some of the use cases that you're seeing

00:10:20.480 | and why do you think it's a good strategy

00:10:22.980 | to make it easier to do now the text to SQL

00:10:25.900 | you wanted to do seven years ago?

00:10:27.540 | - Yeah, so, I mean, you know, when it started to happen,

00:10:30.420 | I was just like, oh my God,

00:10:31.620 | people are now building stats bot with Kube.

00:10:34.000 | They just have a better technology for, you know,

00:10:36.080 | like natural language.

00:10:37.800 | So it kind of, it made sense to me, you know,

00:10:39.940 | like from the first moment I saw it.

00:10:42.060 | So I think it's something that, you know,

00:10:43.420 | like happening right now.

00:10:46.700 | And that's, chat bot is one of the use cases.

00:10:49.780 | I think, you know, like if you try to generalize it,

00:10:52.940 | the use case would be how do we use a structured

00:10:56.980 | or tabular data with, you know, like AI models, right?

00:11:00.740 | Like how do we turn the data

00:11:02.860 | and give the context to the data

00:11:04.780 | and then bring it to the model

00:11:06.260 | and then model can, you know, like give you answers,

00:11:09.220 | make a questions, do whatever you want.

00:11:11.260 | But the question is like how we go from just the data

00:11:13.780 | in your data warehouse, database, whatever,

00:11:16.300 | which is usually just a tabular data, right?

00:11:18.180 | Like in a SQL based warehouses to some sort of, you know,

00:11:21.460 | like a context that system can do.

00:11:24.380 | And if you're building this application,

00:11:26.140 | you have to do it.

00:11:26.980 | It's like no way you can get away around not doing this.

00:11:30.700 | You either map it manually

00:11:32.700 | or you come up with some framework or something else.

00:11:35.420 | So our take is that, and my take is that semantic layer

00:11:38.020 | is just really good place for this context to live

00:11:41.460 | because you need to give this context to the humans.

00:11:43.620 | You need to give that context to the AI system anyway, right?

00:11:46.700 | So that's why you define metric once.

00:11:48.620 | And then, you know, like you teach your AI system

00:11:51.820 | what this metric is about.

00:11:54.340 | - What are some of the challenges

00:11:56.500 | of using tabular versus language data

00:11:59.420 | and some of the ways that having the semantic layer

00:12:01.780 | kind of makes that easier maybe?

00:12:03.460 | - I feel like, imagine you're a human, right?

00:12:05.820 | And you going into like your new data analyst at a company

00:12:09.860 | and just people give you a warehouse

00:12:11.860 | with a bunch of tables and they tell you,

00:12:14.060 | okay, just try to make sense of this data.

00:12:16.220 | And you're going through all of these tables

00:12:17.940 | and you're really like trying to make sense

00:12:19.780 | without any, you know, like additional context

00:12:22.060 | or like some columns, you know,

00:12:24.100 | like in many cases they might have a weird names.

00:12:27.180 | Sometimes, you know, if they follow some kind of

00:12:30.100 | like a star schema or like a Kimball style dimensions,

00:12:32.940 | maybe that would be easier

00:12:34.180 | because you would have facts and dimensions column,

00:12:36.540 | but it's still, it's hard to understand

00:12:38.380 | and to kind of make sense

00:12:39.740 | because it doesn't have descriptions, right?

00:12:42.220 | And then there is like a whole like industry

00:12:45.420 | of like a data catalogs exist

00:12:47.180 | because the whole purpose of that,

00:12:48.820 | to give context to the data so people can understand that.

00:12:53.140 | And I think the same applies to the AI, right?

00:12:55.100 | Like, and the same challenge is that

00:12:56.780 | if you give it pure tabular data,

00:13:00.180 | it doesn't have this sort of context that it can read.

00:13:03.260 | So you sort of needed to write a book

00:13:05.220 | or like essay about your data

00:13:07.220 | and give that book to the system so it can understand it.

00:13:10.940 | - Can you run through the steps of how that works today?

00:13:14.940 | So the initial part is like the natural language query,

00:13:18.100 | like what are the steps that happen in between

00:13:20.460 | to do model to semantic layer,

00:13:23.060 | semantic layer to SQL and all that flow?

00:13:26.900 | - The first key step is to do some sort of indexing.

00:13:31.540 | So that's what I was referring to,

00:13:34.300 | like write a book about your data, right?

00:13:36.220 | Like describe in a text format what your data is about,

00:13:41.220 | right, like what metrics it has, dimensions,

00:13:44.220 | what is the structures of that,

00:13:45.820 | what a relationship between those metrics,

00:13:48.180 | what are potential values of the dimension.

00:13:50.180 | So sort of, you know, like build a really good indexed

00:13:52.940 | as a text representation and then turn it into embeddings

00:13:57.940 | into your, you know, like vector storage.

00:14:00.700 | Once you have that, then you can sort of, you know,

00:14:04.380 | like provide that as a context to the model.

00:14:08.660 | I mean, there are like a lot of options,

00:14:10.260 | like either fine tune or, you know,

00:14:11.860 | like sort of in context learning,

00:14:13.380 | but somehow kind of give that

00:14:14.860 | as a context to the model, right?

00:14:17.780 | As I want this model has this context,

00:14:20.140 | it can create a query.

00:14:22.060 | Now the query, I believe,

00:14:23.860 | should be created against semantic layer

00:14:26.740 | because it reduces the room for the error.

00:14:30.300 | Because what usually happens is that your query

00:14:32.820 | to semantic layer would be very simple.

00:14:35.260 | It would be like, give me that metric group

00:14:37.780 | by that dimension and maybe that filter should be applied.

00:14:41.140 | And then your real query for the warehouse,

00:14:43.700 | it might have like a five joins,

00:14:46.100 | a lot of different, you know, like techniques,

00:14:48.540 | like how to avoid fan out, fan traps,

00:14:51.660 | chasm traps, all of that stuff.

00:14:53.900 | And the bigger query,

00:14:56.260 | the more room that the model can make an error, right?

00:14:59.660 | Like even sometimes it could be a small error

00:15:01.940 | and then, you know, like your numbers is going to be off.

00:15:04.220 | But making a query against semantic layer,

00:15:07.340 | that sort of reduces the error.

00:15:09.380 | So the model generates a SQL query

00:15:11.580 | and then it executes us against semantic layer.

00:15:14.340 | That's semantic layer executes us against your warehouse

00:15:17.540 | and then sends result all the way back to your application.

00:15:22.340 | And then can be done multiple times

00:15:24.140 | because what we were missing was just about this ability

00:15:27.140 | to have a conversation, right, with the model.

00:15:30.060 | Like you can ask question

00:15:31.580 | and then system can do a follow-up questions,

00:15:34.620 | you know, like then do a query to get some information,

00:15:37.540 | additional information based on this information,

00:15:39.540 | do a query again.

00:15:40.820 | And sort of, you know, like it can keep doing this stuff

00:15:42.940 | and then eventually maybe give you a big report

00:15:45.460 | that consists of a lot of like data points.

00:15:48.220 | But the whole flow is that it knows the system,

00:15:52.540 | it knows your data

00:15:53.380 | because you already kind of did the indexing.

00:15:55.700 | And then it queries semantic layer

00:15:58.060 | instead of a data warehouse directly.

00:16:00.940 | - Maybe just to make it a little clearer

00:16:03.340 | for people that haven't used a semantic layer before,

00:16:06.500 | you can have definitions like revenue

00:16:09.020 | where revenue is like select from customers

00:16:12.300 | and like join orders

00:16:13.460 | and then sum of the amount of orders.

00:16:15.460 | But in the semantic layer,

00:16:16.660 | you're kind of hiding all of that away.

00:16:18.660 | So when you do natural language to queue,

00:16:21.580 | I just select revenue from last week

00:16:24.020 | and then it turns into a bigger and bigger query.

00:16:27.820 | - One of the biggest difficulties around semantic layer

00:16:30.380 | for people who've never thought about this concept before,

00:16:32.980 | this all sounds super neat

00:16:34.780 | until you have multiple stakeholders

00:16:37.940 | within a single company

00:16:39.180 | who all have different concepts of what a revenue is,

00:16:42.540 | but they all have different concepts

00:16:43.500 | of what active user is.

00:16:44.900 | And then so they'll have like revenue revision one

00:16:48.540 | by the sales team and then revenue revision one,

00:16:52.420 | accounting team or tax team, I don't know.

00:16:54.380 | I feel like I always want semantic layer discussions

00:16:57.980 | to talk about the not so pretty parts of the semantic layer

00:17:02.260 | because this is where effectively you ship your org chart

00:17:05.500 | in the semantic layer.

00:17:06.860 | - I think the way I think about it is that

00:17:09.180 | at the end of the day,

00:17:10.020 | semantic layer is a code base

00:17:11.900 | and in Qubit, it's essentially a code base, right?

00:17:14.100 | It's just a set of YAML files with pythons.

00:17:17.500 | I think code is never perfect.

00:17:19.540 | We know that, we're like software engineers, right?

00:17:21.980 | It's never going to be perfect.

00:17:23.220 | You will have a lot of, you know, like revisions of code.

00:17:25.660 | We have a version control

00:17:26.900 | which helps it's easier with revisions.

00:17:28.980 | So I think we should treat our metrics

00:17:30.740 | and we are in semantic layer as a code, right?

00:17:33.900 | And then collaboration is a big part of it.

00:17:36.100 | You know, like if there are like a multiple teams

00:17:37.900 | that sort of have a different opinions,

00:17:39.700 | let them collaborate on the pull request.

00:17:41.860 | You know, they can discuss that.

00:17:43.540 | Like why they think that should be calculated differently.

00:17:45.740 | Have an open conversation about it.

00:17:48.780 | You know, like when everyone can just discuss it

00:17:50.860 | like an open source community, right?

00:17:52.420 | Like you go on a GitHub and you talk about

00:17:54.140 | why that code is written the way it's written, right?

00:17:57.060 | It should be written differently.

00:17:58.620 | And then hopefully at some point you can come up,

00:18:01.140 | you know, like to some definition.

00:18:03.220 | Now, if you still have multiple versions, right?

00:18:06.660 | It's a code, right?

00:18:07.820 | So you can still manage it.

00:18:09.660 | But I think the big part of that

00:18:11.420 | is that like we really need to treat it as a code base.

00:18:14.220 | Then it makes a lot of things easier,

00:18:15.860 | not as spreadsheets, you know, like a hidden Excel files.

00:18:20.180 | - The other thing is like,

00:18:21.820 | then having the definition spread in the organization,

00:18:24.940 | you know, like versus everybody trying to come up

00:18:27.940 | with their own thing.

00:18:28.980 | But yeah, I'm sure that when you talk to customers,

00:18:31.420 | there's people that, you know, have issues with the product

00:18:34.660 | and it's really like two people

00:18:35.820 | trying to define the same thing.

00:18:37.100 | One in sales that wants to look good.

00:18:39.100 | The other is like the finance team

00:18:40.540 | that wants to be conservative

00:18:41.900 | and they all have different definitions.

00:18:44.780 | How important is the natural language to people?

00:18:47.500 | So obviously, you know, you guys both work

00:18:51.140 | in modern data stack companies either now or before.

00:18:55.140 | There's gonna be the whole wave

00:18:56.740 | of empowering data professionals.

00:18:59.420 | I think now a big part of the wave

00:19:01.540 | is removing the need for data professionals

00:19:04.140 | to always be in the loop

00:19:05.580 | and having non-technical folks do more of the work.

00:19:08.300 | Are you seeing that as a big push too with these models,

00:19:12.100 | like allowing everybody to interact with the data?

00:19:15.220 | Yeah, any customer stories you can share,

00:19:17.500 | anything like that?

00:19:18.500 | - I think it's a multidimensional question.

00:19:20.300 | That's an example of, you know,

00:19:21.660 | like where you have a lot of inside the question.

00:19:24.740 | So in terms of examples,

00:19:28.540 | I think a lot of people building different,

00:19:30.460 | you know, like agents or chatbots.

00:19:32.140 | We have a company that built as internal Slack bot

00:19:35.940 | that sort of answers questions, you know,

00:19:37.940 | like based on the data in a warehouse.

00:19:40.260 | And then like a lot of people kind of go in

00:19:41.940 | and like ask that chatbot this question.

00:19:44.940 | Is it like a real big use case?

00:19:47.900 | Maybe.

00:19:48.740 | Is it still like a toy pet project?

00:19:51.380 | Maybe too right now.

00:19:52.460 | I think it's really hard to tell them apart at this point

00:19:55.820 | because there is a lot of like a hype

00:19:57.380 | and you know, just people building LLM style

00:20:00.020 | because it's cool.

00:20:01.020 | And everyone wants to build something,

00:20:02.500 | you know, like kind of even at least a pet project.

00:20:05.260 | So that's what happened in Krizawa community as well.

00:20:07.420 | We see a lot of like people building a lot of cool stuff

00:20:10.300 | and it probably will take some time

00:20:12.300 | for that stuff to mature

00:20:13.980 | and kind of to see like what are real, the best use cases.

00:20:17.260 | But I think what I saw so far,

00:20:18.860 | one use case was building this chatbot

00:20:20.740 | and we have even one company

00:20:21.940 | that are building it as a service.

00:20:23.940 | So they essentially connect into Q semantic layer

00:20:27.660 | and then offering their like chatbot

00:20:30.260 | so you can do it in a web, in a Slack,

00:20:32.420 | so it can, you know, like answer questions

00:20:34.300 | based on data in your semantic layer.

00:20:36.660 | But, and I'll also see a lot of things

00:20:39.180 | like they're just being built in-house.

00:20:41.420 | And there are other use cases,

00:20:42.740 | some sort of automation, you know,

00:20:44.300 | like then that agent checks on the data

00:20:47.620 | and then kind of perform some actions

00:20:49.940 | based, you know, like on changes in data.

00:20:52.740 | But other dimension of your question is like,

00:20:56.580 | will it replace people or not?

00:20:59.180 | I think, you know, like what I see so far

00:21:01.300 | in data specifically,

00:21:02.740 | there are like a few use cases of LLM.

00:21:06.460 | I don't see you being part of that use case,

00:21:09.300 | but it's more like a co-pilot for a data analyst,

00:21:12.020 | a co-pilot for data engineer,

00:21:13.980 | where you develop something, you develop a model

00:21:16.260 | and it can help you to write a SQL or something like that.

00:21:19.100 | So, you know, it can create a boilerplate SQL

00:21:21.820 | and then you can edit this SQL,

00:21:23.900 | which is fine because you know how to edit SQL, right?

00:21:26.340 | So, you're not going to make a mistake,

00:21:28.580 | but it will help you to just generate, you know,

00:21:30.860 | like a bunch of SQL that you write again and again, right?

00:21:34.060 | Like boilerplate code.

00:21:35.660 | So sort of a co-pilot use case.

00:21:37.660 | I think that's great and we'll see more of it.

00:21:39.740 | I think every platform that is building for data engineers

00:21:43.660 | will have some sort of a co-pilot capabilities

00:21:45.700 | and Kube included,

00:21:46.540 | we're building this co-pilot capabilities

00:21:48.700 | to help people build semantic layers easier.

00:21:51.380 | I think that just a baseline

00:21:53.060 | for every engineering product right now

00:21:54.820 | to have some sort of, you know, like a co-pilot capabilities.

00:21:57.740 | Then the other use case is a little bit more

00:22:00.340 | where Kube has been involved is like,

00:22:02.460 | how do we enable access to data for non-technical people

00:22:05.860 | through the natural language as an interface to data, right?

00:22:08.980 | Like visual dashboards, charts,

00:22:11.780 | it's always has been an interface to data in every BI.

00:22:15.580 | Now, I think we will see just a second interface

00:22:19.460 | as just kind of a natural language.

00:22:21.860 | So I think at this point,

00:22:22.900 | many BI's will add it as a commodity feature.

00:22:25.620 | It's like Tableau will probably have a search bar

00:22:28.900 | at some point saying like, "Hey, ask me a question."

00:22:31.420 | I know that some of the, you know, like AWS QuickSight,

00:22:35.500 | they're about to announce features like this

00:22:37.900 | in their like BI and I think Power BI will do that,

00:22:40.860 | especially with their deal with OpenAI.

00:22:43.340 | So every company, every BI will have this

00:22:45.540 | some sort of a search capabilities built in inside their BI.

00:22:48.820 | So I think that's just going to be a baseline feature

00:22:51.060 | for them as well, but that's where Kube can help

00:22:53.740 | because we can provide that context, right?

00:22:55.860 | - Do you know how, or do you have an idea

00:22:57.700 | for how these products will differentiate

00:23:00.340 | once you get the same interface?

00:23:02.260 | So right now there's like, you know,

00:23:04.260 | Tableau is like the super complicated

00:23:06.300 | and it's like super set is like easier.

00:23:08.420 | Yeah, do you just see everything will look the same

00:23:12.100 | and then how do people differentiate?

00:23:14.700 | - It's like they all have line chart, right?

00:23:17.100 | And they all have bar chart.

00:23:18.820 | So I feel like it pretty much the same.

00:23:20.820 | I don't think BI market will,

00:23:24.380 | it's going to be fragmented as well.

00:23:28.260 | And every major vendor and most of the vendors

00:23:31.100 | will try to have some sort of natural language capabilities

00:23:34.380 | and they might be a little bit different.

00:23:35.940 | Some of them will try to position

00:23:38.340 | the whole product around it.

00:23:40.540 | Some of them will just have them as a checkbox, right?

00:23:43.980 | So we'll see, but I don't think it's going to be something

00:23:47.380 | that will change the BI market, you know,

00:23:49.820 | like something that can take the BI market

00:23:52.780 | and make it more consolidated

00:23:54.700 | rather than, you know, like what we have right now.

00:23:56.460 | I think it's still will remain fragmented.

00:23:59.660 | - Let's talk a bit more about application use cases.

00:24:03.620 | So people also use Kube for kind of like analytics

00:24:06.900 | on their product, like dashboards and things like that.

00:24:11.380 | How do you see that changing in more,

00:24:14.420 | especially like when it comes to like agents, you know,

00:24:16.540 | so there's like a lot of people trying to build agents

00:24:19.180 | for reporting, building agents for sales.

00:24:22.140 | Like if you're building a sales agent,

00:24:23.700 | you need to know everything about the purchasing history

00:24:26.580 | of the customer, all of these things.

00:24:29.100 | Yeah, any thoughts there?

00:24:31.100 | What should all the AI engineers listening

00:24:33.900 | think about when implementing data into agents?

00:24:37.780 | - Yeah, I think kind of, you know,

00:24:39.860 | like trying to solve for two problems.

00:24:41.740 | One is how to make sure that agents or LLM model, right,

00:24:46.740 | has enough context about, you know, like a tabular data.

00:24:50.740 | And also, you know, like how do we deliver updates

00:24:53.300 | to the context, which is also important

00:24:55.220 | because data is changing, right?

00:24:56.780 | So every time we change something upstream,

00:24:59.140 | we need to sure we update that context

00:25:01.380 | in our vector database or something.

00:25:04.540 | And how do you make sure that the queries are correct?

00:25:07.820 | You know, I think it's obviously a big pain

00:25:09.940 | in this all, you know, like AI kind of, you know,

00:25:12.500 | like a space right now, how do we make sure

00:25:13.940 | that we don't, you know, provide our own counselors?

00:25:16.940 | But I think, you know, like kind of be able to reduce

00:25:20.300 | the room for error as much as possible

00:25:22.500 | that what I would look for, you know,

00:25:24.100 | like to try to like minimize potential damage.

00:25:28.740 | And then, yeah, I feel like our use case, you know,

00:25:32.660 | like for Kube, it's been, we've been using,

00:25:36.740 | Kube been used a lot to power

00:25:38.660 | sort of customer facing analytics.

00:25:40.460 | So I don't think that much is going to change

00:25:43.560 | is that I feel like, again, more and more products

00:25:45.980 | will adopt natural language interfaces

00:25:48.860 | as sort of a part of that product as well.

00:25:51.380 | So we would be able to power this business

00:25:55.140 | to not only, you know, like charts, visuals,

00:25:58.140 | but also some sort of, you know, like summaries,

00:26:01.400 | you know, like probably in the future,

00:26:02.900 | you're going to open the page with some surface stats

00:26:06.540 | and you will have a smart summary kind of generated by AI.

00:26:09.580 | And that summary can be powered by Kube, right?

00:26:11.820 | Like, because the rest is already being powered by Kube.

00:26:14.660 | - You know, we had Linus from Notion on the pod

00:26:18.180 | and one of the ideas he had that I really like

00:26:20.580 | is kind of like thumbnails of text,

00:26:23.500 | kind of like how do you like compress knowledge

00:26:25.980 | and then start to expand it.

00:26:27.700 | A lot of that comes into dashboards, you know,

00:26:29.900 | where like you have a lot of data,

00:26:31.340 | you have like a lot of charts

00:26:32.420 | and sometimes you just want to know,

00:26:34.060 | hey, this is like the three lines summary of it.

00:26:36.820 | Yeah, and yeah, makes sense that you would want to power that.

00:26:42.220 | So are you, how are you thinking about, yeah,

00:26:44.460 | the evolution of like the modern data stack

00:26:47.320 | and in quotes, whatever that means today,

00:26:49.900 | what's like the future of what people are going to do?

00:26:53.260 | What's the future of like what models and agents

00:26:55.700 | are gonna do for them?

00:26:57.620 | Do you have any thoughts?

00:26:59.620 | - I feel like modern data stack

00:27:01.380 | sometimes it's not very connected.

00:27:03.420 | I mean, it's obviously a big crossover between AI,

00:27:06.460 | you know, like ecosystem, AI infrastructure ecosystem

00:27:09.740 | and then sort of the data,

00:27:11.780 | but I don't think it's a full overlap.

00:27:14.620 | So I feel like when we know,

00:27:15.900 | like I'm looking at a lot of like what's happening

00:27:17.780 | in a modern data stack, right?

00:27:19.580 | Like where like we use warehouses,

00:27:23.020 | we use BI's, you know, different like transformation tools,

00:27:26.780 | catalogs, like data quality tools, ETLs, all of that.

00:27:30.500 | I don't see a lot of being compacted by AI specifically.

00:27:35.020 | I think, you know, that space is being compacted

00:27:37.500 | as much as any other space in terms of,

00:27:40.700 | yes, we'll have all those copilot capabilities,

00:27:43.420 | some of AI capabilities here and there,

00:27:45.500 | but I don't think see anything sort of dramatically,

00:27:48.860 | you know, being sort of, you know, a change or shifted

00:27:52.220 | because of, you know, like AI wave.

00:27:54.380 | In terms of just in general data space,

00:27:57.220 | I think, you know, like in the last two, three years,

00:27:59.660 | we saw an explosion, right?

00:28:01.260 | Like we got like a lot of tools,

00:28:03.100 | every vendor for every problem.

00:28:05.660 | I feel like right now we should go

00:28:07.380 | through the cycle of consolidation.

00:28:09.900 | And, you know, like, I mean, if Fivetran and DBT merge,

00:28:13.900 | they can be Alteryx of a new generation or something like,

00:28:16.900 | - Yeah.

00:28:18.100 | - And, you know, probably some ETL too there,

00:28:21.460 | but I feel it might happen.

00:28:23.460 | I mean, it just natural waves, you know, like in cycles.

00:28:26.940 | - I wonder if everybody is gonna have their own copilot.

00:28:29.660 | The other thing I think about these models is like,

00:28:31.940 | you know, SWIX was at AirByte and yeah, there's Fivetran.

00:28:35.620 | - That's the ETL thing.

00:28:37.580 | But Fivetran versus AirByte,

00:28:39.780 | I don't think it all makes very well.

00:28:41.980 | - There's the, you know, a lot of times these companies

00:28:46.340 | are doing the syntax work for you

00:28:48.660 | of like building the integration between your data store

00:28:51.220 | and like the app or another data store.

00:28:53.340 | I feel like now these models are pretty good at coming up

00:28:56.540 | with the integration themselves, you know,

00:28:58.500 | and like using the docs to then connect the two.

00:29:00.660 | So I'm really curious, like in the future,

00:29:02.780 | what that will look like, you know.

00:29:06.220 | And same with data transformation.

00:29:07.580 | I mean, you think about DBT and some of these tools

00:29:10.620 | and right now you have to create rules to normalize

00:29:13.580 | and transform data, but in the future,

00:29:16.580 | I could see you explaining the model,

00:29:18.820 | how you want the data to be,

00:29:20.500 | and then the model figuring out

00:29:21.860 | how to do the transformation.

00:29:23.380 | But yeah, I think it all needs a semantic layer

00:29:28.460 | as far as like figuring out what to do with it, you know,

00:29:30.740 | what's the data for, where it goes.

00:29:32.900 | - Yeah, I think many of this, you know,

00:29:34.900 | like workflows will be augmented by, you know,

00:29:38.540 | like some sort of a copilot.

00:29:40.300 | You know, you can describe what transformation

00:29:43.100 | you want to see and it can generate a boilerplate, right,

00:29:46.060 | of transformation for you.

00:29:47.340 | Or even, you know, like kind of generate a boilerplate

00:29:49.940 | of specific ETL driver or ETL integration.

00:29:54.140 | I think we're still maybe not at the point

00:29:58.660 | where this code can be fully automated.

00:30:00.820 | So we still need a human and a loop, right,

00:30:02.460 | like who can use this copilot.

00:30:05.700 | But in general, I think, yeah,

00:30:07.860 | data work and software engineering work

00:30:10.260 | can be augmented quite significantly with all that stuff.

00:30:14.500 | - I think the other important thing with data too

00:30:16.900 | is like sometimes, you know,

00:30:20.420 | the big thing with machine learning before was like,

00:30:22.060 | well, all of your data is bad, you know,

00:30:23.820 | the data is not good for anything.

00:30:26.180 | And I think like now at least with these models,

00:30:30.380 | they have some knowledge of their own

00:30:31.940 | and they can also tell you if your data is bad, you know,

00:30:34.500 | which I think is like something that before you didn't,

00:30:36.540 | you didn't have.

00:30:37.420 | Any cool apps that you've seen being built on, on Cube,

00:30:40.900 | like any kind of like AI native things

00:30:43.460 | that people should think about,

00:30:45.580 | new experiences, anything like that?

00:30:47.700 | - Well, I see a lot of Slack bots.

00:30:49.660 | So, you know, like it's just,

00:30:51.140 | it's definitely like, they all remind me of Statsbot,

00:30:55.260 | but I know like I played with few of them,

00:30:57.980 | they're much, much better than Statsbot.

00:30:59.900 | So I feel like it just,

00:31:01.780 | it feels like it's on the surface, right?

00:31:03.580 | It's just that use case that you really want, you know,

00:31:06.180 | think about your data engineer in your company,

00:31:08.700 | like everyone is like, and you're asking,

00:31:10.460 | "Hey, can you pull that data for me?"

00:31:12.380 | And you would be like,

00:31:13.260 | "Can I build a bot to replace myself?"

00:31:15.540 | You know, like, so they will ping that bot instead.

00:31:18.820 | So it's like, that's why a lot of people doing this.

00:31:20.540 | So I think it's a first use case

00:31:21.980 | that actually people are playing with.

00:31:23.620 | But I think inside that use case, people get creative.

00:31:26.580 | So I see bots that can actually have a dialogue with you.

00:31:29.500 | So, you know, like you would come to that bot and say,

00:31:30.980 | "Hey, show me metrics."

00:31:32.220 | And the bot would be like, "What kind of metrics?

00:31:34.500 | What do you want to look at?"

00:31:35.700 | So like, it would be like active users.

00:31:38.220 | And then it would be like,

00:31:39.060 | "How do you define active users?"

00:31:40.460 | You want to see active users, you know, like sort of cohort,

00:31:44.420 | you want to see active users

00:31:45.660 | kind of changing behavior over time,

00:31:47.300 | like a lot of like a follow-up questions.

00:31:49.100 | So it tries to sort of, you know,

00:31:51.620 | like understand what exactly you want,

00:31:53.540 | because a lot of people,

00:31:54.660 | and that's how many data analysts work, right?

00:31:57.940 | When people started to ask you something,

00:32:00.300 | you always try to understand what exactly do you mean?

00:32:02.820 | Because many people,

00:32:04.380 | they don't know how to ask correct questions

00:32:07.820 | about your data.

00:32:09.140 | It's a sort of an interesting specter.

00:32:11.460 | On one side of the specter, you don't know,

00:32:13.140 | you know nothing, you're just like,

00:32:13.980 | "Hey, show me metrics."

00:32:15.460 | And the other side of specter,

00:32:16.780 | you know how to write SQL,

00:32:18.380 | and you can write exact query to your data warehouse, right?

00:32:21.940 | So many people like say a little bit in the middle,

00:32:24.420 | and this, the data analyst,

00:32:27.460 | they usually have the knowledge about your data.

00:32:30.260 | And that's why they can ask follow-up questions

00:32:32.420 | and to understand what exactly you want.

00:32:34.620 | And I saw people building bots who can do that.

00:32:37.700 | And that part is amazing.

00:32:39.860 | I mean, like generating SQL, all of that stuff,

00:32:41.820 | it's okay, it's good,

00:32:44.300 | but when the bot can actually act like they know your data

00:32:49.260 | and they can ask follow-up questions,

00:32:50.860 | I think that's great.

00:32:52.140 | - Yeah.

00:32:52.980 | Are there any issues with the models

00:32:55.660 | and the way they understand numbers?

00:32:57.180 | You know, one of the big complaints people have

00:32:59.420 | is like GPD, at least three and a half cannot do math.

00:33:02.500 | You know, have you seen any limitations and improvement?

00:33:06.660 | And also when it comes to what model to use,

00:33:09.100 | do you see most people use like GPT-4

00:33:11.060 | because it's like the best at this kind of analysis?

00:33:13.500 | - I think I saw people use all kinds of models.

00:33:17.180 | To be honest, it's usually GPT, so it's not,

00:33:19.220 | I mean, inside GPT, it could be 3.5 or 4, right?

00:33:22.620 | But it's not like I see a lot of something else,

00:33:25.460 | to be honest, like I don't, I mean,

00:33:27.100 | maybe know like some open source alternatives,

00:33:31.020 | but it's pretty much, you know,

00:33:32.220 | like it feels like the market is being dominated

00:33:34.500 | by just chat GPT, which is probably true.

00:33:38.980 | In terms of the problems,

00:33:40.540 | I think I've been chatting about it with a few people.

00:33:44.260 | So they try just kind of, you know,

00:33:45.580 | like if math is required to do math,

00:33:47.780 | you know, like outside of, you know, like chat GPT itself.

00:33:50.580 | So it would be like some additional Python scripts

00:33:53.060 | or something.

00:33:53.940 | When we're talking about production level use cases,

00:33:57.660 | it's quite a lot of Python code around, you know,

00:33:59.860 | like your model to make it work, to be honest.

00:34:01.860 | It's like, it's not that magic

00:34:03.460 | that you just throw the model in it.

00:34:04.980 | Like it can give you all these answers.

00:34:06.820 | For like a toy use cases, the one we have on a, you know,

00:34:09.180 | like our demo page or something, it works fine.

00:34:11.500 | It's great.

00:34:12.340 | But you know, like if you want to do like a lot

00:34:13.860 | of post-processing, do a mass on your own,

00:34:16.020 | you probably need to code it in Python anyway.

00:34:18.380 | That's what I see people doing.

00:34:20.380 | - Yeah. Yeah.

00:34:21.220 | We heard the same from Harrison and Langstream

00:34:24.580 | that most people just use OpenAI.

00:34:26.900 | We did a OpenAI Snowmode emergency podcast.

00:34:30.340 | And it was funny to like just see the reaction

00:34:32.660 | that people had to that

00:34:33.620 | and how hard it actually is to break down

00:34:37.300 | some of the monopoly.

00:34:38.540 | What else should people keep in mind, Artem?

00:34:40.780 | You're kind of like at the cutting edge of this, you know,

00:34:43.300 | if I'm looking to build a data-driven AI application,

00:34:47.580 | I'm trying to build data into my AI workflows.

00:34:50.780 | Any mistakes people should avoid?

00:34:52.900 | Any tips on the best stack to use?

00:34:55.180 | What tools to use?

00:34:56.420 | - I would just recommend going through

00:34:58.460 | to warehouse as soon as possible.

00:35:00.700 | I think a lot of people feel that MySQL can be a warehouse,

00:35:04.100 | which can be maybe on like a lower scale,

00:35:06.260 | but you know, like it's definitely not

00:35:08.580 | from a performance perspective.

00:35:10.540 | So just kind of have it starting with a good warehouse,

00:35:13.340 | a query engine like house,

00:35:14.740 | that's probably like something I would recommend

00:35:17.460 | starting from a day zero.

00:35:18.780 | And there are like ways to do it very cheap

00:35:21.460 | with open source technologies too,

00:35:23.140 | especially in a lake house architecture.

00:35:25.380 | I think, you know, I'm biased, obviously,

00:35:27.260 | but using a semantic layer, preferably Kube.

00:35:30.180 | And for, you know, like a context.

00:35:32.460 | And other than that, it's just like,

00:35:34.620 | I feel it's a very interesting space, you know,

00:35:36.940 | like in terms of AI ecosystem,

00:35:40.140 | I see a lot of people using link chain right now,

00:35:42.180 | which is great, you know, like,

00:35:43.380 | and we build an integration,

00:35:45.500 | but I'm sure the space will continue to evolve.

00:35:48.420 | And, you know, like we'll see a lot of like

00:35:50.260 | interesting tools and maybe, you know,

00:35:51.620 | like some tools would be a better fit for a job.

00:35:54.420 | I don't, I'm not aware of any right now,

00:35:57.020 | but it's always interesting to see how it evolves.

00:36:00.020 | Also, it's a little unclear, you know,

00:36:01.740 | like how all the infrastructure around

00:36:03.860 | actually developing, testing,

00:36:05.620 | documenting all that stuff will kind of evolve too.

00:36:09.100 | But yeah, again, it's just like really interesting

00:36:12.060 | to see and observe, you know,

00:36:13.220 | what's happening in this space.

00:36:15.380 | - Okay, so before we go to the lightning round,

00:36:17.700 | I wanted to ask you on your thoughts on embedded analytics.

00:36:22.700 | And in a sense, the kind of chatbots

00:36:26.180 | that people are inserting on their websites

00:36:28.300 | and building with LLMs

00:36:31.260 | is very much sort of end user programming

00:36:33.340 | or end user interaction with their own data.

00:36:36.420 | I love seeing embedded analytics.

00:36:38.060 | And for those who don't know,

00:36:39.020 | embedded analytics is basically user facing dashboards

00:36:42.060 | where you can see your own data, right?

00:36:44.820 | Instead of the company seeing data

00:36:47.020 | across all their customers,

00:36:48.420 | it's an individual user seeing their own data

00:36:51.020 | as a slice of the overall data

00:36:53.220 | that is owned by the platform that they're using.

00:36:57.100 | So I love embedded analytics,

00:36:58.940 | but actually overwhelmingly the observation that I've had

00:37:02.420 | is that people who try to build in this market

00:37:05.020 | fail to monetize.

00:37:06.140 | And I was wondering your insights on why.

00:37:08.420 | - I think overall the statement is true.

00:37:10.500 | It's really hard to monetize, you know,

00:37:12.620 | like in embedded analytics.

00:37:14.660 | That's why at Qube we're excited more

00:37:17.420 | about our internal kind of BI use case

00:37:20.220 | or like a company's a building, you know,

00:37:21.940 | like a chatbots for their internal data consumption

00:37:24.700 | or like internal workflows.

00:37:26.540 | Embedded analytics is hard to monetize

00:37:28.660 | because it's historically been dominated by the BI vendors.

00:37:33.660 | And we still, you know, like see a lot of, you know,

00:37:39.020 | like organizations are using BI tools as a vendors.

00:37:42.540 | So, and what I was talking about BI vendors

00:37:46.020 | adding natural language interfaces,

00:37:48.100 | they will probably add that

00:37:49.340 | to the embedded analytics capabilities as well, right?

00:37:51.740 | So they would be able to embed that too.

00:37:54.060 | So I think that's part of it.

00:37:56.620 | Also, you know, if you look at the embedded analytics market

00:38:00.620 | the bigger organizations, the big GADs,

00:38:03.940 | they're really more custom, you know, like it becomes.

00:38:06.540 | And at some point I see many organizations

00:38:08.580 | they just stop using any vendor

00:38:10.500 | and they just kind of build most of the stuff from scratch,

00:38:13.860 | which probably, you know, like the right way to do it.

00:38:16.020 | So it's sort of, you know, like you got a market

00:38:18.060 | that is very kept at the top

00:38:21.660 | and then you also in that middle and small segment

00:38:25.620 | you got a lot of vendors trying, you know,

00:38:28.020 | like to compete for the buyers.

00:38:29.820 | And because again, the BI is very fragmented

00:38:32.900 | embedded analytics therefore is fragmented also.

00:38:36.340 | So you're really going after the mid-market slice

00:38:39.980 | and then with a lot of other vendors competing for that.

00:38:43.060 | So that's why it's historically been hard to monetize, right?

00:38:45.780 | I don't think AI really going to change that

00:38:48.620 | just because it's using model,

00:38:51.420 | you just pay to open AI and that's it.

00:38:54.180 | Like everyone can do that, right?

00:38:55.460 | So it's not much of a competitive advantage.

00:38:58.140 | So it's going to be more like a commodity feature

00:39:00.060 | is that a lot of like, yeah,

00:39:01.340 | vendors would be able to leverage.

00:39:03.340 | - This is great Artem.

00:39:04.500 | As usual, we got our lightning round.

00:39:06.540 | So it's true question.

00:39:07.820 | One is about acceleration, one on exploration

00:39:10.740 | and then take away.

00:39:12.420 | The acceleration thing is what's something

00:39:14.380 | that already happened in AI or maybe, you know,

00:39:17.660 | in data that you thought would take much longer

00:39:20.340 | but it's already happening today.

00:39:22.140 | - To be honest, all this foundational models,

00:39:24.220 | I thought that, you know,

00:39:26.820 | we had a lot of models that had been in production

00:39:29.940 | for like quite, you know, maybe decade or so.

00:39:32.700 | And it was like a very niche use cases,

00:39:35.180 | very vertical use cases.

00:39:36.660 | It's just like in very customized models.

00:39:39.020 | And even when we're building stats bot back then in 2016,

00:39:43.460 | right, it was, even back then we had some

00:39:48.060 | natural language models being deployed,

00:39:49.940 | like a Google Translate or something that was,

00:39:52.100 | it still was sort of a model, right?

00:39:53.940 | But it was very customized with a specific use case.

00:39:56.820 | So I thought that would continue for like many years.

00:40:00.660 | We'll use AI, we'll have all this customized niche models.

00:40:04.420 | But there is like foundational model,

00:40:05.900 | they like very generic now.

00:40:07.900 | They like, they can serve many, many different use cases.

00:40:11.660 | So I think that's, that is a big change.

00:40:15.100 | And I didn't expect that to be honest.

00:40:17.420 | - And the next question is about exploration.

00:40:20.740 | What is one thing that you think

00:40:22.340 | is the most interesting unsolved question in AI?

00:40:24.780 | - I think AI is a subset of software engineering in general.

00:40:29.780 | And it's sort of connected to the data as well.

00:40:33.100 | And in software, because software engineering

00:40:35.340 | as a discipline, it has quite a history.

00:40:38.420 | We build a lot of processes, you know,

00:40:40.460 | like toolkits and methodologies,

00:40:44.020 | how we project that, right?

00:40:46.620 | And now AI, I don't think it's completely different,

00:40:49.820 | but it has some unique traits.

00:40:51.620 | You know, like it's quite much not idempotent, right?

00:40:54.860 | And kind of from many dimensions.

00:40:57.620 | So, and like other traits.

00:40:59.580 | So which kind of may require a different methodologies,

00:41:03.820 | may require different approaches in a different toolkit.

00:41:06.820 | I don't think how much is going to deviate

00:41:08.660 | from a standard software engineering.

00:41:10.100 | I think many sort of, you know, like tools and practices

00:41:13.420 | was that we develop our software engineering

00:41:15.260 | can be applied to AI.

00:41:16.980 | And some of the data best practices

00:41:18.780 | can be applied as well.

00:41:19.780 | But it's might be a very interesting subfield,

00:41:23.660 | like we got a DevOps, right?

00:41:25.100 | Like just a bunch of tools, like ecosystem.

00:41:27.580 | So now like AI is kind of feels like it's shaping into that

00:41:31.260 | with a lot of its own, you know,

00:41:32.740 | like methodologies, practices and toolkits.

00:41:35.860 | So I'm really excited about it.

00:41:37.660 | And I think it's a lot of sort of, you know,

00:41:39.220 | like unsolved still question again,

00:41:41.060 | how do we develop with that?

00:41:42.100 | How do we test, you know, like what is the best practices?

00:41:44.620 | How does the methodologies?

00:41:45.780 | So I think that would be an interesting to see.

00:41:48.140 | - Awesome.

00:41:49.180 | And then, yeah, the final message, you know,

00:41:52.180 | you have a big audience of engineers and technical folks.

00:41:56.180 | What's something you want everybody to remember,

00:41:58.460 | to think about, to explore?

00:42:00.380 | - It says being who could try to build a chatbot,

00:42:03.140 | you know, like for analytics back then,

00:42:05.300 | and kind of, you know, like looking at what,

00:42:07.220 | what people do right now.

00:42:08.060 | I think, yeah, just do that.

00:42:09.420 | I mean, it's working right now.

00:42:11.420 | So it's, with foundational models,

00:42:14.820 | it's actually now it's possible to build

00:42:16.780 | all those cool applications.

00:42:18.620 | The thing, you know, like it's,

00:42:20.260 | I'm so excited to see, you know,

00:42:22.500 | like how much changed in the last six years

00:42:25.780 | or so that we actually now can build a smart agents.

00:42:28.460 | I think that's sort of, you know, like a takeaways.

00:42:30.540 | And yeah, we are, as you know, like as humans in general,

00:42:35.340 | we're like, we really move technology forward

00:42:38.340 | and it's fun to see, you know, like it's just a firsthand.

00:42:41.420 | - Ah, well, thank you so much for coming on Artem.

00:42:45.980 | This was great.

00:42:47.460 | (upbeat music)

00:42:50.060 | (upbeat music)

00:42:52.660 | (upbeat music)

00:42:55.260 | (upbeat music)

00:42:57.840 | (upbeat music)

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Chapters