back to index

Powering your Copilot for Data - with Artem Keydunov from Cube.dev


Chapters

0:0 Introductions
1:35 History of Statsbot - Slack bot for querying data in Slack
4:45 Building Cube to power Statsbot due to limitations in natural language processing at the time
6:50 Open sourcing Cube as a standalone product
8:34 Explaining the concept of a semantic layer and OLAP cubes
10:27 Using semantic layers to provide context to AI models
11:54 Challenges of using tabular vs. language data with AI models
13:11 Workflow of natural language to SQL query using semantic layer
16:1 Ensuring AI agents have proper data context and make correct queries
18:20 Treating metrics definitions in the semantic layer as a codebase with collaboration
22:55 Natural language capabilities becoming a commodity baseline for BI tools
24:37 Recommendations for building data-driven AI applications
28:26 Predictions on the consolidation of modern data stack tools/companies
30:14 AI assistance augmenting but not fully automating data workflows
34:20 Using external Python scripts to handle limitations of models with math
36:15 Embedded analytics challenges and natural language commoditization
39:4 Lightning round

Whisper Transcript | Transcript Only Page

00:00:00.000 | (upbeat music)
00:00:02.580 | - Hey everyone.
00:00:08.440 | Welcome to the Latent Space Podcast.
00:00:09.640 | This is Swigx, writer, editor of Latent Space
00:00:11.420 | and founder of Small.ai
00:00:12.640 | and Alessio, partner and CTO in residence
00:00:15.560 | at Decibel Partners.
00:00:16.720 | - Hey everyone.
00:00:17.540 | And today we have Artem Ketunov on the podcast,
00:00:20.520 | co-founder of Cube.
00:00:21.600 | Hey Artem.
00:00:22.440 | - Hey Alessio, hi Swigx.
00:00:23.720 | Good to be here today.
00:00:25.240 | Thank you for inviting me.
00:00:26.480 | - Yeah, thanks for joining.
00:00:28.000 | For people that don't know,
00:00:29.200 | I've known Artem for a long time,
00:00:31.240 | ever since he started Cube.
00:00:32.640 | And Cube is actually a spin out of his previous company,
00:00:35.220 | which is Statsbot.
00:00:37.040 | And this kind of feels like going both backward
00:00:41.000 | and forward in time.
00:00:42.020 | So the premise of Statsbot was having a Slack bot
00:00:46.240 | that you can ask,
00:00:47.080 | basically like text to SQL in Slack.
00:00:49.240 | And this was six, seven years ago,
00:00:51.600 | or something like that.
00:00:52.720 | So a lot ahead of its time
00:00:54.480 | and you see startups trying to do that today.
00:00:56.840 | And then Cube came out of that
00:00:59.080 | as a part of the infrastructure
00:01:00.460 | that was powering Statsbot.
00:01:02.140 | And Cube then evolved from an embedded analytics product
00:01:06.000 | to the semantic layer
00:01:07.320 | and just an awesome open source evolution.
00:01:09.580 | I think you have over 16,000 stars on GitHub today.
00:01:13.480 | You have a very active open source community.
00:01:15.740 | But maybe for people at home,
00:01:18.240 | just give a quick like lay of the land
00:01:21.240 | of the original Statsbot product.
00:01:24.080 | You know, what got you interested in like text to SQL
00:01:27.200 | and what were some of the limitations that you saw then,
00:01:30.960 | the limitations that you're also seeing today
00:01:33.680 | in the new landscape?
00:01:35.440 | - I started Statsbot in 2016.
00:01:39.080 | So the original idea was to just make
00:01:43.320 | sort of a side project
00:01:45.320 | based off my initial project that I did at a company
00:01:49.120 | that I was working for back then.
00:01:50.600 | And I was working for a company
00:01:51.920 | that was building software for schools.
00:01:54.300 | And we were using Slack a lot.
00:01:56.600 | And Slack was growing really fast.
00:01:58.160 | A lot of people were talking about Slack,
00:01:59.880 | you know, like Slack apps, Chatsbots in general.
00:02:02.940 | So I think it was, you know,
00:02:04.420 | like another wave of, you know, bots and all of that.
00:02:07.200 | We have one more wave right now,
00:02:08.840 | but it's always comes in waves.
00:02:10.760 | So we were like living through one of those waves.
00:02:14.020 | And I wanted to build a bot
00:02:16.660 | that would give me information
00:02:18.440 | from the different places where like a data lives to Slack.
00:02:22.200 | So this was like some, you know,
00:02:24.020 | developer data, like New Relic,
00:02:26.400 | you know, maybe some marketing data, Google Analytics,
00:02:29.800 | and then some just regular data,
00:02:31.400 | like a production databases or even Salesforce sometimes.
00:02:34.400 | And I wanted to bring it all into Slack
00:02:36.360 | because we were always talking, chatting,
00:02:38.840 | you know, like in Slack,
00:02:39.840 | and I wanted to see some stats in Slack.
00:02:42.080 | So that was the idea of Statsbot, right?
00:02:43.600 | Like bring stats to Slack.
00:02:46.280 | I built that as a, you know,
00:02:47.520 | like a first sort of a side project,
00:02:49.760 | and I published it on Reddit,
00:02:52.680 | and people started to use it even before Slack came up
00:02:55.320 | with that Slack application directory.
00:02:57.360 | So it was a little, you know,
00:02:58.600 | like a hackish way to install it,
00:03:00.320 | but people are still installing it.
00:03:02.320 | So it was a lot of fun,
00:03:03.300 | and then Slack kind of came up
00:03:05.020 | with that application directory,
00:03:07.040 | and they reached out to me,
00:03:08.840 | and they wanted to feature Statsbot
00:03:10.520 | because it was one of the already being
00:03:12.720 | kind of widely used bots on Slack.
00:03:15.040 | So they featured me on this application directory front page,
00:03:18.200 | and I just got a lot of, you know,
00:03:19.360 | like new users signing up for that.
00:03:21.640 | It was a lot of fun, I think, you know, like,
00:03:23.580 | but it was sort of a big limitation
00:03:26.600 | in terms of how you can process natural language,
00:03:29.260 | because the original idea was to let people ask questions
00:03:34.080 | directly in Slack, right?
00:03:35.400 | Hey, show me my, you know,
00:03:37.340 | like opportunities closed last week or something like that.
00:03:40.000 | My co-founder who kind of started helping me
00:03:42.680 | with this Slack application,
00:03:45.240 | him and I were trying to build a system
00:03:47.280 | to recognize that natural language,
00:03:49.360 | but it was, you know, we didn't have LLMs, right,
00:03:51.720 | back then and all of that technologies,
00:03:53.360 | so it was really hard to build the system,
00:03:55.880 | especially the systems that can kind of, you know,
00:03:57.840 | like keep talking to you,
00:03:59.160 | like maintain some sort of a dialogue.
00:04:01.620 | It was a lot of like one-off requests,
00:04:03.520 | and like it was a lot of hit and miss, right?
00:04:05.320 | If you know how to construct a query in natural language,
00:04:07.880 | you will get a result back,
00:04:09.400 | but, you know, like it was not a system
00:04:11.520 | that was capable of, you know,
00:04:12.960 | like asking follow-up questions
00:04:14.800 | to try to understand what you actually want,
00:04:16.960 | and then kind of finally, you know,
00:04:18.200 | like bring this all context
00:04:20.040 | and go to generate a SQL query,
00:04:21.880 | get the result back and all of that.
00:04:23.760 | So that was a really missing part,
00:04:25.880 | and I think right now that's, you know,
00:04:27.200 | like what is the difference?
00:04:28.680 | So right now I'm kind of bullish that, you know,
00:04:31.040 | like if I would start Statsbot again,
00:04:32.860 | probably would have a much better shot at it,
00:04:35.620 | but back then that was a big limitation.
00:04:37.800 | Funny thing is that we wanted to,
00:04:39.040 | we kind of built a queue, right,
00:04:41.380 | as we were working on Statsbot because we needed it.
00:04:45.300 | - Yeah.
00:04:46.140 | What was the ML stack at the time?
00:04:48.720 | Were you building, trying to build your own,
00:04:50.560 | like a natural language understanding models?
00:04:53.160 | Like were there open source models that were good
00:04:55.740 | that you were trying to leverage?
00:04:57.080 | - I think it was mostly combination of a bunch of things,
00:05:00.040 | and we tried a lot of different approaches.
00:05:02.140 | The first version which I built, like was RegApps.
00:05:06.140 | They were working well.
00:05:07.240 | - This is the same as I did.
00:05:10.700 | I did option pricing when I was in finance,
00:05:13.520 | and I had a natural language pricing tool thing,
00:05:16.120 | and it was Regex.
00:05:16.960 | It was just a lot of Regex.
00:05:18.640 | - Yeah, yeah.
00:05:20.640 | And then, and my co-founder, Jamie Powell,
00:05:23.520 | he's much smarter than I am.
00:05:24.960 | He's like PhD in math, all of that.
00:05:27.000 | And he started to like do some stuff that was like,
00:05:30.200 | I was like, no, you just do that stuff.
00:05:32.120 | I don't know, like I can do Regex.
00:05:33.800 | And, you know, like he started to do like some models
00:05:37.020 | and trying to either, you know,
00:05:38.280 | like look at what we had on the market back then,
00:05:41.240 | or, you know, like try to build a different sort of,
00:05:43.580 | you know, like kind of models.
00:05:45.920 | Again, we didn't have any foundation back in place, right?
00:05:48.680 | We wanted to build something that, you know, like we,
00:05:51.260 | okay, we wanted to try to use existing math, obviously,
00:05:54.200 | right, but it was not something that we can take the model
00:05:56.840 | and, you know, like a try and run it.
00:05:58.940 | I think in 2019, we started to see more like of stuff,
00:06:03.360 | you know, like ecosystem being built,
00:06:05.440 | and then it eventually kind of, you know,
00:06:07.160 | like resulted in all this LLM, like what we have right now.
00:06:10.120 | But back then in 2016, it was not much, you know,
00:06:13.920 | like available for just the people to build on top.
00:06:16.000 | It was some academic research, right,
00:06:18.280 | kind of been happening, but it was like very, very early,
00:06:21.680 | you know, like for something to actually be able to use.
00:06:25.020 | - And then that became Kube,
00:06:28.160 | which was started just as an open source project.
00:06:30.480 | And I think I remember going on a walk with you
00:06:33.120 | in San Mateo in like 2020, something like that.
00:06:36.440 | And you were like, you have people reaching out to you
00:06:38.840 | who are like, "Hey, we use Kube in production."
00:06:41.000 | Like, "I just need to give you some money,
00:06:43.200 | even though you guys are not a company."
00:06:45.600 | What's the story of Kube then from Statsbot
00:06:48.480 | to where you are today?
00:06:49.760 | - We built a Kube at Statsbot because we needed it.
00:06:53.440 | It was like the whole Statsbot stack
00:06:56.120 | was that we first tried to translate
00:06:59.760 | the initial sort of language query
00:07:02.080 | into some sort of multidimensional query.
00:07:06.480 | It's like, we were trying to understand,
00:07:07.680 | okay, people wanted to get active opportunities, right?
00:07:11.480 | What does it mean?
00:07:12.320 | Is it a metric?
00:07:13.320 | Is it what a dimension here?
00:07:15.080 | Because usually in analytics,
00:07:16.320 | you always, you know, like try to reduce everything down
00:07:19.160 | to the sort of, you know, like a multidimensional framework.
00:07:22.240 | So that was the first step.
00:07:24.080 | And that's where, you know, like it didn't really work well
00:07:26.760 | because all this limitation
00:07:28.600 | of us not having foundational technologies.
00:07:31.060 | But then from the multidimensional query,
00:07:33.760 | we wanted to go to SQL.
00:07:36.040 | And that's what was semantic layer
00:07:37.800 | and what was Kube essentially.
00:07:39.680 | So we built a framework where you would be able
00:07:42.760 | to map your data into this concept, into this metrics.
00:07:46.800 | Because when people were coming to Statsbot,
00:07:49.440 | they were bringing their own datasets, right?
00:07:52.040 | And the big question was, how do we tell the system
00:07:55.520 | what is active opportunities for that specific users?
00:07:58.080 | How we kind of, you know, like provide that context,
00:08:00.560 | how we do the training.
00:08:01.740 | So that's why we came up with the idea
00:08:03.340 | of building the semantic layer.
00:08:05.060 | So people can actually define their metrics
00:08:07.140 | and then kind of use them as a Statsbot.
00:08:09.260 | So that's how we built a Kube.
00:08:10.860 | But at some point we saw people started to see more value
00:08:14.700 | in the Kube itself, you know,
00:08:16.340 | like kind of building the semantic layers
00:08:18.580 | and then using it to power different types
00:08:20.760 | of the application.
00:08:22.060 | So in 2019, we decided, okay,
00:08:24.540 | it feels like it might be a standalone product
00:08:27.500 | and a lot of people want to use it.
00:08:28.700 | Let's just try to open source it.
00:08:31.440 | So we took it out of Statsbot and open sourced.
00:08:34.320 | - Can I make sure that everyone
00:08:36.800 | has the same foundational knowledge?
00:08:39.240 | The concept of a Kube is not something that you invented.
00:08:42.280 | I think, you know, not everyone has the same background
00:08:44.440 | in analytics and data that all three of us do.
00:08:47.520 | Maybe you want to explain like OLAP Kube, Hyper Kube,
00:08:50.320 | you know, anything, whatever the brief history of Kubes.
00:08:53.280 | - Right.
00:08:56.560 | I'll try, you know, like there's a lot of like
00:08:58.980 | Wikipedia pages and like a lot of like a blog post
00:09:01.700 | trying to go into academics of it.
00:09:03.500 | So I'm trying to like--
00:09:04.340 | - Kubes according to you, yeah.
00:09:06.020 | - Yeah, it's just,
00:09:08.180 | so when we think about just a table in a database,
00:09:10.860 | the problem with the table, it's not a multidimensional,
00:09:13.740 | meaning that in many cases, if we want to slice the data,
00:09:17.420 | we kind of need to result with a different table, right?
00:09:20.900 | Like think about when you're writing a SQL query
00:09:23.020 | to answer one question,
00:09:24.300 | SQL query always ends up with a data,
00:09:26.500 | with a table, right?
00:09:27.580 | So you write one SQL, you got one.
00:09:29.300 | Then you write to answer a different question,
00:09:31.260 | you write a second query.
00:09:32.300 | So you're kind of getting a bunch of tables.
00:09:35.100 | So now let's imagine that we can kind of
00:09:37.100 | bring all this tables together into multidimensional table.
00:09:41.940 | And that's essentially Kube.
00:09:43.520 | So it's just like the way that we can have measures
00:09:46.340 | and dimension that can potentially be kind of, you know,
00:09:49.040 | like used at the same time from a different angles.
00:09:53.820 | - And so initially, a lot of your use cases
00:09:56.140 | were more, you know, BI related,
00:09:58.220 | but you recently released a link chain integration.
00:10:01.640 | There's obviously more and more interest in, again,
00:10:05.300 | using these models to answer data questions.
00:10:07.340 | So you've seen the chat GPT code interpreter,
00:10:09.740 | which is renamed as like advanced data analysis.
00:10:12.920 | So what's kind of like the future
00:10:16.580 | of like the semantic layer in AI, you know,
00:10:18.460 | what are like some of the use cases that you're seeing
00:10:20.480 | and why do you think it's a good strategy
00:10:22.980 | to make it easier to do now the text to SQL
00:10:25.900 | you wanted to do seven years ago?
00:10:27.540 | - Yeah, so, I mean, you know, when it started to happen,
00:10:30.420 | I was just like, oh my God,
00:10:31.620 | people are now building stats bot with Kube.
00:10:34.000 | They just have a better technology for, you know,
00:10:36.080 | like natural language.
00:10:37.800 | So it kind of, it made sense to me, you know,
00:10:39.940 | like from the first moment I saw it.
00:10:42.060 | So I think it's something that, you know,
00:10:43.420 | like happening right now.
00:10:46.700 | And that's, chat bot is one of the use cases.
00:10:49.780 | I think, you know, like if you try to generalize it,
00:10:52.940 | the use case would be how do we use a structured
00:10:56.980 | or tabular data with, you know, like AI models, right?
00:11:00.740 | Like how do we turn the data
00:11:02.860 | and give the context to the data
00:11:04.780 | and then bring it to the model
00:11:06.260 | and then model can, you know, like give you answers,
00:11:09.220 | make a questions, do whatever you want.
00:11:11.260 | But the question is like how we go from just the data
00:11:13.780 | in your data warehouse, database, whatever,
00:11:16.300 | which is usually just a tabular data, right?
00:11:18.180 | Like in a SQL based warehouses to some sort of, you know,
00:11:21.460 | like a context that system can do.
00:11:24.380 | And if you're building this application,
00:11:26.140 | you have to do it.
00:11:26.980 | It's like no way you can get away around not doing this.
00:11:30.700 | You either map it manually
00:11:32.700 | or you come up with some framework or something else.
00:11:35.420 | So our take is that, and my take is that semantic layer
00:11:38.020 | is just really good place for this context to live
00:11:41.460 | because you need to give this context to the humans.
00:11:43.620 | You need to give that context to the AI system anyway, right?
00:11:46.700 | So that's why you define metric once.
00:11:48.620 | And then, you know, like you teach your AI system
00:11:51.820 | what this metric is about.
00:11:54.340 | - What are some of the challenges
00:11:56.500 | of using tabular versus language data
00:11:59.420 | and some of the ways that having the semantic layer
00:12:01.780 | kind of makes that easier maybe?
00:12:03.460 | - I feel like, imagine you're a human, right?
00:12:05.820 | And you going into like your new data analyst at a company
00:12:09.860 | and just people give you a warehouse
00:12:11.860 | with a bunch of tables and they tell you,
00:12:14.060 | okay, just try to make sense of this data.
00:12:16.220 | And you're going through all of these tables
00:12:17.940 | and you're really like trying to make sense
00:12:19.780 | without any, you know, like additional context
00:12:22.060 | or like some columns, you know,
00:12:24.100 | like in many cases they might have a weird names.
00:12:27.180 | Sometimes, you know, if they follow some kind of
00:12:30.100 | like a star schema or like a Kimball style dimensions,
00:12:32.940 | maybe that would be easier
00:12:34.180 | because you would have facts and dimensions column,
00:12:36.540 | but it's still, it's hard to understand
00:12:38.380 | and to kind of make sense
00:12:39.740 | because it doesn't have descriptions, right?
00:12:42.220 | And then there is like a whole like industry
00:12:45.420 | of like a data catalogs exist
00:12:47.180 | because the whole purpose of that,
00:12:48.820 | to give context to the data so people can understand that.
00:12:53.140 | And I think the same applies to the AI, right?
00:12:55.100 | Like, and the same challenge is that
00:12:56.780 | if you give it pure tabular data,
00:13:00.180 | it doesn't have this sort of context that it can read.
00:13:03.260 | So you sort of needed to write a book
00:13:05.220 | or like essay about your data
00:13:07.220 | and give that book to the system so it can understand it.
00:13:10.940 | - Can you run through the steps of how that works today?
00:13:14.940 | So the initial part is like the natural language query,
00:13:18.100 | like what are the steps that happen in between
00:13:20.460 | to do model to semantic layer,
00:13:23.060 | semantic layer to SQL and all that flow?
00:13:26.900 | - The first key step is to do some sort of indexing.
00:13:31.540 | So that's what I was referring to,
00:13:34.300 | like write a book about your data, right?
00:13:36.220 | Like describe in a text format what your data is about,
00:13:41.220 | right, like what metrics it has, dimensions,
00:13:44.220 | what is the structures of that,
00:13:45.820 | what a relationship between those metrics,
00:13:48.180 | what are potential values of the dimension.
00:13:50.180 | So sort of, you know, like build a really good indexed
00:13:52.940 | as a text representation and then turn it into embeddings
00:13:57.940 | into your, you know, like vector storage.
00:14:00.700 | Once you have that, then you can sort of, you know,
00:14:04.380 | like provide that as a context to the model.
00:14:08.660 | I mean, there are like a lot of options,
00:14:10.260 | like either fine tune or, you know,
00:14:11.860 | like sort of in context learning,
00:14:13.380 | but somehow kind of give that
00:14:14.860 | as a context to the model, right?
00:14:17.780 | As I want this model has this context,
00:14:20.140 | it can create a query.
00:14:22.060 | Now the query, I believe,
00:14:23.860 | should be created against semantic layer
00:14:26.740 | because it reduces the room for the error.
00:14:30.300 | Because what usually happens is that your query
00:14:32.820 | to semantic layer would be very simple.
00:14:35.260 | It would be like, give me that metric group
00:14:37.780 | by that dimension and maybe that filter should be applied.
00:14:41.140 | And then your real query for the warehouse,
00:14:43.700 | it might have like a five joins,
00:14:46.100 | a lot of different, you know, like techniques,
00:14:48.540 | like how to avoid fan out, fan traps,
00:14:51.660 | chasm traps, all of that stuff.
00:14:53.900 | And the bigger query,
00:14:56.260 | the more room that the model can make an error, right?
00:14:59.660 | Like even sometimes it could be a small error
00:15:01.940 | and then, you know, like your numbers is going to be off.
00:15:04.220 | But making a query against semantic layer,
00:15:07.340 | that sort of reduces the error.
00:15:09.380 | So the model generates a SQL query
00:15:11.580 | and then it executes us against semantic layer.
00:15:14.340 | That's semantic layer executes us against your warehouse
00:15:17.540 | and then sends result all the way back to your application.
00:15:22.340 | And then can be done multiple times
00:15:24.140 | because what we were missing was just about this ability
00:15:27.140 | to have a conversation, right, with the model.
00:15:30.060 | Like you can ask question
00:15:31.580 | and then system can do a follow-up questions,
00:15:34.620 | you know, like then do a query to get some information,
00:15:37.540 | additional information based on this information,
00:15:39.540 | do a query again.
00:15:40.820 | And sort of, you know, like it can keep doing this stuff
00:15:42.940 | and then eventually maybe give you a big report
00:15:45.460 | that consists of a lot of like data points.
00:15:48.220 | But the whole flow is that it knows the system,
00:15:52.540 | it knows your data
00:15:53.380 | because you already kind of did the indexing.
00:15:55.700 | And then it queries semantic layer
00:15:58.060 | instead of a data warehouse directly.
00:16:00.940 | - Maybe just to make it a little clearer
00:16:03.340 | for people that haven't used a semantic layer before,
00:16:06.500 | you can have definitions like revenue
00:16:09.020 | where revenue is like select from customers
00:16:12.300 | and like join orders
00:16:13.460 | and then sum of the amount of orders.
00:16:15.460 | But in the semantic layer,
00:16:16.660 | you're kind of hiding all of that away.
00:16:18.660 | So when you do natural language to queue,
00:16:21.580 | I just select revenue from last week
00:16:24.020 | and then it turns into a bigger and bigger query.
00:16:27.820 | - One of the biggest difficulties around semantic layer
00:16:30.380 | for people who've never thought about this concept before,
00:16:32.980 | this all sounds super neat
00:16:34.780 | until you have multiple stakeholders
00:16:37.940 | within a single company
00:16:39.180 | who all have different concepts of what a revenue is,
00:16:42.540 | but they all have different concepts
00:16:43.500 | of what active user is.
00:16:44.900 | And then so they'll have like revenue revision one
00:16:48.540 | by the sales team and then revenue revision one,
00:16:52.420 | accounting team or tax team, I don't know.
00:16:54.380 | I feel like I always want semantic layer discussions
00:16:57.980 | to talk about the not so pretty parts of the semantic layer
00:17:02.260 | because this is where effectively you ship your org chart
00:17:05.500 | in the semantic layer.
00:17:06.860 | - I think the way I think about it is that
00:17:09.180 | at the end of the day,
00:17:10.020 | semantic layer is a code base
00:17:11.900 | and in Qubit, it's essentially a code base, right?
00:17:14.100 | It's just a set of YAML files with pythons.
00:17:17.500 | I think code is never perfect.
00:17:19.540 | We know that, we're like software engineers, right?
00:17:21.980 | It's never going to be perfect.
00:17:23.220 | You will have a lot of, you know, like revisions of code.
00:17:25.660 | We have a version control
00:17:26.900 | which helps it's easier with revisions.
00:17:28.980 | So I think we should treat our metrics
00:17:30.740 | and we are in semantic layer as a code, right?
00:17:33.900 | And then collaboration is a big part of it.
00:17:36.100 | You know, like if there are like a multiple teams
00:17:37.900 | that sort of have a different opinions,
00:17:39.700 | let them collaborate on the pull request.
00:17:41.860 | You know, they can discuss that.
00:17:43.540 | Like why they think that should be calculated differently.
00:17:45.740 | Have an open conversation about it.
00:17:48.780 | You know, like when everyone can just discuss it
00:17:50.860 | like an open source community, right?
00:17:52.420 | Like you go on a GitHub and you talk about
00:17:54.140 | why that code is written the way it's written, right?
00:17:57.060 | It should be written differently.
00:17:58.620 | And then hopefully at some point you can come up,
00:18:01.140 | you know, like to some definition.
00:18:03.220 | Now, if you still have multiple versions, right?
00:18:06.660 | It's a code, right?
00:18:07.820 | So you can still manage it.
00:18:09.660 | But I think the big part of that
00:18:11.420 | is that like we really need to treat it as a code base.
00:18:14.220 | Then it makes a lot of things easier,
00:18:15.860 | not as spreadsheets, you know, like a hidden Excel files.
00:18:20.180 | - The other thing is like,
00:18:21.820 | then having the definition spread in the organization,
00:18:24.940 | you know, like versus everybody trying to come up
00:18:27.940 | with their own thing.
00:18:28.980 | But yeah, I'm sure that when you talk to customers,
00:18:31.420 | there's people that, you know, have issues with the product
00:18:34.660 | and it's really like two people
00:18:35.820 | trying to define the same thing.
00:18:37.100 | One in sales that wants to look good.
00:18:39.100 | The other is like the finance team
00:18:40.540 | that wants to be conservative
00:18:41.900 | and they all have different definitions.
00:18:44.780 | How important is the natural language to people?
00:18:47.500 | So obviously, you know, you guys both work
00:18:51.140 | in modern data stack companies either now or before.
00:18:55.140 | There's gonna be the whole wave
00:18:56.740 | of empowering data professionals.
00:18:59.420 | I think now a big part of the wave
00:19:01.540 | is removing the need for data professionals
00:19:04.140 | to always be in the loop
00:19:05.580 | and having non-technical folks do more of the work.
00:19:08.300 | Are you seeing that as a big push too with these models,
00:19:12.100 | like allowing everybody to interact with the data?
00:19:15.220 | Yeah, any customer stories you can share,
00:19:17.500 | anything like that?
00:19:18.500 | - I think it's a multidimensional question.
00:19:20.300 | That's an example of, you know,
00:19:21.660 | like where you have a lot of inside the question.
00:19:24.740 | So in terms of examples,
00:19:28.540 | I think a lot of people building different,
00:19:30.460 | you know, like agents or chatbots.
00:19:32.140 | We have a company that built as internal Slack bot
00:19:35.940 | that sort of answers questions, you know,
00:19:37.940 | like based on the data in a warehouse.
00:19:40.260 | And then like a lot of people kind of go in
00:19:41.940 | and like ask that chatbot this question.
00:19:44.940 | Is it like a real big use case?
00:19:47.900 | Maybe.
00:19:48.740 | Is it still like a toy pet project?
00:19:51.380 | Maybe too right now.
00:19:52.460 | I think it's really hard to tell them apart at this point
00:19:55.820 | because there is a lot of like a hype
00:19:57.380 | and you know, just people building LLM style
00:20:00.020 | because it's cool.
00:20:01.020 | And everyone wants to build something,
00:20:02.500 | you know, like kind of even at least a pet project.
00:20:05.260 | So that's what happened in Krizawa community as well.
00:20:07.420 | We see a lot of like people building a lot of cool stuff
00:20:10.300 | and it probably will take some time
00:20:12.300 | for that stuff to mature
00:20:13.980 | and kind of to see like what are real, the best use cases.
00:20:17.260 | But I think what I saw so far,
00:20:18.860 | one use case was building this chatbot
00:20:20.740 | and we have even one company
00:20:21.940 | that are building it as a service.
00:20:23.940 | So they essentially connect into Q semantic layer
00:20:27.660 | and then offering their like chatbot
00:20:30.260 | so you can do it in a web, in a Slack,
00:20:32.420 | so it can, you know, like answer questions
00:20:34.300 | based on data in your semantic layer.
00:20:36.660 | But, and I'll also see a lot of things
00:20:39.180 | like they're just being built in-house.
00:20:41.420 | And there are other use cases,
00:20:42.740 | some sort of automation, you know,
00:20:44.300 | like then that agent checks on the data
00:20:47.620 | and then kind of perform some actions
00:20:49.940 | based, you know, like on changes in data.
00:20:52.740 | But other dimension of your question is like,
00:20:56.580 | will it replace people or not?
00:20:59.180 | I think, you know, like what I see so far
00:21:01.300 | in data specifically,
00:21:02.740 | there are like a few use cases of LLM.
00:21:06.460 | I don't see you being part of that use case,
00:21:09.300 | but it's more like a co-pilot for a data analyst,
00:21:12.020 | a co-pilot for data engineer,
00:21:13.980 | where you develop something, you develop a model
00:21:16.260 | and it can help you to write a SQL or something like that.
00:21:19.100 | So, you know, it can create a boilerplate SQL
00:21:21.820 | and then you can edit this SQL,
00:21:23.900 | which is fine because you know how to edit SQL, right?
00:21:26.340 | So, you're not going to make a mistake,
00:21:28.580 | but it will help you to just generate, you know,
00:21:30.860 | like a bunch of SQL that you write again and again, right?
00:21:34.060 | Like boilerplate code.
00:21:35.660 | So sort of a co-pilot use case.
00:21:37.660 | I think that's great and we'll see more of it.
00:21:39.740 | I think every platform that is building for data engineers
00:21:43.660 | will have some sort of a co-pilot capabilities
00:21:45.700 | and Kube included,
00:21:46.540 | we're building this co-pilot capabilities
00:21:48.700 | to help people build semantic layers easier.
00:21:51.380 | I think that just a baseline
00:21:53.060 | for every engineering product right now
00:21:54.820 | to have some sort of, you know, like a co-pilot capabilities.
00:21:57.740 | Then the other use case is a little bit more
00:22:00.340 | where Kube has been involved is like,
00:22:02.460 | how do we enable access to data for non-technical people
00:22:05.860 | through the natural language as an interface to data, right?
00:22:08.980 | Like visual dashboards, charts,
00:22:11.780 | it's always has been an interface to data in every BI.
00:22:15.580 | Now, I think we will see just a second interface
00:22:19.460 | as just kind of a natural language.
00:22:21.860 | So I think at this point,
00:22:22.900 | many BI's will add it as a commodity feature.
00:22:25.620 | It's like Tableau will probably have a search bar
00:22:28.900 | at some point saying like, "Hey, ask me a question."
00:22:31.420 | I know that some of the, you know, like AWS QuickSight,
00:22:35.500 | they're about to announce features like this
00:22:37.900 | in their like BI and I think Power BI will do that,
00:22:40.860 | especially with their deal with OpenAI.
00:22:43.340 | So every company, every BI will have this
00:22:45.540 | some sort of a search capabilities built in inside their BI.
00:22:48.820 | So I think that's just going to be a baseline feature
00:22:51.060 | for them as well, but that's where Kube can help
00:22:53.740 | because we can provide that context, right?
00:22:55.860 | - Do you know how, or do you have an idea
00:22:57.700 | for how these products will differentiate
00:23:00.340 | once you get the same interface?
00:23:02.260 | So right now there's like, you know,
00:23:04.260 | Tableau is like the super complicated
00:23:06.300 | and it's like super set is like easier.
00:23:08.420 | Yeah, do you just see everything will look the same
00:23:12.100 | and then how do people differentiate?
00:23:14.700 | - It's like they all have line chart, right?
00:23:17.100 | And they all have bar chart.
00:23:18.820 | So I feel like it pretty much the same.
00:23:20.820 | I don't think BI market will,
00:23:24.380 | it's going to be fragmented as well.
00:23:28.260 | And every major vendor and most of the vendors
00:23:31.100 | will try to have some sort of natural language capabilities
00:23:34.380 | and they might be a little bit different.
00:23:35.940 | Some of them will try to position
00:23:38.340 | the whole product around it.
00:23:40.540 | Some of them will just have them as a checkbox, right?
00:23:43.980 | So we'll see, but I don't think it's going to be something
00:23:47.380 | that will change the BI market, you know,
00:23:49.820 | like something that can take the BI market
00:23:52.780 | and make it more consolidated
00:23:54.700 | rather than, you know, like what we have right now.
00:23:56.460 | I think it's still will remain fragmented.
00:23:59.660 | - Let's talk a bit more about application use cases.
00:24:03.620 | So people also use Kube for kind of like analytics
00:24:06.900 | on their product, like dashboards and things like that.
00:24:11.380 | How do you see that changing in more,
00:24:14.420 | especially like when it comes to like agents, you know,
00:24:16.540 | so there's like a lot of people trying to build agents
00:24:19.180 | for reporting, building agents for sales.
00:24:22.140 | Like if you're building a sales agent,
00:24:23.700 | you need to know everything about the purchasing history
00:24:26.580 | of the customer, all of these things.
00:24:29.100 | Yeah, any thoughts there?
00:24:31.100 | What should all the AI engineers listening
00:24:33.900 | think about when implementing data into agents?
00:24:37.780 | - Yeah, I think kind of, you know,
00:24:39.860 | like trying to solve for two problems.
00:24:41.740 | One is how to make sure that agents or LLM model, right,
00:24:46.740 | has enough context about, you know, like a tabular data.
00:24:50.740 | And also, you know, like how do we deliver updates
00:24:53.300 | to the context, which is also important
00:24:55.220 | because data is changing, right?
00:24:56.780 | So every time we change something upstream,
00:24:59.140 | we need to sure we update that context
00:25:01.380 | in our vector database or something.
00:25:04.540 | And how do you make sure that the queries are correct?
00:25:07.820 | You know, I think it's obviously a big pain
00:25:09.940 | in this all, you know, like AI kind of, you know,
00:25:12.500 | like a space right now, how do we make sure
00:25:13.940 | that we don't, you know, provide our own counselors?
00:25:16.940 | But I think, you know, like kind of be able to reduce
00:25:20.300 | the room for error as much as possible
00:25:22.500 | that what I would look for, you know,
00:25:24.100 | like to try to like minimize potential damage.
00:25:28.740 | And then, yeah, I feel like our use case, you know,
00:25:32.660 | like for Kube, it's been, we've been using,
00:25:36.740 | Kube been used a lot to power
00:25:38.660 | sort of customer facing analytics.
00:25:40.460 | So I don't think that much is going to change
00:25:43.560 | is that I feel like, again, more and more products
00:25:45.980 | will adopt natural language interfaces
00:25:48.860 | as sort of a part of that product as well.
00:25:51.380 | So we would be able to power this business
00:25:55.140 | to not only, you know, like charts, visuals,
00:25:58.140 | but also some sort of, you know, like summaries,
00:26:01.400 | you know, like probably in the future,
00:26:02.900 | you're going to open the page with some surface stats
00:26:06.540 | and you will have a smart summary kind of generated by AI.
00:26:09.580 | And that summary can be powered by Kube, right?
00:26:11.820 | Like, because the rest is already being powered by Kube.
00:26:14.660 | - You know, we had Linus from Notion on the pod
00:26:18.180 | and one of the ideas he had that I really like
00:26:20.580 | is kind of like thumbnails of text,
00:26:23.500 | kind of like how do you like compress knowledge
00:26:25.980 | and then start to expand it.
00:26:27.700 | A lot of that comes into dashboards, you know,
00:26:29.900 | where like you have a lot of data,
00:26:31.340 | you have like a lot of charts
00:26:32.420 | and sometimes you just want to know,
00:26:34.060 | hey, this is like the three lines summary of it.
00:26:36.820 | Yeah, and yeah, makes sense that you would want to power that.
00:26:42.220 | So are you, how are you thinking about, yeah,
00:26:44.460 | the evolution of like the modern data stack
00:26:47.320 | and in quotes, whatever that means today,
00:26:49.900 | what's like the future of what people are going to do?
00:26:53.260 | What's the future of like what models and agents
00:26:55.700 | are gonna do for them?
00:26:57.620 | Do you have any thoughts?
00:26:59.620 | - I feel like modern data stack
00:27:01.380 | sometimes it's not very connected.
00:27:03.420 | I mean, it's obviously a big crossover between AI,
00:27:06.460 | you know, like ecosystem, AI infrastructure ecosystem
00:27:09.740 | and then sort of the data,
00:27:11.780 | but I don't think it's a full overlap.
00:27:14.620 | So I feel like when we know,
00:27:15.900 | like I'm looking at a lot of like what's happening
00:27:17.780 | in a modern data stack, right?
00:27:19.580 | Like where like we use warehouses,
00:27:23.020 | we use BI's, you know, different like transformation tools,
00:27:26.780 | catalogs, like data quality tools, ETLs, all of that.
00:27:30.500 | I don't see a lot of being compacted by AI specifically.
00:27:35.020 | I think, you know, that space is being compacted
00:27:37.500 | as much as any other space in terms of,
00:27:40.700 | yes, we'll have all those copilot capabilities,
00:27:43.420 | some of AI capabilities here and there,
00:27:45.500 | but I don't think see anything sort of dramatically,
00:27:48.860 | you know, being sort of, you know, a change or shifted
00:27:52.220 | because of, you know, like AI wave.
00:27:54.380 | In terms of just in general data space,
00:27:57.220 | I think, you know, like in the last two, three years,
00:27:59.660 | we saw an explosion, right?
00:28:01.260 | Like we got like a lot of tools,
00:28:03.100 | every vendor for every problem.
00:28:05.660 | I feel like right now we should go
00:28:07.380 | through the cycle of consolidation.
00:28:09.900 | And, you know, like, I mean, if Fivetran and DBT merge,
00:28:13.900 | they can be Alteryx of a new generation or something like,
00:28:16.900 | - Yeah.
00:28:18.100 | - And, you know, probably some ETL too there,
00:28:21.460 | but I feel it might happen.
00:28:23.460 | I mean, it just natural waves, you know, like in cycles.
00:28:26.940 | - I wonder if everybody is gonna have their own copilot.
00:28:29.660 | The other thing I think about these models is like,
00:28:31.940 | you know, SWIX was at AirByte and yeah, there's Fivetran.
00:28:35.620 | - That's the ETL thing.
00:28:37.580 | But Fivetran versus AirByte,
00:28:39.780 | I don't think it all makes very well.
00:28:41.980 | - There's the, you know, a lot of times these companies
00:28:46.340 | are doing the syntax work for you
00:28:48.660 | of like building the integration between your data store
00:28:51.220 | and like the app or another data store.
00:28:53.340 | I feel like now these models are pretty good at coming up
00:28:56.540 | with the integration themselves, you know,
00:28:58.500 | and like using the docs to then connect the two.
00:29:00.660 | So I'm really curious, like in the future,
00:29:02.780 | what that will look like, you know.
00:29:06.220 | And same with data transformation.
00:29:07.580 | I mean, you think about DBT and some of these tools
00:29:10.620 | and right now you have to create rules to normalize
00:29:13.580 | and transform data, but in the future,
00:29:16.580 | I could see you explaining the model,
00:29:18.820 | how you want the data to be,
00:29:20.500 | and then the model figuring out
00:29:21.860 | how to do the transformation.
00:29:23.380 | But yeah, I think it all needs a semantic layer
00:29:28.460 | as far as like figuring out what to do with it, you know,
00:29:30.740 | what's the data for, where it goes.
00:29:32.900 | - Yeah, I think many of this, you know,
00:29:34.900 | like workflows will be augmented by, you know,
00:29:38.540 | like some sort of a copilot.
00:29:40.300 | You know, you can describe what transformation
00:29:43.100 | you want to see and it can generate a boilerplate, right,
00:29:46.060 | of transformation for you.
00:29:47.340 | Or even, you know, like kind of generate a boilerplate
00:29:49.940 | of specific ETL driver or ETL integration.
00:29:54.140 | I think we're still maybe not at the point
00:29:58.660 | where this code can be fully automated.
00:30:00.820 | So we still need a human and a loop, right,
00:30:02.460 | like who can use this copilot.
00:30:05.700 | But in general, I think, yeah,
00:30:07.860 | data work and software engineering work
00:30:10.260 | can be augmented quite significantly with all that stuff.
00:30:14.500 | - I think the other important thing with data too
00:30:16.900 | is like sometimes, you know,
00:30:20.420 | the big thing with machine learning before was like,
00:30:22.060 | well, all of your data is bad, you know,
00:30:23.820 | the data is not good for anything.
00:30:26.180 | And I think like now at least with these models,
00:30:30.380 | they have some knowledge of their own
00:30:31.940 | and they can also tell you if your data is bad, you know,
00:30:34.500 | which I think is like something that before you didn't,
00:30:36.540 | you didn't have.
00:30:37.420 | Any cool apps that you've seen being built on, on Cube,
00:30:40.900 | like any kind of like AI native things
00:30:43.460 | that people should think about,
00:30:45.580 | new experiences, anything like that?
00:30:47.700 | - Well, I see a lot of Slack bots.
00:30:49.660 | So, you know, like it's just,
00:30:51.140 | it's definitely like, they all remind me of Statsbot,
00:30:55.260 | but I know like I played with few of them,
00:30:57.980 | they're much, much better than Statsbot.
00:30:59.900 | So I feel like it just,
00:31:01.780 | it feels like it's on the surface, right?
00:31:03.580 | It's just that use case that you really want, you know,
00:31:06.180 | think about your data engineer in your company,
00:31:08.700 | like everyone is like, and you're asking,
00:31:10.460 | "Hey, can you pull that data for me?"
00:31:12.380 | And you would be like,
00:31:13.260 | "Can I build a bot to replace myself?"
00:31:15.540 | You know, like, so they will ping that bot instead.
00:31:18.820 | So it's like, that's why a lot of people doing this.
00:31:20.540 | So I think it's a first use case
00:31:21.980 | that actually people are playing with.
00:31:23.620 | But I think inside that use case, people get creative.
00:31:26.580 | So I see bots that can actually have a dialogue with you.
00:31:29.500 | So, you know, like you would come to that bot and say,
00:31:30.980 | "Hey, show me metrics."
00:31:32.220 | And the bot would be like, "What kind of metrics?
00:31:34.500 | What do you want to look at?"
00:31:35.700 | So like, it would be like active users.
00:31:38.220 | And then it would be like,
00:31:39.060 | "How do you define active users?"
00:31:40.460 | You want to see active users, you know, like sort of cohort,
00:31:44.420 | you want to see active users
00:31:45.660 | kind of changing behavior over time,
00:31:47.300 | like a lot of like a follow-up questions.
00:31:49.100 | So it tries to sort of, you know,
00:31:51.620 | like understand what exactly you want,
00:31:53.540 | because a lot of people,
00:31:54.660 | and that's how many data analysts work, right?
00:31:57.940 | When people started to ask you something,
00:32:00.300 | you always try to understand what exactly do you mean?
00:32:02.820 | Because many people,
00:32:04.380 | they don't know how to ask correct questions
00:32:07.820 | about your data.
00:32:09.140 | It's a sort of an interesting specter.
00:32:11.460 | On one side of the specter, you don't know,
00:32:13.140 | you know nothing, you're just like,
00:32:13.980 | "Hey, show me metrics."
00:32:15.460 | And the other side of specter,
00:32:16.780 | you know how to write SQL,
00:32:18.380 | and you can write exact query to your data warehouse, right?
00:32:21.940 | So many people like say a little bit in the middle,
00:32:24.420 | and this, the data analyst,
00:32:27.460 | they usually have the knowledge about your data.
00:32:30.260 | And that's why they can ask follow-up questions
00:32:32.420 | and to understand what exactly you want.
00:32:34.620 | And I saw people building bots who can do that.
00:32:37.700 | And that part is amazing.
00:32:39.860 | I mean, like generating SQL, all of that stuff,
00:32:41.820 | it's okay, it's good,
00:32:44.300 | but when the bot can actually act like they know your data
00:32:49.260 | and they can ask follow-up questions,
00:32:50.860 | I think that's great.
00:32:52.140 | - Yeah.
00:32:52.980 | Are there any issues with the models
00:32:55.660 | and the way they understand numbers?
00:32:57.180 | You know, one of the big complaints people have
00:32:59.420 | is like GPD, at least three and a half cannot do math.
00:33:02.500 | You know, have you seen any limitations and improvement?
00:33:06.660 | And also when it comes to what model to use,
00:33:09.100 | do you see most people use like GPT-4
00:33:11.060 | because it's like the best at this kind of analysis?
00:33:13.500 | - I think I saw people use all kinds of models.
00:33:17.180 | To be honest, it's usually GPT, so it's not,
00:33:19.220 | I mean, inside GPT, it could be 3.5 or 4, right?
00:33:22.620 | But it's not like I see a lot of something else,
00:33:25.460 | to be honest, like I don't, I mean,
00:33:27.100 | maybe know like some open source alternatives,
00:33:31.020 | but it's pretty much, you know,
00:33:32.220 | like it feels like the market is being dominated
00:33:34.500 | by just chat GPT, which is probably true.
00:33:38.980 | In terms of the problems,
00:33:40.540 | I think I've been chatting about it with a few people.
00:33:44.260 | So they try just kind of, you know,
00:33:45.580 | like if math is required to do math,
00:33:47.780 | you know, like outside of, you know, like chat GPT itself.
00:33:50.580 | So it would be like some additional Python scripts
00:33:53.060 | or something.
00:33:53.940 | When we're talking about production level use cases,
00:33:57.660 | it's quite a lot of Python code around, you know,
00:33:59.860 | like your model to make it work, to be honest.
00:34:01.860 | It's like, it's not that magic
00:34:03.460 | that you just throw the model in it.
00:34:04.980 | Like it can give you all these answers.
00:34:06.820 | For like a toy use cases, the one we have on a, you know,
00:34:09.180 | like our demo page or something, it works fine.
00:34:11.500 | It's great.
00:34:12.340 | But you know, like if you want to do like a lot
00:34:13.860 | of post-processing, do a mass on your own,
00:34:16.020 | you probably need to code it in Python anyway.
00:34:18.380 | That's what I see people doing.
00:34:20.380 | - Yeah. Yeah.
00:34:21.220 | We heard the same from Harrison and Langstream
00:34:24.580 | that most people just use OpenAI.
00:34:26.900 | We did a OpenAI Snowmode emergency podcast.
00:34:30.340 | And it was funny to like just see the reaction
00:34:32.660 | that people had to that
00:34:33.620 | and how hard it actually is to break down
00:34:37.300 | some of the monopoly.
00:34:38.540 | What else should people keep in mind, Artem?
00:34:40.780 | You're kind of like at the cutting edge of this, you know,
00:34:43.300 | if I'm looking to build a data-driven AI application,
00:34:47.580 | I'm trying to build data into my AI workflows.
00:34:50.780 | Any mistakes people should avoid?
00:34:52.900 | Any tips on the best stack to use?
00:34:55.180 | What tools to use?
00:34:56.420 | - I would just recommend going through
00:34:58.460 | to warehouse as soon as possible.
00:35:00.700 | I think a lot of people feel that MySQL can be a warehouse,
00:35:04.100 | which can be maybe on like a lower scale,
00:35:06.260 | but you know, like it's definitely not
00:35:08.580 | from a performance perspective.
00:35:10.540 | So just kind of have it starting with a good warehouse,
00:35:13.340 | a query engine like house,
00:35:14.740 | that's probably like something I would recommend
00:35:17.460 | starting from a day zero.
00:35:18.780 | And there are like ways to do it very cheap
00:35:21.460 | with open source technologies too,
00:35:23.140 | especially in a lake house architecture.
00:35:25.380 | I think, you know, I'm biased, obviously,
00:35:27.260 | but using a semantic layer, preferably Kube.
00:35:30.180 | And for, you know, like a context.
00:35:32.460 | And other than that, it's just like,
00:35:34.620 | I feel it's a very interesting space, you know,
00:35:36.940 | like in terms of AI ecosystem,
00:35:40.140 | I see a lot of people using link chain right now,
00:35:42.180 | which is great, you know, like,
00:35:43.380 | and we build an integration,
00:35:45.500 | but I'm sure the space will continue to evolve.
00:35:48.420 | And, you know, like we'll see a lot of like
00:35:50.260 | interesting tools and maybe, you know,
00:35:51.620 | like some tools would be a better fit for a job.
00:35:54.420 | I don't, I'm not aware of any right now,
00:35:57.020 | but it's always interesting to see how it evolves.
00:36:00.020 | Also, it's a little unclear, you know,
00:36:01.740 | like how all the infrastructure around
00:36:03.860 | actually developing, testing,
00:36:05.620 | documenting all that stuff will kind of evolve too.
00:36:09.100 | But yeah, again, it's just like really interesting
00:36:12.060 | to see and observe, you know,
00:36:13.220 | what's happening in this space.
00:36:15.380 | - Okay, so before we go to the lightning round,
00:36:17.700 | I wanted to ask you on your thoughts on embedded analytics.
00:36:22.700 | And in a sense, the kind of chatbots
00:36:26.180 | that people are inserting on their websites
00:36:28.300 | and building with LLMs
00:36:31.260 | is very much sort of end user programming
00:36:33.340 | or end user interaction with their own data.
00:36:36.420 | I love seeing embedded analytics.
00:36:38.060 | And for those who don't know,
00:36:39.020 | embedded analytics is basically user facing dashboards
00:36:42.060 | where you can see your own data, right?
00:36:44.820 | Instead of the company seeing data
00:36:47.020 | across all their customers,
00:36:48.420 | it's an individual user seeing their own data
00:36:51.020 | as a slice of the overall data
00:36:53.220 | that is owned by the platform that they're using.
00:36:57.100 | So I love embedded analytics,
00:36:58.940 | but actually overwhelmingly the observation that I've had
00:37:02.420 | is that people who try to build in this market
00:37:05.020 | fail to monetize.
00:37:06.140 | And I was wondering your insights on why.
00:37:08.420 | - I think overall the statement is true.
00:37:10.500 | It's really hard to monetize, you know,
00:37:12.620 | like in embedded analytics.
00:37:14.660 | That's why at Qube we're excited more
00:37:17.420 | about our internal kind of BI use case
00:37:20.220 | or like a company's a building, you know,
00:37:21.940 | like a chatbots for their internal data consumption
00:37:24.700 | or like internal workflows.
00:37:26.540 | Embedded analytics is hard to monetize
00:37:28.660 | because it's historically been dominated by the BI vendors.
00:37:33.660 | And we still, you know, like see a lot of, you know,
00:37:39.020 | like organizations are using BI tools as a vendors.
00:37:42.540 | So, and what I was talking about BI vendors
00:37:46.020 | adding natural language interfaces,
00:37:48.100 | they will probably add that
00:37:49.340 | to the embedded analytics capabilities as well, right?
00:37:51.740 | So they would be able to embed that too.
00:37:54.060 | So I think that's part of it.
00:37:56.620 | Also, you know, if you look at the embedded analytics market
00:38:00.620 | the bigger organizations, the big GADs,
00:38:03.940 | they're really more custom, you know, like it becomes.
00:38:06.540 | And at some point I see many organizations
00:38:08.580 | they just stop using any vendor
00:38:10.500 | and they just kind of build most of the stuff from scratch,
00:38:13.860 | which probably, you know, like the right way to do it.
00:38:16.020 | So it's sort of, you know, like you got a market
00:38:18.060 | that is very kept at the top
00:38:21.660 | and then you also in that middle and small segment
00:38:25.620 | you got a lot of vendors trying, you know,
00:38:28.020 | like to compete for the buyers.
00:38:29.820 | And because again, the BI is very fragmented
00:38:32.900 | embedded analytics therefore is fragmented also.
00:38:36.340 | So you're really going after the mid-market slice
00:38:39.980 | and then with a lot of other vendors competing for that.
00:38:43.060 | So that's why it's historically been hard to monetize, right?
00:38:45.780 | I don't think AI really going to change that
00:38:48.620 | just because it's using model,
00:38:51.420 | you just pay to open AI and that's it.
00:38:54.180 | Like everyone can do that, right?
00:38:55.460 | So it's not much of a competitive advantage.
00:38:58.140 | So it's going to be more like a commodity feature
00:39:00.060 | is that a lot of like, yeah,
00:39:01.340 | vendors would be able to leverage.
00:39:03.340 | - This is great Artem.
00:39:04.500 | As usual, we got our lightning round.
00:39:06.540 | So it's true question.
00:39:07.820 | One is about acceleration, one on exploration
00:39:10.740 | and then take away.
00:39:12.420 | The acceleration thing is what's something
00:39:14.380 | that already happened in AI or maybe, you know,
00:39:17.660 | in data that you thought would take much longer
00:39:20.340 | but it's already happening today.
00:39:22.140 | - To be honest, all this foundational models,
00:39:24.220 | I thought that, you know,
00:39:26.820 | we had a lot of models that had been in production
00:39:29.940 | for like quite, you know, maybe decade or so.
00:39:32.700 | And it was like a very niche use cases,
00:39:35.180 | very vertical use cases.
00:39:36.660 | It's just like in very customized models.
00:39:39.020 | And even when we're building stats bot back then in 2016,
00:39:43.460 | right, it was, even back then we had some
00:39:48.060 | natural language models being deployed,
00:39:49.940 | like a Google Translate or something that was,
00:39:52.100 | it still was sort of a model, right?
00:39:53.940 | But it was very customized with a specific use case.
00:39:56.820 | So I thought that would continue for like many years.
00:40:00.660 | We'll use AI, we'll have all this customized niche models.
00:40:04.420 | But there is like foundational model,
00:40:05.900 | they like very generic now.
00:40:07.900 | They like, they can serve many, many different use cases.
00:40:11.660 | So I think that's, that is a big change.
00:40:15.100 | And I didn't expect that to be honest.
00:40:17.420 | - And the next question is about exploration.
00:40:20.740 | What is one thing that you think
00:40:22.340 | is the most interesting unsolved question in AI?
00:40:24.780 | - I think AI is a subset of software engineering in general.
00:40:29.780 | And it's sort of connected to the data as well.
00:40:33.100 | And in software, because software engineering
00:40:35.340 | as a discipline, it has quite a history.
00:40:38.420 | We build a lot of processes, you know,
00:40:40.460 | like toolkits and methodologies,
00:40:44.020 | how we project that, right?
00:40:46.620 | And now AI, I don't think it's completely different,
00:40:49.820 | but it has some unique traits.
00:40:51.620 | You know, like it's quite much not idempotent, right?
00:40:54.860 | And kind of from many dimensions.
00:40:57.620 | So, and like other traits.
00:40:59.580 | So which kind of may require a different methodologies,
00:41:03.820 | may require different approaches in a different toolkit.
00:41:06.820 | I don't think how much is going to deviate
00:41:08.660 | from a standard software engineering.
00:41:10.100 | I think many sort of, you know, like tools and practices
00:41:13.420 | was that we develop our software engineering
00:41:15.260 | can be applied to AI.
00:41:16.980 | And some of the data best practices
00:41:18.780 | can be applied as well.
00:41:19.780 | But it's might be a very interesting subfield,
00:41:23.660 | like we got a DevOps, right?
00:41:25.100 | Like just a bunch of tools, like ecosystem.
00:41:27.580 | So now like AI is kind of feels like it's shaping into that
00:41:31.260 | with a lot of its own, you know,
00:41:32.740 | like methodologies, practices and toolkits.
00:41:35.860 | So I'm really excited about it.
00:41:37.660 | And I think it's a lot of sort of, you know,
00:41:39.220 | like unsolved still question again,
00:41:41.060 | how do we develop with that?
00:41:42.100 | How do we test, you know, like what is the best practices?
00:41:44.620 | How does the methodologies?
00:41:45.780 | So I think that would be an interesting to see.
00:41:48.140 | - Awesome.
00:41:49.180 | And then, yeah, the final message, you know,
00:41:52.180 | you have a big audience of engineers and technical folks.
00:41:56.180 | What's something you want everybody to remember,
00:41:58.460 | to think about, to explore?
00:42:00.380 | - It says being who could try to build a chatbot,
00:42:03.140 | you know, like for analytics back then,
00:42:05.300 | and kind of, you know, like looking at what,
00:42:07.220 | what people do right now.
00:42:08.060 | I think, yeah, just do that.
00:42:09.420 | I mean, it's working right now.
00:42:11.420 | So it's, with foundational models,
00:42:14.820 | it's actually now it's possible to build
00:42:16.780 | all those cool applications.
00:42:18.620 | The thing, you know, like it's,
00:42:20.260 | I'm so excited to see, you know,
00:42:22.500 | like how much changed in the last six years
00:42:25.780 | or so that we actually now can build a smart agents.
00:42:28.460 | I think that's sort of, you know, like a takeaways.
00:42:30.540 | And yeah, we are, as you know, like as humans in general,
00:42:35.340 | we're like, we really move technology forward
00:42:38.340 | and it's fun to see, you know, like it's just a firsthand.
00:42:41.420 | - Ah, well, thank you so much for coming on Artem.
00:42:45.980 | This was great.
00:42:47.460 | (upbeat music)
00:42:50.060 | (upbeat music)
00:42:52.660 | (upbeat music)
00:42:55.260 | (upbeat music)
00:42:57.840 | (upbeat music)