Powering your Copilot for Data - with Artem Keydunov from Cube.dev

(upbeat music) - Hey everyone. Welcome to the Latent Space Podcast. This is Swigx, writer, editor of Latent Space and founder of Small.ai and Alessio, partner and CTO in residence at Decibel Partners. - Hey everyone. And today we have Artem Ketunov on the podcast, co-founder of Cube. Hey Artem. - Hey Alessio, hi Swigx.

Good to be here today. Thank you for inviting me. - Yeah, thanks for joining. For people that don't know, I've known Artem for a long time, ever since he started Cube. And Cube is actually a spin out of his previous company, which is Statsbot. And this kind of feels like going both backward and forward in time.

So the premise of Statsbot was having a Slack bot that you can ask, basically like text to SQL in Slack. And this was six, seven years ago, or something like that. So a lot ahead of its time and you see startups trying to do that today. And then Cube came out of that as a part of the infrastructure that was powering Statsbot.

And Cube then evolved from an embedded analytics product to the semantic layer and just an awesome open source evolution. I think you have over 16,000 stars on GitHub today. You have a very active open source community. But maybe for people at home, just give a quick like lay of the land of the original Statsbot product.

You know, what got you interested in like text to SQL and what were some of the limitations that you saw then, the limitations that you're also seeing today in the new landscape? - I started Statsbot in 2016. So the original idea was to just make sort of a side project based off my initial project that I did at a company that I was working for back then.

And I was working for a company that was building software for schools. And we were using Slack a lot. And Slack was growing really fast. A lot of people were talking about Slack, you know, like Slack apps, Chatsbots in general. So I think it was, you know, like another wave of, you know, bots and all of that.

We have one more wave right now, but it's always comes in waves. So we were like living through one of those waves. And I wanted to build a bot that would give me information from the different places where like a data lives to Slack. So this was like some, you know, developer data, like New Relic, you know, maybe some marketing data, Google Analytics, and then some just regular data, like a production databases or even Salesforce sometimes.

And I wanted to bring it all into Slack because we were always talking, chatting, you know, like in Slack, and I wanted to see some stats in Slack. So that was the idea of Statsbot, right? Like bring stats to Slack. I built that as a, you know, like a first sort of a side project, and I published it on Reddit, and people started to use it even before Slack came up with that Slack application directory.

So it was a little, you know, like a hackish way to install it, but people are still installing it. So it was a lot of fun, and then Slack kind of came up with that application directory, and they reached out to me, and they wanted to feature Statsbot because it was one of the already being kind of widely used bots on Slack.

So they featured me on this application directory front page, and I just got a lot of, you know, like new users signing up for that. It was a lot of fun, I think, you know, like, but it was sort of a big limitation in terms of how you can process natural language, because the original idea was to let people ask questions directly in Slack, right?

Hey, show me my, you know, like opportunities closed last week or something like that. My co-founder who kind of started helping me with this Slack application, him and I were trying to build a system to recognize that natural language, but it was, you know, we didn't have LLMs, right, back then and all of that technologies, so it was really hard to build the system, especially the systems that can kind of, you know, like keep talking to you, like maintain some sort of a dialogue.

It was a lot of like one-off requests, and like it was a lot of hit and miss, right? If you know how to construct a query in natural language, you will get a result back, but, you know, like it was not a system that was capable of, you know, like asking follow-up questions to try to understand what you actually want, and then kind of finally, you know, like bring this all context and go to generate a SQL query, get the result back and all of that.

So that was a really missing part, and I think right now that's, you know, like what is the difference? So right now I'm kind of bullish that, you know, like if I would start Statsbot again, probably would have a much better shot at it, but back then that was a big limitation.

Funny thing is that we wanted to, we kind of built a queue, right, as we were working on Statsbot because we needed it. - Yeah. What was the ML stack at the time? Were you building, trying to build your own, like a natural language understanding models? Like were there open source models that were good that you were trying to leverage?

- I think it was mostly combination of a bunch of things, and we tried a lot of different approaches. The first version which I built, like was RegApps. They were working well. - This is the same as I did. I did option pricing when I was in finance, and I had a natural language pricing tool thing, and it was Regex.

It was just a lot of Regex. - Yeah, yeah. And then, and my co-founder, Jamie Powell, he's much smarter than I am. He's like PhD in math, all of that. And he started to like do some stuff that was like, I was like, no, you just do that stuff.

I don't know, like I can do Regex. And, you know, like he started to do like some models and trying to either, you know, like look at what we had on the market back then, or, you know, like try to build a different sort of, you know, like kind of models.

Again, we didn't have any foundation back in place, right? We wanted to build something that, you know, like we, okay, we wanted to try to use existing math, obviously, right, but it was not something that we can take the model and, you know, like a try and run it.

I think in 2019, we started to see more like of stuff, you know, like ecosystem being built, and then it eventually kind of, you know, like resulted in all this LLM, like what we have right now. But back then in 2016, it was not much, you know, like available for just the people to build on top.

It was some academic research, right, kind of been happening, but it was like very, very early, you know, like for something to actually be able to use. - And then that became Kube, which was started just as an open source project. And I think I remember going on a walk with you in San Mateo in like 2020, something like that.

And you were like, you have people reaching out to you who are like, "Hey, we use Kube in production." Like, "I just need to give you some money, even though you guys are not a company." What's the story of Kube then from Statsbot to where you are today? - We built a Kube at Statsbot because we needed it.

It was like the whole Statsbot stack was that we first tried to translate the initial sort of language query into some sort of multidimensional query. It's like, we were trying to understand, okay, people wanted to get active opportunities, right? What does it mean? Is it a metric? Is it what a dimension here?

Because usually in analytics, you always, you know, like try to reduce everything down to the sort of, you know, like a multidimensional framework. So that was the first step. And that's where, you know, like it didn't really work well because all this limitation of us not having foundational technologies.

But then from the multidimensional query, we wanted to go to SQL. And that's what was semantic layer and what was Kube essentially. So we built a framework where you would be able to map your data into this concept, into this metrics. Because when people were coming to Statsbot, they were bringing their own datasets, right?

And the big question was, how do we tell the system what is active opportunities for that specific users? How we kind of, you know, like provide that context, how we do the training. So that's why we came up with the idea of building the semantic layer. So people can actually define their metrics and then kind of use them as a Statsbot.

So that's how we built a Kube. But at some point we saw people started to see more value in the Kube itself, you know, like kind of building the semantic layers and then using it to power different types of the application. So in 2019, we decided, okay, it feels like it might be a standalone product and a lot of people want to use it.

Let's just try to open source it. So we took it out of Statsbot and open sourced. - Can I make sure that everyone has the same foundational knowledge? The concept of a Kube is not something that you invented. I think, you know, not everyone has the same background in analytics and data that all three of us do.

Maybe you want to explain like OLAP Kube, Hyper Kube, you know, anything, whatever the brief history of Kubes. - Right. I'll try, you know, like there's a lot of like Wikipedia pages and like a lot of like a blog post trying to go into academics of it. So I'm trying to like-- - Kubes according to you, yeah.

- Yeah, it's just, so when we think about just a table in a database, the problem with the table, it's not a multidimensional, meaning that in many cases, if we want to slice the data, we kind of need to result with a different table, right? Like think about when you're writing a SQL query to answer one question, SQL query always ends up with a data, with a table, right?

So you write one SQL, you got one. Then you write to answer a different question, you write a second query. So you're kind of getting a bunch of tables. So now let's imagine that we can kind of bring all this tables together into multidimensional table. And that's essentially Kube.

So it's just like the way that we can have measures and dimension that can potentially be kind of, you know, like used at the same time from a different angles. - And so initially, a lot of your use cases were more, you know, BI related, but you recently released a link chain integration.

There's obviously more and more interest in, again, using these models to answer data questions. So you've seen the chat GPT code interpreter, which is renamed as like advanced data analysis. So what's kind of like the future of like the semantic layer in AI, you know, what are like some of the use cases that you're seeing and why do you think it's a good strategy to make it easier to do now the text to SQL you wanted to do seven years ago?

- Yeah, so, I mean, you know, when it started to happen, I was just like, oh my God, people are now building stats bot with Kube. They just have a better technology for, you know, like natural language. So it kind of, it made sense to me, you know, like from the first moment I saw it.

So I think it's something that, you know, like happening right now. And that's, chat bot is one of the use cases. I think, you know, like if you try to generalize it, the use case would be how do we use a structured or tabular data with, you know, like AI models, right?

Like how do we turn the data and give the context to the data and then bring it to the model and then model can, you know, like give you answers, make a questions, do whatever you want. But the question is like how we go from just the data in your data warehouse, database, whatever, which is usually just a tabular data, right?

Like in a SQL based warehouses to some sort of, you know, like a context that system can do. And if you're building this application, you have to do it. It's like no way you can get away around not doing this. You either map it manually or you come up with some framework or something else.

So our take is that, and my take is that semantic layer is just really good place for this context to live because you need to give this context to the humans. You need to give that context to the AI system anyway, right? So that's why you define metric once.

And then, you know, like you teach your AI system what this metric is about. - What are some of the challenges of using tabular versus language data and some of the ways that having the semantic layer kind of makes that easier maybe? - I feel like, imagine you're a human, right?

And you going into like your new data analyst at a company and just people give you a warehouse with a bunch of tables and they tell you, okay, just try to make sense of this data. And you're going through all of these tables and you're really like trying to make sense without any, you know, like additional context or like some columns, you know, like in many cases they might have a weird names.

Sometimes, you know, if they follow some kind of like a star schema or like a Kimball style dimensions, maybe that would be easier because you would have facts and dimensions column, but it's still, it's hard to understand and to kind of make sense because it doesn't have descriptions, right?

And then there is like a whole like industry of like a data catalogs exist because the whole purpose of that, to give context to the data so people can understand that. And I think the same applies to the AI, right? Like, and the same challenge is that if you give it pure tabular data, it doesn't have this sort of context that it can read.

So you sort of needed to write a book or like essay about your data and give that book to the system so it can understand it. - Can you run through the steps of how that works today? So the initial part is like the natural language query, like what are the steps that happen in between to do model to semantic layer, semantic layer to SQL and all that flow?

- The first key step is to do some sort of indexing. So that's what I was referring to, like write a book about your data, right? Like describe in a text format what your data is about, right, like what metrics it has, dimensions, what is the structures of that, what a relationship between those metrics, what are potential values of the dimension.

So sort of, you know, like build a really good indexed as a text representation and then turn it into embeddings into your, you know, like vector storage. Once you have that, then you can sort of, you know, like provide that as a context to the model. I mean, there are like a lot of options, like either fine tune or, you know, like sort of in context learning, but somehow kind of give that as a context to the model, right?

As I want this model has this context, it can create a query. Now the query, I believe, should be created against semantic layer because it reduces the room for the error. Because what usually happens is that your query to semantic layer would be very simple. It would be like, give me that metric group by that dimension and maybe that filter should be applied.

And then your real query for the warehouse, it might have like a five joins, a lot of different, you know, like techniques, like how to avoid fan out, fan traps, chasm traps, all of that stuff. And the bigger query, the more room that the model can make an error, right?

Like even sometimes it could be a small error and then, you know, like your numbers is going to be off. But making a query against semantic layer, that sort of reduces the error. So the model generates a SQL query and then it executes us against semantic layer. That's semantic layer executes us against your warehouse and then sends result all the way back to your application.

And then can be done multiple times because what we were missing was just about this ability to have a conversation, right, with the model. Like you can ask question and then system can do a follow-up questions, you know, like then do a query to get some information, additional information based on this information, do a query again.

And sort of, you know, like it can keep doing this stuff and then eventually maybe give you a big report that consists of a lot of like data points. But the whole flow is that it knows the system, it knows your data because you already kind of did the indexing.

And then it queries semantic layer instead of a data warehouse directly. - Maybe just to make it a little clearer for people that haven't used a semantic layer before, you can have definitions like revenue where revenue is like select from customers and like join orders and then sum of the amount of orders.

But in the semantic layer, you're kind of hiding all of that away. So when you do natural language to queue, I just select revenue from last week and then it turns into a bigger and bigger query. - One of the biggest difficulties around semantic layer for people who've never thought about this concept before, this all sounds super neat until you have multiple stakeholders within a single company who all have different concepts of what a revenue is, but they all have different concepts of what active user is.

And then so they'll have like revenue revision one by the sales team and then revenue revision one, accounting team or tax team, I don't know. I feel like I always want semantic layer discussions to talk about the not so pretty parts of the semantic layer because this is where effectively you ship your org chart in the semantic layer.

- I think the way I think about it is that at the end of the day, semantic layer is a code base and in Qubit, it's essentially a code base, right? It's just a set of YAML files with pythons. I think code is never perfect. We know that, we're like software engineers, right?

It's never going to be perfect. You will have a lot of, you know, like revisions of code. We have a version control which helps it's easier with revisions. So I think we should treat our metrics and we are in semantic layer as a code, right? And then collaboration is a big part of it.

You know, like if there are like a multiple teams that sort of have a different opinions, let them collaborate on the pull request. You know, they can discuss that. Like why they think that should be calculated differently. Have an open conversation about it. You know, like when everyone can just discuss it like an open source community, right?

Like you go on a GitHub and you talk about why that code is written the way it's written, right? It should be written differently. And then hopefully at some point you can come up, you know, like to some definition. Now, if you still have multiple versions, right? It's a code, right?

So you can still manage it. But I think the big part of that is that like we really need to treat it as a code base. Then it makes a lot of things easier, not as spreadsheets, you know, like a hidden Excel files. - The other thing is like, then having the definition spread in the organization, you know, like versus everybody trying to come up with their own thing.

But yeah, I'm sure that when you talk to customers, there's people that, you know, have issues with the product and it's really like two people trying to define the same thing. One in sales that wants to look good. The other is like the finance team that wants to be conservative and they all have different definitions.

How important is the natural language to people? So obviously, you know, you guys both work in modern data stack companies either now or before. There's gonna be the whole wave of empowering data professionals. I think now a big part of the wave is removing the need for data professionals to always be in the loop and having non-technical folks do more of the work.

Are you seeing that as a big push too with these models, like allowing everybody to interact with the data? Yeah, any customer stories you can share, anything like that? - I think it's a multidimensional question. That's an example of, you know, like where you have a lot of inside the question.

So in terms of examples, I think a lot of people building different, you know, like agents or chatbots. We have a company that built as internal Slack bot that sort of answers questions, you know, like based on the data in a warehouse. And then like a lot of people kind of go in and like ask that chatbot this question.

Is it like a real big use case? Maybe. Is it still like a toy pet project? Maybe too right now. I think it's really hard to tell them apart at this point because there is a lot of like a hype and you know, just people building LLM style because it's cool.

And everyone wants to build something, you know, like kind of even at least a pet project. So that's what happened in Krizawa community as well. We see a lot of like people building a lot of cool stuff and it probably will take some time for that stuff to mature and kind of to see like what are real, the best use cases.

But I think what I saw so far, one use case was building this chatbot and we have even one company that are building it as a service. So they essentially connect into Q semantic layer and then offering their like chatbot so you can do it in a web, in a Slack, so it can, you know, like answer questions based on data in your semantic layer.

But, and I'll also see a lot of things like they're just being built in-house. And there are other use cases, some sort of automation, you know, like then that agent checks on the data and then kind of perform some actions based, you know, like on changes in data. But other dimension of your question is like, will it replace people or not?

I think, you know, like what I see so far in data specifically, there are like a few use cases of LLM. I don't see you being part of that use case, but it's more like a co-pilot for a data analyst, a co-pilot for data engineer, where you develop something, you develop a model and it can help you to write a SQL or something like that.

So, you know, it can create a boilerplate SQL and then you can edit this SQL, which is fine because you know how to edit SQL, right? So, you're not going to make a mistake, but it will help you to just generate, you know, like a bunch of SQL that you write again and again, right?

Like boilerplate code. So sort of a co-pilot use case. I think that's great and we'll see more of it. I think every platform that is building for data engineers will have some sort of a co-pilot capabilities and Kube included, we're building this co-pilot capabilities to help people build semantic layers easier.

I think that just a baseline for every engineering product right now to have some sort of, you know, like a co-pilot capabilities. Then the other use case is a little bit more where Kube has been involved is like, how do we enable access to data for non-technical people through the natural language as an interface to data, right?

Like visual dashboards, charts, it's always has been an interface to data in every BI. Now, I think we will see just a second interface as just kind of a natural language. So I think at this point, many BI's will add it as a commodity feature. It's like Tableau will probably have a search bar at some point saying like, "Hey, ask me a question." I know that some of the, you know, like AWS QuickSight, they're about to announce features like this in their like BI and I think Power BI will do that, especially with their deal with OpenAI.

So every company, every BI will have this some sort of a search capabilities built in inside their BI. So I think that's just going to be a baseline feature for them as well, but that's where Kube can help because we can provide that context, right? - Do you know how, or do you have an idea for how these products will differentiate once you get the same interface?

So right now there's like, you know, Tableau is like the super complicated and it's like super set is like easier. Yeah, do you just see everything will look the same and then how do people differentiate? - It's like they all have line chart, right? And they all have bar chart.

So I feel like it pretty much the same. I don't think BI market will, it's going to be fragmented as well. And every major vendor and most of the vendors will try to have some sort of natural language capabilities and they might be a little bit different. Some of them will try to position the whole product around it.

Some of them will just have them as a checkbox, right? So we'll see, but I don't think it's going to be something that will change the BI market, you know, like something that can take the BI market and make it more consolidated rather than, you know, like what we have right now.

I think it's still will remain fragmented. - Let's talk a bit more about application use cases. So people also use Kube for kind of like analytics on their product, like dashboards and things like that. How do you see that changing in more, especially like when it comes to like agents, you know, so there's like a lot of people trying to build agents for reporting, building agents for sales.

Like if you're building a sales agent, you need to know everything about the purchasing history of the customer, all of these things. Yeah, any thoughts there? What should all the AI engineers listening think about when implementing data into agents? - Yeah, I think kind of, you know, like trying to solve for two problems.

One is how to make sure that agents or LLM model, right, has enough context about, you know, like a tabular data. And also, you know, like how do we deliver updates to the context, which is also important because data is changing, right? So every time we change something upstream, we need to sure we update that context in our vector database or something.

And how do you make sure that the queries are correct? You know, I think it's obviously a big pain in this all, you know, like AI kind of, you know, like a space right now, how do we make sure that we don't, you know, provide our own counselors? But I think, you know, like kind of be able to reduce the room for error as much as possible that what I would look for, you know, like to try to like minimize potential damage.

And then, yeah, I feel like our use case, you know, like for Kube, it's been, we've been using, Kube been used a lot to power sort of customer facing analytics. So I don't think that much is going to change is that I feel like, again, more and more products will adopt natural language interfaces as sort of a part of that product as well.

So we would be able to power this business to not only, you know, like charts, visuals, but also some sort of, you know, like summaries, you know, like probably in the future, you're going to open the page with some surface stats and you will have a smart summary kind of generated by AI.

And that summary can be powered by Kube, right? Like, because the rest is already being powered by Kube. - You know, we had Linus from Notion on the pod and one of the ideas he had that I really like is kind of like thumbnails of text, kind of like how do you like compress knowledge and then start to expand it.

A lot of that comes into dashboards, you know, where like you have a lot of data, you have like a lot of charts and sometimes you just want to know, hey, this is like the three lines summary of it. Yeah, and yeah, makes sense that you would want to power that.

So are you, how are you thinking about, yeah, the evolution of like the modern data stack and in quotes, whatever that means today, what's like the future of what people are going to do? What's the future of like what models and agents are gonna do for them? Do you have any thoughts?

- I feel like modern data stack sometimes it's not very connected. I mean, it's obviously a big crossover between AI, you know, like ecosystem, AI infrastructure ecosystem and then sort of the data, but I don't think it's a full overlap. So I feel like when we know, like I'm looking at a lot of like what's happening in a modern data stack, right?

Like where like we use warehouses, we use BI's, you know, different like transformation tools, catalogs, like data quality tools, ETLs, all of that. I don't see a lot of being compacted by AI specifically. I think, you know, that space is being compacted as much as any other space in terms of, yes, we'll have all those copilot capabilities, some of AI capabilities here and there, but I don't think see anything sort of dramatically, you know, being sort of, you know, a change or shifted because of, you know, like AI wave.

In terms of just in general data space, I think, you know, like in the last two, three years, we saw an explosion, right? Like we got like a lot of tools, every vendor for every problem. I feel like right now we should go through the cycle of consolidation. And, you know, like, I mean, if Fivetran and DBT merge, they can be Alteryx of a new generation or something like, - Yeah.

- And, you know, probably some ETL too there, but I feel it might happen. I mean, it just natural waves, you know, like in cycles. - I wonder if everybody is gonna have their own copilot. The other thing I think about these models is like, you know, SWIX was at AirByte and yeah, there's Fivetran.

- That's the ETL thing. But Fivetran versus AirByte, I don't think it all makes very well. - There's the, you know, a lot of times these companies are doing the syntax work for you of like building the integration between your data store and like the app or another data store.

I feel like now these models are pretty good at coming up with the integration themselves, you know, and like using the docs to then connect the two. So I'm really curious, like in the future, what that will look like, you know. And same with data transformation. I mean, you think about DBT and some of these tools and right now you have to create rules to normalize and transform data, but in the future, I could see you explaining the model, how you want the data to be, and then the model figuring out how to do the transformation.

But yeah, I think it all needs a semantic layer as far as like figuring out what to do with it, you know, what's the data for, where it goes. - Yeah, I think many of this, you know, like workflows will be augmented by, you know, like some sort of a copilot.

You know, you can describe what transformation you want to see and it can generate a boilerplate, right, of transformation for you. Or even, you know, like kind of generate a boilerplate of specific ETL driver or ETL integration. I think we're still maybe not at the point where this code can be fully automated.

So we still need a human and a loop, right, like who can use this copilot. But in general, I think, yeah, data work and software engineering work can be augmented quite significantly with all that stuff. - I think the other important thing with data too is like sometimes, you know, the big thing with machine learning before was like, well, all of your data is bad, you know, the data is not good for anything.

And I think like now at least with these models, they have some knowledge of their own and they can also tell you if your data is bad, you know, which I think is like something that before you didn't, you didn't have. Any cool apps that you've seen being built on, on Cube, like any kind of like AI native things that people should think about, new experiences, anything like that?

- Well, I see a lot of Slack bots. So, you know, like it's just, it's definitely like, they all remind me of Statsbot, but I know like I played with few of them, they're much, much better than Statsbot. So I feel like it just, it feels like it's on the surface, right?

It's just that use case that you really want, you know, think about your data engineer in your company, like everyone is like, and you're asking, "Hey, can you pull that data for me?" And you would be like, "Can I build a bot to replace myself?" You know, like, so they will ping that bot instead.

So it's like, that's why a lot of people doing this. So I think it's a first use case that actually people are playing with. But I think inside that use case, people get creative. So I see bots that can actually have a dialogue with you. So, you know, like you would come to that bot and say, "Hey, show me metrics." And the bot would be like, "What kind of metrics?

What do you want to look at?" So like, it would be like active users. And then it would be like, "How do you define active users?" You want to see active users, you know, like sort of cohort, you want to see active users kind of changing behavior over time, like a lot of like a follow-up questions.

So it tries to sort of, you know, like understand what exactly you want, because a lot of people, and that's how many data analysts work, right? When people started to ask you something, you always try to understand what exactly do you mean? Because many people, they don't know how to ask correct questions about your data.

It's a sort of an interesting specter. On one side of the specter, you don't know, you know nothing, you're just like, "Hey, show me metrics." And the other side of specter, you know how to write SQL, and you can write exact query to your data warehouse, right? So many people like say a little bit in the middle, and this, the data analyst, they usually have the knowledge about your data.

And that's why they can ask follow-up questions and to understand what exactly you want. And I saw people building bots who can do that. And that part is amazing. I mean, like generating SQL, all of that stuff, it's okay, it's good, but when the bot can actually act like they know your data and they can ask follow-up questions, I think that's great.

- Yeah. Are there any issues with the models and the way they understand numbers? You know, one of the big complaints people have is like GPD, at least three and a half cannot do math. You know, have you seen any limitations and improvement? And also when it comes to what model to use, do you see most people use like GPT-4 because it's like the best at this kind of analysis?

- I think I saw people use all kinds of models. To be honest, it's usually GPT, so it's not, I mean, inside GPT, it could be 3.5 or 4, right? But it's not like I see a lot of something else, to be honest, like I don't, I mean, maybe know like some open source alternatives, but it's pretty much, you know, like it feels like the market is being dominated by just chat GPT, which is probably true.

In terms of the problems, I think I've been chatting about it with a few people. So they try just kind of, you know, like if math is required to do math, you know, like outside of, you know, like chat GPT itself. So it would be like some additional Python scripts or something.

When we're talking about production level use cases, it's quite a lot of Python code around, you know, like your model to make it work, to be honest. It's like, it's not that magic that you just throw the model in it. Like it can give you all these answers. For like a toy use cases, the one we have on a, you know, like our demo page or something, it works fine.

It's great. But you know, like if you want to do like a lot of post-processing, do a mass on your own, you probably need to code it in Python anyway. That's what I see people doing. - Yeah. Yeah. We heard the same from Harrison and Langstream that most people just use OpenAI.

We did a OpenAI Snowmode emergency podcast. And it was funny to like just see the reaction that people had to that and how hard it actually is to break down some of the monopoly. What else should people keep in mind, Artem? You're kind of like at the cutting edge of this, you know, if I'm looking to build a data-driven AI application, I'm trying to build data into my AI workflows.

Any mistakes people should avoid? Any tips on the best stack to use? What tools to use? - I would just recommend going through to warehouse as soon as possible. I think a lot of people feel that MySQL can be a warehouse, which can be maybe on like a lower scale, but you know, like it's definitely not from a performance perspective.

So just kind of have it starting with a good warehouse, a query engine like house, that's probably like something I would recommend starting from a day zero. And there are like ways to do it very cheap with open source technologies too, especially in a lake house architecture. I think, you know, I'm biased, obviously, but using a semantic layer, preferably Kube.

And for, you know, like a context. And other than that, it's just like, I feel it's a very interesting space, you know, like in terms of AI ecosystem, I see a lot of people using link chain right now, which is great, you know, like, and we build an integration, but I'm sure the space will continue to evolve.

And, you know, like we'll see a lot of like interesting tools and maybe, you know, like some tools would be a better fit for a job. I don't, I'm not aware of any right now, but it's always interesting to see how it evolves. Also, it's a little unclear, you know, like how all the infrastructure around actually developing, testing, documenting all that stuff will kind of evolve too.

But yeah, again, it's just like really interesting to see and observe, you know, what's happening in this space. - Okay, so before we go to the lightning round, I wanted to ask you on your thoughts on embedded analytics. And in a sense, the kind of chatbots that people are inserting on their websites and building with LLMs is very much sort of end user programming or end user interaction with their own data.

I love seeing embedded analytics. And for those who don't know, embedded analytics is basically user facing dashboards where you can see your own data, right? Instead of the company seeing data across all their customers, it's an individual user seeing their own data as a slice of the overall data that is owned by the platform that they're using.

So I love embedded analytics, but actually overwhelmingly the observation that I've had is that people who try to build in this market fail to monetize. And I was wondering your insights on why. - I think overall the statement is true. It's really hard to monetize, you know, like in embedded analytics.

That's why at Qube we're excited more about our internal kind of BI use case or like a company's a building, you know, like a chatbots for their internal data consumption or like internal workflows. Embedded analytics is hard to monetize because it's historically been dominated by the BI vendors. And we still, you know, like see a lot of, you know, like organizations are using BI tools as a vendors.

So, and what I was talking about BI vendors adding natural language interfaces, they will probably add that to the embedded analytics capabilities as well, right? So they would be able to embed that too. So I think that's part of it. Also, you know, if you look at the embedded analytics market the bigger organizations, the big GADs, they're really more custom, you know, like it becomes.

And at some point I see many organizations they just stop using any vendor and they just kind of build most of the stuff from scratch, which probably, you know, like the right way to do it. So it's sort of, you know, like you got a market that is very kept at the top and then you also in that middle and small segment you got a lot of vendors trying, you know, like to compete for the buyers.

And because again, the BI is very fragmented embedded analytics therefore is fragmented also. So you're really going after the mid-market slice and then with a lot of other vendors competing for that. So that's why it's historically been hard to monetize, right? I don't think AI really going to change that just because it's using model, you just pay to open AI and that's it.

Like everyone can do that, right? So it's not much of a competitive advantage. So it's going to be more like a commodity feature is that a lot of like, yeah, vendors would be able to leverage. - This is great Artem. As usual, we got our lightning round. So it's true question.

One is about acceleration, one on exploration and then take away. The acceleration thing is what's something that already happened in AI or maybe, you know, in data that you thought would take much longer but it's already happening today. - To be honest, all this foundational models, I thought that, you know, we had a lot of models that had been in production for like quite, you know, maybe decade or so.

And it was like a very niche use cases, very vertical use cases. It's just like in very customized models. And even when we're building stats bot back then in 2016, right, it was, even back then we had some natural language models being deployed, like a Google Translate or something that was, it still was sort of a model, right?

But it was very customized with a specific use case. So I thought that would continue for like many years. We'll use AI, we'll have all this customized niche models. But there is like foundational model, they like very generic now. They like, they can serve many, many different use cases.

So I think that's, that is a big change. And I didn't expect that to be honest. - And the next question is about exploration. What is one thing that you think is the most interesting unsolved question in AI? - I think AI is a subset of software engineering in general.

And it's sort of connected to the data as well. And in software, because software engineering as a discipline, it has quite a history. We build a lot of processes, you know, like toolkits and methodologies, how we project that, right? And now AI, I don't think it's completely different, but it has some unique traits.

You know, like it's quite much not idempotent, right? And kind of from many dimensions. So, and like other traits. So which kind of may require a different methodologies, may require different approaches in a different toolkit. I don't think how much is going to deviate from a standard software engineering.

I think many sort of, you know, like tools and practices was that we develop our software engineering can be applied to AI. And some of the data best practices can be applied as well. But it's might be a very interesting subfield, like we got a DevOps, right? Like just a bunch of tools, like ecosystem.

So now like AI is kind of feels like it's shaping into that with a lot of its own, you know, like methodologies, practices and toolkits. So I'm really excited about it. And I think it's a lot of sort of, you know, like unsolved still question again, how do we develop with that?

How do we test, you know, like what is the best practices? How does the methodologies? So I think that would be an interesting to see. - Awesome. And then, yeah, the final message, you know, you have a big audience of engineers and technical folks. What's something you want everybody to remember, to think about, to explore?

- It says being who could try to build a chatbot, you know, like for analytics back then, and kind of, you know, like looking at what, what people do right now. I think, yeah, just do that. I mean, it's working right now. So it's, with foundational models, it's actually now it's possible to build all those cool applications.

The thing, you know, like it's, I'm so excited to see, you know, like how much changed in the last six years or so that we actually now can build a smart agents. I think that's sort of, you know, like a takeaways. And yeah, we are, as you know, like as humans in general, we're like, we really move technology forward and it's fun to see, you know, like it's just a firsthand.

- Ah, well, thank you so much for coming on Artem. This was great. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)

Powering your Copilot for Data - with Artem Keydunov from Cube.dev

Chapters

Transcript