Hey everyone, we're here today as guests on Latent Space. - It's great to be here. I'm a long time listener and fan. They've had some great guests on this show before. - Yeah, what an honor to have us, the hosts of another podcast, join as guests. - I mean, a huge thank you to Swix and Alessio for the invite, thanks for having us on the show.
- Yeah, really, it seems like they brought us here to talk a little bit about our show, our podcast. - Yeah, I mean, we've had lots of listeners ourselves, listeners at Deep Dive. - Oh yeah, we've made a ton of audio overviews since we launched and we're learning a lot.
- There's probably a lot we can share around what we're building next, huh? - Yeah, we'll share a little bit at least. - The short version is we'll keep learning and getting better for you. - We're glad you're along for the ride. - So yeah, keep listening. - Keep listening and stay curious.
We promise to keep diving deep and bringing you even better options in the future. - Stay curious. (upbeat music) - Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO of Residence at Decibel Partners and I'm joined by Nicole Swicks, founder of Small.ai. - Hey, and today we're back in the studio with our special guests, Ryzen Martin and Usama, I forgot to get your last name, Shavkat?
- Yes. - Okay, welcome. - Hello, thank you for having us. - Thanks for having us. - So AI podcasters meet human podcasters, always fun. Congrats on the success of Notebook LM. I mean, how does it feel? - It's been a lot of fun. A lot of it honestly was unexpected, but my favorite part is really listening to the audio overviews that people have been making.
- Maybe we should do a little bit of intros and tell the story. You know, what is your path into the sort of Google AI org or maybe I actually don't even know what org you guys are in. - I can start. My name's Ryza. I lead the Notebook LM team inside of Google Labs.
So specifically that's the org that we're in. It's called Google Labs. It's only about two years old. And our whole mandate is really to build AI products. That's it. We work super closely with DeepMind. Our entire thing is just like try a bunch of things and see what's landing with users.
And the background that I have is really, I worked in payments before this and I worked in ads right before and then startups. I tell people like at every time that I changed orgs, I actually almost quit Google. Like specifically like in between ads and payments, I was like, all right, I can't do this.
Like, this is like super hard. I was like, it's not for me. I'm like a very zero to one person. But then I was like, okay, I'll try. I'll interview with other teams. And when I interviewed in payments, I was like, oh, these people are really cool. I don't know if I'm like a super good fit with this space, but I'll try it 'cause the people are cool.
And then I really enjoyed that. And then I worked on like zero to one features inside of payments and I had a lot of fun. But then the time came again where I was like, oh, I don't know. It's like, it's time to leave. It's time to start my own thing.
But then I interviewed inside of Google Labs and I was like, oh darn. Like there's definitely like- - They got you again. - They got me again. (laughing) And so now I've been here for two years and I'm happy that I stayed because especially with the recent success of Notebook LM, I'm like, dang, we did it.
I actually got to do it. So that was really cool. - Kind of similar, honestly. I was at a big team at Google. We do sort of the data center supply chain planning stuff. Google has like the largest sort of footprint. Obviously there's a lot of management stuff to do there.
But then there was this thing called Area 120 at Google, which does not exist anymore. But I sort of wanted to do like more zero to one building and landed a role there where we're trying to build like a creator commerce platform called Kaya. It launched briefly a couple of years ago.
But then Area 120 sort of transitioned and morphed into Labs. And like over the last few years, like the focus just got a lot clearer. Like we were trying to build new AI products and do it in the wild and sort of co-create and all of that. So yeah, we've just been trying a bunch of different things and this one really landed, which has felt pretty phenomenal.
- Really, really landed. Let's talk about the brief history of NotebookLM. You had a tweet, which is very helpful for doing research. May, 2023, during Google I/O, you announced Project Tailwind. - Yeah. - So today is October, 2024. So you joined October, 2022. - Actually, I used to lead AI Test Kitchen.
And this was actually, I think not I/O 2023, I/O 2022 is when we launched AI Test Kitchen or announced it. And I don't know if you remember it. - I wasn't, that's how you like had the basic prototype for Gemini 8. - Yes, yes, exactly. - And like gave beta access to people.
- Yeah, yeah, yeah. And I remember, I was like, wow, this is crazy. We're going to launch an LLM into the wild. And that was the first project that I was working on at Google. But at the same time, my manager at the time, Josh, he was like, "Hey, but I want you to really think about like what real products would we build that are not just demos of the technology?" That was in October of 2022.
I was sitting next to an engineer that was working on a project called Talk to Small Corpus. His name was Adam. And the idea of Talk to Small Corpus is basically using LLM to talk to your data. And at the time I was like, wait, there are some like really practical things that you can build here.
And I, just a little bit of background, like I was an adult learner. Like I went to college while I was working a full-time job. And the first thing I thought was like, this would have really helped me with my studying, right? If I could just like talk to a textbook, especially like when I was tired after work, that would have been huge.
We took a lot of like the Talk to Small Corpus prototypes and I showed it to a lot of like college students, particularly like adult learners. They were like, yes, like I get it. Like I didn't even have to explain it to them. And we just continued to iterate the prototype from there to the point where we actually got a slot as part of the I/O demo in '23.
- And Corpus, was it a textbook? - Oh my gosh, yeah. It's funny, actually. When he explained the project to me, he was like, "Talk to Small Corpus." I was like, "Talk to a small corpse?" - Yeah, nobody says corpus. - It was like a small corpse? This is not AI.
- It's very academic. - Yeah, yeah. And it really was just like a way for us to describe the amount of data that we thought like could be, it could be good for. - Yeah, but even then, you're still like doing rag stuff because, you know, the context lens back then was probably like 2K, 4K.
- Yeah, it was basically rag. That was essentially what it was. And I remember, I was like, we were building the prototypes and at the same time, I think like the rest of the world was, right? We were seeing all of these like chat with PDF stuff come up and I was like, "Come on, we gotta go." Like we have to like push this out into the world.
I think if there was anything, I wish we would have launched sooner because I wanted to learn faster. But I think like we netted out pretty well. - Was the initial product just text-to-speech or were you also doing kind of like a synthesizing of the content, refining it? Or were you just helping people read through it?
- Before we did the I/O announcement in '23, we'd already done a lot of studies. And one of the first things that I realized was the first thing anybody ever typed was summarize the thing, right? Summarize the document. And it was like half like a test and half just like, "Oh, I know the content.
I wanna see how well it does this." So as part of the first thing that we launched, it was called Project Tailwind back then. It was just Q&A. So you could chat with the doc just through text and it would automatically generate a summary as well. I'm not sure if we had it back then.
I think we did. It would also generate the key topics in your document. And it could support up to like 10 documents. So it wasn't just like a single doc. - And then the I/O demo went well, I guess. - Yeah. - And then what was the discussion from there to where we are today?
Is there any maybe intermediate step of the product that people missed between this was launch or? - It was interesting because every step of the way, I think we hit like some pretty critical milestones. So I think from the initial demo, I think there was so much excitement of like, "Wow, what is this thing that Google is launching?" And so we capitalized on that.
We built the wait list. That's actually when we also launched the Discord server, which has been huge for us because for us in particular, one of the things that I really wanted to do was to be able to launch features and get feedback ASAP. Like the moment somebody tries it, like I want to hear what they think right now.
And I want to ask follow-up questions. And the Discord has just been so great for that. But then we basically took the feedback from I/O. We continued to refine the product. So we added more features. We added sort of like the ability to save notes, write notes, we generate follow-up questions.
So there's a bunch of stuff in the product that shows like a lot of that research, but it was really the rolling out of things. Like we removed the wait list, so rolled out to all of the United States. We rolled out to over 200 countries and territories. We started supporting more languages, both in the UI and like the actual source stuff.
We experienced, like in terms of milestones, there was like an explosion of like users in Japan. This was super interesting in terms of just like unexpected, like people would write to us and they would be like, "This is amazing. I have to read all of these rules in English, but I can chat in Japanese." It's like, oh, wow, that's true, right?
Like with LLMs, you kind of get this natural, it translates the content for you, and you can ask in your sort of preferred mode. And I think that's not just like a language thing too. I think there's like, I do this test with Wealth of Nations all the time, 'cause it's like a pretty complicated text to read.
- The Evan Smith classic, it's like 400 pages this thing. - Yeah, but I like this test 'cause I'm like, I ask in like normie, you know, plain speak, and then it summarizes really well for me. It sort of adapts to my tone. - Very capitalist. - Very on brand.
- I just checked in on a Notebook LM Discord, 65,000 people. - Yeah. - Crazy, just like for one project within Google. It's not like, it's not labs, it's just Notebook LM. - Just Notebook LM. - What do you learn from the community? - I think that the Discord is really great for hearing about a couple of things.
One, when things are going wrong. I think, honestly, like our fastest way that we've been able to find out if like the servers are down, or there's just an influx of people being like, it says system unable to answer, anybody else getting this? And I'm like, all right, let's go.
And it actually catches it a lot faster than like our own monitoring does. It's like, that's been really cool. So thank you. - Cats will need a dog. (all laughing) - So thank you to everybody. Please keep reporting it. I think the second thing is really the use cases.
I think when we put it out there, I was like, hey, I have a hunch of how people will use it, but like to actually hear about, you know, not just the context of like the use of Notebook LM, but like, what is this person's life like? Why do they care about using this tool?
Especially people who actually have trouble using it, but they keep pushing, like that's just so critical to understand what was so motivating, right? Like what was your problem that was like so worth solving? So that's like a second thing. The third thing is also just hearing sort of like when we have wins and when we don't have wins, because there's actually a lot of functionality where I'm like, hmm, I don't know if that landed super well or if that was actually super critical.
As part of having this sort of small project, right, I wanna be able to unlaunch things too. So it's not just about just like rolling things out and testing it and being like, wow, now we have like 99 features. Like hopefully we get to a place where it's like, there's just a really strong core feature set and the things that aren't as great, we can just unlaunch.
- What have you unlaunched? I have to ask. - I'm in the process of unlaunching some stuff. But for example, we had this idea that you could highlight the text in your source passage and then you could transform it. And nobody was really using it. And it was like a very complicated piece of our architecture and it's very hard to continue supporting it in the context of new features.
So we were like, okay, let's do a 50/50 sunset of this thing and see if anybody complains. And so far, nobody has. - Is there like a feature flagging paradigm inside of your architecture that lets you feature flag these things easily? - Yes, and actually- - What is it called?
Like, I love feature flagging. - You mean like in terms of just like being able to expose things to users? - Yeah, as a PM, like this is your number one tool, right? - Yeah, yeah. - Let's try this out. All right, if it works, roll it out. If it doesn't, roll it back, you know?
- Yeah, I mean, we just run Mendel experiments for the most part. And I actually, I don't know if you saw it, but on Twitter, somebody was able to get around our flags and they enabled all the experiments. They were like, "Check out what the Notebook LM team is cooking." And I was like, "Oh!" And I was at lunch with the rest of the team.
And I was like, I was eating, I was like, "Guys, guys, Magic Draft week!" They were like, "Oh no!" I was like, "Okay, just finish eating and then let's go figure out what to do." - Yeah. - I think a post-mortem would be fun, but I don't think we need to do it on the podcast now.
- Yeah, yeah. - Can we just talk about what's behind the magic? So I think everybody has questions, hypotheses about what models power it. I know you might not be able to share everything, but can you just get people very basic? How do you take the data and put it in the model?
What text model do you use? What's the text-to-speech kind of like jump between the two? - Sure, yeah. - I was going to say, Susama, he manually does all the podcasts. - Oh, thank you. - Really fast. - You're very fast, yeah. - He's both of the voices at once.
- Voice actor. - Go ahead, go ahead. - So just for a bit of background, we were building this thing sort of outside Notebook LM to begin with. Like just the idea is like content transformation, right? Like we can do different modalities. Like everyone knows that everyone's been poking at it, but like, how do you make it really useful?
And like one of the ways we thought was like, okay, like you maybe like, you know, people learn better when they're hearing things, but TTS exists and you can like narrate whatever's on screen, but you want to absorb it the same way. So like, that's where we sort of started out into the realm of like, maybe we try like, you know, two people are having a conversation kind of format.
We didn't actually start out thinking this would live in Notebook, right? Like Notebook was sort of, we built this demo out independently, tried out like a few different sort of sources. The main idea was like, go from some sort of sources and transform it into a listenable, engaging audio format.
And then through that process, we like unlocked a bunch more sort of learnings. Like for example, in a sense, like you're not prompting the model as much because like the information density is getting unrolled by the model prompting itself in a sense, because there's two speakers and they're both technically like AI personas, right?
That have different angles of looking at things and like, they'll have a discussion about it. And that sort of, we realized that's kind of what was making it riveting in a sense. Like you care about what comes next, even if you've read the material already, 'cause like people say they get new insights on their own journals or books or whatever, like anything that they've written themselves.
So yeah, from a modeling perspective, like it's, like Raisa said earlier, like we work with the DeepMind audio folks pretty closely. So they're always cooking up new techniques to like get better, more human-like audio. And then Gemini 1.5 is really, really good at absorbing long context. So we sort of like generally put those things together in a way that we could reliably produce the audio.
- I would add like, there's something really nuanced, I think about sort of the evolution of like the utility of text-to-speech, where if it's just reading an actual text response, and I've done this several times, I do it all the time with like reading my text messages or like sometimes I'm trying to read like a really dense paper, but I'm trying to do actual work, I'll have it like read out the screen.
There is something really robotic about it that is not engaging. And it's really hard to consume content in that way. And it's never been really effective, like particularly for me where I'm like, hey, it's actually just like, it's fine for like short stuff like texting, but even that, it's like not that great.
So I think the frontier of experimentation here was really thinking about there is a transform that needs to happen in between whatever, here's like my resume, right? Or here's like a hundred page slide deck or something. There is a transform that needs to happen that is inherently editorial. And I think this is where like that two-person persona, right, dialogue model, they have takes on the material that you've presented, that's where it really sort of like brings the content to life in a way that's like not robotic.
And I think that's like where the magic is, is like, you don't actually know what's going to happen when you press generate, you know, for better or for worse, like to the extent that like people are like, no, I actually want it to be more predictable now. Like I want to be able to tell them, but I think that initial like, wow, was because you didn't know, right?
When you upload your resume, what's it about to say about you? And I think I've seen enough of these where I'm like, oh, it gave you good vibes, right? Like you knew I was going to say like something really cool. As we start to shape this product, I think we want to try to preserve as much of that wow, as much as we can, because I do think like exposing like all the knobs and like the dials, like we've been thinking about this a lot.
It's like, hey, is that like the actual thing? Is that the thing that people really want? - Have you found differences in having one model just generate the conversation and then using text-to-speech to kind of fake two people? Or like, are you actually using two different kind of system prompts to like have a conversation step-by-step?
I'm always curious, like, if persona system prompts make a big difference or like you just put in one prompt and then you just let it run? - I guess like generally we use a lot of inference as you can tell with like the spinning thing takes a while. So yeah, there's definitely like a bunch of different things happening under the hood.
We've tried both approaches and they have their sort of drawbacks and benefits. I think that that idea of like questioning like the two different personas, like persist throughout like whatever approach we try. It's like, there's a bit of like imperfection in there. Like we had to really lean into the fact that like to build something that's engaging, like it needs to be somewhat human and it needs to be just not a chatbot.
Like that was sort of like what we need to diverge from. It's like, you know, most chatbots will just narrate the same kind of answer, like given the same sources for the most part, which is ridiculous. So yeah, there's like experimentation there under the hood, like with the model to like make sure that it's spitting out like different takes and different personas and different sort of prompting each other is like a good analogy, I guess.
- Yeah, I think Steven Johnson, I think he's on your team. I don't know what his role is. He seems like chief dreamer, writer. - Yeah, I mean, I can comment on Steven. So Steven joined actually in the very early days, I think before it was even a fully funded project.
And I remember when he joined, I was like, Steven Johnson's going to be on my team. You know, and for folks who don't know him, Steven is a New York Times bestselling author of like 14 books. He has a PBS show. He's like incredibly smart, just like a true sort of celebrity by himself.
And then he joined Google and he was like, I want to come here and I want to build the thing that I've always dreamed of, which is a tool to help me think. I was like, a what? Like a tool to help you think? I was like, what do you need help with?
Like, you seem to be doing great on your own. And, you know, he would describe this to me and I would watch his flow. And aside from like providing a lot of inspiration, to be honest, like when I watched Steven work, I was like, oh, nobody works like this, right?
Like this is what makes him special. Like he is such a dedicated like researcher and journalist and he's so thorough, he's so smart. And then I had this realization of like, maybe Steven is the product. Maybe the work is to take Steven's expertise and bring it to like everyday people that could really benefit from this.
Like just watching him work, I was like, oh, I could definitely use like a mini Steven, like doing work for me. Like that would make me a better PM. And then I thought very quickly about like the adjacent roles that could use sort of this like research and analysis tool.
And so aside from being, you know, chief dreamer, Steven also represents like a super workflow that I think all of us, like if we had access to a tool like it, would just inherently like make us better. - Did you make him express his thoughts while he worked or you just silently watched him?
Or how does this work? - Oh no, now you're making me admit it. But yes, I did just silently watch him. - Yeah, this is a part of the PM toolkit, right? Like user interviews and all that. - Yeah, I mean, I did interview him, but I noticed like if I interviewed him, it was different than if I just watched him.
And I did the same thing with students all the time. Like I followed a lot of students around, I watched them study. I would ask them like, oh, how do you feel now? Right, or why did you do that? Like what made you do that actually? Or why are you upset about like this particular thing?
Why are you cranky about this particular topic? And it was very similar, I think, for Steven, especially because he was describing, he was in the middle of writing a book and he would describe like, oh, you know, here's how I research things and here's how I keep my notes.
Oh, and here's how I do it. And it was really, he was doing this sort of like self-questioning, right? Like now we talk about like chain of, you know, reasoning or thought, reflection. And I was like, oh, he's the OG. Like I watched him do it in real time.
I was like, that's like LLM right there. And to be able to bring sort of that expertise in a way that was like, you know, maybe like costly inference wise, but really have like that ability inside of a tool that was like, for starters, free inside of Notebook LM, it was good to learn whether or not people really did find use out of it.
- So did he just commit to using Notebook LM for everything? Or did you just model his existing workflow? - Both, right? Like in the beginning, there was no product for him to use. And so he just kept describing the thing that he wanted. And then eventually like we started building the thing and then I would start watching him use it.
One of the things that I love about Stephen is he uses the product in ways where it kind of does it, but doesn't quite, like he's always using it at like the absolute max limit of this thing. But the way that he describes it is so full of promise where he's like, I can see it going here.
And all I have to do is sort of like meet him there and sort of pressure test whether or not, you know, everyday people want it and we just have to build it. - I would say OpenAI has a pretty similar person, Andrew Mason, I think his name is.
It's very similar, like just from the writing world and using it as a tool for thought to shape Chachabitty. I don't think that people who use AR tools to their limit are common. I'm looking at my Notebook LM now, I've got two sources. You have a little like source limit thing and my bar is over here, you know, and it stretches across the whole thing.
I'm like, did he fill it up? Like what, you know? - Yes, and he has like a higher limit than others. I think Stephen- - He fills it up. - Oh yeah, like I don't think Stephen even has a limit. - And he has Notes, Google Drive stuff, PDFs, MP3, whatever.
- Yes, and one of my favorite demos, he just did this recently, is he has actually PDFs of like handwritten Marie Curie notes. - I see, so you're doing image recognition as well. - Yeah, so it does support it today. So if you have a PDF that's purely images, it will recognize it.
But his demo is just like super powerful. He's like, okay, here's Marie Curie's notes. And it's like, here's how I'm using it to analyze it. And I'm using it for like this thing that I'm writing. And that's really compelling. It's like the everyday person doesn't think of these applications.
And I think even like when I listened to Stephen's demo, I see the gap. I see how Stephen got there, but I don't see how I could without him. And so there's a lot of work still for us to build of like, hey, how do I bring that magic down to like zero work?
Because I look at all the steps that he had to take in order to do it. And I'm like, okay, that's product work for us, right? Like that's just onboarding. - And so from an engineering perspective, people come to you and it's like, okay, I need to use this handwritten notes from Marie Curie from hundreds of years ago.
How do you think about adding support for like data sources and then maybe any fun stories and like supporting more esoteric types of inputs? - So I think about the product in three ways, right? So there's the sources, the source input, there's like the capabilities of like what you could do with those sources.
And then there's the third space, which is how do you output it into the world? Like how do you put it back out there? There's a lot of really basic sources that we don't support still, right? I think there's sort of like the handwritten notes stuff is one, but even basic things like Doc X or like PowerPoint, right?
Like these are the things that people, everyday people are like, "Hey, my professor actually gave me everything in Doc X. Can you support that?" And then just like basic stuff, like images and PDFs combined with texts. Like there's just a really long roadmap for sources that I think we just have to work on.
So that's like a big piece of it. On the output side, and I think this is like one of the most interesting things that we learned really early on is, sure, there's like the Q&A analysis stuff, which is like, "Hey, when did this thing launch? Okay, you found it in the slide deck.
Here's the answer." But most of the time, the reason why people ask those questions is because they're trying to make something new. And so when actually, when some of those early features leaked, like a lot of the features we're experimenting with are the output types. And so you can imagine that people care a lot about the resources that they're putting into Notebook LM 'cause they're trying to create something new.
So I think equally as important as the source inputs are the outputs that we're helping people to create. And really, shortly on the roadmap, we're thinking about, how do we help people use Notebook LM to distribute knowledge? And that's like one of the most compelling use cases is like shared notebooks.
It's like a way to share knowledge. How do we help people take sources and then one-click new documents out of it, right? And I think that's something that people think is like, "Oh yeah, of course," right? Like one push a document, but what does it mean to do it right?
Like to do it in your style, in your brand, right? To follow your guidelines, stuff like that. So I think there's a lot of work on both sides of that equation. - Interesting. Any comments on the engineering side of things? - So yeah, like I said, I was mostly working on building the text to audio, which kind of lives as a separate engineering pipeline almost that we then put into Notebook LM.
But I think there's probably tons of Notebook LM engineering war stories on dealing with sources. And so I don't work too closely with engineers directly, but I think a lot of it does come down to like Gemini's native understanding of images really well, like the latest generation. - Yeah, I think on the engineering and modeling side, I think we are a really good example of a team that's put a product out there and we're getting a lot of feedback from the users and we return the data to the modeling team, right?
To the extent that we say, "Hey, actually, you know what people are uploading, but we can't really support super well? Text plus image," right? Especially to the extent that like Notebook LM can handle up to 50 sources, 500,000 words each. Like you're not going to be able to jam all of that into like the context window.
So how do we do multimodal embeddings with that? There's really like a lot of things that we have to solve that are almost there, but not quite there yet. - And then turning it into audio. I think one of the best things is it has so many of the human does that happen in the text generation that then becomes audio?
Or is that a part of like the audio model that transforms the text? - It's a bit of both, I would say. The audio model is definitely trying to mimic like certain human intonations and like sort of natural, like breathing and pauses and like laughter and things like that.
But yeah, in generating like the text, we also have to sort of give signals on like where those things maybe would make sense. - Yeah, and on the input side instead, having a transcript versus having the audio, like, can you take some of the emotions out of it too?
If I'm giving, like, for example, when we did the recaps of our podcast, we can either give audio of the pod or we can give a diarized transcription of it. But like the transcription doesn't have some of the, you know, voice kind of like things. Do you reconstruct that when people upload audio or how does that work?
- So when you upload audio today, we just transcribe it. So it is quite lossy in the sense that like, we don't transcribe like the emotion from that as a source. But when you do upload a text file and it has a lot of like that annotation, I think that there is some ability for it to be reused in like the audio output, right?
But I think it will still contextualize it in the deep dive format. So I think that's something that's like particularly important is like, hey, today we only have one format, it's deep dive. It's meant to be pretty general overview and it is pretty peppy. It's just very upbeat. - It's very enthusiastic, yeah.
- Yeah, yeah, even if you had like a sad topic, I think they would find a way to be like, silver lining though, we're having a good chat. - Yeah, that's awesome. One of the ways, many, many, many ways that deep dive went viral is people saying like, if you want to feel good about yourself, just drop in your LinkedIn.
Any other like favorite use cases that you saw from people discovering things in social media? - I mean, there's so many funny ones and I love the funny ones. I think because I'm always relieved when I watch them, I'm like, that was funny and not scary, it's great. There was another one that was interesting, which was a startup founder putting their landing page and being like, all right, let's test whether or not like the value prop is coming through.
And I was like, wow, that's right, that's smart. And then I saw a couple of other people following up on that too. - Yeah, I put my about page in there and like, yeah, if there are things that I'm not comfortable with, I should remove it, so that it can pick it up.
- Right, I think that the personal hype machine was like pretty viral one. I think like people uploaded their dreams and like some people like keep sort of dream journals and it like would sort of comment on those and like it was therapeutic. - I didn't see those, those are good.
I hear from Googlers all the time, especially 'cause we launched it internally first. And I think we launched it during the Q3 sort of like check-in cycle. So all Googlers have to write notes about like, hey, what'd you do in Q3? And what Googlers were doing is they would write whatever they accomplished in Q3 and then they would create an audio overview.
And these people that I didn't know would just ping me and be like, wow, like, I feel really good like going into a meeting with my manager. And I was like, good, good, good, good. You really did that, right? (laughs) - I think another cool one is just like any Wikipedia article like you drop it in and it's just like suddenly like the best sort of summary overview.
- I think that's what Karpathy did, right? Like he has now a Spotify channel called "Histories of Mysteries," which is basically like he just took like interesting stuff from Wikipedia and made audio overviews out of it. - Yeah, he became a podcaster overnight. - Yeah, I'm here for it.
I fully support him. I'm racking up the listens for him. - Honestly, it's useful even without the audio. You know, I feel like the audio does add an element to it, but I always want, you know, paired audio and text. And it's just amazing to see what people are organically discovering.
I feel like it's because you laid the groundwork with NotebookLM and then you came in and added the sort of TTS portion and made it so good, so human, which is weird. Like it's this engineering process of humans. Oh, one thing I wanted to ask. Do you have evals?
- Yeah. - Yes. - What? - Potatoes for chefs. (laughing) - What is that? What do you mean potatoes? - Oh, sorry, sorry. We were joking with this like a couple of weeks ago. We were doing like side-by-sides, but like Usama sent me the file and it was literally called "Potatoes for Chefs." And I was like, you know, my job is really serious, but like- - It's kind of funny.
- You have to laugh a little bit. Like the title of the file was like "Potatoes for Chefs." - Was it like a training document for chefs? - It was just a side-by-side for like two different kind of audio transcripts. - The question is really like, as you iterate, the typical engineering advice is you establish some kind of tests or a benchmark.
You're at like 30%. You want to get it up to 90, right? - Yeah. - What does that look like for making something sound human and interesting and voice? - We have the sort of formal eval process as well, but I think like for this particular project, we maybe took a slightly different route to begin with.
Like there was a lot of just within the team listening sessions, a lot of like sort of like- - Dogfooding. - Yeah, like I think the bar that we tried to get to before even starting formal evals with raters and everything was much higher than I think other projects would.
Like, 'cause that's, as you said, like the traditional advice, right? Like get that ASAP. Like, what are you looking to improve on? Whatever benchmark it is. So there was a lot of just like critical listening. And I think a lot of making sure that those improvements actually could go into the model and like we're happy with that human element of it.
And then eventually we had to obviously distill those down into an eval set, but like still there's like, the team is just like a very, very like avid user of the product at all stages. - I think you just have to be really opinionated. I think that sometimes if you are, your intuition is just sharper and you can move a lot faster on the product because it's like, if you hold that bar high, right?
Like if you think about like the iterative cycle, it's like, hey, we could take like six months to ship this thing, to get it to like mid where we were, or we could just like listen to this and be like, yeah, that's not it, right? And I don't need a rater to tell me that.
That's my preference, right? And collectively, like if I have two other people listen to it, they'll probably agree. And it's just kind of this step of like, just keep improving it to the point where you're like, okay, now I think this is really impressive. And then like do evals, right?
And then validate that. - Was the sound model done and frozen before you started doing all this? Or are you also saying, hey, we need to improve the sound model as well? - Both, yeah. We were making improvements on the audio and just like generating the transcript as well.
I think another weird thing here was like, we need it to be entertaining and that's much harder to quantify than some of the other benchmarks that you can make for like, you know, Sweebench or get better at this math. - Do you just have people rate one to five or, you know, or just thumbs up and down?
- For the formal rater evals, we have sort of like a Likert scale and like a bunch of different dimensions there. But we had to sort of break down that what makes it entertaining into like a bunch of different factors. But I think the team stage of that was more critical.
It was like, we need to make sure that like what is making it fun and engaging. Like we dialed that as far as it goes. And while we're making other changes that are necessary, like obviously they shouldn't make stuff up or, you know, be insensitive. - Hallucinations. - Hallucinations.
- Other safety things. - Right, like a bunch of safety stuff. - Yeah, exactly. So like with all of that, and like also just, you know, following sort of a coherent narrative and structure is really important. But like with all of this, we really had to make sure that that central tenet of being entertaining and engaging and something you actually want to listen to, it just doesn't go away, which takes like a lot of just active listening time 'cause you're closest to the prompts, the model and everything.
- I think sometimes the difficulty is because we're dealing with non-deterministic models, sometimes you just got a bad roll of the dice and it's always on the distribution that you could get something bad. Basically, how many, do you like do 10 runs at a time? And then how do you get rid of the non-determinism?
- Right, yeah. That's-- - Like bad luck. - Yeah, yeah, yeah. I mean, there still will be like bad audio overviews. There's like a bunch of them that happens. - Do you mean for like the raider emails? - For raiders, right? Like what if that one person just got like a really bad rating?
You actually had a great prompt. You actually had a great model, great weights, whatever. And you just, you had a bad output. Like, and that's okay, right? - I actually think like the way that these are constructed, if you think about like the different types of controls that the user has, right?
Like what can the user do today to affect it? - We push a button. - Just use your sources. You just push a button. - I have tried to prompt engineer by changing the title. - Yeah, yeah, yeah. - Changing the title, people have found out, the title of the notebook, people have found out you can add show notes, right?
You can get them to think like the show has changed sort of fundamentally. - Someone changed the language of the output. - Changing the language of the output. Like those are less well-tested because we focused on like this one aspect. So it did change the way that we sort of think about quality as well, right?
So it's like quality is on the dimensions of entertainment, of course, like consistency, groundedness. But in general, does it follow the structure of the deep dive? And I think when we talk about like non-determinism, it's like, well, as long as it follows like the structure of the deep dive, right?
It sort of inherently meets all those other qualities. And so it makes it a little bit easier for us to ship something with confidence to the extent that it's like, I know it's gonna make a deep dive. It's gonna make a good deep dive. Whether or not the person likes it, I don't know.
But as we expand to new formats, as we open up controls, I think that's where it gets really much harder, even with the show notes, right? Like people don't know what they're going to get when they do that. And we see that already where it's like, this is gonna be a lot harder to validate in terms of quality, where now we'll get a greater distribution.
Whereas I don't think we really got like very distribution because of like that pre-process that Usama was talking about. And also because of the way that we'd constrain, like what were we measuring for? Literally just like, is it a deep dive? - And you determine what a deep dive is.
- Yeah. - Everything needs a PM. I have, this is very similar to something I've been thinking about for AI products in general. There's always like a chief tastemaker. And for Notebook LM, it seems like it's a combination of you and Steven. - Well, okay. I want to take a step back.
- And Usama. I mean, presumably for the voice stuff. - Usama's like the head chef, right? Of like deep dive, I think. - Potatoes. - Of potatoes. And I say this because I think even though we are already a very opinionated team and Steven, for sure, very opinionated, I think of the audio generations, like Usama was the most opinionated, right?
And we all, we all like would say like, "Hey," I remember like one of the first ones he sent me, I was like, "Oh, I feel like "they should introduce themselves. "I feel like they should say a title." But then like, we would catch things like, maybe they shouldn't say their names.
- Yeah, they don't say their names. - That was a Steven catch. - Yeah, yeah. - Like not give them names. - So stuff like that is just like, we all injected like a little bit of just like, "Hey, here's like my take on like how a podcast should be." Right, and I think like if you're a person who like regularly listens to podcasts, there's probably some collective preference there that's generic enough that you can standardize into like the deep dive format.
But yeah, it's the new formats where I think like, "Oh, that's the next test." - Yeah, I've tried to make a clone by the way. Of course, everyone did. - Yeah. - Everyone in AI was like, "Oh no, this is so easy. "I'll just take a TTS model." Obviously our models are not as good as yours, but I tried to inject a consistent character backstory, like age, identity, where they went to work, where they went to school, what their hobbies are.
Then it just, the models try to bring it in too much. I don't know if you tried this. So then I'm like, "Okay, like how do I define a personality "but it doesn't keep coming up every single time?" - Yeah, I mean, we have like a really, really good like character designer on our team.
- What? Like a D&D person? - Just to say like we, just like we had to be opinionated about the format, we had to be opinionated about who are those two people talking. - Okay. - Right, and then to the extent that like you can design the format, you should be able to design the people as well.
- Yeah, I would love like a, you know, like when you play Baldur's Gate, like you roll like 17 on charisma and like it's like what race they are, I don't know. - I recently, actually, I was just talking about character select screens. - Yeah. - I was like, I love that.
- People spend hours on that. - I love that, right? And I was like, maybe there's something to be learned there because like people have fallen in love with the deep dive as a format, as a technology, but also as just like those two personas. Now, when you hear a deep dive and you've heard them, you're like, "I know those two," right?
And people, it's so funny when I, when people are trying to find out their names, like it's a worthy task, it's a worthy goal. I know what you're doing. But the next step here is to sort of introduce like, is this like what people want? People want to sort of edit their personas or do they just want more of them?
- I'm sure you're getting a lot of opinions and they all conflict with each other. Before we move on, I have to ask, because we're kind of on this topic, how do you make audio engaging? Because it's useful, not just for deep dive, but also for us as podcasters.
What does engaging mean? If you could break it down for us, that'd be great. - I mean, I can try. Don't claim to be an expert at all. - So I'll give you some, like variation in tone and speed. You know, there's this sort of writing advice where, you know, this sentence is five words, this sentence is three, that kind of advice where you vary things, you have excitement, you have laughter, all that stuff.
But I'd be curious how else you break down. - So there's the basics, like obviously structure that can't be meandering, right? Like there needs to be sort of an ultimate goal that the voices are trying to get to, human or artificial. I think one thing we find often is if there's just too much agreement between people, like that's not fun to listen to.
So there needs to be some sort of tension and buildup, you know, withholding information, for example. Like as you listen to a story unfold, like you're gonna learn more and more about it. And audio that maybe becomes even more important because like you actually don't have the ability to just like skim to the end of something when you're driving or something, like you're gonna be hooked.
'Cause like there's, and that's how like, that's how a lot of podcasts work. Like maybe not interviews necessarily, but a lot of true crime, a lot of entertainment in general. There's just like a gradual unrolling of information. And that also like sort of goes back to the content transformation aspect of it.
Like maybe you are going from, let's say the Wikipedia article of like, one of the history of mysteries, maybe episodes, like the Wikipedia article is gonna state out the information very differently. It's like, here's what happened, would probably be in the very first paragraph. And one approach we could have done is like, maybe a person's just narrating that thing.
And maybe that would work for like a certain audience. Or I guess that's how I would picture like a standard history lesson to unfold. But like, because we're trying to put it in this two-person dialogue format, like we inject like the fact that, you know, there's, you don't give everything at first.
And then you set up like differing opinions of the same topic or the same, like maybe you seize on a topic and go deeper into it and then try to bring yourself back out of it and go back to the main narrative. So that's mostly from like the setting up the script perspective.
And then the audio, I was saying earlier, it's trying to be as close to just human speech as possible, I think was what we found success with so far. - Yeah. Like with interjections, right? Like, I think like when you listen to two people talk, there's a lot of like, yeah, yeah, right.
And then there's like a lot of like that questioning, like, oh yeah, really? What did you think? - I noticed that, that's great. - Totally. - Like, so my question is, do you pull in speech experts to do this or did you just come up with it yourselves? You can be like, okay, talk to a whole bunch of fiction writers to make things engaging or comedy writers or whatever, stand up comedy, right?
They have to make audio engaging. But audio as well, like there's professional fields of studying where people do this for a living, but us as AI engineers are just making this up as we go. - I mean, it's a great idea, but you definitely didn't. - Yeah. - No, I'm just like, oh.
- My guess is you didn't. - Yeah. - There's a certain appeal to authority that people have. They're like, oh, like you can't do this 'cause you don't have any experience like making engaging audio, but that's what you literally did. - Right, I mean, I was literally chatting with someone at Google earlier today about how some people think that like, you need a linguistics person in the room for like making a good chatbot, but that's not actually true.
'Cause like this person went to school for linguistics and according to him, he's an engineer now, according to him, like most of his classmates were not actually good at language. Like they knew how to analyze language and like sort of the mathematical patterns and rhythms and language, but that doesn't necessarily mean they were gonna be eloquent at like, while speaking or writing.
So I think, yeah, a lot of we haven't invested in specialists in the audio format yet, but maybe that would. - I think it's like super interesting because I think there's like a very human question of like what makes something interesting. And there's like a very deep question of like, what is it, right?
Like, what is the quality that we are all looking for? Is it, does somebody have to be funny? Does something have to be entertaining? Does something have to be straight to the point? And I think when you try to distill that, this is the interesting thing I think about our experiment, about this particular launch is, first, we only launched one format.
And so we sort of had to squeeze everything we believed about what an interesting thing is into one package. And as a result of it, I think we learned, it's like, hey, interacting with a chatbot is sort of novel at first, but it's not interesting, right? It's like humans are what makes interacting with chatbots interesting.
It's like, ha, ha, ha, I'm gonna try to trick it. It's like, that's interesting, spell strawberry, right? This is like the fun that like people have with it. But like, that's not the LLM being interesting, that's you, just like kind of giving it your own flavor. But it's like, what does it mean to sort of flip it on its head and say, no, you be interesting now, right?
Like you give the chatbot the opportunity to do it. And this is not a chatbot per se, it is like just the audio. And it's like the texture, I think, that really brings it to life. And it's like the things that we've described here, which was like, okay, now I have to like lead you down a path of information about like this commercialization deck.
It's like, how do you do that? To be able to successfully do it, I do think that you need experts. I think we'll engage with experts like down the road, but I think it will have to be in the context of, well, what's the next thing we're building, right?
It's like, what am I trying to change here? What do I fundamentally believe needs to be improved? And I think there's still like a lot more studying that we have to do in terms of like, well, what are people actually using this for? And we're just in such early days.
Like it hasn't even been a month. - Two, three weeks, three weeks, I think. - Yeah. - I think the other, one other element to that is the, like the fact that you're bringing your own sources to it. Like it's your stuff. Like, you know this somewhat well, or you care to know about this.
So like that, I think changed the equation on its head as well. It's like your sources and someone's telling you about it. So like you care about how that dynamic is, but you just care for it to be good enough to be entertaining. 'Cause ultimately they're talking about your mortgage deed or whatever.
- So it's interesting just from the topic itself, even taking out all the agreements and the hiding of the slow reveal. - I mean, there's a baseline maybe, like if it was like too drab, like if it was someone who was reading it off, like, you know, that's like the absolute worst, but like.
- Do you prompt for humor? That's a tough one, right? - I think it's more of a generic way to bring humor out if possible. I think humor is actually one of the hardest things. - Yeah. - But I don't know if you saw. - That is AGI, humor is AGI.
- Yeah, but did you see the chicken one? - No. - Okay, if you haven't heard it. - We'll splice it in here. - Okay, yeah, yeah. There is a video on threads. I think it was by Martino Wong. And it's a PDF. - Welcome to your deep dive for today.
- Oh yeah, get ready for a fun one. - Buckle up because we are diving into chicken, chicken, chicken, chicken, chicken. - You got that right. - By Doug Zonker. - Now. - And yes, you heard that title correctly. - Titles. - Our listener today submitted this paper. - Yeah, they're gonna need our help.
- And I can totally see why. - Absolutely. - It's dense, it's baffling. - It's a lot. - And it's packed with more chicken than a KFC buffet. - Wait, that's hilarious, that's so funny. So it's like stuff like that, that's like truly delightful, truly surprising, but it's like, we didn't tell it to be funny.
- Humor's contextual also, like super contextual what we're realizing. So we're not prompting for humor, but we're prompting for maybe a lot of other things that are bringing out that humor. - I think the thing about ad generated content, if we look at YouTube, like we do videos on YouTube and it's like, you know, a lot of people are screaming in the thumbnails to get clicks.
There's like everybody, there's kind of like a meta of like what you need to do to get clicks. But I think in your product, there's no actual creator on the other side investing the time. So you can actually generate a type of content that is maybe not universally appealing, you know, at a much.
- It's personal. - Yeah, exactly. I think that's the most interesting thing. It's like, well, is there a way for like, take Mr. Beast, right? It's like Mr. Beast optimizes videos to reach the biggest audience and like the most clicks. But what if every video could be kind of like regenerated to be closer to your taste, you know, when you watch it?
- I think that's kind of the promise of AI that I think we are just like touching on, which is I think every time I've gotten information from somebody, they have delivered it to me in their preferred method, right? Like if somebody gives me a PDF, it's a PDF.
Somebody gives me a hundred slide deck, that is the format in which I'm going to read it. But I think we are now living in the era where transformations are really possible, which is look, like I don't want to read your hundred slide deck, but I'll listen to a 16 minute audio overview on the drive home.
- Yeah. - And that I think is really novel. And that is paving the way in a way that like maybe we wanted, but didn't expect. Where I also think you're listening to a lot of content that normally wouldn't have had content made about it. Like I watched this TikTok where this woman uploaded her diary from 2004.
For sure, right? Like nobody was going to make a podcast about a diary. Like hopefully not, like it seems kind of embarrassing. - It's kind of creepy. - Yeah, it's kind of creepy. But she was doing this like live listen of like, "Oh, like here's a podcast about my diary." And it's like, it's entertaining right now to sort of all listen to it together.
But like the connection is personal. It was like, it was her interacting with like her information in a totally different way. And I think that's where like, oh, that's a super interesting space, right? Where it's like, I'm creating content for myself in a way that suits the way that I want to consume it.
- Or people compare like retirement plan options. Like no one's going to give you that content like for your personal financial situation. And like, even when we started out the experiment, like a lot of the goal was to go for really obscure content and see how well we could transform that.
So like, if you look at the Mountain View, like city council meeting notes, like you're never going to read it. But like, if it was a three minute summary, like that would be interesting. - I see. You have one system, one prompt that just covers everything you threw at it.
- Maybe. - No, I'm just kidding. It's really interesting. You know, I'm trying to figure out what you nailed compared to others. And I think that the way that you treat your, the AI is like a little bit different than a lot of the builders I talked to. So I don't know what it is you said.
I wish I had a transcript right in front of me, but it's something like, people treat AI as like a tool for thought, but usually it's kind of doing their bidding. And you know, what you're really doing is loading up these like two virtual agents. I don't, you've never said the word agents, I put that in your mouth, but two virtual humans or AIs and letting them form their own opinion and letting them kind of just live and embody it a little bit.
Is that accurate? - I think that that is as close to accurate as possible. I mean, in general, I try to be careful about saying like, oh, you know, letting, you know, yeah, like these personas live. But I think to your earlier question of like, what makes it interesting?
That's what it takes to make it interesting. - Yeah. - Right, and I think to do it well is like a worthy challenge. I also think that it's interesting because they're interested, right? Like, is it interesting to compare- - The O'Carnegie thing. - Yeah, is it interesting to have two retirement plans?
No, but to listen to these two talk about it, oh my gosh, you'd think it was like the best thing ever invented, right? It's like, get this, deep dive into 401k through Chase versus, you know, whatever. - They do do a lot of get this, which is funny. - I know, I know, I dream about it.
I'm sorry. - There's a, I have a few more questions on just like the engineering around this. And obviously some of this is just me creatively asking how this works. How do you make decisions between when to trust the AI overlord to decide for you? In other words, stick it, let's say products as it is today, you want to improve it in some way.
Do you engineer it into the system? Like write code to make sure it happens or you just stick it in a prompt and hope that the LM does it for you? Do you know what I mean? - Do you mean specifically about audio or sort of in general? - In general, like designing AI products, I think this is like the one thing that people are struggling with.
And there's compound AI people and then there's big AI people. So compound AI people will be like Databricks, have lots of little models, chain them together to make an output. It's deterministic, you control every single piece and you produce what you produce. The open AI people, totally the opposite, like write one giant prompts and let the model figure it out.
And obviously the answer for most people is going to be a spectrum in between those two, like big model, small model. When do you decide that? - I think it depends on the task. It also depends on, well, it depends on the task, but ultimately depends on what is your desired outcome?
Like what am I engineering for here? And I think there's like several potential outputs and there's sort of like general categories. Am I trying to delight somebody? Am I trying to just like meet whatever the person is trying to do? Am I trying to sort of simplify a workflow?
At what layer am I implementing this? Am I trying to implement this as part of the stack to reduce like friction, particularly for like engineers or something? Or am I trying to engineer it so that I deliver like a super high quality thing? I think that the question of like, which of those two, I think you're right, it is a spectrum.
But I think fundamentally it comes down to like, it's a craft, like it's still a craft as much as it is a science. And I think the reality is like, you have to have a really strong POV about like what you want to get out of it and to be able to make that decision.
Because I think if you don't have that strong POV, like you're going to get lost in sort of the detail of like capability. And capability is sort of the last thing that matters because it's like models will catch up, right? Like models will be able to do, you know, whatever in the next five years, it's going to be insane.
So I think this is like a race to like value. And it's like really having a strong opinion about like, what does that look like today? And how far are you going to be able to push it? Sorry, I think maybe that was like very like philosophical. - It's fine, we get there.
And I think that hits a lot of the points it's going to make. I tweeted today, or I ex-posted, whatever, that we're going to interview you on what we should ask you. So we got a list of feature requests, mostly. It's funny, nobody actually had any like specific questions about how the product was built.
They just want to know when you're releasing some feature. So I know you cannot talk about all of these things, but I think maybe it will give people an idea of like where the product is going. So I think the most common question, I think five people asked is like, are you going to build an API?
And, you know, do you see this product as still be kind of like a full head product, where I can log in and do everything there? Or do you want it to be a piece of infrastructure that people build on? - I mean, I think, why not both? I think we work at a place where you could have both.
I think that end user products, like products that touch the hands of users, have a lot of value. For me personally, like we learn a lot about what people are trying to do and what's like actually useful and what people are ready for. And so we're going to keep investing in that.
I think at the same time, right, there are a lot of developers that are interested in using the same technology to build their own thing. We're going to look into that. How soon that's going to be ready, I can't really comment, but these are the things that like, hey, we heard it.
We're trying to figure it out. And I think there's room for both. - Is there a world in which this becomes a default Gemini interface because it's technically different org? - It's such a good question. And I think every time someone asks me, it's like, hey, I just leaned over Gilliam.
(laughing) We'll ask the Gemini folks what they think. - Multilingual support. I know people kind of hack this a little bit together. Any ideas for full support, but also I'm mostly interested in dialects. In Italy, we have Italian obviously, but we have a lot of local dialects. Like if you go to Rome, people don't really speak Italian, they speak local dialect.
Do you think there's a path to which these models, especially the speech can learn very like niche dialects, like how much data do you need? Can people contribute? Like, I'm curious if you see this as a possibility. - So I guess high level, like we're definitely working on adding more languages.
That's like top priority. We're going to start small, but like theoretically we should be able to cover like most languages pretty soon. - What a ridiculous statement by the way, that's crazy. - Unlike the soon or the pretty soon part. - No, but like, you know, a few years ago, like a small team of like, I don't know, 10 people saying that we will support the top 100, 200 languages is like absurd, but you can do it.
You can do it. - And I think like the speech team, you know, we are a small team, but the speech team is another team and the modeling team, like these folks are just like absolutely brilliant at what they do. And I think like when we've talked to them and we've said, hey, you know, how about more languages?
How about more voices? How about dialects, right? This is something that like they are game to do. And like, that's the roadmap for them. The speech team supports like a bunch of other efforts across Google, like Gemini Live, for example, is also the models built by the same, like sort of deep mind speech team.
But yeah, the thing about dialects is really interesting. 'Cause like in some of our sort of earliest testing with trying out other languages, we actually noticed that sometimes it wouldn't stick to a certain dialect, especially for like, I think for French, we noticed that like when we presented it to like a native speaker, it would sometimes go from like a Canadian person speaking French versus like a French person speaking French or an American person speaking French, which is not what we wanted.
So there's a lot more sort of speech quality work that we need to do there to make sure that it works reliably and at least sort of like the standard dialect that we want. But that does show that there's potential to sort of do the thing that you're talking about of like fixing a dialect that you want, maybe contribute your own voice or like you pick from one of the options.
There's a lot more headroom there. - Yeah, because we have movies. Like we have old Roman movies that are like different languages, but there's not that many, you know? So I'm always like, well, I'm sure like the Italian is so strong in the model that like when you're trying to like pull that away from it, like you kind of need a lot, but- - Right, that's all sort of like wonderful deep mind speech team.
- Yeah. - Yeah, yeah, yeah. - Well, anyway, if you need Italian, he's got you. - Yeah, yeah, yeah. - I got him on, I got him on. - Specifically, it's English, I got you. The managing system prompt, people want a lot of that. I assume yes-ish. Definitely looking into it for just core notebook LM.
Like everybody's wanted that forever. So we're working on that. I think for the audio itself, we are trying to figure out the best way to do it. So we'll launch something sooner rather than later. So we'll probably stage it. And I think like, you know, just to be fully transparent, we'll probably launch something that's more of a fast follow than like a fully baked feature first.
Just because like I see so many people put in like the fake show notes, it's like, hey, I'll help you out. We'll just put a text fax or something, yeah. - I think a lot of people are like, this is almost perfect, but like, I just need that extra 10, 20%.
- Yeah. - I noticed that you say no a lot, I think, or you try to ship one thing. - Yeah. - And that is different about you than maybe other PMs or other eng teams that try to ship, they're like, oh, here are all the knobs. I'm just, take all my knobs.
- Yeah, yeah. - Top P, top K, it doesn't matter. I'll just put it in the docs and you figure it out, right? - That's right, that's right. - Whereas for you, it's you actually just, you make one product. - Yeah. - As opposed to like 10 you could possibly have done.
- Yeah, yeah. - I don't know, it's interesting. - I think about this a lot. I think it requires a lot of discipline because I thought about the knobs. I was like, oh, I saw on Twitter, you know, on X, people want the knobs, like, great. Started mocking it up, making the text boxes, designing like the little fiddles, right?
And then I looked at it and I was kind of sad. I was like, oh, right, it's like, oh, it's like, this is not cool, this is not fun, this is not magical. It is sort of exactly what you would expect knobs to be. But then, you know, it's like, oh, I mean, how much can you, you know, design a knob?
I thought about it, I was like, but the thing that people really liked was that there wasn't any. They just pushed a button. - One button. - And it was cool. And so I was like, how do we bring more of that, right? That still gives the user the optionality that they want.
And so this is where, like, you have to have a strong POV, I think. You have to like really boil down, what did I learn in like the month since I've launched this thing that people really want? And I can give it to them while preserving like that, that delightful sort of fun experience.
And I think that's actually really hard. Like, I'm not gonna come up with that by myself. I'm like, that's something that like our team thinks about every day. We all have different ideas. We're all experimenting with sort of how to get the most out of like the insight and also ship it quick.
So we'll see, we'll find out soon if people like it or not. - I think the other interesting thing about like AI development now is that the knobs are not necessarily, like going back to all the sort of like craft and like human taste and all of that that went into building it.
Like the knobs are not as easy to add as simply like, I'm gonna add a parameter to this and it's gonna make it happen. It's like, you kind of have to redo the quality process for everything. But the prioritization is also different though. - It goes back to sort of like, it's a lot easier to do an eval for like the deep dive format than if like, okay, now I'm gonna let you inject like these random things, right?
Okay, how am I gonna measure quality? Either I say, well, I don't care because like you just input whatever. Or I say, actually wait, right? Like I wanna help you get the best output ever. What's it going to take? - The knob actually needs to work reliably. - Yeah.
- Yeah. Very important point. - Two more things we definitely wanna talk about. I guess now people equivalent notebook LM to like a podcast generator, but I guess, you know, there's a whole product suite there. How should people think about that? Like, is this, and also like the future of the product as far as monetization too, you know?
Like, is it gonna be, the voice thing gonna be a core to it? Is it just gonna be one output modality and like you're still looking to build like a broader kind of like a interface with data and documents platform? - I mean, that's such a good question that I think the answer it's, I'm waiting to get more data.
I think because we are still in the period where everyone's really excited about it. Everyone's trying it. I think I'm getting a lot of sort of like positive feedback on the audio. We have some early signal that says it's a really good hook, but people stay for the other features.
So that's really good too. I was making a joke yesterday. I was like, it'd be really nice, you know, if it was just the audio, 'cause then I could just like simplify the train, right? I don't have to think about all this other functionality. But I think the reality is that the framework kind of like what we were talking about earlier that we had laid out, which is like, you bring your own sources, there's something you do in the middle, and then there's an output is that really extensible one.
And it's a really interesting one. And I think like, particularly when we think about what a big business looks like, especially when we think about commercialization, audio is just one such modality. But the editor itself, like the space in which you're able to do these things is like, that's the business, right?
Like maybe the audio by itself, not so much, but like in this big package, like, oh, I could see that. I could see that being like a really big business. - Yep. Any thoughts on some of the alternative interact with data and documents thing, like cloud artifacts, like a JGBD canvas, you know, kind of how do you see, maybe where notebook LM stars, but like Gemini starts, like you have so many amazing teams and products at Google that sometimes like, I'm sure you have to figure that out.
- Yeah, well, I love artifacts. I played a little bit with canvas. I got a little dizzy using it. I was like, oh, there's something, well, you know, I like the idea of it fundamentally, but something about the UX was like, oh, this is like more disorienting than like artifacts.
And I couldn't figure out what it was. And I didn't spend a lot of time thinking about it, but I love that, right? Like the thing where you are like, I'm working with, you know, an LLM, an agent, a chap or whatever to create something new. And there's like the chat space.
There's like the output space. I love that. And the thing that I think I feel angsty about is like, we've been talking about this for like a year, right? Like, of course, like, I'm going to say that, but it's like, but like for a year now, I've had these like mocks that I was just like, I want to push the button, but we prioritize other things.
We were like, okay, what can we like really win at? And like, we prioritize audio, for example, instead of that. But just like when people were like, oh, what is this magic draft thing? Oh, it's like a hundred percent, right? It's like stuff like that, that we want to try to build into notebook too.
And I'd made this comment on Twitter as well, where I was like, now I don't know, actually, right? I don't actually know if that is the right thing. Like, are people really getting utility out of this? I mean, from the launches, it seems like people are really getting it.
But I think now if we were to ship it, I have to rev on it like one layer more, right? I have to deliver like a differentiating value compared to like artifacts, which is hard. - Which is, because you've, you demonstrated the ability to fast follow. So you don't have to innovate every single time.
- I know, I know. I think for me, it's just like, the bar is high to ship. And when I say that, I think it's sort of like, conceptually, like the value that you deliver to the user. I mean, you'll see a notebook alarm. There are a lot of corners that I have personally cut, where it's like, our UX designer is always like, I can't believe you let us ship with like these ugly scroll bars.
And I'm like, no one notices, I promise. He's like, no, everyone. It's a screenshot, this thing. But I mean, kidding aside, I think that's true, that it's like, we do want to be able to fast follow, but I think we want to make sure that things also land really well.
So the utility has to be there. - Code, especially on our podcast, has a special place. Is code notebook LLM interesting to you? I haven't, I've never, I don't see like a connect my GitHub to this thing. - Yeah, yeah. I think code is a big one. Code is a big one.
I think we have been really focused, especially when we had like a much smaller team, we were really focused on like, let's push like an end-to-end journey together. Let's prove that we can do that. Because then once you lay the groundwork of like, sources, do something in the chat, output, once you have that, you just scale it up from there, right?
And it's like, now it's just a matter of like, scaling the inputs, scaling the outputs, scaling the capabilities of the chat. So I think we're going to get there. And now I also feel like I have a much better view of like where the investment is required. Whereas previously I was like, hey, like, let's flesh out the story first before we put more engineers on this thing, because that's just going to slow us down.
- For what it's worth, the model still understands code. So like, I've seen at least one or two people just like, download their GitHub repo, put it in there and get like an audio overview of your code. - Yeah, yeah. - I've never tried that. - This is like, these are all, all the files are connected together.
'Cause the model still understands code. Like, even if you haven't like, optimized for it. - I think on sort of like the creepy side of things, I did watch a student, like with her permission, of course, I watched her do her homework in Notebook LM. And I didn't tell her like, what kind of homework to bring, but she brought like her computer science homework.
And I was like, oh. And she uploaded it and she said, here's my homework, read it. And it was just the instructions. And Notebook LM was like, okay, I've read it. And the student was like, okay, here's my code so far. And she copy pasted it from the editor.
And she was like, check my homework. And Notebook LM was like, well, number one is wrong. And I thought that was really interesting, 'cause it didn't tell her what was wrong. It just said it's wrong. And she was like, okay, don't tell me the answer, but like, walk me through like how you'd think about this.
And it was, what was interesting for me was that she didn't ask for the answer. And I asked her, I was like, oh, why did you do that? And she was like, well, I actually want to learn it. She was like, 'cause I'm going to have to take a quiz on this at some point.
And I was like, oh yeah, this is a really good point. And it was interesting because, you know, Notebook LM, while the formatting wasn't perfect, like did say like, hey, have you thought about using, you know, maybe an integer instead of like this? And so that was really interesting.
- Are you adding like real-time chat on the output? Like, you know, there's kind of like the deep dive show and then there's like the listeners call in and say, hey. - Yeah, we're actively, that's one of the things we're actively prioritizing. Actually, one of the interesting things is now we're like, why would anyone want to do that?
Like, what are the actual, like kind of going back to sort of having a strong POV about the experience. It's like, what is better? Like, what is fundamentally better about doing that? That's not just like being able to Q&A your notebook. How is that different from like a conversation?
Is it just the fact that like there was a show and you want to tweak the show? Is it because you want to participate? So I think there's a lot there that like we can continue to unpack, but yes, that's coming. - It's because I formed a parasocial relationship.
- Yeah, I just want to be part of your life. - Get this. - Totally. - Yeah, but it is obviously because OpenAI has just launched a real-time chat. It's a very hot topic. I would say one of the toughest AI engineering disciplines out there because even their API doesn't do interruptions that well.
To be honest and you know, yeah. So real-time chat is tough. - I love that thing. I love it, yeah. - Okay, so we have a couple of ways to end, either call to action or laying out one principle of AI PMing or engineering that you really think about a lot.
Is there anything that comes to mind? - I feel like that's a test. Of course, I'm going to say go to notebooklm.google.com. Try it out, join the Discord and tell us what you think. - Yeah, especially like you have a technical audience. What do you want from a technical engineering audience?
- I mean, I think it's interesting because the technical and engineering audience typically will just say, "Hey, where's the API?" But you know, and I think we addressed it. But I think what I would really be interested to discover is, is this useful to you? Why is it useful?
What did you do? Right, is it useful tomorrow? How about next week? Just the most useful thing for me is if you do stop using it or if you do keep using it, tell me why. Because I think contextualizing it within your life, right, your background, your motivations, like is what really helps me build really cool things.
- And then one piece of advice for AI PMs. - Okay, if I had to pick one, it's just always be building. Like build things yourself. I think like for PMs, it's like such a critical skill and just like take time to like pop your head up and see what else is new out there.
On the weekends, I try to have a lot of discipline. Like I only use ChatGPT and like Cloud on the weekend. I try to like use like the APIs. Occasionally I'll try to build something on like GCP over the weekend, 'cause like I don't do that normally like at work.
But it's just like the rigor of just trying to be like a builder yourself. And even just like testing, right? Like you can have an idea of like how a product should work and maybe your engineers are building it. But it's like, what was your like proof of concept, right?
Like what gave you conviction that that was the right thing? - Call to action. - I feel like consistently like the most magical moments out of like AI building come about for me when like I'm really, really, really just close to the edge of the model capability. And sometimes it's like farther than you think it is.
Like I think while building this product, some of the other experiments, like there were phases where it was like easy to think that you've like approached it. But like sometimes at that point, what you really need is to like show your thing to someone and like they'll come up with creative ways to improve it.
Like we're all sort of like learning, I think. So yeah, like I feel like unless you're hitting that bound of like, this is what Gemini 1.5 can do, probably like the magic moment is like somewhere there, like in that sort of limit. - So push the edge of the capability.
- Yeah, totally. - It's funny because we had a Nicola Scarlini from DeepMind on the pod. And he was like, if the model is always successful, you're probably not trying hard enough to like give it hard. - Right. - So yeah. - My problem is like sometimes I'm not smart enough to judge.
- Yeah, right. (laughing) - I think like that's, I hear that a lot. Like people are always like, I don't know how to use it. Yeah, and it's hard. Like I remember the first time I used Google search, I was like, what do we type? My dad was like, anything.
It's like anything, I got nothing in my brain, dad. (laughing) What do you mean? And I think there's a lot of like for product builders is like have a strong opinion about like, what is the user supposed to do? - Yeah. - Help them do it. - Principle for AI engineers or like just one advice that you have others?
- I guess like, in addition to pushing the bounds and to do that, that often means like, you're not gonna get it right in the first go. So like, don't be afraid to just like, batch multiple models together. I guess that's, I'm basically describing an agent, but more thinking time equals just better results consistently.
And that holds true for probably every single time that I've tried to build something. - Well, at some point we will talk about the sort of longer inference paradigm. It seems like DeepMind is rumored to be coming out with something. You can't comment, of course. Yeah, well, thank you so much.
You know, you've created, I actually said, I think you saw this. I think that Notebook LLM was kind of like the ChatGPC moment for Google. - Yeah, that was so crazy when I saw that. I was like, what? Like ChatGPC was huge for me. And I think, you know, when you said it and other people have said it, I was like, is it?
- Yeah. - That's crazy, that's so cool. - People weren't like really cognizant of Notebook LLM before and audio overviews and Notebook LLM, like unlocked the, you know, a use case for people in a way that I would go so far as to say cloud projects never did. And I don't know, you know, I think a lot of it is competent PMing and engineering, but also just, you know, it's interesting how a lot of these projects are always like low key research previews.
For you, it's like, you're a separate org, but like, you know, you built products and UI innovation on top of also working with research to improve the model. That was a success. That wasn't planned to be this whole big thing. You know, your TPUs were on fire, right? - Oh my gosh, that was so funny.
I didn't know people would like really catch on to the Elmo fire, but it was just like one of those things where I was like, you know, we had to ask for more TPUs. Yeah, many times. And, you know, it was a little bit of a subtweet of like, hey, reminder, give us more TPUs down here.
- It's weird. I just think like when people try to make big launches, then they flop. And then like when they're not trying and they're just trying to build a good thing, then they succeed. It's this fundamentally really weird magic that I haven't really encapsulated yet, but you've done it.
- Thank you. And you know, I think we'll just keep going in like the same way. We just keep trying, keep trying to make it better. - Yeah, I hope so. All right, cool. Thank you. - Thank you. Thanks for having us. - Thanks. (upbeat music) (upbeat music) (upbeat music)