The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Hi, everybody. Thanks for having us here today. We're super excited to be here. I'm Sam and I'm one of the co-founders of New Computer. And I'm Jason, the other co-founder. And we're really excited that we are starting today by letting you all see our pores up close, which is amazing.

So, you know, when Sam and I started New Computer, we did so because we believed that for so long, we've taken certain metaphors and abstractions and tools for granted. And for the first time in what feels like 40 years, we can finally change all of that. And we can start thinking from first principles, what our relationship, not only with computing, but with intelligence, period, should look like in the future.

So what do we mean by intelligence? Because, you know, sometimes I'm on the internet and I wonder if it even exists. Well, one way to think about intelligence is the ability to sort of take in lots of information, different types, different volumes from different sources. Visualize those dots here and sort of find ways to make sense of it all, find ways to reason, find ways to find meaning.

And as human beings, as carbon-based life forms, we do this through a process where at first we use our senses to sort of perceive the world around us. Then we, you know, process that information in our heads and then given what we think, we then choose a reaction. So if we're lucky, we are blessed with at least five senses.

Six when I've had four margaritas. But as humans, we sort of are inherently capable of just processing all of this at the same time. And that actually is how our short-term memory gets to work. And taking all of this context and information, we then get to form what's called a theory of mind.

What is going on? What is, you know, how is the world relating to me right now? What should I be doing about it? So we sense, we think, and then we react. And how do we react? Well, there's a lot of things right now. But if we take it all the way back to the Stone Age and we think real simple, a lot of how people used to react and communicate is just unintelligible grunts.

And then one day we, that sort of evolved into a language as we know it. And to this day, that's still something that we rely on to communicate and react to the world around us. And that's also how a lot of us think. So we have language. But the language of communication is so much broader than just language.

We're standing here on stage right now. I'm making eye contact with some of you. Nice shirt. And I'm making gestures. I'm wearing these ridiculous gloves. I'm looking at Sam. I'm looking at things. I'm pointing at things. And I can hear sort of laughter or I can hear people, you know, thinking.

I'm taking lots of information at once. And right now I'm sensing, thinking and reacting. And so this year, well, last year technically, we saw a really amazing thing happen, kind of with the advent of chat GPT, I would say, where we saw the beginnings of a computer start to approximate that same loop, where input was coming in in the form of language.

There was some reasoning process. However, that actually works. And then the output felt also like language coming back to us. And this was very inspiring to me and Jason. And we've been spending a lot of time this past year thinking about what's next and how this gets to feel even more natural for people to interact with computers specifically.

And so today we wanted to take you on a tour of a few demos. One which you can do with the computer right now. And then a few which are kind of with futuristic or next generation hardware, which may be available soon. And knowing that you're all engineers, we know that this will kind of get the sparks flowing, the ideas flowing, for seeing how, like, you might use some of these things that are coming out soon or things that exist today to build things that feel more natural.

So I'll start by getting to a demo. And I will say this is a live audio visual demo. So I am foolish enough to make that choice. So we will see how it goes. Before we show any demos, prudent to point out that none of these represent the product we are building.

They are simply pieces, stories of inspiration. So the point of this first demo is to imagine we have a lot of things where we're saying, like, okay, is text the right input? Is audio the right input? And we've been thinking about, it's not if those are the right things, but when.

So in this case, you'll see some measurements happening on the left here. What's actually happening is that this has access to my camera, and it's taking real-time pose measurements of where I am relative to the screen. So it knows I'm at the keyboard, basically, because it's making that assessment.

And you can see the reasoning in the side here, where it's saying user is close to screen, we'll use keyboard input. User is facing screen, we'll use text output. And so we're using an LLM to actually make that choice as it goes to the response. So let's try something else.

And again, demo gods be nice, because this may not work at all. But if I now walk away, and it doesn't detect me anymore, it should now actually start listening to me. Hello? Can you hear me? Are you going to respond? I think that's a no. It might not respond.

But basically, what we are attempting to build here is, like, if I want to actually talk to the computer in a really natural way, if I'm there next to the keyboard, it should not be paying attention to my voice or any sounds, ambient sounds, and if I walk away from the keyboard, I might want to have a conversation with it, like walk around the room.

It is listening. It seems to not to decided not to actually talk back. But, oh, it's talking. Is there something you need help? That sounds like an interesting project, Samantha. How is your talk going so far? Yay! Yes, you can see it paid attention, and it decided to ignore me for a while.

But anyway, this is just like a toy demo. You can see here we have, this is how it's working kind of behind the scenes. It's like trying to decide if I'm close to the keyboard, facing the screen, not facing the screen, and use that all as inputs to decide whether it should talk to me or just display the text as on the interface.

Cool. So. The reason why we think this is interesting is because we think, you know, people are naturally sensitive to other people. And we think computers, instead of asking people to adapt to computers, to be like, come up to me and type and whatever, should find ways to try to adapt to circumstances and context of people.

Exactly. So, again here, it's like, in this case, it's adapting to where I am by using the pose detection, whether or not I'm actually in the process of talking to it, to decide to update its own world state, use an LLM to actually do that, and then use the LLM to respond using the knowledge of that world state.

And so, this is a really simple and, as you can see, kind of hacky demo that is something you could build today. In theory, you could imagine how this could be like a really cool native way to interact with an LLM on your computer where you don't have to worry about the input monality at all.

So, again, takeaways are consider like explicit inputs, what I'm typing, what I'm saying, along with implicit, where I am. There's other things you could do with that, like tone and emotion detection. You could plug in a whole bunch of different signals that you want to extract from that. And you can even imagine if I'm in the frame with Sam, and the agent knows Sam and she had recently been complaining about me, I should probably not bring that up until I leave the frame.

Yeah. And as we mentioned that, using it as a reasoning engine, and then next one, cool. And yeah, and then we're adapting. So, we want to get to the futuristic stuff. Jason has been spending a lot of time imagining this, so he's going to walk you through a few things that might exist shortly in the near future when new hardware comes out.

So, when we think future, we still think the sensing-thinking-react loop will take place. To preface all of this, these are my personal speculative fictions, not representative of anything that I think might actually happen. And this is a very conservative view of the next one to 12 months, maybe. So, it's not a true future future AGI god-worshipping type situation.

So, let's start with what I call like a social interface. We're all really excited about, you know, certain headsets being released at certain points. And one thing that I think is interesting about some headsets is they have sensors and they have hand tracking and eye tracking. And just like how I'm being expressive right now, maybe there comes a day where I can be such with a computer that sort of lives with me.

So, here I am in my apartment minding my own business. And my ex decides to FaceTime me. And now I've declined the call. You know, historically with deterministic interfaces, I would have had to like find the hang-up button or go like, "Hey, Alexa, decline call." Like, thinking commands, thinking computer-speak.

But like, as a person, I can be like, "Fuck off." You know, I can be like, "I'm busy." I can be like, "I'm sick." You know, like, all of this stuff, the computer should be able to interpret for me and, you know, send, send, what's his name again, toxic trashiest, whatever, on his merry way.

So, explicit social gestures can be a great way to determine user intent, like the way I just showed now. But we should also consider interpreting implicit gestures. If I give a really fast gesture with a slow gesture, my mood, my tone, how far away I am. But we should also be conscious of social cultural norms, different gestures mean different things in different societies.

And it might mean, you know, as you scale your application or hardware to different locales, this is something that you should pay attention to. Now, I want to move on to talk about what I call new physics. And this part is super fun. Um, this demo is based on, um, a little, uh, I think on the iPad, which, you know, has over five daily active users in the world.

It's very popular. Um, and here I'm imagining, like, okay, Midjourney, if I was the founder of Midjourney, I would be putting all my resources and making some sort of, uh, Midjourney Canvas app for iPads. So, in this one, I've asked Midjourney to create, uh, Balenciaga Naruto, which now I'm realizing kind of looks like me.

Um, so, let's think about the iPad. It's like this big slab that you can, like, touch and fiddle with, right? So, what do I want to do? Okay, I want to, like, edit this photo. Um, but first, I need to make space. How do I do that? Well, very easy.

You just, you know, um, you can just zoom out and now you have extra space. Very obvious. We do this all the time. Um, I kind of think my cat would look really good in that outfit. So, I kind of want to find a way to do that here.

Let me just ask AI real quick. Um, hey, random AI sent me pictures of my cat. And, you know, the AI knows me and has contacts and gives me pictures of my cat. And then, what do I do here? Well, why can't we just take one of the photos and sort of just blend them with the other?

Um, and the metaphor you're seeing here as you sort of work with these photos, they start glowing when you pick them up. And what does light? You guys know the Pink Floyd, uh, Dark Side of the Moon album cover. Like, we're really familiar with the idea that light can sort of, uh, provide different colors and, and sort of concentrate back into one form.

And we're leaning into that metaphor here, implicitly. Um, and so it's now created something that looks 50% human, 50% cat, 100% cringe. I don't really like this. How do we remix this? What is the gesture? What is the thing we do in real life that's remixing? Um, for me, it's a margarita.

And for Sam, it's her morning huell. We shake a blender bottle. So, why, why can't we work with intelligent materials the same way that we work with real materials and just blend it up? This is totally doable right now. David, why aren't you building this? If you don't build this, I'm going to build this.

It's fine. Um, so, you know, here, the metaphor is like, what we're trying to say is, you know, think about familiar universal metaphors like physics, like light, like Meta Balls, like squishy, like fog, whatever. Because, you know, if you're designing an iPhone, you have to be very cognizant of the qualities of aluminum and titanium to make an iPhone, but generative intelligence is a probabilistic material that's sort of more fluid.

Maybe it's fog, maybe it's mercury. Um, and, you know, for this reason, maybe metaphors that are really rigid, like wood or paper or metal, aren't the right metaphors to use for some of these experiences. So, finally, I want to walk you through an experience that's inherently mixed modal, um, slash mixed reality.

Um, let's imagine for a second there's a piece of hardware coming out that's a wearable that has a camera on it and has a microphone and it can maybe project things. I don't know if such a thing will ever exist. But let's imagine for a second it does. Um, I'm sort of browsing this book, this Beyoncé tour book, and I see these images that I find really inspiring.

Um, what I'm trying to do here is what if I could just point at something on my desk and say, like, "This is cool," and have the sort of device, uh, pick up on that and, and, and indicate that it's heard me and it's gonna do something by sort of projection mapping this sort of feedback.

Um, this is, you know, this demo doesn't really have sound, but the way this would work is ideally a combination of voice and gesture at the same time. Um, and obviously this gesture is really easy to make mistakes with, so any time you work with probabilistic materials, you want to provide a graceful way out.

So in this case, I've accidentally tapped this photo. Why can't I just flick it away like dust? And be like, "That's wrong. I don't want to press an undo button. I don't want to press command Z. I just want to flick it away." Um, really leaning to the physics of it.

Um, so now that I've found two pieces, I'm kind of like, "Okay, I want to send this to two of my friends who, hmm, there was a friend who I said I would do Halloween with, but I can't really remember their name. Um, what do I do here? I should ask AI.

I should be like, "Who is that friend I said I'd spend Halloween with?" And you notice here that, like, we're imagining sort of projection mapped UI pieces that can work with the context of the world you're in right now, such that you don't have to go fish out a phone or use cumbersome voice commands.

Um, it just all sorts of naturally melding with the world. Um, and, you know, crucially, I think one point we want to make is voiced in doesn't need to mean voice out, gesture in doesn't need to mean gesture out, and visual UI in does not need to mean visual UI out.

We can mix these modalities in real time for whatever makes sense in whatever context you're in. So, given that interactions that require multiple simultaneous inputs are now possible, um, it's our job as designers and developers to sort of think on behalf of the user and think when, what's the appropriate output given the current context, and be smart about it.

Um, yeah. Yeah, so again, the takeaways, as we mentioned, it's this idea of, we have a lot of sensors and, and uh, contextual modalities available to us as ingredients, even today. There will be more tomorrow, as you kind of saw with these upcoming, uh, potential hardware releases. Um, but even now, with a laptop, with things like typing speed, with things like, uh, the tone of voice, there's a lot of ways that you could gather context and extract signals from it.

You could choose to process it in a variety of different ways. And so, all of that can now be passed to an LLM and used in a reasoning layer which decides how, um, both to respond in words and also how to present that information. Um, and so basically, everything can now be an input and your output could be everywhere and have every format.

Um, at the same time, one might say everything everywhere all at once. Well, you want to be intentional with it. You know, if someone wants to generate a photo on their Apple watch, you're like, why, why? Like, no, use your freaking phone. Jesus. Um, anyway. And the last thing we'll say is, um, probabilistic interfaces are hard because they have lots of different outputs.

So, a really great way to sort of ground these interfaces is to lean into familiar metaphors, whether they are from nature, from physics, or even from human-made tools and materials, like buttons, for now. Um, and you know, social norms is also a material that we work with, right? So, your banking AI agent probably shouldn't be able to have a deep philosophical chat with you.

That just socially doesn't make sense. That would feel weird. Exactly. Um, but on the same note, we've, we've, we've related all of these interfaces to what humans perceive and experience now, but what might a truly intelligent interface look like in the future, where if we think we are, where we are right now, skeuomorphism, what is the abstraction layer above that?

And that's kind of for us to figure out. Um, so, with that, um, yeah, that's all. Thank you.

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Chapters

Transcript