back to index

The Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer


Chapters

0:0 Introduction
1:0 What is intelligence
2:54 Language of communication
5:5 First demo
7:42 Adapting
9:44 Social Interface
14:32 Projection Mapping
16:44 Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, everybody. Thanks for having us here today. We're super excited to be here. I'm Sam and I'm
00:00:20.480 | one of the co-founders of New Computer. And I'm Jason, the other co-founder. And we're really
00:00:25.560 | excited that we are starting today by letting you all see our pores up close, which is amazing.
00:00:31.660 | So, you know, when Sam and I started New Computer, we did so because we believed that for so long,
00:00:40.060 | we've taken certain metaphors and abstractions and tools for granted. And for the first time in what
00:00:46.300 | feels like 40 years, we can finally change all of that. And we can start thinking from first
00:00:51.900 | principles, what our relationship, not only with computing, but with intelligence, period,
00:00:56.840 | should look like in the future. So what do we mean by intelligence? Because, you know, sometimes I'm
00:01:05.660 | on the internet and I wonder if it even exists. Well, one way to think about intelligence is the
00:01:11.340 | ability to sort of take in lots of information, different types, different volumes from different
00:01:16.480 | sources. Visualize those dots here and sort of find ways to make sense of it all, find ways to reason,
00:01:24.080 | find ways to find meaning. And as human beings, as carbon-based life forms, we do this through a process
00:01:32.320 | where at first we use our senses to sort of perceive the world around us. Then we, you know, process that
00:01:39.420 | information in our heads and then given what we think, we then choose a reaction. So if we're lucky, we
00:01:49.020 | are blessed with at least five senses. Six when I've had four margaritas. But as humans, we sort of are
00:01:59.340 | inherently capable of just processing all of this at the same time. And that actually is how our short-term
00:02:06.060 | memory gets to work. And taking all of this context and information, we then get to form what's called
00:02:11.900 | a theory of mind. What is going on? What is, you know, how is the world relating to me right now? What
00:02:16.940 | should I be doing about it? So we sense, we think, and then we react. And how do we react? Well, there's a
00:02:26.540 | lot of things right now. But if we take it all the way back to the Stone Age and we think real simple,
00:02:34.620 | a lot of how people used to react and communicate is just unintelligible grunts. And then one day we,
00:02:40.460 | that sort of evolved into a language as we know it. And to this day, that's still something that we rely
00:02:48.220 | on to communicate and react to the world around us. And that's also how a lot of us think.
00:02:52.780 | So we have language. But the language of communication is so much broader than just language. We're standing
00:03:03.180 | here on stage right now. I'm making eye contact with some of you. Nice shirt. And I'm making
00:03:08.380 | gestures. I'm wearing these ridiculous gloves. I'm looking at Sam. I'm looking at things. I'm
00:03:11.900 | pointing at things. And I can hear sort of laughter or I can hear people, you know, thinking. I'm taking
00:03:19.500 | lots of information at once. And right now I'm sensing, thinking and reacting.
00:03:24.060 | And so this year, well, last year technically, we saw a really amazing thing happen, kind of with the
00:03:33.340 | advent of chat GPT, I would say, where we saw the beginnings of a computer start to approximate that
00:03:39.660 | same loop, where input was coming in in the form of language. There was some reasoning process.
00:03:45.260 | However, that actually works. And then the output felt also like language coming back to us. And
00:03:54.700 | this was very inspiring to me and Jason. And we've been spending a lot of time this past year thinking
00:03:59.020 | about what's next and how this gets to feel even more natural for people to interact with computers
00:04:07.740 | specifically. And so today we wanted to take you on a tour of a few demos. One which you can do with
00:04:16.060 | the computer right now. And then a few which are kind of with futuristic or next generation hardware,
00:04:21.980 | which may be available soon. And knowing that you're all engineers, we know that this will kind
00:04:26.220 | of get the sparks flowing, the ideas flowing, for seeing how, like, you might use some of these
00:04:32.940 | things that are coming out soon or things that exist today to build things that feel more natural.
00:04:37.580 | So I'll start by getting to a demo. And I will say this is a live audio visual demo. So I am foolish enough
00:04:49.420 | to make that choice. So we will see how it goes. Before we show any demos, prudent to point out that
00:04:56.620 | none of these represent the product we are building. They are simply pieces, stories of inspiration.
00:05:05.340 | So the point of this first demo is to imagine we have a lot of things where we're saying, like, okay,
00:05:12.300 | is text the right input? Is audio the right input? And we've been thinking about, it's not if those are the
00:05:20.940 | right things, but when. So in this case, you'll see some measurements happening on the left here. What's
00:05:25.900 | actually happening is that this has access to my camera, and it's taking real-time pose measurements of
00:05:32.060 | where I am relative to the screen. So it knows I'm at the keyboard, basically, because it's making that
00:05:38.300 | assessment. And you can see the reasoning in the side here, where it's saying user is close to screen,
00:05:42.460 | we'll use keyboard input. User is facing screen, we'll use text output. And so we're using an LLM to
00:05:48.620 | actually make that choice as it goes to the response. So let's try something else. And again, demo gods be
00:05:55.100 | nice, because this may not work at all. But if I now walk away, and it doesn't detect me anymore, it should now
00:06:02.700 | actually start listening to me. Hello? Can you hear me? Are you going to respond?
00:06:09.180 | I think that's a no. It might not respond. But basically, what we are attempting to build here is,
00:06:15.420 | like, if I want to actually talk to the computer in a really natural way, if I'm there next to the
00:06:23.020 | keyboard, it should not be paying attention to my voice or any sounds, ambient sounds, and if I walk away
00:06:30.860 | from the keyboard, I might want to have a conversation with it, like walk around the room. It is listening.
00:06:36.220 | It seems to not to decided not to actually talk back. But, oh, it's talking.
00:06:42.460 | Is there something you need help? That sounds like an interesting project,
00:06:51.260 | Samantha. How is your talk going so far? Yay!
00:06:56.780 | Yes, you can see it paid attention, and it decided to ignore me for a while.
00:07:05.580 | But anyway, this is just like a toy demo. You can see here we have, this is how it's working kind of
00:07:14.780 | behind the scenes. It's like trying to decide if I'm close to the keyboard, facing the screen,
00:07:21.100 | not facing the screen, and use that all as inputs to decide whether it should talk to me or just
00:07:26.940 | display the text as on the interface. Cool. So.
00:07:32.860 | The reason why we think this is interesting is because we think, you know, people are naturally
00:07:39.020 | sensitive to other people. And we think computers, instead of asking people to adapt to computers,
00:07:48.780 | to be like, come up to me and type and whatever, should find ways to try to adapt to circumstances
00:07:54.300 | and context of people. Exactly. So, again here, it's like, in this case, it's adapting to where I am
00:08:03.260 | by using the pose detection, whether or not I'm actually in the process of talking to it,
00:08:07.420 | to decide to update its own world state, use an LLM to actually do that, and then use the LLM to
00:08:12.700 | respond using the knowledge of that world state. And so, this is a really simple and, as you can see,
00:08:17.580 | kind of hacky demo that is something you could build today. In theory, you could imagine how this could
00:08:22.940 | be like a really cool native way to interact with an LLM on your computer where you don't have to worry
00:08:27.980 | about the input monality at all. So, again, takeaways are consider like explicit inputs, what I'm typing,
00:08:34.940 | what I'm saying, along with implicit, where I am. There's other things you could do with that, like
00:08:40.220 | tone and emotion detection. You could plug in a whole bunch of different signals that you want to
00:08:45.500 | extract from that. And you can even imagine if I'm in the frame with Sam, and the agent knows Sam and
00:08:51.100 | she had recently been complaining about me, I should probably not bring that up until I leave the frame.
00:08:55.500 | Yeah. And as we mentioned that, using it as a reasoning engine, and then next one, cool. And yeah, and then we're adapting.
00:09:05.580 | So, we want to get to the futuristic stuff. Jason has been spending a lot of time imagining this, so he's
00:09:11.420 | going to walk you through a few things that might exist shortly in the near future when new hardware comes out.
00:09:16.060 | So, when we think future, we still think the sensing-thinking-react loop will take place. To preface
00:09:25.340 | all of this, these are my personal speculative fictions, not representative of anything that
00:09:31.020 | I think might actually happen. And this is a very conservative view of the next one to 12 months,
00:09:37.420 | maybe. So, it's not a true future future AGI god-worshipping type situation. So, let's start with
00:09:44.060 | what I call like a social interface. We're all really excited about, you know, certain headsets
00:09:50.140 | being released at certain points. And one thing that I think is interesting about some headsets is they
00:09:56.060 | have sensors and they have hand tracking and eye tracking. And just like how I'm being expressive right
00:10:02.620 | now, maybe there comes a day where I can be such with a computer that sort of lives with me.
00:10:07.260 | So, here I am in my apartment minding my own business. And my ex decides to FaceTime me.
00:10:16.940 | And now I've declined the call. You know, historically with deterministic interfaces,
00:10:27.580 | I would have had to like find the hang-up button or go like, "Hey, Alexa, decline call." Like,
00:10:33.180 | thinking commands, thinking computer-speak. But like, as a person, I can be like, "Fuck off." You
00:10:37.580 | know, I can be like, "I'm busy." I can be like, "I'm sick." You know, like, all of this stuff,
00:10:42.300 | the computer should be able to interpret for me and, you know, send, send, what's his name again,
00:10:47.980 | toxic trashiest, whatever, on his merry way. So, explicit social gestures can be a great way to
00:10:54.540 | determine user intent, like the way I just showed now. But we should also consider interpreting
00:10:59.900 | implicit gestures. If I give a really fast gesture with a slow gesture, my mood, my tone,
00:11:05.260 | how far away I am. But we should also be conscious of social cultural norms, different gestures mean
00:11:10.220 | different things in different societies. And it might mean, you know, as you scale your application
00:11:14.700 | or hardware to different locales, this is something that you should pay attention to.
00:11:19.100 | Now, I want to move on to talk about what I call new physics. And this part is super fun.
00:11:24.060 | Um, this demo is based on, um, a little, uh, I think on the iPad, which, you know, has over five daily
00:11:31.500 | active users in the world. It's very popular. Um, and here I'm imagining, like, okay, Midjourney,
00:11:37.500 | if I was the founder of Midjourney, I would be putting all my resources and making some sort of,
00:11:41.580 | uh, Midjourney Canvas app for iPads. So, in this one, I've asked Midjourney to create, uh, Balenciaga Naruto,
00:11:49.340 | which now I'm realizing kind of looks like me. Um, so, let's think about the iPad. It's like this big slab
00:11:57.340 | that you can, like, touch and fiddle with, right? So, what do I want to do? Okay, I want to, like, edit
00:12:01.180 | this photo. Um, but first, I need to make space. How do I do that? Well, very easy. You just, you know,
00:12:06.140 | um, you can just zoom out and now you have extra space. Very obvious. We do this all the time. Um,
00:12:12.220 | I kind of think my cat would look really good in that outfit. So, I kind of want to find a way to do
00:12:18.540 | that here. Let me just ask AI real quick. Um, hey, random AI sent me pictures of my cat. And,
00:12:25.820 | you know, the AI knows me and has contacts and gives me pictures of my cat. And then,
00:12:31.100 | what do I do here? Well, why can't we just take one of the photos and sort of just blend them with
00:12:41.580 | the other? Um, and the metaphor you're seeing here as you sort of work with these photos, they start
00:12:47.580 | glowing when you pick them up. And what does light? You guys know the Pink Floyd, uh, Dark Side of the
00:12:52.700 | Moon album cover. Like, we're really familiar with the idea that light can sort of, uh, provide
00:12:58.220 | different colors and, and sort of concentrate back into one form. And we're leaning into that metaphor
00:13:02.220 | here, implicitly. Um, and so it's now created something that looks 50% human, 50% cat, 100% cringe. I don't
00:13:10.300 | really like this. How do we remix this? What is the gesture? What is the thing we do in real life that's
00:13:14.780 | remixing? Um, for me, it's a margarita. And for Sam, it's her morning huell. We shake a blender bottle.
00:13:20.940 | So, why, why can't we work with intelligent materials the same way that we work with real
00:13:27.580 | materials and just blend it up? This is totally doable right now. David, why aren't you building
00:13:33.740 | this? If you don't build this, I'm going to build this. It's fine. Um, so, you know, here, the metaphor
00:13:37.820 | is like, what we're trying to say is, you know, think about familiar universal metaphors like physics, like
00:13:43.420 | light, like Meta Balls, like squishy, like fog, whatever. Because, you know, if you're designing an iPhone, you
00:13:49.180 | have to be very cognizant of the qualities of aluminum and titanium to make an iPhone,
00:13:54.140 | but generative intelligence is a probabilistic material that's sort of more fluid. Maybe it's
00:14:00.540 | fog, maybe it's mercury. Um, and, you know, for this reason, maybe metaphors that are really rigid,
00:14:08.700 | like wood or paper or metal, aren't the right metaphors to use for some of these experiences.
00:14:13.820 | So, finally, I want to walk you through an experience that's inherently mixed
00:14:18.780 | modal, um, slash mixed reality. Um, let's imagine for a second there's a piece of hardware coming
00:14:25.180 | out that's a wearable that has a camera on it and has a microphone and it can maybe project things.
00:14:31.180 | I don't know if such a thing will ever exist. But let's imagine for a second it does. Um,
00:14:35.420 | I'm sort of browsing this book, this Beyoncé tour book, and I see these images that I find really
00:14:43.100 | inspiring. Um, what I'm trying to do here is what if I could just point at something on my desk and say,
00:14:49.260 | like, "This is cool," and have the sort of device, uh, pick up on that and, and, and indicate that it's heard me
00:14:57.500 | and it's gonna do something by sort of projection mapping this sort of feedback. Um, this is, you know,
00:15:03.260 | this demo doesn't really have sound, but the way this would work is ideally a combination of voice and gesture at the same time.
00:15:08.380 | Um, and obviously this gesture is really easy to make mistakes with, so any time you work with probabilistic materials,
00:15:16.220 | you want to provide a graceful way out. So in this case, I've accidentally tapped this photo.
00:15:20.380 | Why can't I just flick it away like dust? And be like, "That's wrong. I don't want to press an undo button.
00:15:26.300 | I don't want to press command Z. I just want to flick it away." Um, really leaning to the physics of it.
00:15:30.940 | Um, so now that I've found two pieces, I'm kind of like, "Okay, I want to send this to two of my friends who,
00:15:37.340 | hmm, there was a friend who I said I would do Halloween with, but I can't really remember their name.
00:15:43.180 | Um, what do I do here? I should ask AI. I should be like, "Who is that friend I said I'd spend Halloween with?"
00:15:50.540 | And you notice here that, like, we're imagining sort of projection mapped UI pieces that can work with the
00:15:59.180 | context of the world you're in right now, such that you don't have to go fish out a phone or use cumbersome
00:16:04.460 | voice commands. Um, it just all sorts of naturally melding with the world. Um, and, you know, crucially,
00:16:12.860 | I think one point we want to make is voiced in doesn't need to mean voice out, gesture in doesn't need to
00:16:17.260 | mean gesture out, and visual UI in does not need to mean visual UI out. We can mix these modalities
00:16:22.300 | in real time for whatever makes sense in whatever context you're in.
00:16:26.940 | So, given that interactions that require multiple simultaneous inputs are now possible,
00:16:31.260 | um, it's our job as designers and developers to sort of think on behalf of the user and think
00:16:36.220 | when, what's the appropriate output given the current context, and be smart about it. Um, yeah.
00:16:43.100 | Yeah, so again, the takeaways, as we mentioned, it's this idea of, we have a lot of sensors and, and
00:16:49.340 | uh, contextual modalities available to us as ingredients, even today. There will be more
00:16:53.500 | tomorrow, as you kind of saw with these upcoming, uh, potential hardware releases. Um, but even now,
00:16:58.700 | with a laptop, with things like typing speed, with things like, uh, the tone of voice, there's a lot
00:17:04.620 | of ways that you could gather context and extract signals from it. You could choose to process it in
00:17:09.580 | a variety of different ways. And so, all of that can now be passed to an LLM and used in a reasoning
00:17:15.180 | layer which decides how, um, both to respond in words and also how to present that information.
00:17:21.980 | Um, and so basically, everything can now be an input and your output could be everywhere and have
00:17:28.780 | every format. Um, at the same time, one might say everything everywhere all at once.
00:17:33.820 | Well, you want to be intentional with it. You know, if someone wants to generate a photo on their Apple
00:17:40.940 | watch, you're like, why, why? Like, no, use your freaking phone. Jesus. Um, anyway. And the last
00:17:46.540 | thing we'll say is, um, probabilistic interfaces are hard because they have lots of different outputs.
00:17:51.580 | So, a really great way to sort of ground these interfaces is to lean into familiar metaphors,
00:17:56.220 | whether they are from nature, from physics, or even from human-made tools and materials,
00:18:00.460 | like buttons, for now. Um, and you know, social norms is also a material that we work with, right?
00:18:06.060 | So, your banking AI agent probably shouldn't be able to have a deep philosophical chat with you.
00:18:14.220 | That just socially doesn't make sense. That would feel weird.
00:18:16.860 | Exactly. Um, but on the same note, we've, we've, we've related all of these interfaces to what humans
00:18:23.900 | perceive and experience now, but what might a truly intelligent interface look like in the future,
00:18:31.100 | where if we think we are, where we are right now, skeuomorphism, what is the abstraction layer above
00:18:35.340 | that? And that's kind of for us to figure out. Um, so, with that, um, yeah, that's all. Thank you.