back to indexThe Intelligent Interface: Sam Whitmore & Jason Yuan of New Computer

Chapters
0:0 Introduction
1:0 What is intelligence
2:54 Language of communication
5:5 First demo
7:42 Adapting
9:44 Social Interface
14:32 Projection Mapping
16:44 Conclusion
00:00:00.000 |
Hi, everybody. Thanks for having us here today. We're super excited to be here. I'm Sam and I'm 00:00:20.480 |
one of the co-founders of New Computer. And I'm Jason, the other co-founder. And we're really 00:00:25.560 |
excited that we are starting today by letting you all see our pores up close, which is amazing. 00:00:31.660 |
So, you know, when Sam and I started New Computer, we did so because we believed that for so long, 00:00:40.060 |
we've taken certain metaphors and abstractions and tools for granted. And for the first time in what 00:00:46.300 |
feels like 40 years, we can finally change all of that. And we can start thinking from first 00:00:51.900 |
principles, what our relationship, not only with computing, but with intelligence, period, 00:00:56.840 |
should look like in the future. So what do we mean by intelligence? Because, you know, sometimes I'm 00:01:05.660 |
on the internet and I wonder if it even exists. Well, one way to think about intelligence is the 00:01:11.340 |
ability to sort of take in lots of information, different types, different volumes from different 00:01:16.480 |
sources. Visualize those dots here and sort of find ways to make sense of it all, find ways to reason, 00:01:24.080 |
find ways to find meaning. And as human beings, as carbon-based life forms, we do this through a process 00:01:32.320 |
where at first we use our senses to sort of perceive the world around us. Then we, you know, process that 00:01:39.420 |
information in our heads and then given what we think, we then choose a reaction. So if we're lucky, we 00:01:49.020 |
are blessed with at least five senses. Six when I've had four margaritas. But as humans, we sort of are 00:01:59.340 |
inherently capable of just processing all of this at the same time. And that actually is how our short-term 00:02:06.060 |
memory gets to work. And taking all of this context and information, we then get to form what's called 00:02:11.900 |
a theory of mind. What is going on? What is, you know, how is the world relating to me right now? What 00:02:16.940 |
should I be doing about it? So we sense, we think, and then we react. And how do we react? Well, there's a 00:02:26.540 |
lot of things right now. But if we take it all the way back to the Stone Age and we think real simple, 00:02:34.620 |
a lot of how people used to react and communicate is just unintelligible grunts. And then one day we, 00:02:40.460 |
that sort of evolved into a language as we know it. And to this day, that's still something that we rely 00:02:48.220 |
on to communicate and react to the world around us. And that's also how a lot of us think. 00:02:52.780 |
So we have language. But the language of communication is so much broader than just language. We're standing 00:03:03.180 |
here on stage right now. I'm making eye contact with some of you. Nice shirt. And I'm making 00:03:08.380 |
gestures. I'm wearing these ridiculous gloves. I'm looking at Sam. I'm looking at things. I'm 00:03:11.900 |
pointing at things. And I can hear sort of laughter or I can hear people, you know, thinking. I'm taking 00:03:19.500 |
lots of information at once. And right now I'm sensing, thinking and reacting. 00:03:24.060 |
And so this year, well, last year technically, we saw a really amazing thing happen, kind of with the 00:03:33.340 |
advent of chat GPT, I would say, where we saw the beginnings of a computer start to approximate that 00:03:39.660 |
same loop, where input was coming in in the form of language. There was some reasoning process. 00:03:45.260 |
However, that actually works. And then the output felt also like language coming back to us. And 00:03:54.700 |
this was very inspiring to me and Jason. And we've been spending a lot of time this past year thinking 00:03:59.020 |
about what's next and how this gets to feel even more natural for people to interact with computers 00:04:07.740 |
specifically. And so today we wanted to take you on a tour of a few demos. One which you can do with 00:04:16.060 |
the computer right now. And then a few which are kind of with futuristic or next generation hardware, 00:04:21.980 |
which may be available soon. And knowing that you're all engineers, we know that this will kind 00:04:26.220 |
of get the sparks flowing, the ideas flowing, for seeing how, like, you might use some of these 00:04:32.940 |
things that are coming out soon or things that exist today to build things that feel more natural. 00:04:37.580 |
So I'll start by getting to a demo. And I will say this is a live audio visual demo. So I am foolish enough 00:04:49.420 |
to make that choice. So we will see how it goes. Before we show any demos, prudent to point out that 00:04:56.620 |
none of these represent the product we are building. They are simply pieces, stories of inspiration. 00:05:05.340 |
So the point of this first demo is to imagine we have a lot of things where we're saying, like, okay, 00:05:12.300 |
is text the right input? Is audio the right input? And we've been thinking about, it's not if those are the 00:05:20.940 |
right things, but when. So in this case, you'll see some measurements happening on the left here. What's 00:05:25.900 |
actually happening is that this has access to my camera, and it's taking real-time pose measurements of 00:05:32.060 |
where I am relative to the screen. So it knows I'm at the keyboard, basically, because it's making that 00:05:38.300 |
assessment. And you can see the reasoning in the side here, where it's saying user is close to screen, 00:05:42.460 |
we'll use keyboard input. User is facing screen, we'll use text output. And so we're using an LLM to 00:05:48.620 |
actually make that choice as it goes to the response. So let's try something else. And again, demo gods be 00:05:55.100 |
nice, because this may not work at all. But if I now walk away, and it doesn't detect me anymore, it should now 00:06:02.700 |
actually start listening to me. Hello? Can you hear me? Are you going to respond? 00:06:09.180 |
I think that's a no. It might not respond. But basically, what we are attempting to build here is, 00:06:15.420 |
like, if I want to actually talk to the computer in a really natural way, if I'm there next to the 00:06:23.020 |
keyboard, it should not be paying attention to my voice or any sounds, ambient sounds, and if I walk away 00:06:30.860 |
from the keyboard, I might want to have a conversation with it, like walk around the room. It is listening. 00:06:36.220 |
It seems to not to decided not to actually talk back. But, oh, it's talking. 00:06:42.460 |
Is there something you need help? That sounds like an interesting project, 00:06:51.260 |
Samantha. How is your talk going so far? Yay! 00:06:56.780 |
Yes, you can see it paid attention, and it decided to ignore me for a while. 00:07:05.580 |
But anyway, this is just like a toy demo. You can see here we have, this is how it's working kind of 00:07:14.780 |
behind the scenes. It's like trying to decide if I'm close to the keyboard, facing the screen, 00:07:21.100 |
not facing the screen, and use that all as inputs to decide whether it should talk to me or just 00:07:26.940 |
display the text as on the interface. Cool. So. 00:07:32.860 |
The reason why we think this is interesting is because we think, you know, people are naturally 00:07:39.020 |
sensitive to other people. And we think computers, instead of asking people to adapt to computers, 00:07:48.780 |
to be like, come up to me and type and whatever, should find ways to try to adapt to circumstances 00:07:54.300 |
and context of people. Exactly. So, again here, it's like, in this case, it's adapting to where I am 00:08:03.260 |
by using the pose detection, whether or not I'm actually in the process of talking to it, 00:08:07.420 |
to decide to update its own world state, use an LLM to actually do that, and then use the LLM to 00:08:12.700 |
respond using the knowledge of that world state. And so, this is a really simple and, as you can see, 00:08:17.580 |
kind of hacky demo that is something you could build today. In theory, you could imagine how this could 00:08:22.940 |
be like a really cool native way to interact with an LLM on your computer where you don't have to worry 00:08:27.980 |
about the input monality at all. So, again, takeaways are consider like explicit inputs, what I'm typing, 00:08:34.940 |
what I'm saying, along with implicit, where I am. There's other things you could do with that, like 00:08:40.220 |
tone and emotion detection. You could plug in a whole bunch of different signals that you want to 00:08:45.500 |
extract from that. And you can even imagine if I'm in the frame with Sam, and the agent knows Sam and 00:08:51.100 |
she had recently been complaining about me, I should probably not bring that up until I leave the frame. 00:08:55.500 |
Yeah. And as we mentioned that, using it as a reasoning engine, and then next one, cool. And yeah, and then we're adapting. 00:09:05.580 |
So, we want to get to the futuristic stuff. Jason has been spending a lot of time imagining this, so he's 00:09:11.420 |
going to walk you through a few things that might exist shortly in the near future when new hardware comes out. 00:09:16.060 |
So, when we think future, we still think the sensing-thinking-react loop will take place. To preface 00:09:25.340 |
all of this, these are my personal speculative fictions, not representative of anything that 00:09:31.020 |
I think might actually happen. And this is a very conservative view of the next one to 12 months, 00:09:37.420 |
maybe. So, it's not a true future future AGI god-worshipping type situation. So, let's start with 00:09:44.060 |
what I call like a social interface. We're all really excited about, you know, certain headsets 00:09:50.140 |
being released at certain points. And one thing that I think is interesting about some headsets is they 00:09:56.060 |
have sensors and they have hand tracking and eye tracking. And just like how I'm being expressive right 00:10:02.620 |
now, maybe there comes a day where I can be such with a computer that sort of lives with me. 00:10:07.260 |
So, here I am in my apartment minding my own business. And my ex decides to FaceTime me. 00:10:16.940 |
And now I've declined the call. You know, historically with deterministic interfaces, 00:10:27.580 |
I would have had to like find the hang-up button or go like, "Hey, Alexa, decline call." Like, 00:10:33.180 |
thinking commands, thinking computer-speak. But like, as a person, I can be like, "Fuck off." You 00:10:37.580 |
know, I can be like, "I'm busy." I can be like, "I'm sick." You know, like, all of this stuff, 00:10:42.300 |
the computer should be able to interpret for me and, you know, send, send, what's his name again, 00:10:47.980 |
toxic trashiest, whatever, on his merry way. So, explicit social gestures can be a great way to 00:10:54.540 |
determine user intent, like the way I just showed now. But we should also consider interpreting 00:10:59.900 |
implicit gestures. If I give a really fast gesture with a slow gesture, my mood, my tone, 00:11:05.260 |
how far away I am. But we should also be conscious of social cultural norms, different gestures mean 00:11:10.220 |
different things in different societies. And it might mean, you know, as you scale your application 00:11:14.700 |
or hardware to different locales, this is something that you should pay attention to. 00:11:19.100 |
Now, I want to move on to talk about what I call new physics. And this part is super fun. 00:11:24.060 |
Um, this demo is based on, um, a little, uh, I think on the iPad, which, you know, has over five daily 00:11:31.500 |
active users in the world. It's very popular. Um, and here I'm imagining, like, okay, Midjourney, 00:11:37.500 |
if I was the founder of Midjourney, I would be putting all my resources and making some sort of, 00:11:41.580 |
uh, Midjourney Canvas app for iPads. So, in this one, I've asked Midjourney to create, uh, Balenciaga Naruto, 00:11:49.340 |
which now I'm realizing kind of looks like me. Um, so, let's think about the iPad. It's like this big slab 00:11:57.340 |
that you can, like, touch and fiddle with, right? So, what do I want to do? Okay, I want to, like, edit 00:12:01.180 |
this photo. Um, but first, I need to make space. How do I do that? Well, very easy. You just, you know, 00:12:06.140 |
um, you can just zoom out and now you have extra space. Very obvious. We do this all the time. Um, 00:12:12.220 |
I kind of think my cat would look really good in that outfit. So, I kind of want to find a way to do 00:12:18.540 |
that here. Let me just ask AI real quick. Um, hey, random AI sent me pictures of my cat. And, 00:12:25.820 |
you know, the AI knows me and has contacts and gives me pictures of my cat. And then, 00:12:31.100 |
what do I do here? Well, why can't we just take one of the photos and sort of just blend them with 00:12:41.580 |
the other? Um, and the metaphor you're seeing here as you sort of work with these photos, they start 00:12:47.580 |
glowing when you pick them up. And what does light? You guys know the Pink Floyd, uh, Dark Side of the 00:12:52.700 |
Moon album cover. Like, we're really familiar with the idea that light can sort of, uh, provide 00:12:58.220 |
different colors and, and sort of concentrate back into one form. And we're leaning into that metaphor 00:13:02.220 |
here, implicitly. Um, and so it's now created something that looks 50% human, 50% cat, 100% cringe. I don't 00:13:10.300 |
really like this. How do we remix this? What is the gesture? What is the thing we do in real life that's 00:13:14.780 |
remixing? Um, for me, it's a margarita. And for Sam, it's her morning huell. We shake a blender bottle. 00:13:20.940 |
So, why, why can't we work with intelligent materials the same way that we work with real 00:13:27.580 |
materials and just blend it up? This is totally doable right now. David, why aren't you building 00:13:33.740 |
this? If you don't build this, I'm going to build this. It's fine. Um, so, you know, here, the metaphor 00:13:37.820 |
is like, what we're trying to say is, you know, think about familiar universal metaphors like physics, like 00:13:43.420 |
light, like Meta Balls, like squishy, like fog, whatever. Because, you know, if you're designing an iPhone, you 00:13:49.180 |
have to be very cognizant of the qualities of aluminum and titanium to make an iPhone, 00:13:54.140 |
but generative intelligence is a probabilistic material that's sort of more fluid. Maybe it's 00:14:00.540 |
fog, maybe it's mercury. Um, and, you know, for this reason, maybe metaphors that are really rigid, 00:14:08.700 |
like wood or paper or metal, aren't the right metaphors to use for some of these experiences. 00:14:13.820 |
So, finally, I want to walk you through an experience that's inherently mixed 00:14:18.780 |
modal, um, slash mixed reality. Um, let's imagine for a second there's a piece of hardware coming 00:14:25.180 |
out that's a wearable that has a camera on it and has a microphone and it can maybe project things. 00:14:31.180 |
I don't know if such a thing will ever exist. But let's imagine for a second it does. Um, 00:14:35.420 |
I'm sort of browsing this book, this Beyoncé tour book, and I see these images that I find really 00:14:43.100 |
inspiring. Um, what I'm trying to do here is what if I could just point at something on my desk and say, 00:14:49.260 |
like, "This is cool," and have the sort of device, uh, pick up on that and, and, and indicate that it's heard me 00:14:57.500 |
and it's gonna do something by sort of projection mapping this sort of feedback. Um, this is, you know, 00:15:03.260 |
this demo doesn't really have sound, but the way this would work is ideally a combination of voice and gesture at the same time. 00:15:08.380 |
Um, and obviously this gesture is really easy to make mistakes with, so any time you work with probabilistic materials, 00:15:16.220 |
you want to provide a graceful way out. So in this case, I've accidentally tapped this photo. 00:15:20.380 |
Why can't I just flick it away like dust? And be like, "That's wrong. I don't want to press an undo button. 00:15:26.300 |
I don't want to press command Z. I just want to flick it away." Um, really leaning to the physics of it. 00:15:30.940 |
Um, so now that I've found two pieces, I'm kind of like, "Okay, I want to send this to two of my friends who, 00:15:37.340 |
hmm, there was a friend who I said I would do Halloween with, but I can't really remember their name. 00:15:43.180 |
Um, what do I do here? I should ask AI. I should be like, "Who is that friend I said I'd spend Halloween with?" 00:15:50.540 |
And you notice here that, like, we're imagining sort of projection mapped UI pieces that can work with the 00:15:59.180 |
context of the world you're in right now, such that you don't have to go fish out a phone or use cumbersome 00:16:04.460 |
voice commands. Um, it just all sorts of naturally melding with the world. Um, and, you know, crucially, 00:16:12.860 |
I think one point we want to make is voiced in doesn't need to mean voice out, gesture in doesn't need to 00:16:17.260 |
mean gesture out, and visual UI in does not need to mean visual UI out. We can mix these modalities 00:16:22.300 |
in real time for whatever makes sense in whatever context you're in. 00:16:26.940 |
So, given that interactions that require multiple simultaneous inputs are now possible, 00:16:31.260 |
um, it's our job as designers and developers to sort of think on behalf of the user and think 00:16:36.220 |
when, what's the appropriate output given the current context, and be smart about it. Um, yeah. 00:16:43.100 |
Yeah, so again, the takeaways, as we mentioned, it's this idea of, we have a lot of sensors and, and 00:16:49.340 |
uh, contextual modalities available to us as ingredients, even today. There will be more 00:16:53.500 |
tomorrow, as you kind of saw with these upcoming, uh, potential hardware releases. Um, but even now, 00:16:58.700 |
with a laptop, with things like typing speed, with things like, uh, the tone of voice, there's a lot 00:17:04.620 |
of ways that you could gather context and extract signals from it. You could choose to process it in 00:17:09.580 |
a variety of different ways. And so, all of that can now be passed to an LLM and used in a reasoning 00:17:15.180 |
layer which decides how, um, both to respond in words and also how to present that information. 00:17:21.980 |
Um, and so basically, everything can now be an input and your output could be everywhere and have 00:17:28.780 |
every format. Um, at the same time, one might say everything everywhere all at once. 00:17:33.820 |
Well, you want to be intentional with it. You know, if someone wants to generate a photo on their Apple 00:17:40.940 |
watch, you're like, why, why? Like, no, use your freaking phone. Jesus. Um, anyway. And the last 00:17:46.540 |
thing we'll say is, um, probabilistic interfaces are hard because they have lots of different outputs. 00:17:51.580 |
So, a really great way to sort of ground these interfaces is to lean into familiar metaphors, 00:17:56.220 |
whether they are from nature, from physics, or even from human-made tools and materials, 00:18:00.460 |
like buttons, for now. Um, and you know, social norms is also a material that we work with, right? 00:18:06.060 |
So, your banking AI agent probably shouldn't be able to have a deep philosophical chat with you. 00:18:14.220 |
That just socially doesn't make sense. That would feel weird. 00:18:16.860 |
Exactly. Um, but on the same note, we've, we've, we've related all of these interfaces to what humans 00:18:23.900 |
perceive and experience now, but what might a truly intelligent interface look like in the future, 00:18:31.100 |
where if we think we are, where we are right now, skeuomorphism, what is the abstraction layer above 00:18:35.340 |
that? And that's kind of for us to figure out. Um, so, with that, um, yeah, that's all. Thank you.