back to indexSecond Order Effects of AI: Cheng Lou

Chapters
0:0 Introduction
2:3 Who is learning
4:46 Music
5:40 User Interface
10:34 Extrapolating
12:3 Agent Focus
14:21 More UI
21:4 Summary
00:00:02.000 |
This is a small visualization of our Lord and Savior 00:00:18.200 |
I was asked to make a cool demo, so here it is. 00:00:21.080 |
This is a single fragment shader drawn fully on the GPU. 00:00:24.000 |
There's no imported asset, no triangle meshes, 00:00:35.360 |
but the GPU wasn't supposed to be abused this way. 00:00:39.080 |
But then again, the entire domain of machine learning 00:00:49.240 |
and why things happen with unintended consequences 00:00:51.960 |
and how you can more reliably predict the future. 00:01:06.920 |
And 2015, AlphaGo beats the famous Lisa Dolan Go. 00:01:10.600 |
If you were a pessimist, as many were back then, 00:01:13.320 |
you'd say that this is over for chess and Go. 00:01:15.560 |
You know, what's even the point of playing those anymore, right? 00:01:19.480 |
But if you check what actually happened subsequently, 00:01:22.840 |
this is a graph of professional Go players' decision quality over time. 00:01:35.160 |
It has black and white cells and a few simple rules 00:01:41.720 |
But as you start building macro patterns with those, 00:01:45.640 |
And here's Conway's Game of Life implemented in Conway's Game of Life, 00:01:56.760 |
quantity, and quality, or performance increase. 00:02:00.120 |
The domain of machine learning is pretty familiar with this phenomenon. 00:02:04.200 |
And generally speaking, emergent behavior is mostly MP-complete, 00:02:10.920 |
people have to zoom out a level and consider high-level macrodynamics, 00:02:16.440 |
a new set of rules plus various heuristics and errors. 00:02:19.400 |
Those folks work more like biologists than mathematicians or physicists. 00:02:25.240 |
And what I'm trying to say is we cannot easily predict the emergent behavior 00:02:29.400 |
of even a simple system that scales beyond our low-level intuitions. 00:02:33.240 |
So in this talk, I would like to provide a few personal thought processes 00:02:36.120 |
I use to predict some interesting second-order effect of AI, 00:02:39.560 |
aka the ripple effect caused by more direct consequences of this era of AI. 00:02:47.640 |
"Good sci-fi predicts cars, great sci-fi predicts traffic." 00:02:50.680 |
So the first I would like to use is broadly called "Who is learning?" 00:02:56.600 |
Are you learning or is the machine learning, and do you care? 00:03:02.520 |
So for example, chess didn't die, it only got better, right? 00:03:05.240 |
Because it turns out that the crowd dynamic of it is that when you have free chess 00:03:08.840 |
teachers anytime you want, instead of having to seek out that one dude in the village 00:03:12.840 |
that teaches chess, well, everyone ends up knowing chess. 00:03:17.080 |
And when that crowd knowledge becomes distributed widely enough, 00:03:19.880 |
you get to have an audience to sustain more professional playing. 00:03:22.840 |
because ultimately it is you who want to do the learning and the audience. 00:03:26.840 |
This is disregarding whether the machine learns better than you or not. 00:03:30.920 |
You go to the equivalent of mental gym, because no matter how much the machine goes to the gym, 00:03:42.440 |
A blank canvas is actually a very daunting challenge, right? 00:03:45.240 |
But very soon, you'll be able to have this equivalent of what we call a stroke auto-completion. 00:03:51.080 |
So imagine a conceptual slider like this one, where on one side nothing happens, right? 00:03:57.000 |
On the other side, the full drawing is made for you. 00:03:59.560 |
What's interesting is that now we can have a learned behavior where we can slide that into the middle. 00:04:05.560 |
So when you prompt the system that you want to draw a chair, the system goes, 00:04:10.040 |
"Oh, okay, you're curving this way, so I guess you want a Victorian-era chair," right? 00:04:14.520 |
Or when you use a shade of blue, the system goes, 00:04:16.840 |
"Oh, I guess you want to draw the reflection of the sea on her face, but there's sand actually, 00:04:21.160 |
so it should be a little bit more green than this." 00:04:25.240 |
So if you try to learn coloring, you know how long of a feedback loop this actually is to master this. 00:04:30.760 |
And now you can dial that up and down per your need with immediate feedback. 00:04:34.920 |
And ironically, over time, your slides actually go way more to the left all the way to the end, 00:04:40.200 |
where you basically stop using AI because you've internalized everything and the skill came back to you. 00:04:46.040 |
Similarly for music, yes, we can now generate full songs in one shot for utilitarian ends, 00:04:53.160 |
but AI could also help in a different way for you to learn. 00:04:56.280 |
And what I'm interested in is in music, most of the time you're using direct manipulation 00:05:03.880 |
The impotence mismatch between pressing a piano key and hearing the expected sound is almost none. 00:05:09.160 |
none. But that lack of indirection is also a trade-off. So this is a theremin and already we're seeing 00:05:14.440 |
a little bit more of an indirect manipulation. So I was wondering, what if you use your fingers to 00:05:19.400 |
create and manipulate a music spectrogram, right? Obviously your fingers aren't fine enough, but if the AI has 00:05:24.440 |
If the AI has enough world model knowledge to super sample it for you, so to speak, maybe we'll end up 00:05:29.720 |
with a new form of instrument when you gesture more like a puppeteer indirectly and create new kinds of music 00:05:36.280 |
that analog music manipulation couldn't achieve. So here's another reason I'm giving these examples. 00:05:43.720 |
My domain is user interface mostly nowadays. If you think about who's learning you might end up with a conclusion 00:05:49.000 |
The conclusion that direct manipulation of user interfaces is actually about learning for yourself 00:05:54.280 |
akin to going to the gym for yourself or learning to draw for yourself, learning abstract instrument for yourself. 00:05:58.280 |
So AKA once the classic user interface of tapping this and tapping that gets increasingly automated away 00:06:07.560 |
Once the utilitarian ends have been met, all that's left is the kind of lifestyle user interfaces where you use them 00:06:14.200 |
not because you're more efficient than the machine, but because you are the one who's trying to learn them 00:06:18.840 |
for whichever self-fulfillment reason. And so in that regard, we might end up with more artisanal quirky niche interfaces 00:06:25.880 |
for luxury lifestyle or other purposes as a second order effect. And the second category I want to talk about 00:06:31.560 |
is the idea of widening the information bandwidth, which is a trick I use quite often. So the other day 00:06:38.520 |
I was looking at some new research result from Anthropik regarding sparse autoencoders, but tangentially 00:06:43.960 |
there was this simple visualization of a cluster and just purely from a visual perspective, it kind of reminded me 00:06:49.880 |
of the movie Arrival by Denis Villeneuve where humanity learns an alien language 00:06:55.560 |
that allows them to unlock their full potential. And I thought, why not take it further and make it one language per person? 00:07:02.520 |
Widen the whole information bandwidth. So up until this point, human language is this somewhat 00:07:07.560 |
standardized communication interface, and it's a very, very narrow bandwidth one, a very lossy one. 00:07:13.240 |
We learn relatively few languages, mostly standardized, and stuff our entire fuzzy ether of information 00:07:21.240 |
into hoping that most of it isn't lost in translation. Now that AI basically solved translation. So why not go a step ahead and translate one English to another English? 00:07:31.560 |
Say I'm arguing with someone, and I say I feel blue, and this is coming from my perspective really, so it's unclear that this intent arrives to the other person intact. 00:07:43.240 |
Maybe for that particular listener, I should have translated I feel blue into I'm feeling purple, right? 00:07:48.840 |
And what if I can just show it? What if my chat speech bubble is much more dynamic and much more nuanced because the AI understands the other person's aesthetic preferences? 00:07:57.560 |
Right, what if things are fast enough that every sentence can be personalized into a dynamic art piece in 4D or something? 00:08:05.160 |
Way more nuanced information dense, and I can just hand it to the person in AR, right? 00:08:10.360 |
Much denser than static emojis and a few basic curves, right? 00:08:13.880 |
And what if you're commuting AR, and this gets machine translated into like some kind of cloud around you for the receiver, 00:08:21.240 |
just in time individual specific language translation mechanism free of the compromise of a one-size-fit-out low bandwidth text language, right? 00:08:32.040 |
Maybe verbal conflict resolution end up taking in the order of seconds instead of minutes or hours in 50 years. 00:08:40.280 |
So some more examples, here's one for the iPhone, the hardware user interface, and the next one for the iPad. 00:08:49.000 |
For canvas apps, the act of pressing the pencil against the tablet usually means drawing a line, 00:08:54.840 |
but it is overloaded to be selection, moving, resizing, et cetera, right? 00:08:59.880 |
But the reality is of a much higher bandwidth. 00:09:04.280 |
So for example, if someone multi-taps on the screen with a pencil, maybe right before they said, 00:09:13.240 |
Or maybe they drew a stroke, they said, yeah, this probably goes there instead before that, right? 00:09:18.360 |
And didn't feel like, you know, hunting for the lasso tool, selecting it, coming back and drawing a circle, 00:09:24.920 |
long tap to hold the object, move it, double tap pencil back to the previous pen tool, 00:09:29.480 |
and you do all these kind of acrobatic because you want to move an item, right? 00:09:32.920 |
So if you want to use a traditional design to categorize and overload the single stroke gesture, 00:09:38.280 |
then you'll inevitably end up with more confusing behaviors with an implicit rule set. 00:09:44.040 |
So traditionally, if your current stroke is conditioned on the current selected tool state, 00:09:49.960 |
the object under your pencil, and maybe the action one second before, 00:09:54.440 |
if we need to undo the stroke in favor of interpreting as a tap, this is very messy, right? 00:10:04.600 |
But in this new era, that line shouldn't be only conditioned on the beginning of its basic path, right? 00:10:11.080 |
It should be conditioned on the entire world. 00:10:14.520 |
So the tap and the stroke behavior should be as learned, as in machine learned, as possible. 00:10:19.560 |
And some people's short press, you know, you know, sometimes they're just slightly too long and 00:10:23.640 |
trigger the wrong gesture and all sort of bad action that a human observer would have corrected within 00:10:28.360 |
a second. So why can't machine learning just do it, right? Locally, too. 00:10:35.400 |
So the last thought process I like to use often, which is extrapolating a certain quantity or quality 00:10:43.640 |
to the extreme, which causes all sorts of fun emergent behavior you can try to guess, like 00:10:49.480 |
previously mentioned Conway's Game of Life, for example. And then I can reason from first principle 00:10:55.000 |
there and see what kind of new things we can get from this. 00:11:00.360 |
So if anyone's into programming languages, this is Smalltalk, a programming language environment 00:11:06.440 |
from the '70s. That's the grandfather of the original object-oriented programming, 00:11:11.560 |
which inspire Objective-C and other languages. One of its main characteristics is message passing, 00:11:17.880 |
as in sending commands, maybe even to another remote Smalltalk object on a different computer 00:11:22.920 |
somewhere else through LAN or later on internet. I'm going to spare you of the detail, but Alan Kay, 00:11:29.560 |
one of its inventors said the inspiration is basically, well, it's inspired by cells, 00:11:37.640 |
right? And that each object is basically a full computer you can examine and poke into, 00:11:43.160 |
and then you can do things with it. And recursively, it might be a one-to-one mapping to the computer, 00:11:49.320 |
it might be that one computer has many objects, etc. And also, he did say, somewhat more obscurely, 00:11:55.800 |
that sending the message, sending a command to another computer, that's easy, but finding a receiver, 00:12:01.320 |
that's hard. So each Smalltalk object can theoretically smartly go to the internet, do stuff, 00:12:11.960 |
come back with an answer, like a little self-directed intelligent agent, if this sounds familiar to this 00:12:18.600 |
audience. But Smalltalk had a big problem, which is that when an agent is as smart as it can be, 00:12:25.000 |
it's also as resource-intensive arbitrarily as it can be. So when each agent takes up enough resource, 00:12:33.400 |
you only get to have a single or double digit of them, right, by the law of numbers. So you missed 00:12:39.480 |
out on an entire category of emergent behavior because you try to be too smart at the lower level. 00:12:44.600 |
In this case, an emergent behavior from quantity and collaboration. 00:12:48.920 |
So on the other hand, look at this multi-layer perceptron. That's a graph. So interestingly, 00:12:57.080 |
it kind of solved the discovery of the receiver problem because you can make it fully connected or 00:13:01.880 |
whatever. And because the weights are learned, you'll propagate and, you know, some connections are more 00:13:07.160 |
important than others, right? The biggest difference between this and a swarm of agents is that the 00:13:13.080 |
nodes are as dumb as they get. And when they're dumb and simple, you could have millions of them. 00:13:18.840 |
And when that happens, you could leverage the emergent behavior of the aggregate and create a new media 00:13:23.960 |
completely. So there are quite a few agent-focused talk in the domain of ML. And I like to take this 00:13:32.040 |
opportunity to use this method to offer some interesting food for thought. So for example, 00:13:38.920 |
the more agents you have, the more you zoom out to care more about the aggregate rather than lower level 00:13:44.520 |
agents, right? Just like people and civilization. And the more you zoom out, the less you actually care 00:13:50.120 |
about each individual agent. So in an alternative reality, not in this one, we invented a couple of agents 00:13:59.400 |
who got sent to scour the internet called Wikipedia and came back with some snippets of information. 00:14:05.240 |
However, thankfully, in our reality, we sent billions of dumber notes to read Wikipedia and aggregated 00:14:13.000 |
all of them together to sit in a phone coordinated by a single top-level smart top-level process. 00:14:21.800 |
So here's my last example. This was actually freshly picked. This is the Apple TV macOS UI. Pretty decent 00:14:29.560 |
looking. I'd like you to pay attention to this part, the circled red part, which is the more button here. 00:14:35.880 |
When you click on it to see the full description, what do you expect? Well, where does the description 00:14:41.160 |
expand to, right? It turns out that you get a very atypical Apple UI. When you click on it, you get this, 00:14:47.880 |
which is literally a big UI text view, right? Very on Apple. So it looks like an unfinished notepad. 00:14:54.600 |
In fact, you can kind of select it and do things with it, which is weird. The thing is, this Apple TV 00:15:00.360 |
Mac app is actually a catalyst app, which for those who don't do iOS development, it means it's a direct 00:15:06.840 |
port of their iOS app here. On iOS, if you tap the description and get such a new view, then things don't 00:15:15.560 |
look too out of place. In fact, it's rather idiomatic. Now, you might say that the problem 00:15:20.360 |
here is that a lack of UI design and lack of care, lacks of craftsmanship. But for the sake of making 00:15:27.560 |
a point for this talk, I would like to provide a perspective that this might actually be a lack of 00:15:34.040 |
literally needing more UI, a lack of more UI. So what would the world look like if we extrapolated that 00:15:41.560 |
quantity if we raise the order of magnitude and have way more UI, two order of magnitude more, right? 00:15:47.400 |
What does that even mean? So let's start with a simple 12 column grid, right? We first list out all the 00:15:54.360 |
discrete pieces of information we want to potentially show across this entire view or maybe across the 00:15:58.920 |
entire app, right? Now that we have AI nowadays, a design time, not a runtime, a design time, we could 00:16:05.560 |
generate thousands of these permutations of layout for a show's UI screens, right? We're not shipping 00:16:11.720 |
these, just using AI to generate a bunch of potential candidates. So previously, this task wasn't achievable 00:16:17.560 |
through traditional means unless you're in a particular niche, since we didn't have a way to pay attention to 00:16:23.560 |
the semantic relationship between, say, a show's title and the box's size and position in relation 00:16:29.720 |
to other items, right? You could still generate plain boxes through traditional heuristic and generative 00:16:34.680 |
algorithm, but you'd have a hard time tagging each box with the right piece of information, for example. 00:16:39.480 |
So after our first pass, we can use a scoring mechanism, either traditional or some fancy AI-driven scoring heuristic, 00:16:49.560 |
aesthetic things to eliminate undesirable layout at data generation time. And of course, you'd involve 00:16:56.680 |
the designer here too. This is done at design time offline, not at runtime. So we can use an algorithm 00:17:03.000 |
that's as slow as we need and the designer can take as much time as he or she needs and patiently curate the subset, 00:17:09.400 |
which is quite large. So the key here is that you've generated not 10, right? You generated a thousand of these 00:17:20.520 |
truths through smarter generative and semantic filtering techniques. So we're raising the order of magnitude, right? 00:17:26.520 |
You're not a designer making a single digit number of design, moving boxes yourself in Figma and waiting for your boss 00:17:33.640 |
to go, "Can we move this box somewhere else instead?" And just one more ad hoc design, please, I promise. It's just one more. 00:17:40.200 |
You'll solve everything, right? And of course, you want to involve the designer in this particular stage too. 00:17:45.640 |
So maybe at some point, you also decide to throw in a little bit of diffusion, again, for rough draft time to using some control now or whatever, 00:17:58.760 |
to generate some rough website to give the boss more of an immersive feeling, right? To say, like, this layout can't work. It's not just like boxes, right? 00:18:07.640 |
And at app runtime now, which is the part I'm personally interested in, right now, LLM generations, they generate at writing time, 00:18:20.840 |
and then they generate like two, right, or three. And then you pick one, and then maybe you ship that, and then it's a traditional web app. 00:18:27.160 |
But the thing is, if your bottleneck is the web part, even an AGI cannot help you make your JavaScript faster than C++, right? 00:18:36.120 |
So we have to swap out some of these items with an actual neural net if we want to advance and use a web platform, for example, 00:18:45.560 |
or any other platform, for that matter. So at runtime, you have, for example, a quick decision tree to choose the right layout. 00:18:52.760 |
So again, for modern web development, this roughly has this one single heuristic for you to select the right design, 00:19:00.120 |
which is called media queries, which all it does is, depending on the width of your window, 00:19:05.160 |
you might show or hide some items. But this entire space could actually use some help from learned algorithms. 00:19:13.720 |
So, for example, what if the user's onboarding? Why is that a different concept when you hide show or show some different boxes? 00:19:22.040 |
It's just another set of boxes, right? What if the user is a super user? Maybe you progressively show them a different set of layouts, right? 00:19:28.600 |
Maybe they require different screens. And what if the user is in a different country, right? 00:19:37.160 |
Different age? Search query? So to be clear, big companies like Uber and Facebook already do this on a daily basis, right? 00:19:43.160 |
When you use Uber, the app in India or China, it looks drastically different, right? 00:19:51.720 |
thousands of engineers of effort, right? For big companies, right? 00:19:55.720 |
And they create an entire mode out of the fact that they have a few more design UIs plus a business logic, to be fair. 00:20:02.280 |
And it's very brittle, right? You cannot see everything. 00:20:06.280 |
And the algorithm is basically less controllable than even a simple decision tree, a classifier. 00:20:12.280 |
So if a user is fuzzy searching, right? This might be a better example. 00:20:18.040 |
You know, "Hey, what movie did Denis Villeneuve make?" Right? 00:20:22.360 |
This goes into a decision tree and shows the curated layout. 00:20:25.160 |
If the user is instead saying, "Hey, what movie did Denis Villeneuve make and with whom?" 00:20:34.520 |
So maybe you're asking a chatbot, in which case the layout is even more contextual, right? 00:20:39.560 |
If you do a napkin calculation of the generated and curated number of UI you ever need, 00:20:45.240 |
they might actually be in the 1,000, not 10s, right? 00:20:49.000 |
Fortunately, 1,000 can still be curated thanks to AI. 00:20:51.960 |
So essentially, it's not an autoregressive problem. 00:20:56.600 |
It's a simple classification problem because we have discrete categories here. 00:21:06.600 |
Second-order effects are pretty unpredictable, 00:21:08.840 |
and there are many ways to tame thinking about them. 00:21:12.040 |
If you think about these points among others, 00:21:13.880 |
then I think you'll be decently prepared when the time comes. 00:21:16.520 |
And of course, you know, read from history and you've got to do things. 00:21:20.520 |
Don't forget that the best way to predict the future is to invent it.