back to index

Second Order Effects of AI: Cheng Lou


Chapters

0:0 Introduction
2:3 Who is learning
4:46 Music
5:40 User Interface
10:34 Extrapolating
12:3 Agent Focus
14:21 More UI
21:4 Summary

Whisper Transcript | Transcript Only Page

00:00:02.000 | This is a small visualization of our Lord and Savior
00:00:16.880 | matrix multiplication.
00:00:18.200 | I was asked to make a cool demo, so here it is.
00:00:21.080 | This is a single fragment shader drawn fully on the GPU.
00:00:24.000 | There's no imported asset, no triangle meshes,
00:00:26.680 | just purely a few hundred lines of GSL.
00:00:30.280 | Shader art is a niche digital art form,
00:00:32.440 | and I highly recommend you to check it out,
00:00:35.360 | but the GPU wasn't supposed to be abused this way.
00:00:39.080 | But then again, the entire domain of machine learning
00:00:41.120 | is enjoying a renaissance thanks to it.
00:00:43.280 | So how did that happen?
00:00:45.560 | Today, I would like to explore a little bit
00:00:47.600 | of these kind of second order effect
00:00:49.240 | and why things happen with unintended consequences
00:00:51.960 | and how you can more reliably predict the future.
00:00:55.000 | So Deep Blue beaten Gasparov in chess, 1996.
00:01:06.920 | And 2015, AlphaGo beats the famous Lisa Dolan Go.
00:01:10.600 | If you were a pessimist, as many were back then,
00:01:13.320 | you'd say that this is over for chess and Go.
00:01:15.560 | You know, what's even the point of playing those anymore, right?
00:01:17.720 | If human is not at the top.
00:01:19.480 | But if you check what actually happened subsequently,
00:01:22.840 | this is a graph of professional Go players' decision quality over time.
00:01:27.160 | You know, guess where AlphaGo happened?
00:01:29.400 | And on a completely unrelated topic,
00:01:33.320 | this is Conway's Game of Life.
00:01:35.160 | It has black and white cells and a few simple rules
00:01:37.400 | to how the cell actually interact.
00:01:39.720 | And at first glance, it's no big deal.
00:01:41.720 | But as you start building macro patterns with those,
00:01:44.040 | you get cool things like these.
00:01:45.640 | And here's Conway's Game of Life implemented in Conway's Game of Life,
00:01:51.000 | Turing Complete.
00:01:51.640 | These are examples of emergent behaviors
00:01:54.600 | produced by an order of magnitude,
00:01:56.760 | quantity, and quality, or performance increase.
00:02:00.120 | The domain of machine learning is pretty familiar with this phenomenon.
00:02:04.200 | And generally speaking, emergent behavior is mostly MP-complete,
00:02:07.560 | so you can't compute it easily.
00:02:09.560 | So to create these patterns,
00:02:10.920 | people have to zoom out a level and consider high-level macrodynamics,
00:02:16.440 | a new set of rules plus various heuristics and errors.
00:02:19.400 | Those folks work more like biologists than mathematicians or physicists.
00:02:25.240 | And what I'm trying to say is we cannot easily predict the emergent behavior
00:02:29.400 | of even a simple system that scales beyond our low-level intuitions.
00:02:33.240 | So in this talk, I would like to provide a few personal thought processes
00:02:36.120 | I use to predict some interesting second-order effect of AI,
00:02:39.560 | aka the ripple effect caused by more direct consequences of this era of AI.
00:02:45.240 | And as a famous sci-fi author once said,
00:02:47.640 | "Good sci-fi predicts cars, great sci-fi predicts traffic."
00:02:50.680 | So the first I would like to use is broadly called "Who is learning?"
00:02:56.600 | Are you learning or is the machine learning, and do you care?
00:02:59.320 | I'm mostly talking about what people want.
00:03:02.520 | So for example, chess didn't die, it only got better, right?
00:03:05.240 | Because it turns out that the crowd dynamic of it is that when you have free chess
00:03:08.840 | teachers anytime you want, instead of having to seek out that one dude in the village
00:03:12.840 | that teaches chess, well, everyone ends up knowing chess.
00:03:17.080 | And when that crowd knowledge becomes distributed widely enough,
00:03:19.880 | you get to have an audience to sustain more professional playing.
00:03:22.840 | because ultimately it is you who want to do the learning and the audience.
00:03:26.840 | This is disregarding whether the machine learns better than you or not.
00:03:30.920 | You go to the equivalent of mental gym, because no matter how much the machine goes to the gym,
00:03:36.200 | you won't get better unless you do.
00:03:37.800 | The same is true for drawing.
00:03:40.360 | Imagine a novice learning to draw.
00:03:42.440 | A blank canvas is actually a very daunting challenge, right?
00:03:45.240 | But very soon, you'll be able to have this equivalent of what we call a stroke auto-completion.
00:03:51.080 | So imagine a conceptual slider like this one, where on one side nothing happens, right?
00:03:57.000 | On the other side, the full drawing is made for you.
00:03:59.560 | What's interesting is that now we can have a learned behavior where we can slide that into the middle.
00:04:05.560 | So when you prompt the system that you want to draw a chair, the system goes,
00:04:10.040 | "Oh, okay, you're curving this way, so I guess you want a Victorian-era chair," right?
00:04:14.520 | Or when you use a shade of blue, the system goes,
00:04:16.840 | "Oh, I guess you want to draw the reflection of the sea on her face, but there's sand actually,
00:04:21.160 | so it should be a little bit more green than this."
00:04:25.240 | So if you try to learn coloring, you know how long of a feedback loop this actually is to master this.
00:04:30.760 | And now you can dial that up and down per your need with immediate feedback.
00:04:34.920 | And ironically, over time, your slides actually go way more to the left all the way to the end,
00:04:40.200 | where you basically stop using AI because you've internalized everything and the skill came back to you.
00:04:46.040 | Similarly for music, yes, we can now generate full songs in one shot for utilitarian ends,
00:04:53.160 | but AI could also help in a different way for you to learn.
00:04:56.280 | And what I'm interested in is in music, most of the time you're using direct manipulation
00:05:01.400 | in UI speak of the instrument, right?
00:05:03.880 | The impotence mismatch between pressing a piano key and hearing the expected sound is almost none.
00:05:09.160 | none. But that lack of indirection is also a trade-off. So this is a theremin and already we're seeing
00:05:14.440 | a little bit more of an indirect manipulation. So I was wondering, what if you use your fingers to
00:05:19.400 | create and manipulate a music spectrogram, right? Obviously your fingers aren't fine enough, but if the AI has
00:05:24.440 | If the AI has enough world model knowledge to super sample it for you, so to speak, maybe we'll end up
00:05:29.720 | with a new form of instrument when you gesture more like a puppeteer indirectly and create new kinds of music
00:05:36.280 | that analog music manipulation couldn't achieve. So here's another reason I'm giving these examples.
00:05:43.720 | My domain is user interface mostly nowadays. If you think about who's learning you might end up with a conclusion
00:05:49.000 | The conclusion that direct manipulation of user interfaces is actually about learning for yourself
00:05:54.280 | akin to going to the gym for yourself or learning to draw for yourself, learning abstract instrument for yourself.
00:05:58.280 | So AKA once the classic user interface of tapping this and tapping that gets increasingly automated away
00:06:07.560 | Once the utilitarian ends have been met, all that's left is the kind of lifestyle user interfaces where you use them
00:06:14.200 | not because you're more efficient than the machine, but because you are the one who's trying to learn them
00:06:18.840 | for whichever self-fulfillment reason. And so in that regard, we might end up with more artisanal quirky niche interfaces
00:06:25.880 | for luxury lifestyle or other purposes as a second order effect. And the second category I want to talk about
00:06:31.560 | is the idea of widening the information bandwidth, which is a trick I use quite often. So the other day
00:06:38.520 | I was looking at some new research result from Anthropik regarding sparse autoencoders, but tangentially
00:06:43.960 | there was this simple visualization of a cluster and just purely from a visual perspective, it kind of reminded me
00:06:49.880 | of the movie Arrival by Denis Villeneuve where humanity learns an alien language
00:06:55.560 | that allows them to unlock their full potential. And I thought, why not take it further and make it one language per person?
00:07:02.520 | Widen the whole information bandwidth. So up until this point, human language is this somewhat
00:07:07.560 | standardized communication interface, and it's a very, very narrow bandwidth one, a very lossy one.
00:07:13.240 | We learn relatively few languages, mostly standardized, and stuff our entire fuzzy ether of information
00:07:21.240 | into hoping that most of it isn't lost in translation. Now that AI basically solved translation. So why not go a step ahead and translate one English to another English?
00:07:31.560 | Say I'm arguing with someone, and I say I feel blue, and this is coming from my perspective really, so it's unclear that this intent arrives to the other person intact.
00:07:43.240 | Maybe for that particular listener, I should have translated I feel blue into I'm feeling purple, right?
00:07:48.840 | And what if I can just show it? What if my chat speech bubble is much more dynamic and much more nuanced because the AI understands the other person's aesthetic preferences?
00:07:57.560 | Right, what if things are fast enough that every sentence can be personalized into a dynamic art piece in 4D or something?
00:08:05.160 | Way more nuanced information dense, and I can just hand it to the person in AR, right?
00:08:10.360 | Much denser than static emojis and a few basic curves, right?
00:08:13.880 | And what if you're commuting AR, and this gets machine translated into like some kind of cloud around you for the receiver,
00:08:21.240 | just in time individual specific language translation mechanism free of the compromise of a one-size-fit-out low bandwidth text language, right?
00:08:32.040 | Maybe verbal conflict resolution end up taking in the order of seconds instead of minutes or hours in 50 years.
00:08:40.280 | So some more examples, here's one for the iPhone, the hardware user interface, and the next one for the iPad.
00:08:49.000 | For canvas apps, the act of pressing the pencil against the tablet usually means drawing a line,
00:08:54.840 | but it is overloaded to be selection, moving, resizing, et cetera, right?
00:08:59.880 | But the reality is of a much higher bandwidth.
00:09:04.280 | So for example, if someone multi-taps on the screen with a pencil, maybe right before they said,
00:09:09.160 | why is this part red, right?
00:09:11.800 | Can we change it?
00:09:13.240 | Or maybe they drew a stroke, they said, yeah, this probably goes there instead before that, right?
00:09:18.360 | And didn't feel like, you know, hunting for the lasso tool, selecting it, coming back and drawing a circle,
00:09:24.920 | long tap to hold the object, move it, double tap pencil back to the previous pen tool,
00:09:29.480 | and you do all these kind of acrobatic because you want to move an item, right?
00:09:32.920 | So if you want to use a traditional design to categorize and overload the single stroke gesture,
00:09:38.280 | then you'll inevitably end up with more confusing behaviors with an implicit rule set.
00:09:44.040 | So traditionally, if your current stroke is conditioned on the current selected tool state,
00:09:49.960 | the object under your pencil, and maybe the action one second before,
00:09:54.440 | if we need to undo the stroke in favor of interpreting as a tap, this is very messy, right?
00:10:01.640 | It's very fine design and craftsmanship.
00:10:04.600 | But in this new era, that line shouldn't be only conditioned on the beginning of its basic path, right?
00:10:11.080 | It should be conditioned on the entire world.
00:10:14.520 | So the tap and the stroke behavior should be as learned, as in machine learned, as possible.
00:10:19.560 | And some people's short press, you know, you know, sometimes they're just slightly too long and
00:10:23.640 | trigger the wrong gesture and all sort of bad action that a human observer would have corrected within
00:10:28.360 | a second. So why can't machine learning just do it, right? Locally, too.
00:10:35.400 | So the last thought process I like to use often, which is extrapolating a certain quantity or quality
00:10:43.640 | to the extreme, which causes all sorts of fun emergent behavior you can try to guess, like
00:10:49.480 | previously mentioned Conway's Game of Life, for example. And then I can reason from first principle
00:10:55.000 | there and see what kind of new things we can get from this.
00:11:00.360 | So if anyone's into programming languages, this is Smalltalk, a programming language environment
00:11:06.440 | from the '70s. That's the grandfather of the original object-oriented programming,
00:11:11.560 | which inspire Objective-C and other languages. One of its main characteristics is message passing,
00:11:17.880 | as in sending commands, maybe even to another remote Smalltalk object on a different computer
00:11:22.920 | somewhere else through LAN or later on internet. I'm going to spare you of the detail, but Alan Kay,
00:11:29.560 | one of its inventors said the inspiration is basically, well, it's inspired by cells,
00:11:37.640 | right? And that each object is basically a full computer you can examine and poke into,
00:11:43.160 | and then you can do things with it. And recursively, it might be a one-to-one mapping to the computer,
00:11:49.320 | it might be that one computer has many objects, etc. And also, he did say, somewhat more obscurely,
00:11:55.800 | that sending the message, sending a command to another computer, that's easy, but finding a receiver,
00:12:01.320 | that's hard. So each Smalltalk object can theoretically smartly go to the internet, do stuff,
00:12:11.960 | come back with an answer, like a little self-directed intelligent agent, if this sounds familiar to this
00:12:18.600 | audience. But Smalltalk had a big problem, which is that when an agent is as smart as it can be,
00:12:25.000 | it's also as resource-intensive arbitrarily as it can be. So when each agent takes up enough resource,
00:12:33.400 | you only get to have a single or double digit of them, right, by the law of numbers. So you missed
00:12:39.480 | out on an entire category of emergent behavior because you try to be too smart at the lower level.
00:12:44.600 | In this case, an emergent behavior from quantity and collaboration.
00:12:48.920 | So on the other hand, look at this multi-layer perceptron. That's a graph. So interestingly,
00:12:57.080 | it kind of solved the discovery of the receiver problem because you can make it fully connected or
00:13:01.880 | whatever. And because the weights are learned, you'll propagate and, you know, some connections are more
00:13:07.160 | important than others, right? The biggest difference between this and a swarm of agents is that the
00:13:13.080 | nodes are as dumb as they get. And when they're dumb and simple, you could have millions of them.
00:13:18.840 | And when that happens, you could leverage the emergent behavior of the aggregate and create a new media
00:13:23.960 | completely. So there are quite a few agent-focused talk in the domain of ML. And I like to take this
00:13:32.040 | opportunity to use this method to offer some interesting food for thought. So for example,
00:13:38.920 | the more agents you have, the more you zoom out to care more about the aggregate rather than lower level
00:13:44.520 | agents, right? Just like people and civilization. And the more you zoom out, the less you actually care
00:13:50.120 | about each individual agent. So in an alternative reality, not in this one, we invented a couple of agents
00:13:59.400 | who got sent to scour the internet called Wikipedia and came back with some snippets of information.
00:14:05.240 | However, thankfully, in our reality, we sent billions of dumber notes to read Wikipedia and aggregated
00:14:13.000 | all of them together to sit in a phone coordinated by a single top-level smart top-level process.
00:14:21.800 | So here's my last example. This was actually freshly picked. This is the Apple TV macOS UI. Pretty decent
00:14:29.560 | looking. I'd like you to pay attention to this part, the circled red part, which is the more button here.
00:14:35.880 | When you click on it to see the full description, what do you expect? Well, where does the description
00:14:41.160 | expand to, right? It turns out that you get a very atypical Apple UI. When you click on it, you get this,
00:14:47.880 | which is literally a big UI text view, right? Very on Apple. So it looks like an unfinished notepad.
00:14:54.600 | In fact, you can kind of select it and do things with it, which is weird. The thing is, this Apple TV
00:15:00.360 | Mac app is actually a catalyst app, which for those who don't do iOS development, it means it's a direct
00:15:06.840 | port of their iOS app here. On iOS, if you tap the description and get such a new view, then things don't
00:15:15.560 | look too out of place. In fact, it's rather idiomatic. Now, you might say that the problem
00:15:20.360 | here is that a lack of UI design and lack of care, lacks of craftsmanship. But for the sake of making
00:15:27.560 | a point for this talk, I would like to provide a perspective that this might actually be a lack of
00:15:34.040 | literally needing more UI, a lack of more UI. So what would the world look like if we extrapolated that
00:15:41.560 | quantity if we raise the order of magnitude and have way more UI, two order of magnitude more, right?
00:15:47.400 | What does that even mean? So let's start with a simple 12 column grid, right? We first list out all the
00:15:54.360 | discrete pieces of information we want to potentially show across this entire view or maybe across the
00:15:58.920 | entire app, right? Now that we have AI nowadays, a design time, not a runtime, a design time, we could
00:16:05.560 | generate thousands of these permutations of layout for a show's UI screens, right? We're not shipping
00:16:11.720 | these, just using AI to generate a bunch of potential candidates. So previously, this task wasn't achievable
00:16:17.560 | through traditional means unless you're in a particular niche, since we didn't have a way to pay attention to
00:16:23.560 | the semantic relationship between, say, a show's title and the box's size and position in relation
00:16:29.720 | to other items, right? You could still generate plain boxes through traditional heuristic and generative
00:16:34.680 | algorithm, but you'd have a hard time tagging each box with the right piece of information, for example.
00:16:39.480 | So after our first pass, we can use a scoring mechanism, either traditional or some fancy AI-driven scoring heuristic,
00:16:49.560 | aesthetic things to eliminate undesirable layout at data generation time. And of course, you'd involve
00:16:56.680 | the designer here too. This is done at design time offline, not at runtime. So we can use an algorithm
00:17:03.000 | that's as slow as we need and the designer can take as much time as he or she needs and patiently curate the subset,
00:17:09.400 | which is quite large. So the key here is that you've generated not 10, right? You generated a thousand of these
00:17:20.520 | truths through smarter generative and semantic filtering techniques. So we're raising the order of magnitude, right?
00:17:26.520 | You're not a designer making a single digit number of design, moving boxes yourself in Figma and waiting for your boss
00:17:33.640 | to go, "Can we move this box somewhere else instead?" And just one more ad hoc design, please, I promise. It's just one more.
00:17:40.200 | You'll solve everything, right? And of course, you want to involve the designer in this particular stage too.
00:17:45.640 | So maybe at some point, you also decide to throw in a little bit of diffusion, again, for rough draft time to using some control now or whatever,
00:17:58.760 | to generate some rough website to give the boss more of an immersive feeling, right? To say, like, this layout can't work. It's not just like boxes, right?
00:18:07.640 | And at app runtime now, which is the part I'm personally interested in, right now, LLM generations, they generate at writing time,
00:18:20.840 | and then they generate like two, right, or three. And then you pick one, and then maybe you ship that, and then it's a traditional web app.
00:18:27.160 | But the thing is, if your bottleneck is the web part, even an AGI cannot help you make your JavaScript faster than C++, right?
00:18:36.120 | So we have to swap out some of these items with an actual neural net if we want to advance and use a web platform, for example,
00:18:45.560 | or any other platform, for that matter. So at runtime, you have, for example, a quick decision tree to choose the right layout.
00:18:52.760 | So again, for modern web development, this roughly has this one single heuristic for you to select the right design,
00:19:00.120 | which is called media queries, which all it does is, depending on the width of your window,
00:19:05.160 | you might show or hide some items. But this entire space could actually use some help from learned algorithms.
00:19:13.720 | So, for example, what if the user's onboarding? Why is that a different concept when you hide show or show some different boxes?
00:19:22.040 | It's just another set of boxes, right? What if the user is a super user? Maybe you progressively show them a different set of layouts, right?
00:19:28.600 | Maybe they require different screens. And what if the user is in a different country, right?
00:19:37.160 | Different age? Search query? So to be clear, big companies like Uber and Facebook already do this on a daily basis, right?
00:19:43.160 | When you use Uber, the app in India or China, it looks drastically different, right?
00:19:49.720 | But it's currently
00:19:51.720 | thousands of engineers of effort, right? For big companies, right?
00:19:55.720 | And they create an entire mode out of the fact that they have a few more design UIs plus a business logic, to be fair.
00:20:02.280 | And it's very brittle, right? You cannot see everything.
00:20:06.280 | And the algorithm is basically less controllable than even a simple decision tree, a classifier.
00:20:12.280 | So if a user is fuzzy searching, right? This might be a better example.
00:20:18.040 | You know, "Hey, what movie did Denis Villeneuve make?" Right?
00:20:22.360 | This goes into a decision tree and shows the curated layout.
00:20:25.160 | If the user is instead saying, "Hey, what movie did Denis Villeneuve make and with whom?"
00:20:31.800 | Then you show this curated layout instead.
00:20:34.520 | So maybe you're asking a chatbot, in which case the layout is even more contextual, right?
00:20:39.560 | If you do a napkin calculation of the generated and curated number of UI you ever need,
00:20:45.240 | they might actually be in the 1,000, not 10s, right?
00:20:49.000 | Fortunately, 1,000 can still be curated thanks to AI.
00:20:51.960 | So essentially, it's not an autoregressive problem.
00:20:54.840 | It's not a diffusion problem.
00:20:56.600 | It's a simple classification problem because we have discrete categories here.
00:21:00.680 | So here you go, dynamic UIs.
00:21:02.600 | So let me summarize this for a little bit.
00:21:06.600 | Second-order effects are pretty unpredictable,
00:21:08.840 | and there are many ways to tame thinking about them.
00:21:12.040 | If you think about these points among others,
00:21:13.880 | then I think you'll be decently prepared when the time comes.
00:21:16.520 | And of course, you know, read from history and you've got to do things.
00:21:20.520 | Don't forget that the best way to predict the future is to invent it.
00:21:23.080 | Thank you.
00:21:25.080 | Thank you.