Second Order Effects of AI: Cheng Lou

- This is a small visualization of our Lord and Savior matrix multiplication. I was asked to make a cool demo, so here it is. This is a single fragment shader drawn fully on the GPU. There's no imported asset, no triangle meshes, just purely a few hundred lines of GSL.

Shader art is a niche digital art form, and I highly recommend you to check it out, but the GPU wasn't supposed to be abused this way. But then again, the entire domain of machine learning is enjoying a renaissance thanks to it. So how did that happen? Today, I would like to explore a little bit of these kind of second order effect and why things happen with unintended consequences and how you can more reliably predict the future.

So Deep Blue beaten Gasparov in chess, 1996. And 2015, AlphaGo beats the famous Lisa Dolan Go. If you were a pessimist, as many were back then, you'd say that this is over for chess and Go. You know, what's even the point of playing those anymore, right? If human is not at the top.

But if you check what actually happened subsequently, this is a graph of professional Go players' decision quality over time. You know, guess where AlphaGo happened? And on a completely unrelated topic, this is Conway's Game of Life. It has black and white cells and a few simple rules to how the cell actually interact.

And at first glance, it's no big deal. But as you start building macro patterns with those, you get cool things like these. And here's Conway's Game of Life implemented in Conway's Game of Life, Turing Complete. These are examples of emergent behaviors produced by an order of magnitude, quantity, and quality, or performance increase.

The domain of machine learning is pretty familiar with this phenomenon. And generally speaking, emergent behavior is mostly MP-complete, so you can't compute it easily. So to create these patterns, people have to zoom out a level and consider high-level macrodynamics, a new set of rules plus various heuristics and errors.

Those folks work more like biologists than mathematicians or physicists. And what I'm trying to say is we cannot easily predict the emergent behavior of even a simple system that scales beyond our low-level intuitions. So in this talk, I would like to provide a few personal thought processes I use to predict some interesting second-order effect of AI, aka the ripple effect caused by more direct consequences of this era of AI.

And as a famous sci-fi author once said, "Good sci-fi predicts cars, great sci-fi predicts traffic." So the first I would like to use is broadly called "Who is learning?" Are you learning or is the machine learning, and do you care? I'm mostly talking about what people want. So for example, chess didn't die, it only got better, right?

Because it turns out that the crowd dynamic of it is that when you have free chess teachers anytime you want, instead of having to seek out that one dude in the village that teaches chess, well, everyone ends up knowing chess. And when that crowd knowledge becomes distributed widely enough, you get to have an audience to sustain more professional playing.

because ultimately it is you who want to do the learning and the audience. This is disregarding whether the machine learns better than you or not. You go to the equivalent of mental gym, because no matter how much the machine goes to the gym, you won't get better unless you do.

The same is true for drawing. Imagine a novice learning to draw. A blank canvas is actually a very daunting challenge, right? But very soon, you'll be able to have this equivalent of what we call a stroke auto-completion. So imagine a conceptual slider like this one, where on one side nothing happens, right?

On the other side, the full drawing is made for you. What's interesting is that now we can have a learned behavior where we can slide that into the middle. So when you prompt the system that you want to draw a chair, the system goes, "Oh, okay, you're curving this way, so I guess you want a Victorian-era chair," right?

Or when you use a shade of blue, the system goes, "Oh, I guess you want to draw the reflection of the sea on her face, but there's sand actually, so it should be a little bit more green than this." So if you try to learn coloring, you know how long of a feedback loop this actually is to master this.

And now you can dial that up and down per your need with immediate feedback. And ironically, over time, your slides actually go way more to the left all the way to the end, where you basically stop using AI because you've internalized everything and the skill came back to you.

Similarly for music, yes, we can now generate full songs in one shot for utilitarian ends, but AI could also help in a different way for you to learn. And what I'm interested in is in music, most of the time you're using direct manipulation in UI speak of the instrument, right?

The impotence mismatch between pressing a piano key and hearing the expected sound is almost none. none. But that lack of indirection is also a trade-off. So this is a theremin and already we're seeing a little bit more of an indirect manipulation. So I was wondering, what if you use your fingers to create and manipulate a music spectrogram, right?

Obviously your fingers aren't fine enough, but if the AI has If the AI has enough world model knowledge to super sample it for you, so to speak, maybe we'll end up with a new form of instrument when you gesture more like a puppeteer indirectly and create new kinds of music that analog music manipulation couldn't achieve.

So here's another reason I'm giving these examples. My domain is user interface mostly nowadays. If you think about who's learning you might end up with a conclusion that The conclusion that direct manipulation of user interfaces is actually about learning for yourself akin to going to the gym for yourself or learning to draw for yourself, learning abstract instrument for yourself.

So AKA once the classic user interface of tapping this and tapping that gets increasingly automated away Once the utilitarian ends have been met, all that's left is the kind of lifestyle user interfaces where you use them not because you're more efficient than the machine, but because you are the one who's trying to learn them for whichever self-fulfillment reason.

And so in that regard, we might end up with more artisanal quirky niche interfaces for luxury lifestyle or other purposes as a second order effect. And the second category I want to talk about is the idea of widening the information bandwidth, which is a trick I use quite often.

So the other day I was looking at some new research result from Anthropik regarding sparse autoencoders, but tangentially there was this simple visualization of a cluster and just purely from a visual perspective, it kind of reminded me of the movie Arrival by Denis Villeneuve where humanity learns an alien language that allows them to unlock their full potential.

And I thought, why not take it further and make it one language per person? Widen the whole information bandwidth. So up until this point, human language is this somewhat standardized communication interface, and it's a very, very narrow bandwidth one, a very lossy one. We learn relatively few languages, mostly standardized, and stuff our entire fuzzy ether of information into hoping that most of it isn't lost in translation.

Now that AI basically solved translation. So why not go a step ahead and translate one English to another English? Say I'm arguing with someone, and I say I feel blue, and this is coming from my perspective really, so it's unclear that this intent arrives to the other person intact.

Maybe for that particular listener, I should have translated I feel blue into I'm feeling purple, right? And what if I can just show it? What if my chat speech bubble is much more dynamic and much more nuanced because the AI understands the other person's aesthetic preferences? Right, what if things are fast enough that every sentence can be personalized into a dynamic art piece in 4D or something?

Way more nuanced information dense, and I can just hand it to the person in AR, right? Much denser than static emojis and a few basic curves, right? And what if you're commuting AR, and this gets machine translated into like some kind of cloud around you for the receiver, just in time individual specific language translation mechanism free of the compromise of a one-size-fit-out low bandwidth text language, right?

Maybe verbal conflict resolution end up taking in the order of seconds instead of minutes or hours in 50 years. So some more examples, here's one for the iPhone, the hardware user interface, and the next one for the iPad. For canvas apps, the act of pressing the pencil against the tablet usually means drawing a line, but it is overloaded to be selection, moving, resizing, et cetera, right?

But the reality is of a much higher bandwidth. So for example, if someone multi-taps on the screen with a pencil, maybe right before they said, why is this part red, right? Can we change it? Or maybe they drew a stroke, they said, yeah, this probably goes there instead before that, right?

And didn't feel like, you know, hunting for the lasso tool, selecting it, coming back and drawing a circle, long tap to hold the object, move it, double tap pencil back to the previous pen tool, and you do all these kind of acrobatic because you want to move an item, right?

So if you want to use a traditional design to categorize and overload the single stroke gesture, then you'll inevitably end up with more confusing behaviors with an implicit rule set. So traditionally, if your current stroke is conditioned on the current selected tool state, the object under your pencil, and maybe the action one second before, if we need to undo the stroke in favor of interpreting as a tap, this is very messy, right?

It's very fine design and craftsmanship. But in this new era, that line shouldn't be only conditioned on the beginning of its basic path, right? It should be conditioned on the entire world. So the tap and the stroke behavior should be as learned, as in machine learned, as possible. And some people's short press, you know, you know, sometimes they're just slightly too long and trigger the wrong gesture and all sort of bad action that a human observer would have corrected within a second.

So why can't machine learning just do it, right? Locally, too. So the last thought process I like to use often, which is extrapolating a certain quantity or quality to the extreme, which causes all sorts of fun emergent behavior you can try to guess, like previously mentioned Conway's Game of Life, for example.

And then I can reason from first principle there and see what kind of new things we can get from this. So if anyone's into programming languages, this is Smalltalk, a programming language environment from the '70s. That's the grandfather of the original object-oriented programming, which inspire Objective-C and other languages.

One of its main characteristics is message passing, as in sending commands, maybe even to another remote Smalltalk object on a different computer somewhere else through LAN or later on internet. I'm going to spare you of the detail, but Alan Kay, one of its inventors said the inspiration is basically, well, it's inspired by cells, right?

And that each object is basically a full computer you can examine and poke into, and then you can do things with it. And recursively, it might be a one-to-one mapping to the computer, it might be that one computer has many objects, etc. And also, he did say, somewhat more obscurely, that sending the message, sending a command to another computer, that's easy, but finding a receiver, that's hard.

So each Smalltalk object can theoretically smartly go to the internet, do stuff, come back with an answer, like a little self-directed intelligent agent, if this sounds familiar to this audience. But Smalltalk had a big problem, which is that when an agent is as smart as it can be, it's also as resource-intensive arbitrarily as it can be.

So when each agent takes up enough resource, you only get to have a single or double digit of them, right, by the law of numbers. So you missed out on an entire category of emergent behavior because you try to be too smart at the lower level. In this case, an emergent behavior from quantity and collaboration.

So on the other hand, look at this multi-layer perceptron. That's a graph. So interestingly, it kind of solved the discovery of the receiver problem because you can make it fully connected or whatever. And because the weights are learned, you'll propagate and, you know, some connections are more important than others, right?

The biggest difference between this and a swarm of agents is that the nodes are as dumb as they get. And when they're dumb and simple, you could have millions of them. And when that happens, you could leverage the emergent behavior of the aggregate and create a new media completely.

So there are quite a few agent-focused talk in the domain of ML. And I like to take this opportunity to use this method to offer some interesting food for thought. So for example, the more agents you have, the more you zoom out to care more about the aggregate rather than lower level agents, right?

Just like people and civilization. And the more you zoom out, the less you actually care about each individual agent. So in an alternative reality, not in this one, we invented a couple of agents who got sent to scour the internet called Wikipedia and came back with some snippets of information.

However, thankfully, in our reality, we sent billions of dumber notes to read Wikipedia and aggregated all of them together to sit in a phone coordinated by a single top-level smart top-level process. So here's my last example. This was actually freshly picked. This is the Apple TV macOS UI. Pretty decent looking.

I'd like you to pay attention to this part, the circled red part, which is the more button here. When you click on it to see the full description, what do you expect? Well, where does the description expand to, right? It turns out that you get a very atypical Apple UI.

When you click on it, you get this, which is literally a big UI text view, right? Very on Apple. So it looks like an unfinished notepad. In fact, you can kind of select it and do things with it, which is weird. The thing is, this Apple TV Mac app is actually a catalyst app, which for those who don't do iOS development, it means it's a direct port of their iOS app here.

On iOS, if you tap the description and get such a new view, then things don't look too out of place. In fact, it's rather idiomatic. Now, you might say that the problem here is that a lack of UI design and lack of care, lacks of craftsmanship. But for the sake of making a point for this talk, I would like to provide a perspective that this might actually be a lack of literally needing more UI, a lack of more UI.

So what would the world look like if we extrapolated that quantity if we raise the order of magnitude and have way more UI, two order of magnitude more, right? What does that even mean? So let's start with a simple 12 column grid, right? We first list out all the discrete pieces of information we want to potentially show across this entire view or maybe across the entire app, right?

Now that we have AI nowadays, a design time, not a runtime, a design time, we could generate thousands of these permutations of layout for a show's UI screens, right? We're not shipping these, just using AI to generate a bunch of potential candidates. So previously, this task wasn't achievable through traditional means unless you're in a particular niche, since we didn't have a way to pay attention to the semantic relationship between, say, a show's title and the box's size and position in relation to other items, right?

You could still generate plain boxes through traditional heuristic and generative algorithm, but you'd have a hard time tagging each box with the right piece of information, for example. So after our first pass, we can use a scoring mechanism, either traditional or some fancy AI-driven scoring heuristic, aesthetic things to eliminate undesirable layout at data generation time.

And of course, you'd involve the designer here too. This is done at design time offline, not at runtime. So we can use an algorithm that's as slow as we need and the designer can take as much time as he or she needs and patiently curate the subset, which is quite large.

So the key here is that you've generated not 10, right? You generated a thousand of these truths through smarter generative and semantic filtering techniques. So we're raising the order of magnitude, right? You're not a designer making a single digit number of design, moving boxes yourself in Figma and waiting for your boss to go, "Can we move this box somewhere else instead?" And just one more ad hoc design, please, I promise.

It's just one more. You'll solve everything, right? And of course, you want to involve the designer in this particular stage too. So maybe at some point, you also decide to throw in a little bit of diffusion, again, for rough draft time to using some control now or whatever, to generate some rough website to give the boss more of an immersive feeling, right?

To say, like, this layout can't work. It's not just like boxes, right? And at app runtime now, which is the part I'm personally interested in, right now, LLM generations, they generate at writing time, and then they generate like two, right, or three. And then you pick one, and then maybe you ship that, and then it's a traditional web app.

But the thing is, if your bottleneck is the web part, even an AGI cannot help you make your JavaScript faster than C++, right? So we have to swap out some of these items with an actual neural net if we want to advance and use a web platform, for example, or any other platform, for that matter.

So at runtime, you have, for example, a quick decision tree to choose the right layout. So again, for modern web development, this roughly has this one single heuristic for you to select the right design, which is called media queries, which all it does is, depending on the width of your window, you might show or hide some items.

But this entire space could actually use some help from learned algorithms. So, for example, what if the user's onboarding? Why is that a different concept when you hide show or show some different boxes? It's just another set of boxes, right? What if the user is a super user? Maybe you progressively show them a different set of layouts, right?

Maybe they require different screens. And what if the user is in a different country, right? Different age? Search query? So to be clear, big companies like Uber and Facebook already do this on a daily basis, right? When you use Uber, the app in India or China, it looks drastically different, right?

But it's currently thousands of engineers of effort, right? For big companies, right? And they create an entire mode out of the fact that they have a few more design UIs plus a business logic, to be fair. And it's very brittle, right? You cannot see everything. And the algorithm is basically less controllable than even a simple decision tree, a classifier.

So if a user is fuzzy searching, right? This might be a better example. You know, "Hey, what movie did Denis Villeneuve make?" Right? This goes into a decision tree and shows the curated layout. If the user is instead saying, "Hey, what movie did Denis Villeneuve make and with whom?" Then you show this curated layout instead.

So maybe you're asking a chatbot, in which case the layout is even more contextual, right? If you do a napkin calculation of the generated and curated number of UI you ever need, they might actually be in the 1,000, not 10s, right? Fortunately, 1,000 can still be curated thanks to AI.

So essentially, it's not an autoregressive problem. It's not a diffusion problem. It's a simple classification problem because we have discrete categories here. So here you go, dynamic UIs. So let me summarize this for a little bit. Second-order effects are pretty unpredictable, and there are many ways to tame thinking about them.

If you think about these points among others, then I think you'll be decently prepared when the time comes. And of course, you know, read from history and you've got to do things. Don't forget that the best way to predict the future is to invent it. Thank you. Thank you.

you

Second Order Effects of AI: Cheng Lou

Chapters

Transcript