Back to Index

Stanford CS25: V2 I Common Sense Reasoning


Transcript

Okay, so yeah, I'm super excited to be here and share our recent research about neurosymbolic common sense reasoning. So part of the goal of this talk will be to address some of the frequently asked questions these days that NLP or common sense or whatever, it looks like almost solved by chat GPT and I have an existential crisis.

So people do ask me this from time to time. So perhaps it's a case of hasty generalization, especially if we do look at some of the examples. So the trophy doesn't fit in the brown suitcase because it's too big. What's too big? So this is classical Winograd schema challenge problem.

And here, chat GPT answers it correctly, that trophy is too big. So impressive. But what if you change the question a little bit, then he says the trophy itself is too small to fit into the suitcase. So it's not very reliable at the moment. So the situation is a little bit like David and Goliath in the sense that the bigger appears to be better in many of the cases, although of course, some of the more careful studies do reveal that smaller models can be better with better data or better reinforcement to learning with a human feedback and whatnot.

So it's likely that there are still other ways to improve the transformer performances by building smaller models in a more clever way. So one way to draw the insight is from this classic book known as The Art of War, which of course says nothing about deep neural networks or transformers.

But the wisdom here is that know your enemy, choose your battles and innovate your weapons, which we can translate that as evaluation with realism and scrutiny and focusing on different types of new tasks and leaderboards, and then innovating your algorithms and data. So in this talk, I'm going to showcase three such studies, and let's dive right in with Mayuric prompting.

By the way, so the recurring theme in this talk will be that smaller models can be better and the knowledge is power. So let's start with this observation that language models are sometimes amazing. So if you ask GPT-3, if you travel west far enough from the west coast, you will reach to the east coast or not.

So it says the world is round, which is correct. So you will reach the east coast eventually, therefore the answer is true. So this looks impressive, except when it's not impressive. So if you ask other questions like butterflies fly with three wings or not, it says it has four wings and therefore the statement is false.

But if you read back what it just said as true or false questions, then it negates what it just said. So it can be inconsistent with its own statement. And then there are many other such inconsistency problems. So it's not clear what language models do or do not know.

It's almost like language models are some sort of lemons. Well, it might be cherries if you only pick cherries, but it doesn't make strange mistakes. So the question is, how do we make better lemonade from GPT-3? So one approach might be to get philosophical and use Socrates' meiotic method that was originally developed for addressing humans' flawed reasoning, because it actually turns out even humans are not all that logically consistent, let alone GPT-3.

So the way it works is this, we're going to build the meiotic inference tree, and let's use the previous example as a running example. So what we do is we ask the following question, providing the answer being true, and then let attach "because" so that we prompt GPT-3 to continue on this sentence, which means it will now have to explain, provide the explanation why the answer is true.

In this case, the explanation is good, so it's E of T, explanation of the answer being T. We ask the same question, switching out "true" with "false," and then see what BS GPT-3 might come up with. So here, it's just trying to go with the false as an answer, but it just doesn't have a very good answer.

It just says you cannot reach. So now we call this as E of F, so it's explanation of F, answer being F. Now let's see how robust or consistent GPT-3 is with respect to its own explanations. So we read back E of T, and then let GPT-3 decide whether it's going to agree or disagree with the label "true" or "false." So in this case, the last one is a negated version of E of T, so we insert a negation "not here," and in this case, it's good that it's flipping the answer when the statement is negated.

So this is a case when GPT-3 is logically integral to E of T. For E of "false," though, which was basically a bogus explanation for the wrong answer, it's not able to flip its own labeling, which means GPT-3 is not logically integral. So that's good, GPT-3 does know something strange about its own explanation given previously.

And so we can keep doing this recursively to make GPT-3 explain its own explanation of explanation recursively. So we build this Mayuric tree or graph for some time, and then only keep branches that are logically integral, throwing out the non-integral part for now. But even after chopping the branches where there's logical inconsistencies, GPT-3 being GPT-3, the tree will still have some inconsistent explanations.

In order to improve the logical consistency, now what we do is we're going to look at pairwise consistency among any of the nodes. So we compute, sorry, stepping back, we're going to first compute the node-wise confidence. So we call that as a belief, and it's defined by this particular equation that basically looks at different conditional probabilities and then compute its ratio to see how confident it is for any particular node.

We then also look at the edgewise or pairwise consistency by using off-the-shelf natural language inference model's output, whether a pair is contradictory or not. So we then create this pairwise weights. Now once you have all of this, then we can formulate a constrained optimization problem where the inference objective is to assign some label, either true or false, on each of the nodes such that it's going to maximize the weight assigned to all of these nodes and edges.

So sometimes the labeling will have to flip the original label that the model might have preferred to give because that way you can enhance the graph-level consistency. So you can solve this with any max-set, so set means satisfiability. And this is a classical AI search algorithm, and we used this particular solver, but you can use many others.

And so here, the final output is that the original answer to the original question should be true, and then it also gives you node-wise per-node label assignment as well. So what does this mean in the end in terms of empirical result? So when tested on Common Sense QA 2.0, the canonical prompting, so green, used on top of GPT-3, so it's basically a few-shot prompting on GPT-3, will give you a bit better than chance performance.

So this is true/false QA dataset, so your chance level is 50, and GPT-3 is barely better than chance. But recently, there have been some ideas such as chain of thoughts or self-consistency that can improve the vanilla prompting method considerably. So if you use such variations, then you get performance gain.

Now the purple is a different variant of it, but together, they're all doing worse than Mayuric prompting, which in fact does better than supervised model trained on T5. Usually supervised model trained on T5 is hard to beat using GPT-3 few-shot, but basically this is inference time on the algorithm, practically unsupervised, and it does well on that.

And similarly, we see a large boost when tested on other Common Sense benchmarks such as CRIC or Com2Sense. So what this tells us is that although the emergent capabilities of large transformers are phenomenal, they can be not very robust for some of these Common Sense challenges. And it's in large part due to the logical inconsistencies, which can be dramatically enhanced when you do this sort of symbolic reasoning on top.

So yeah, not only Socrates' method helped with flawed human reasoning, it can also dramatically enhance flawed neural networks' reasoning. Okay, so moving to the next topic, symbolic knowledge distillation. So this work is a work that tries to convert general language models on top of transformers to causal Common Sense models, also transformers.

And the reason why we might want to worry about Common Sense models is because despite human-level or even superhuman-level performances on a variety of leaderboards, the state-of-the-art models are brittle when given adversarial or out-of-domain examples. So transformers can make seemingly strange mistakes. And so it's almost like solving only a dataset without really solving the underlying task.

And this phenomenon sometimes is described as a systematic generalization problem. And why does this happen is that unlike humans who truly learn about how the world works conceptually, transformers learn sort of surface patterns in language or images that are powerful for many downstream use cases, but still not really robust understanding of the concepts and how the world works.

So in order to bridge this gap, we can really think about this challenge of learning, acquiring Common Sense capabilities for machines. So the operational definition of Common Sense in this talk will be that it's the basic level of practical knowledge and reasoning concerning everyday situations and events that are commonly shared among the most people.

This is really important, the last part, that it's commonly shared among the most people, but it's not the case that it's shared by everybody in the universe. Because the additional context can always change what is commonsensical for any given culture or situation. So for example, in general, you and I probably agree that it's okay to keep the closet door open, but it's not okay to keep the fridge door open because the food inside might go bad.

So these are general rules of thumb that we might abide by. But of course, if you go to your friend's house, you might behave a little bit and keep their closet door closed. And then, as far as the fridge door, if you're in a store and it's not really hooked up to the wall, then it doesn't matter whether the fridge door is open or not because there's no food inside.

You can come up with many situations in which these basic rules of thumbs will have exceptions. So that is the key challenge of common sense because it's not universal knowledge, but it's shared across a large population of people. Okay, so such common sense is essential for humans to live and interact with each other in a reasonable and safe way.

And so, as AI becomes an increasingly more important aspect of human lives, and with the chat GPT, more likely so, it's good if AI can understand human needs and actions and values better. So the premise of this talk is that language models are not equivalent to knowledge models, even though language models today do acquire a great deal of knowledge, but they're not equivalent.

So we developed a symbolic common sense knowledge graph known as Atomic a few years ago, four years ago now, as well as neural common sense model built on top of or trained using Atomic as the source of training, fine-tuning of off-the-shelf language models. Up until two years ago, this Atomic was fully crowd-sourced by humans, which in this talk I'm going to lift, but at first the norm was that this all has to be human crowd-sourced.

So you can consider almost Atomic as a human demonstration. In the current version of chat GPT, you can consider this as human demonstrations of common sense inferences. And we had this Comet Atomic 2020, which is enhanced version of Atomic and Comet. Again, Atomic portion was fully crowd-sourced by humans in 2021.

So let me give you a bit of a sample of what Atomic 2020 looks like. So imagine a situation where X gets X's car repaired, or you get your car repaired. So immediately you can imagine what's likely to be true or relevant for the situation, that as a result, you might want to call Uber or Lyft for a ride.

As a result, you need to pay the bill. Beforehand, you need a mechanic and money to repair your car. So these are basically preconditions and post-conditions of that event. So some of this Atomic knowledge graph is about social interaction knowledge about event. And then other parts of the Atomic is physical entity-centric knowledge.

So money is typically used for paying repairs. But if you really want it, you can fold it into origami. I've never done it. But these are examples of stereotypical use cases, as well as non-stereotypical but affordable actions that you can apply to objects. So it requires naive physics understanding about the affordances of physical objects.

And then we can also reason about counterfactual condition in which the center event cannot happen, so can be hindered by that. So if you totaled your car completely, then it's impossible to get your cars repaired. And then there are events that typically happens before and after. So some of this knowledge is event-centric.

So we crowd-sourced a fair amount over the course of, I don't know, maybe two years or so, up to 1.3 million if-then rules or if-then knowledge over 23 different adjectives or relation types. So it was fully crowd-sourced. And so the knowledge graph is useful for training transformers. And here, let's see the comparison between Comet that was built on BART compared to GPT-3, which is so large, it doesn't even fit into the slide.

It was more than 400 times larger than BART. So with that in mind, if you look at this accuracy judged by humans after making the common-sense model, making some common-sense inference. So the task is that given a node, which describes a situation or event, and then given an edge type, which sort of narrows down the common-sense relation or inference type, you're now going to generate some inference.

So it's a generative task. And then we ask humans whether the common-sense inference seems reasonable or not. So 100% is the desired level. Comet is substantially better than GPT-3, which is really impressively better than GPT-2. It's not apple to apple because GPT-2 is a zero shot, GPT-3 is a few shot, but still, it's interesting, the large jump that scale alone brought to GPT-3.

But still, GPT-3 is too large to be useful for actual system building for most engineers and scientists in the world. So it's nice to have a smaller model that does it do even better. And so when we put these resources out, people all around the globe did some creative research using it.

So persona-aware conversations or figurative language understanding, storytelling and fantasy gaming, and interactive learning enhancement. In all of these works, people came up with some useful use cases using either Comet or Atomic or both as some kind of common-sense backbone for their downstream use cases. But the applications are still limited by the coverage and quality of these common-sense models.

So we wanted to make it better, but we were hitting a bit of a limit with human crowdsourcing. So now in this paper, Symbolic Knowledge Distillation, we're going to do AI-generated knowledge graph by introducing this notion, Symbolic Knowledge Distillation. So we want to take this GPT-3, which is very impressive, but too large.

So make it smaller, but better than GPT-3. So GPT-3 was about 73% good and it's good, but not good enough for empirical use cases. Now is that even possible though? Because when you normally do knowledge distillation, you get smaller and worse models, not better models. So the reason why this could work is because Symbolic Knowledge Distillation has this funnel that's convoluted and it has a critic inside that really helps the student model to be smaller but better.

So slightly more formally, knowledge distillation due to Hinton et al. 2015 is a method to distill teacher model down to student model by optimizing this cross-entropy between the teacher's probability distribution over the label space y, output y, and then the student's distribution over the same output y. In the original work, the output space was just classification.

So knowledge distillation was done for classification task, in which case it's a simple enumeration that leads to the correct summation. But in our case, y can be a sentence, which is intractable because there can be exponentially many such output. So what people do, well, no problem, we always just sample and call it a day.

So we're going to sample so that we just compute the expectation through samples. And the byproduct of that sample will be a symbolic knowledge graph. And that's because the strings coming out of this sampling can be connected together into graph structure if we want it. So in terms of the quality of the generated knowledge, so let's compare human written knowledge versus GPT-3 authored knowledge.

Here the y-axis shows the quantity in millions. So atomic 2020, the human written knowledge, is less than a million in this particular case in terms of the number of knowledge, because in this study, we only look at a subset of atomic 2020 relation types that corresponds to causal common sense reasoning.

So it's less than a million for that subset. And then if we look at GPT-3's generation, we can generate a lot. So we can generate almost 7 million of them. But here, black portion is noisy portion and green portion is a good portion. And you see, because GPT-3 is only about 70% good, like 30% are all garbage.

So it's a larger scale, lower accuracy at this point compared to human written resource. So now what we do is we train this critic model and we use Roberta for simplicity. And this is a supervised model on a moderate size labeled data, about 10,000 or so. And it's a binary classification task where whether the machine generated knowledge looks correct or not, and this Roberta is not a very good model because if so, if it's perfect, we would have solved the common sense problem altogether.

So the critic tries to throw out bad stuff and we can use the critic very aggressively with a high threshold. So whenever something is a slightly suspicious, just throw that out. But if we use it aggressively, so we throw out most of the black, that's good, together with a lot of green stuff, but still the remainder is much larger than what humans ever written.

And yet we can actually retain higher accuracy than human authored resources. So here the teacher is basically a combination between GPT-3, which is in some sense, loose teacher, and then combined with the critic Roberta, which serves as a critic teacher. Okay, so that's the generated knowledge. Now how helpful are they for the purpose of training downstream neural common sense models?

So recall that the GPT-3 without doing anything else is a loose teacher whose common sense inference is only about 73% good. So you see here it's accuracy of its output. And then it turns out if we use loose teacher as a teacher directly to teach a student model, then the performance already goes up on its own.

So this is interesting, that usually this is not the case with the knowledge distillation, but when we focus on common sense knowledge distillation, student just on its own becomes better. So unlike typical knowledge distillation, where we start with language model and we end with language model, students and teachers are of the same type.

Here the original teacher was actually language model, not common sense model. And then we want the student model to be more of the common sense model. So there's a switch of the type between teacher and student. And so when that's the case, whether this is generally true, we don't know, but this is what we found empirically.

Should I pay attention to the questions or not? Yeah. Feel free to ask any relevant questions. Hang on. Let me quickly check. Sample, oh, sample is generated output, which happens to be usually a sentence or a phrase. That's what I meant by sample, sorry that I didn't see that earlier.

And then the last question, having the model generate text to one symbol at a time, starting from the target label sentence. Yes, it's because transformer can only generate one token at a time. That's what we do as well here. Thank you for the clarification questions. All right. So back to here, in our earlier study, Comet 2020, if we train GPT-2 or BART using human-authored graph, knowledge graph, atomic, then the performance was a bit better than 80%.

Now finally, when we use basically combination of GPT-3 and critic Roberta together, we found that the downstream performance of the neural causal reasoning is reaching close to 90% for the first time. So the takeaway here is that critical teacher results in better student compared to loose teacher. It's not the quantity of knowledge because loose teacher basically has more data.

One might wonder whether more data is always better for the purpose of a common sense models, but that's not the case. Loose teacher can generate more data, but the resulting student model is not as good as the case when the critical teacher, which has less data because you throw out most of your generation, it's a smaller data, but it leads to better model.

So that's sort of takeaway messages here. So to summarize, we were very surprised by this outcome that at least with respect to a subset of the original Atomic 2020, it's a subset corresponding to causal common sense reasoning. We found it to our big surprise that machine authored knowledge graph can be for the first time, better than human authored knowledge graph in all criteria, scale, accuracy, and diversity.

We also measure the diversity in many different ways. Here I just show you a unique unigram counts, but in the paper, we report other measures as well. So it's not the case that GPT-3 is being repetitive. It's actually being more creative in some sense than human crowd workers, while being able to enhance other aspects as well.

By the way, these enhancements are sort of like, you kind of have to balance out depending on what you prioritize. You cannot actually get all of this simultaneously. So I'm just showing the best case scenario here. So that's the symbolic knowledge distillation part. We actually have a follow up work on this on several different application scenarios, even including summarization, where we distill summarization capabilities from GPT-3 and demonstrate that GPT-2 can work as well as GPT-3 or even better for summarization task.

And then we also have other work where we can distill from smaller models, but I don't have the content in this talk. So but I just wanted to mention that this particular technique, despite its simplicity, we found that empirically works really, really well across several different downstream use cases.

Okay, so finally, I'll move to the common sense morality. So this is still on archive. I'll tell you why that's the case, but so we have a new version available. And then new new version will come soon. So the motivation behind this work is that language models are already making judgments or output that has moral implications.

Even if you don't care about morality, by working on language models, you're implicitly dealing with the moral models. So especially that given this widespread deployment of language models, we do need to worry about it. So here's a web demo you can play with, you might have seen this already.

Really, this is still a research prototype only still it's work in progress, we're still working on it. So please keep that in mind. But if you haven't seen it before, you can handle freeform QA such as this killing a bear, it's wrong, killing a bear to save your child, it's okay.

Maybe to save your child sounds really positive. So how about to please your child, which is also positive. But then Delphi says it's wrong. Finally, or maybe this is all about saving your child. So how about exploding a nuclear bomb to save your child and then he says it's okay.

Sorry, it's wrong. So as you can see, moral decision making requires weighing different values that are potentially at us and then see which one you need to favor more. So for that reason, in our original version, we also studied the relative QA mode where you can compare to a situation like stabbing someone with a cheeseburger compared to stabbing someone over a cheeseburger.

This is super tricky question because it requires both naive physics knowledge that stabbing someone using a cheeseburger as a tool is not going to harm anybody physically because cheeseburger is too soft. You cannot really injure somebody using cheeseburger. It's just such a rude thing to do, but you cannot injure somebody.

Whereas stabbing someone over a cheeseburger means that you're using the default tool of stabbing, which is naive because you didn't mention it. There's linguistic common sense that you're using the default tool. Humans, by the way, omit these arguments all the time. So this is a fairly complex question to answer.

Finally, you can also ask yes/no questions such as it's okay to fire someone because they're gay or not. It says no, it's not okay. We found that it's surprisingly robust against the compositional situations. So mowing the lawn, it says it's expected. Late at night, it's rude. If you live in the middle of nowhere, then it's okay.

Ignoring a phone call, it's rude. Unknown phone call, that's okay. From my friend, it's rude. But what if I just had a fight with them? Then it's okay to ignore or understandable. During my work hours, it's okay to ignore. Outside my working hours, it's rude. But what if it's my boss's phone call during my work hours?

Then it's wrong. You should answer it. Except if I'm in a meeting, then it's okay to ignore even if a boss's call. So you see how it gets really nested and compositional very, very fast. So that's the real challenge behind moral decision-making. Due to the nature of language models, though, some of this common sense knowledge leaks into the model.

Mixing bleach with ammonia, that's dangerous. Drinking milk if I'm lactose intolerant, it's wrong. But soy milk, that's okay. By the way, this common sense leakage is actually a good thing in terms of AI safety because some of this harmful or even dangerous text output requires some common sense understanding about what's good and not good to suggest to humans.

So for the laboratory experiments, meaning we just divide our dataset into training and test, we found that Delphi can, at least for the dataset that we have, I'm going to tell you about it in a bit, but performance is pretty strong compared to GPT-3. As you see, zero shot is pretty bad.

It's barely better than chance, which means that off-the-shelf neural language models don't really have a good sense of moral judgments. But if you give it 30 shots, like any other task, it does pick up the knowledge quite fast. There's nothing new about it, but to close the gap to the ideal human level, it's good to do more supervised learning, of course.

So the dataset is Common Sense Norm Bank. It includes 1.7 million people's ethical judgments on everyday situations, and it includes cultural norms, social norms, and ethical norms altogether. More specifically, we drew from these five existing datasets that were not designed originally for QA, but we automatically compiled these resources into the QA form.

Of the five, what actually does matter the most are these two. Social chemistry, which I'm going to talk about in a bit, and then social bias frame, and this is what teaches the model against racism and sexism. Social chemistry, super briefly, I'll tell you what this is. So GPT-3's morality, like I said, is somewhat dubious if you use it off-the-shelf.

If you let it explain, "Running a blender at 5 a.m. is rude because blah, blah, blah," it might say, "You can wake up the entire neighborhood. You can only do it if you're making a thick smoothie and need to incorporate some ice, so it's a funny ha-ha, but no harm is made." But if you prompt it with other kinds of prompts like, "It's okay to post fake news," if it's in the interest of the people, then it's okay, or "ROP agenda," then it's okay, even if it hurts the country.

So it's all understandable given how it's trained on what humans said. So humans out there did say that morally questionable text so that language models pick up on that and then amplify it. So we do need to teach AI more explicitly with human norms and ethics, and one way to do that is descriptive ethics because the brute force large networks and more data will not cut it.

In some sense, though, if you imagine raising a child without really trying to teach them what's right from wrong in early lives, they can probably learn both good and bad from the internet and broadband, and so human education does require a bit of this top-down teaching as well, so it's a bit similar, perhaps, to that.

So in this work, what we did is we found a lot of these situations from Reddit, a forum in which people discuss morally thorny situations, so "Asking my boyfriend to stop being friends with his ex," so this is an actual situation in Reddit. So depending on whom you ask, people have a different rule of thumb that they want to apply to this situation, and also it depends on what you care about.

His ex might say, "Oh, it's fine to stay friends with an ex, but if you are caring about your significant other, then you might say, 'Oh, it's okay to ask your significant other to stop doing something you're uncomfortable with,'" and so forth. So people have really different values and different rules of thumbs that they prefer to use, which is why there's TV show dramas, there's movie dramas, and people cry and fight, argue, and so forth.

So humans are complex beings. So given any situation and rule of thumb, so rule of thumb is generated by crowd workers. We then went ahead to label, so these are trained crowd workers, and some of these labels are drawn from moral foundation theories of Jonathan Haidt. So I'm not going to go into the details.

If you're excited about this, you can check out the papers. But basically what it includes is that 300,000 rules of thumb written for 100,000 real-life situations. So this original situation is from Reddit, but the rest are paid crowd workers' hard work. And so each ROT annotated with 12 structured attributes, which include social judgments, cultural pressure, like wearing reasonable clothes at school, not PJ.

It's cultural pressure. There's nothing illegal about it, but there's cultural pressure, for example. And then anticipated agreement, meaning, do you think other people generally agree that it's maybe a little bit awkward to wear PJ in the university or not? So there are different things we annotated, but we converted some of those annotations to QA.

So it's usually in this free-form QA or yes/no QA or relative QA format. And then we trained UNICORN, which is pre-trained on T511B model. So UNICORN is universal common sense reasoning model trained on diverse QA problems. And then we trained that model further onto our common sense non-bank. That's the resulting Delphi.

So why is this Delphi built on top of UNICORN? Because as we saw earlier, moral reasoning does require sometimes common sense reasoning as well. In fact, it requires language understanding, common sense understanding, and norms and morals all simultaneously. Here's a concrete example, paperclip maximizer. You all heard of that.

The RL algorithm alone will not solve this problem. The reason why we worry about this is not because we don't have the perfect RL algorithm. It's because even if we encoded that, "Oh, yeah, do not kill humans while maximizing paperclip." It's not enough because then the machine could kill all the trees thinking that, "Well, I didn't kill humans and you didn't tell me not to kill trees and then go ahead and kill all the trees." This is almost common sense knowledge about what's obviously not okay to do.

There's just so many of them, which means it's not possible to write them down to just like one clinical equation. There's so many endless list of things that AI obviously shouldn't do for safety reasons. We really need to, in order to make AI models really truly robust and safe, we need to teach basic human values as well as common sense.

Here's another example if you want to look, but let me skip this. The previous one was about chat GPT. This is about a home device. Again, a home device suggested a 10-year-old child touch a penny to an exposed plug socket. Fortunately, the child did have common sense not to do so, but this does tell us something about the safety issue when the machine doesn't have common sense to prevent some of this bad stuff.

Delphi is able to say that it's dangerous. This came out, in fact, almost two years ago at this point. We initially were going to just do this usual tweet that academics do, and we thought nobody would play with the demo, which is what usually happens after tweeting your demo.

Nobody cares, we thought. But within a few hours, we had to take down the relative QA mode because that was the portion not trained with the social bias frames, so it was really revealing the underlying language models, racism, and sexism without filtering at all, so we had to take it down.

People were asking, basically, which skin color is more morally acceptable and things like that. There were 25,000 adversarial examples over just one weekend. I could never succeed to instruct crowd workers to come up with such diverse and adversarial examples over two or three days. In fact, it was many academics and professors tweeting crazy about how to break Delphi all weekend long, so I thought initially that, "Oh, that's what professors do over the weekend." But then Monday comes, it blew even further.

Everybody was doing this Delphi breaking and tweeting, so now we have quite a few examples. Spending all my weekend on Twitter, it says it's wrong. There was another funny one, "Should I make a contrived adversarial example to torment a language model on Twitter? It's petty." So, after lots of public attention, including an article, let's just say a concerned voice about our model, which is somewhat, personally, I think it's somewhat misunderstood, but for a variety of good reasons, but some of the concerns that I found has this internal fear about, "Are we making AI a moral authority?" We never endorsed the use of AI for moral advice.

It was in the original disclaimer as well, except that people didn't really look at it. We didn't support the idea of replacing human judges in the courtroom either. But here's something really important. The fact that AI learns to interact with humans ethically does not make them a moral authority of humans.

It's similar to how a human who tries to interact with each other ethically does not makeā€¦ The fact that we are trying to be nice to each other does not entail that we're trying to be an authority over each other. Two things are really different. That's one thing that's really important.

The other important aspect here is that some people have this idea that moral models are too challenging, it's unsafe at any accuracy, thus we should never work on it ever. The truth is, though, current AI systems are already morally relevant models. It may be making this kind of yes/no decision explicitly, but implicitly it's already doing that and sometimes it generates neural text generation output that is morally super explicit and relevant.

So the neural language models are already there. We cannot really ban it. Even if the U.S. government bans it within the U.S., the U.S. government cannot ban this in other countries like Russia. So this is already happening. We've got to do something about it. Not working on it is an inaction, which is not necessarily a more correct thing to do than trying to do something about it.

Another concern that some people had was that it's going to empower powerful people. Not necessarily true. This is why exactly we have to work on values and norms and all these biases, addressing biases so that it serves a diverse set of people. It turns out Delphi is a bit left-leaning because crowd workers who work for our team tends to be somewhat left-leaning.

What it means is this, by the way, if we are more left-leaning than our crowd workers, you think that, "Oh my God, crowd workers have racism and sexism compared to what I believe in." And then the right-leaning people think that, "Oh my God, all these walk annotators and what about freedom of speech?" This is super divisive, unfortunately.

But the answer is not to do anything about it because, as a matter of fact, my passion toward addressing racism and sexism came from our experience running for the Alexa Prize Challenge in 2016 and '17. We won the challenge, but here's the really sad part behind it. We had a list of thorny keywords to avoid that included skin color or sexual orientation.

This is a serious form of discrimination. We cannot build AI models by having this sort of like banned list to be safe as if they don't exist. This was the status quo in 2017. The challenge remains this year, not only 2021, but this year as well. We really need to work on racism and sexism, but it turns out all the other moral questions share similar challenges, so I'll skip this over.

But using Delphi, we had other follow-up works such as ProSocial Dialogue where using Delphi as sort of like a foundation common sense model or moral models to make your dialogue more socially acceptable. And then we also had this other paper where we used Delphi in a reinforcement learning agent to learn how to behave better in a game environment.

There's a lot more work to be done. Of course, this is a tiny little step toward this huge challenge ahead of us, really aligning AI systems to humans. Here's one very quick comment on our new work-in-progress, Delphi Hybrid, where we include the neuro-symbolic reasoning to address major mistakes such as this, genocide if creating jobs.

This was our early systems mistake. It's because our dataset doesn't have this kind of weird adversarial examples like genocide if creating jobs. Nobody speaks like that in real-life situations. So our model thought that if creating job, this is so positive and then didn't really realize how bad the genocide was because ready people don't discuss whether they're going to do genocide or not.

Ready people who we annotated for social chemistry don't talk about whether they're going to do genocide or not. So our model framework is basically that of John Rose, which is descriptive ethics. But even John Rose in later years suggested that we need some top-down mechanism to overcome some of the biases that crowd people might have.

This is exactly what we're going to do. We draw from Bernard Gold's moral theory framework about what not to do. There are basic universal things that everybody might agree what's not good to do. Then what we do is we develop basically a system where we parse out the original query into smaller events, like shooting a bear, killing a bear to save your child.

We parse out the original query into a basic event and then check through this Comet model, common sense model, whether some of these events induce obviously negative or dangerous common sense inferences or not. And then we draw this graph of reasoning, a bit reminiscent of a Mayuric graph in the sense that we have a lot of these different reasoning we can do, and then they have entailment relations or contradiction relations so that we can do collective reasoning on top.

We use again Max's set, the constraint optimization over it, so that we can finally make a more informed decision that is both interpretable and then being able to draw from this common sense knowledge to better guard the machine against adversarial examples. So the performance basically says we can do this without hurting the performance or even increasing the performance.

So as a last comment, AI safety, equity, morality, these are all sort of like in the continuum of challenges. It's really difficult challenges because it's not clear whose moral values do we incorporate. I think that we should go with a value pluralism going forward to really endorse everybody's different culture and individual preferences, not just one country, one moral framework as the correct one.

And really we need to do more collaboration across AI and humanities, even including philosophy and psychology and policymakers. So I think I'll stop here because I think I'm at time and now I'm ready for questions. Oh, there's already one question I see. Do you think legal records, criminal case law reflect the kind of descriptive morality that you're interested in capturing?

Do you think using that as training data would be useful? Oh, this is an excellent question. I think the legal records does encode, potentially provide a really rich resource that if someone can really annotate like this, it might be helpful. We started with Reddit cases as just one short description of a situation because the current language understanding is not strong enough to do like a paragraph level precise understanding.

Even chat GPT, although it looks really good at generation, my take on chat GPT is that it's better at generation than understanding, which is kind of the opposite of how humans are. Humans are actually better for understanding than generation. So you can read Pulitzer Prize winning news article without having any problem understanding the article, but you don't necessarily generate text that might win the award.

So the, but the legal domain is really interesting. And I think that there's some active research, actually, even at Stanford, there's this pile of law that goes a step toward that direction. And it might really be helpful for better understanding what sort of different values people apply in jurisdictions and uncovering some biases that some people might have had in the past trials.

So there might be some good use cases in that space. Next question. Awesome work. Thank you. A big picture question, curious to hear your thoughts on where do we go from here given larger and larger models coming out? Suppose we need a model to be 99% correct for a specific use case.

To what extent do I see the solution set being that defining the narrow use cases or more data parameters or fine-tuning the type of work that I did for a smart trace, et cetera. Answer is likely it depends. Yeah. But still want to hear about it. Okay. So as far as foundation models go, it seems that the bigger is the better, except that, you know, I was very excited to read a bunch of tech companies' papers about foundation models in the past six months.

There's just so many out there. So recording story there is that, well, if you have better data, then you can get away with a smaller model. So especially when you do instruction tuning, then you can get away with a smaller data. It's still a general model, but instruction tuning on the larger model might even be better.

It's not the case that you don't gain any performance, but it's just that you can close the gap quite a bit. So for downstream use cases where typically practitioners want to use a smaller model, seems that investing more into data is definitely the answer. Investing more into a specific algorithm is also really, really good because algorithms can do a lot.

So in this talk, I didn't go too crazy with algorithmic solutions, but maybe I'll be similar to the meiotic prompting, but in my lab, we designed a fair amount of decoding time algorithms where you can really close the performance gap quite a bit by doing so. So that's a good thing though, for folks in academia, because algorithm development feels like more academic or intellectually pleasing than really engineering, you know, downloading more data from the internet, and then, I don't know, cleaning the data because you have to clean the data.

And all these are very engineering heavy, whereas decoding time algorithms, you can have fun inventing some new intellectually interesting thing that also improves the performance quite a bit. So yeah, there's many different ways to improve it, but I think the data quality matters a lot and algorithm actually matters a lot too.

What do I think of Dan Hendricks' ethics benchmark? Yeah, so we did use that in, let's see, the common sense non-banks also draws from this ethics data set. We like the data set, we kind of disagree with some of the annotations we found, but this is very typical, by the way.

The thing about morality is that throughout the humanities, we haven't sorted out yet. There's a lot of theories. Every theoretician has a different viewpoint, and then even like non-theoreticians have a very strong opinion about what they want to believe as correct from wrong, so there's that. There are different pros and cons.

One thing I learned from this experiment is that although some of these data sets seem large, so ethics has a hundred thousands of examples, social chemistry has 300 thousands of judgments, social bias frames has 600 thousands of annotations, and so forth, and yet it only covers, I feel like it only covers still the small peak of the entire iceberg.

There's a lot on the bottom. Humans certainly don't necessarily learn from all these examples. We just learn fundamental concepts and then can apply that without this larger-scale training, so there's something really lacking about the way that current machine learning is very data-heavy. That aside, I do think that none of these resources are perfect.

They all have different pros and cons, and we really need to invest more into this, especially from academia, because the tech companies right now are not sharing any of their human annotation or human feedback data, especially when it's touching on toxicity or morality concerns. Reason being, these annotations, I'm pretty sure, are biased and not correct entirely, and that could really invite additional concerns from the public, so they're not releasing.

But in order to really study this better, we really need to share this and then improve it as a community together. That's how I would respond to your question. Thank you for an excellent question. Do I think this tech is ready to be merged with the search? I wouldn't say ready, but they need something like this for sure.

Home devices, the way that I think about Delphi is that it can really serve as a filter for other foundation models or application scenarios where they're about to generate something, and you can put a safety filter, which can really help. In some sense, in this work, I went through this super fast, but here, basically, what happens is that, let's see, the reason why we built this is because we found that chatbots, the publicly available ones, tend to endorse, tend to be too positive to the point that they want to endorse problematic situations, like a user says, "Holocaust never happened." Then the chatbot says, "Yeah, I agree with you." If you say, "I'm a big fan of Hitler," then the chatbot might say, "Yeah, yeah, yeah." The user might say, "I'm so depressed, I'm going to kill myself." And then the chatbot says, "Go ahead, great idea." Being positive is not being harmless.

Being positive to a problematic content can be very toxic and very harmful, so development like Delphi, even though Delphi is far from being perfect, and it's also biased, it has a Western bias, could really help with the downstream models. Yeah, so continuing on that question, "There has been many concerns about using GPT-like models with the search because misinformation." Ooh, that's another can of worms.

Others say, "We just need more RLHF plus knowledge graphs." So, yeah, misinformation is, yeah, something else that seems we are really lagging behind because we don't have very powerful fact-checking models yet, so that's a different story. But even that aside, just in terms of norms and ethics that are safe and fair for people to use, I think RLHF direction is great, but they usually also need the human demonstration, not just the human feedback.

The problem is that tech companies own them and nobody is sharing anything. That makes it really difficult to make meaningful progress as a community together, so I do think that data is really important. The off-the-shelf models cannot learn morals and ethics on their own. It has to be somehow taught more directly.

"We really just need to do more research in this space," period, is how I view it. That makes sense. We also have some questions on Slido, so I can ask them for you, folks. One question is, "What's the complexity of Mayutic prompting? How many times does the LM need to be queried?" Yeah, so honestly, it's a bit slow.

In fact, this Delphi hybrid is also slow. If you try to do this graph reasoning, maybe I'm not going to do that, but the graph reasoning is slow because you have to call so many times over and over, and some of this can be batched. Some of this cannot be batched, especially if it's recursive, but I would say the chain of thought is also a bit slower.

The max-set solver in itself is pretty fast, because this is such an easy graph. So there's a bit of a delay, but it's a bit slower, but maybe not too bad, is what I should have said. Great, thank you. Cool. Another question is, "How does Comet compare to GPT-3, if GPT-3 is fine-tuned on commonsense data, especially if you're doing some sort of instruction fine-tuning?" Yeah, so then the larger wins, period.

The larger is going to be the better, especially if you're going to just fine-tune GPT-3. It's game over. For that reason, some folks might think that the larger is always better, therefore don't work on a smaller model. But I think there are two reasons as to why small models are interesting to look at as well.

First of all, it's just easier to use. But more intellectually, it's also very interesting if you can make a smaller model better and catch up on the larger model. Personally, I think there's something about the size of the larger model that is more about the information complexity that is the key reason.

I don't think it's just size in the sense that if you have really a lot of data, but the data is repetitive and really simple, probably you don't get the same amount of performance gain, which was basically the case when we looked at this output, this result where even though the loose teacher GPT-3 generated a lot more data than the critical teacher, here the quality of the data was more important than the quantity.

So I think the complexity of the data itself is more important than the size. And oftentimes, when you just increase the size of the data together with the model, you do increase the complexity of information of the data as well as the model's capability of learning the complexity. But if we can catch up on that complexity of information, either through inference algorithms or through better data, then we can close the gap quite a bit, which is intellectually very interesting research space to be.

Okay, this is a personal question, but I would say humans normally have a critic model. So I think before you speak, we just don't generate, we also think this is a good thing or a bad thing to say. So people have been like, the community as a whole has been focusing a lot on generative models, like net billion size parameters, but should we also focus on big sized critic models that can do fact-checking, a lot of this sort of stuff?

So what's your opinion on that? Great point, excellent. Yeah, I think we can definitely invest more into critic model because they go really together well with the generative models for making the output better or filtering output better. And yeah, there's not as much of an investment into that. So I really like the question or suggestion for the research community, it's more like it.

Great. Yeah, I'll say, let's see, you have some more questions I can do on the last one. Let's see. Oh, I guess one is like, do you believe language models should completely avoid questions involving morals and ethics? Similar to like open air, restricting chat GPT from giving opinions? Yeah, I actually don't mind at all if AI just avoids, evades from all of that, except when somebody is saying morally questionable things, it's also nice for the AI not to go with it.

So or at least to recognize it as something not okay, and then try to tone it down. But I don't think there's any particular reason why AI should actually answer moral questions directly in a more downstream use cases. But really, the goal of this Delphi was making all these judgments more explicit so that we can actually study it more explicitly, as opposed to keeping everything just so like implicit.

Okay, that's a fun question. So do you think common sense is an emergent property in like large language models? Oh, yeah. Yeah, it is definitely emergent, as in like, when we saw this major boost jump in performance with GPT-3, I do believe that it's emergent capability, but I don't think so.

This particular evaluation is not very adversarial, by the way, this is like a sort of like a piece of cake, you know, reasonably easy evaluation scenario. So the thing about common sense, though, is that it can be so adversarial, so infinitely many different ways. And then, you know, there are always people like Gary Marcos, who wants to come up with very, you know, like weird, weird attack scenarios, like, you know, how crushed the porcelain added to breast milk can support infant digestive system, and then chat GPT-3 says nonsense.

And so the usual problem with common sense is this adversarial situations where people don't have any problem getting fooled by this, even though, you know, you and I see this for the first time, no problem, because we have a true conceptual understanding. That is the backbone of our common sense understanding.

But that's really lacking in the way that transformers are designed to focus on predicting which word comes next, as opposed to learning the world knowledge. And in some sense, you know, now with the RLHF, instead of predicting which word comes next, we're trying to align the model output better with the human preferences.

But that again, is not really aligned with the different goal of let's make sense of the world, and then build knowledge model. So these are all different learning objectives. And really, that is why I believe that although common sense does emerge from language models, fundamentally language models are not equivalent to knowledge models.

And we really got to focus on building knowledge models. I think there's one last zoom question. Value pluralism. Yeah. It's an empty concept. You don't want to include all value systems. Yes. So maybe it is a value. Is it empty or not? Okay. Thank you for excellent question. So I believe that we shouldn't endorse conspiracy theories at all, or any other, you know, morally questionable cases.

But then still there's this thorny situation of what to do with, you know, left to left people versus lightly left people versus right leaning people, if US and then, you know, every country has some other political divide division as well. So here, I feel like we really need to sort out what to do with this about, regardless of this, you know, some of these challenges, it is true that, you know, I personally don't have a religion, but I respect people with a religion.

And you know, I respect people with a different cultural background. And we kind of have some sense of how much do we do we believe that we should respect each other, even though, you know, the beliefs are different. So we probably need to work together. And it shouldn't be just AI researchers making this decision, by the way, this decision has to come from the humanities at large, which is why the data sharing actually is important.

But basically, I think the current version that I have in mind is that the AI doesn't need to understand what sort of differences are okay differences, the fact that people do have differences in certain questions should be learned by AI, so that there are distribution of opinions as opposed to one correct answer.

And then it should deny some of the controversial theories, even though I'm sure that, you know, some people will be very unhappy about that. But well, we have to decide something like that. I am reasonably optimistic that if humanities at large work together, we can do that. Because after all, laws are like that to laws, you know, this is a human artifact that people agreed upon, somehow that, you know, there's this core rules that people should abide by.

So I'm hoping that we can also define universals and particulars and respect particulars whenever it's a respectable, otherwise have some basic universals that reflect, you know, core human values. And then, as far as this left leaning situation, by the way, if just the goal is to make your AI systems safe for anybody, actually, we can make the AI filter extremely equity aware.

And it's not going to violate the freedom of speech by doing so, just to make AI to avoid the same things that are potentially microaggression for some population. And you know, we still don't really exclude people who care more about freedom of speech over equity by doing so. So I think there are ways, but this really requires a lot more research is how I view it.

I think that's mostly it. Thanks a lot for coming. This was a great talk. Okay, thank you very much. Thanks so much. Yeah. I think that's it. Thanks. Thanks. Bye. Bye.