back to index

Stanford CS25: V2 I Common Sense Reasoning


Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, so yeah, I'm super excited to be here and share our recent research about neurosymbolic
00:00:13.760 | common sense reasoning.
00:00:16.400 | So part of the goal of this talk will be to address some of the frequently asked questions
00:00:22.600 | these days that NLP or common sense or whatever, it looks like almost solved by chat GPT and
00:00:30.420 | I have an existential crisis.
00:00:32.620 | So people do ask me this from time to time.
00:00:37.120 | So perhaps it's a case of hasty generalization, especially if we do look at some of the examples.
00:00:45.400 | So the trophy doesn't fit in the brown suitcase because it's too big.
00:00:49.280 | What's too big?
00:00:50.320 | So this is classical Winograd schema challenge problem.
00:00:55.680 | And here, chat GPT answers it correctly, that trophy is too big.
00:01:00.440 | So impressive.
00:01:02.400 | But what if you change the question a little bit, then he says the trophy itself is too
00:01:07.480 | small to fit into the suitcase.
00:01:09.720 | So it's not very reliable at the moment.
00:01:13.240 | So the situation is a little bit like David and Goliath in the sense that the bigger appears
00:01:19.320 | to be better in many of the cases, although of course, some of the more careful studies
00:01:24.720 | do reveal that smaller models can be better with better data or better reinforcement to
00:01:33.120 | learning with a human feedback and whatnot.
00:01:36.080 | So it's likely that there are still other ways to improve the transformer performances
00:01:44.240 | by building smaller models in a more clever way.
00:01:49.100 | So one way to draw the insight is from this classic book known as The Art of War, which
00:01:58.440 | of course says nothing about deep neural networks or transformers.
00:02:02.720 | But the wisdom here is that know your enemy, choose your battles and innovate your weapons,
00:02:07.560 | which we can translate that as evaluation with realism and scrutiny and focusing on
00:02:18.080 | different types of new tasks and leaderboards, and then innovating your algorithms and data.
00:02:23.720 | So in this talk, I'm going to showcase three such studies, and let's dive right in with
00:02:30.040 | Mayuric prompting.
00:02:31.040 | By the way, so the recurring theme in this talk will be that smaller models can be better
00:02:35.760 | and the knowledge is power.
00:02:38.000 | So let's start with this observation that language models are sometimes amazing.
00:02:44.440 | So if you ask GPT-3, if you travel west far enough from the west coast, you will reach
00:02:51.840 | to the east coast or not.
00:02:54.320 | So it says the world is round, which is correct.
00:02:58.800 | So you will reach the east coast eventually, therefore the answer is true.
00:03:03.040 | So this looks impressive, except when it's not impressive.
00:03:07.220 | So if you ask other questions like butterflies fly with three wings or not, it says it has
00:03:13.520 | four wings and therefore the statement is false.
00:03:16.360 | But if you read back what it just said as true or false questions, then it negates what
00:03:21.800 | it just said.
00:03:22.800 | So it can be inconsistent with its own statement.
00:03:27.720 | And then there are many other such inconsistency problems.
00:03:30.960 | So it's not clear what language models do or do not know.
00:03:35.480 | It's almost like language models are some sort of lemons.
00:03:38.480 | Well, it might be cherries if you only pick cherries, but it doesn't make strange mistakes.
00:03:44.680 | So the question is, how do we make better lemonade from GPT-3?
00:03:50.000 | So one approach might be to get philosophical and use Socrates' meiotic method that was
00:03:57.280 | originally developed for addressing humans' flawed reasoning, because it actually turns
00:04:03.120 | out even humans are not all that logically consistent, let alone GPT-3.
00:04:09.880 | So the way it works is this, we're going to build the meiotic inference tree, and let's
00:04:16.200 | use the previous example as a running example.
00:04:20.280 | So what we do is we ask the following question, providing the answer being true, and then
00:04:25.760 | let attach "because" so that we prompt GPT-3 to continue on this sentence, which means
00:04:34.320 | it will now have to explain, provide the explanation why the answer is true.
00:04:39.520 | In this case, the explanation is good, so it's E of T, explanation of the answer being
00:04:45.960 | T. We ask the same question, switching out "true" with "false," and then see what
00:04:52.040 | BS GPT-3 might come up with.
00:04:56.200 | So here, it's just trying to go with the false as an answer, but it just doesn't have a very
00:05:03.520 | good answer.
00:05:04.520 | It just says you cannot reach.
00:05:05.960 | So now we call this as E of F, so it's explanation of F, answer being F.
00:05:13.480 | Now let's see how robust or consistent GPT-3 is with respect to its own explanations.
00:05:22.640 | So we read back E of T, and then let GPT-3 decide whether it's going to agree or disagree
00:05:30.640 | with the label "true" or "false."
00:05:33.360 | So in this case, the last one is a negated version of E of T, so we insert a negation
00:05:40.440 | "not here," and in this case, it's good that it's flipping the answer when the statement
00:05:46.440 | is negated.
00:05:47.440 | So this is a case when GPT-3 is logically integral to E of T.
00:05:54.440 | For E of "false," though, which was basically a bogus explanation for the wrong answer,
00:06:00.240 | it's not able to flip its own labeling, which means GPT-3 is not logically integral.
00:06:08.120 | So that's good, GPT-3 does know something strange about its own explanation given previously.
00:06:17.120 | And so we can keep doing this recursively to make GPT-3 explain its own explanation
00:06:27.840 | of explanation recursively.
00:06:29.840 | So we build this Mayuric tree or graph for some time, and then only keep branches that
00:06:41.240 | are logically integral, throwing out the non-integral part for now.
00:06:45.960 | But even after chopping the branches where there's logical inconsistencies, GPT-3 being
00:06:53.280 | GPT-3, the tree will still have some inconsistent explanations.
00:06:58.680 | In order to improve the logical consistency, now what we do is we're going to look at pairwise
00:07:07.920 | consistency among any of the nodes.
00:07:10.340 | So we compute, sorry, stepping back, we're going to first compute the node-wise confidence.
00:07:19.600 | So we call that as a belief, and it's defined by this particular equation that basically
00:07:26.480 | looks at different conditional probabilities and then compute its ratio to see how confident
00:07:33.000 | it is for any particular node.
00:07:35.400 | We then also look at the edgewise or pairwise consistency by using off-the-shelf natural
00:07:43.560 | language inference model's output, whether a pair is contradictory or not.
00:07:50.880 | So we then create this pairwise weights.
00:07:55.880 | Now once you have all of this, then we can formulate a constrained optimization problem
00:08:04.840 | where the inference objective is to assign some label, either true or false, on each
00:08:14.400 | of the nodes such that it's going to maximize the weight assigned to all of these nodes
00:08:20.960 | and edges.
00:08:22.820 | So sometimes the labeling will have to flip the original label that the model might have
00:08:29.240 | preferred to give because that way you can enhance the graph-level consistency.
00:08:36.040 | So you can solve this with any max-set, so set means satisfiability.
00:08:44.800 | And this is a classical AI search algorithm, and we used this particular solver, but you
00:08:50.980 | can use many others.
00:08:52.880 | And so here, the final output is that the original answer to the original question should
00:08:59.400 | be true, and then it also gives you node-wise per-node label assignment as well.
00:09:06.120 | So what does this mean in the end in terms of empirical result?
00:09:11.140 | So when tested on Common Sense QA 2.0, the canonical prompting, so green, used on top
00:09:20.040 | of GPT-3, so it's basically a few-shot prompting on GPT-3, will give you a bit better than
00:09:27.200 | chance performance.
00:09:28.320 | So this is true/false QA dataset, so your chance level is 50, and GPT-3 is barely better
00:09:35.800 | than chance.
00:09:37.520 | But recently, there have been some ideas such as chain of thoughts or self-consistency that
00:09:45.600 | can improve the vanilla prompting method considerably.
00:09:51.040 | So if you use such variations, then you get performance gain.
00:09:56.080 | Now the purple is a different variant of it, but together, they're all doing worse than
00:10:04.960 | Mayuric prompting, which in fact does better than supervised model trained on T5.
00:10:12.760 | Usually supervised model trained on T5 is hard to beat using GPT-3 few-shot, but basically
00:10:20.920 | this is inference time on the algorithm, practically unsupervised, and it does well on that.
00:10:26.880 | And similarly, we see a large boost when tested on other Common Sense benchmarks such as CRIC
00:10:33.720 | or Com2Sense.
00:10:35.480 | So what this tells us is that although the emergent capabilities of large transformers
00:10:44.320 | are phenomenal, they can be not very robust for some of these Common Sense challenges.
00:10:53.360 | And it's in large part due to the logical inconsistencies, which can be dramatically
00:10:59.760 | enhanced when you do this sort of symbolic reasoning on top.
00:11:04.560 | So yeah, not only Socrates' method helped with flawed human reasoning, it can also dramatically
00:11:11.640 | enhance flawed neural networks' reasoning.
00:11:16.240 | Okay, so moving to the next topic, symbolic knowledge distillation.
00:11:23.560 | So this work is a work that tries to convert general language models on top of transformers
00:11:29.940 | to causal Common Sense models, also transformers.
00:11:34.800 | And the reason why we might want to worry about Common Sense models is because despite
00:11:42.600 | human-level or even superhuman-level performances on a variety of leaderboards, the state-of-the-art
00:11:49.080 | models are brittle when given adversarial or out-of-domain examples.
00:11:53.680 | So transformers can make seemingly strange mistakes.
00:12:03.200 | And so it's almost like solving only a dataset without really solving the underlying task.
00:12:09.540 | And this phenomenon sometimes is described as a systematic generalization problem.
00:12:15.840 | And why does this happen is that unlike humans who truly learn about how the world works
00:12:21.740 | conceptually, transformers learn sort of surface patterns in language or images that are powerful
00:12:32.420 | for many downstream use cases, but still not really robust understanding of the concepts
00:12:40.180 | and how the world works.
00:12:41.980 | So in order to bridge this gap, we can really think about this challenge of learning, acquiring
00:12:47.620 | Common Sense capabilities for machines.
00:12:51.780 | So the operational definition of Common Sense in this talk will be that it's the basic level
00:12:58.140 | of practical knowledge and reasoning concerning everyday situations and events that are commonly
00:13:04.460 | shared among the most people.
00:13:07.140 | This is really important, the last part, that it's commonly shared among the most people,
00:13:11.620 | but it's not the case that it's shared by everybody in the universe.
00:13:16.540 | Because the additional context can always change what is commonsensical for any given
00:13:23.540 | culture or situation.
00:13:24.860 | So for example, in general, you and I probably agree that it's okay to keep the closet door
00:13:29.700 | open, but it's not okay to keep the fridge door open because the food inside might go
00:13:35.140 | So these are general rules of thumb that we might abide by.
00:13:40.500 | But of course, if you go to your friend's house, you might behave a little bit and keep
00:13:45.860 | their closet door closed.
00:13:49.980 | And then, as far as the fridge door, if you're in a store and it's not really hooked up to
00:13:54.660 | the wall, then it doesn't matter whether the fridge door is open or not because there's
00:14:01.300 | no food inside.
00:14:03.200 | You can come up with many situations in which these basic rules of thumbs will have exceptions.
00:14:10.820 | So that is the key challenge of common sense because it's not universal knowledge, but
00:14:20.740 | it's shared across a large population of people.
00:14:26.660 | Okay, so such common sense is essential for humans to live and interact with each other
00:14:31.740 | in a reasonable and safe way.
00:14:34.020 | And so, as AI becomes an increasingly more important aspect of human lives, and with
00:14:43.020 | the chat GPT, more likely so, it's good if AI can understand human needs and actions
00:14:49.860 | and values better.
00:14:51.820 | So the premise of this talk is that language models are not equivalent to knowledge models,
00:14:58.900 | even though language models today do acquire a great deal of knowledge, but they're not
00:15:04.580 | equivalent.
00:15:06.140 | So we developed a symbolic common sense knowledge graph known as Atomic a few years ago, four
00:15:15.940 | years ago now, as well as neural common sense model built on top of or trained using Atomic
00:15:24.420 | as the source of training, fine-tuning of off-the-shelf language models.
00:15:30.620 | Up until two years ago, this Atomic was fully crowd-sourced by humans, which in this talk
00:15:38.300 | I'm going to lift, but at first the norm was that this all has to be human crowd-sourced.
00:15:45.780 | So you can consider almost Atomic as a human demonstration.
00:15:49.580 | In the current version of chat GPT, you can consider this as human demonstrations of common
00:15:55.420 | sense inferences.
00:15:57.940 | And we had this Comet Atomic 2020, which is enhanced version of Atomic and Comet.
00:16:03.660 | Again, Atomic portion was fully crowd-sourced by humans in 2021.
00:16:10.220 | So let me give you a bit of a sample of what Atomic 2020 looks like.
00:16:16.580 | So imagine a situation where X gets X's car repaired, or you get your car repaired.
00:16:22.300 | So immediately you can imagine what's likely to be true or relevant for the situation,
00:16:29.220 | that as a result, you might want to call Uber or Lyft for a ride.
00:16:33.940 | As a result, you need to pay the bill.
00:16:36.460 | Beforehand, you need a mechanic and money to repair your car.
00:16:40.460 | So these are basically preconditions and post-conditions of that event.
00:16:44.780 | So some of this Atomic knowledge graph is about social interaction knowledge about event.
00:16:51.140 | And then other parts of the Atomic is physical entity-centric knowledge.
00:16:56.740 | So money is typically used for paying repairs.
00:16:59.980 | But if you really want it, you can fold it into origami.
00:17:03.140 | I've never done it.
00:17:05.180 | But these are examples of stereotypical use cases, as well as non-stereotypical but affordable
00:17:15.260 | actions that you can apply to objects.
00:17:17.420 | So it requires naive physics understanding about the affordances of physical objects.
00:17:25.940 | And then we can also reason about counterfactual condition in which the center event cannot
00:17:31.140 | happen, so can be hindered by that.
00:17:33.340 | So if you totaled your car completely, then it's impossible to get your cars repaired.
00:17:39.540 | And then there are events that typically happens before and after.
00:17:42.360 | So some of this knowledge is event-centric.
00:17:45.860 | So we crowd-sourced a fair amount over the course of, I don't know, maybe two years or
00:17:52.980 | so, up to 1.3 million if-then rules or if-then knowledge over 23 different adjectives or
00:18:03.180 | relation types.
00:18:07.060 | So it was fully crowd-sourced.
00:18:09.620 | And so the knowledge graph is useful for training transformers.
00:18:15.020 | And here, let's see the comparison between Comet that was built on BART compared to GPT-3,
00:18:21.900 | which is so large, it doesn't even fit into the slide.
00:18:25.820 | It was more than 400 times larger than BART.
00:18:30.140 | So with that in mind, if you look at this accuracy judged by humans after making the
00:18:38.620 | common-sense model, making some common-sense inference.
00:18:41.020 | So the task is that given a node, which describes a situation or event, and then given an edge
00:18:47.780 | type, which sort of narrows down the common-sense relation or inference type, you're now going
00:18:54.140 | to generate some inference.
00:18:57.220 | So it's a generative task.
00:18:59.620 | And then we ask humans whether the common-sense inference seems reasonable or not.
00:19:05.480 | So 100% is the desired level.
00:19:09.740 | Comet is substantially better than GPT-3, which is really impressively better than GPT-2.
00:19:17.420 | It's not apple to apple because GPT-2 is a zero shot, GPT-3 is a few shot, but still,
00:19:22.460 | it's interesting, the large jump that scale alone brought to GPT-3.
00:19:30.620 | But still, GPT-3 is too large to be useful for actual system building for most engineers
00:19:39.040 | and scientists in the world.
00:19:40.900 | So it's nice to have a smaller model that does it do even better.
00:19:44.540 | And so when we put these resources out, people all around the globe did some creative research
00:19:51.500 | using it.
00:19:52.500 | So persona-aware conversations or figurative language understanding, storytelling and fantasy
00:19:58.820 | gaming, and interactive learning enhancement.
00:20:03.980 | In all of these works, people came up with some useful use cases using either Comet or
00:20:11.380 | Atomic or both as some kind of common-sense backbone for their downstream use cases.
00:20:21.060 | But the applications are still limited by the coverage and quality of these common-sense
00:20:27.900 | models.
00:20:28.900 | So we wanted to make it better, but we were hitting a bit of a limit with human crowdsourcing.
00:20:34.720 | So now in this paper, Symbolic Knowledge Distillation, we're going to do AI-generated knowledge graph
00:20:45.820 | by introducing this notion, Symbolic Knowledge Distillation.
00:20:49.900 | So we want to take this GPT-3, which is very impressive, but too large.
00:20:55.780 | So make it smaller, but better than GPT-3.
00:20:59.580 | So GPT-3 was about 73% good and it's good, but not good enough for empirical use cases.
00:21:07.740 | Now is that even possible though?
00:21:10.140 | Because when you normally do knowledge distillation, you get smaller and worse models, not better
00:21:16.220 | models.
00:21:17.300 | So the reason why this could work is because Symbolic Knowledge Distillation has this funnel
00:21:30.060 | that's convoluted and it has a critic inside that really helps the student model to be
00:21:37.500 | smaller but better.
00:21:39.300 | So slightly more formally, knowledge distillation due to Hinton et al. 2015 is a method to distill
00:21:50.660 | teacher model down to student model by optimizing this cross-entropy between the teacher's probability
00:21:59.900 | distribution over the label space y, output y, and then the student's distribution over
00:22:09.420 | the same output y.
00:22:12.100 | In the original work, the output space was just classification.
00:22:18.660 | So knowledge distillation was done for classification task, in which case it's a simple enumeration
00:22:26.260 | that leads to the correct summation.
00:22:29.780 | But in our case, y can be a sentence, which is intractable because there can be exponentially
00:22:36.060 | many such output.
00:22:39.300 | So what people do, well, no problem, we always just sample and call it a day.
00:22:44.580 | So we're going to sample so that we just compute the expectation through samples.
00:22:52.100 | And the byproduct of that sample will be a symbolic knowledge graph.
00:22:58.460 | And that's because the strings coming out of this sampling can be connected together
00:23:04.180 | into graph structure if we want it.
00:23:07.380 | So in terms of the quality of the generated knowledge, so let's compare human written
00:23:16.540 | knowledge versus GPT-3 authored knowledge.
00:23:21.100 | Here the y-axis shows the quantity in millions.
00:23:25.540 | So atomic 2020, the human written knowledge, is less than a million in this particular
00:23:33.020 | case in terms of the number of knowledge, because in this study, we only look at a subset
00:23:38.300 | of atomic 2020 relation types that corresponds to causal common sense reasoning.
00:23:48.940 | So it's less than a million for that subset.
00:23:54.100 | And then if we look at GPT-3's generation, we can generate a lot.
00:23:59.060 | So we can generate almost 7 million of them.
00:24:02.580 | But here, black portion is noisy portion and green portion is a good portion.
00:24:08.460 | And you see, because GPT-3 is only about 70% good, like 30% are all garbage.
00:24:15.460 | So it's a larger scale, lower accuracy at this point compared to human written resource.
00:24:22.580 | So now what we do is we train this critic model and we use Roberta for simplicity.
00:24:30.260 | And this is a supervised model on a moderate size labeled data, about 10,000 or so.
00:24:38.800 | And it's a binary classification task where whether the machine generated knowledge looks
00:24:43.700 | correct or not, and this Roberta is not a very good model because if so, if it's perfect,
00:24:50.980 | we would have solved the common sense problem altogether.
00:24:53.460 | So the critic tries to throw out bad stuff and we can use the critic very aggressively
00:25:00.260 | with a high threshold.
00:25:01.620 | So whenever something is a slightly suspicious, just throw that out.
00:25:06.940 | But if we use it aggressively, so we throw out most of the black, that's good, together
00:25:12.500 | with a lot of green stuff, but still the remainder is much larger than what humans ever written.
00:25:20.580 | And yet we can actually retain higher accuracy than human authored resources.
00:25:26.580 | So here the teacher is basically a combination between GPT-3, which is in some sense, loose
00:25:32.540 | teacher, and then combined with the critic Roberta, which serves as a critic teacher.
00:25:39.220 | Okay, so that's the generated knowledge.
00:25:43.380 | Now how helpful are they for the purpose of training downstream neural common sense models?
00:25:52.020 | So recall that the GPT-3 without doing anything else is a loose teacher whose common sense
00:25:59.700 | inference is only about 73% good.
00:26:02.660 | So you see here it's accuracy of its output.
00:26:06.140 | And then it turns out if we use loose teacher as a teacher directly to teach a student model,
00:26:12.020 | then the performance already goes up on its own.
00:26:15.980 | So this is interesting, that usually this is not the case with the knowledge distillation,
00:26:21.300 | but when we focus on common sense knowledge distillation, student just on its own becomes
00:26:27.700 | better.
00:26:29.100 | So unlike typical knowledge distillation, where we start with language model and we
00:26:36.220 | end with language model, students and teachers are of the same type.
00:26:40.660 | Here the original teacher was actually language model, not common sense model.
00:26:45.040 | And then we want the student model to be more of the common sense model.
00:26:49.380 | So there's a switch of the type between teacher and student.
00:26:53.020 | And so when that's the case, whether this is generally true, we don't know, but this
00:26:57.900 | is what we found empirically.
00:27:04.700 | Should I pay attention to the questions or not?
00:27:07.500 | Yeah.
00:27:08.500 | Feel free to ask any relevant questions.
00:27:11.580 | Hang on.
00:27:12.580 | Let me quickly check.
00:27:15.580 | Sample, oh, sample is generated output, which happens to be usually a sentence or a phrase.
00:27:27.020 | That's what I meant by sample, sorry that I didn't see that earlier.
00:27:32.500 | And then the last question, having the model generate text to one symbol at a time, starting
00:27:38.180 | from the target label sentence.
00:27:40.060 | Yes, it's because transformer can only generate one token at a time.
00:27:45.420 | That's what we do as well here.
00:27:48.500 | Thank you for the clarification questions.
00:27:50.820 | All right.
00:27:51.820 | So back to here, in our earlier study, Comet 2020, if we train GPT-2 or BART using human-authored
00:28:03.020 | graph, knowledge graph, atomic, then the performance was a bit better than 80%.
00:28:08.700 | Now finally, when we use basically combination of GPT-3 and critic Roberta together, we found
00:28:17.380 | that the downstream performance of the neural causal reasoning is reaching close to 90%
00:28:28.380 | for the first time.
00:28:29.660 | So the takeaway here is that critical teacher results in better student compared to loose
00:28:36.640 | teacher.
00:28:38.380 | It's not the quantity of knowledge because loose teacher basically has more data.
00:28:43.940 | One might wonder whether more data is always better for the purpose of a common sense models,
00:28:50.900 | but that's not the case.
00:28:51.900 | Loose teacher can generate more data, but the resulting student model is not as good
00:28:56.260 | as the case when the critical teacher, which has less data because you throw out most of
00:29:03.460 | your generation, it's a smaller data, but it leads to better model.
00:29:09.660 | So that's sort of takeaway messages here.
00:29:16.340 | So to summarize, we were very surprised by this outcome that at least with respect to
00:29:25.100 | a subset of the original Atomic 2020, it's a subset corresponding to causal common sense
00:29:30.620 | reasoning.
00:29:31.620 | We found it to our big surprise that machine authored knowledge graph can be for the first
00:29:37.500 | time, better than human authored knowledge graph in all criteria, scale, accuracy, and
00:29:42.820 | diversity.
00:29:43.820 | We also measure the diversity in many different ways.
00:29:47.220 | Here I just show you a unique unigram counts, but in the paper, we report other measures
00:29:55.740 | as well.
00:29:56.740 | So it's not the case that GPT-3 is being repetitive.
00:30:00.660 | It's actually being more creative in some sense than human crowd workers, while being
00:30:07.220 | able to enhance other aspects as well.
00:30:09.860 | By the way, these enhancements are sort of like, you kind of have to balance out depending
00:30:16.420 | on what you prioritize.
00:30:17.660 | You cannot actually get all of this simultaneously.
00:30:19.780 | So I'm just showing the best case scenario here.
00:30:25.260 | So that's the symbolic knowledge distillation part.
00:30:29.180 | We actually have a follow up work on this on several different application scenarios,
00:30:35.340 | even including summarization, where we distill summarization capabilities from GPT-3 and
00:30:40.980 | demonstrate that GPT-2 can work as well as GPT-3 or even better for summarization task.
00:30:49.140 | And then we also have other work where we can distill from smaller models, but I don't
00:30:54.780 | have the content in this talk.
00:30:58.100 | So but I just wanted to mention that this particular technique, despite its simplicity,
00:31:05.340 | we found that empirically works really, really well across several different downstream use
00:31:12.220 | cases.
00:31:13.220 | Okay, so finally, I'll move to the common sense morality.
00:31:18.820 | So this is still on archive.
00:31:22.540 | I'll tell you why that's the case, but so we have a new version available.
00:31:29.340 | And then new new version will come soon.
00:31:33.100 | So the motivation behind this work is that language models are already making judgments
00:31:41.380 | or output that has moral implications.
00:31:45.100 | Even if you don't care about morality, by working on language models, you're implicitly
00:31:50.140 | dealing with the moral models.
00:31:53.540 | So especially that given this widespread deployment of language models, we do need to worry about
00:32:01.700 | So here's a web demo you can play with, you might have seen this already.
00:32:06.020 | Really, this is still a research prototype only still it's work in progress, we're still
00:32:10.660 | working on it.
00:32:11.860 | So please keep that in mind.
00:32:14.060 | But if you haven't seen it before, you can handle freeform QA such as this killing a
00:32:18.860 | bear, it's wrong, killing a bear to save your child, it's okay.
00:32:24.260 | Maybe to save your child sounds really positive.
00:32:27.260 | So how about to please your child, which is also positive.
00:32:30.860 | But then Delphi says it's wrong.
00:32:32.900 | Finally, or maybe this is all about saving your child.
00:32:36.020 | So how about exploding a nuclear bomb to save your child and then he says it's okay.
00:32:40.620 | Sorry, it's wrong.
00:32:41.940 | So as you can see, moral decision making requires weighing different values that are potentially
00:32:53.220 | at us and then see which one you need to favor more.
00:32:57.980 | So for that reason, in our original version, we also studied the relative QA mode where
00:33:02.740 | you can compare to a situation like stabbing someone with a cheeseburger compared to stabbing
00:33:08.860 | someone over a cheeseburger.
00:33:10.580 | This is super tricky question because it requires both naive physics knowledge that stabbing
00:33:17.740 | someone using a cheeseburger as a tool is not going to harm anybody physically because
00:33:24.220 | cheeseburger is too soft.
00:33:25.780 | You cannot really injure somebody using cheeseburger.
00:33:29.180 | It's just such a rude thing to do, but you cannot injure somebody.
00:33:33.100 | Whereas stabbing someone over a cheeseburger means that you're using the default tool of
00:33:40.300 | stabbing, which is naive because you didn't mention it.
00:33:43.260 | There's linguistic common sense that you're using the default tool.
00:33:47.820 | Humans, by the way, omit these arguments all the time.
00:33:51.820 | So this is a fairly complex question to answer.
00:33:55.620 | Finally, you can also ask yes/no questions such as it's okay to fire someone because
00:34:00.300 | they're gay or not.
00:34:01.300 | It says no, it's not okay.
00:34:05.300 | We found that it's surprisingly robust against the compositional situations.
00:34:10.980 | So mowing the lawn, it says it's expected.
00:34:13.940 | Late at night, it's rude.
00:34:15.860 | If you live in the middle of nowhere, then it's okay.
00:34:18.820 | Ignoring a phone call, it's rude.
00:34:21.060 | Unknown phone call, that's okay.
00:34:22.460 | From my friend, it's rude.
00:34:24.260 | But what if I just had a fight with them?
00:34:26.580 | Then it's okay to ignore or understandable.
00:34:29.180 | During my work hours, it's okay to ignore.
00:34:31.940 | Outside my working hours, it's rude.
00:34:33.860 | But what if it's my boss's phone call during my work hours?
00:34:37.180 | Then it's wrong.
00:34:38.180 | You should answer it.
00:34:39.180 | Except if I'm in a meeting, then it's okay to ignore even if a boss's call.
00:34:43.420 | So you see how it gets really nested and compositional very, very fast.
00:34:50.220 | So that's the real challenge behind moral decision-making.
00:34:56.020 | Due to the nature of language models, though, some of this common sense knowledge leaks
00:35:02.540 | into the model.
00:35:04.180 | Mixing bleach with ammonia, that's dangerous.
00:35:06.780 | Drinking milk if I'm lactose intolerant, it's wrong.
00:35:09.860 | But soy milk, that's okay.
00:35:12.540 | By the way, this common sense leakage is actually a good thing in terms of AI safety because
00:35:17.780 | some of this harmful or even dangerous text output requires some common sense understanding
00:35:28.260 | about what's good and not good to suggest to humans.
00:35:32.740 | So for the laboratory experiments, meaning we just divide our dataset into training and
00:35:40.860 | test, we found that Delphi can, at least for the dataset that we have, I'm going to tell
00:35:48.220 | you about it in a bit, but performance is pretty strong compared to GPT-3.
00:35:56.140 | As you see, zero shot is pretty bad.
00:36:00.620 | It's barely better than chance, which means that off-the-shelf neural language models
00:36:08.220 | don't really have a good sense of moral judgments.
00:36:11.340 | But if you give it 30 shots, like any other task, it does pick up the knowledge quite
00:36:17.620 | fast.
00:36:18.620 | There's nothing new about it, but to close the gap to the ideal human level, it's good
00:36:26.900 | to do more supervised learning, of course.
00:36:31.340 | So the dataset is Common Sense Norm Bank.
00:36:34.780 | It includes 1.7 million people's ethical judgments on everyday situations, and it includes cultural
00:36:42.300 | norms, social norms, and ethical norms altogether.
00:36:45.620 | More specifically, we drew from these five existing datasets that were not designed originally
00:36:51.180 | for QA, but we automatically compiled these resources into the QA form.
00:36:56.980 | Of the five, what actually does matter the most are these two.
00:37:01.140 | Social chemistry, which I'm going to talk about in a bit, and then social bias frame,
00:37:06.020 | and this is what teaches the model against racism and sexism.
00:37:13.180 | Social chemistry, super briefly, I'll tell you what this is.
00:37:17.660 | So GPT-3's morality, like I said, is somewhat dubious if you use it off-the-shelf.
00:37:23.700 | If you let it explain, "Running a blender at 5 a.m. is rude because blah, blah, blah,"
00:37:28.380 | it might say, "You can wake up the entire neighborhood.
00:37:30.500 | You can only do it if you're making a thick smoothie and need to incorporate some ice,
00:37:34.020 | so it's a funny ha-ha, but no harm is made."
00:37:36.980 | But if you prompt it with other kinds of prompts like, "It's okay to post fake news," if it's
00:37:44.600 | in the interest of the people, then it's okay, or "ROP agenda," then it's okay, even if it
00:37:50.420 | hurts the country.
00:37:51.740 | So it's all understandable given how it's trained on what humans said.
00:37:57.880 | So humans out there did say that morally questionable text so that language models pick up on that
00:38:06.700 | and then amplify it.
00:38:08.780 | So we do need to teach AI more explicitly with human norms and ethics, and one way to
00:38:15.740 | do that is descriptive ethics because the brute force large networks and more data will
00:38:23.300 | not cut it.
00:38:24.660 | In some sense, though, if you imagine raising a child without really trying to teach them
00:38:31.060 | what's right from wrong in early lives, they can probably learn both good and bad from
00:38:38.660 | the internet and broadband, and so human education does require a bit of this top-down teaching
00:38:46.940 | as well, so it's a bit similar, perhaps, to that.
00:38:49.860 | So in this work, what we did is we found a lot of these situations from Reddit, a forum
00:38:55.180 | in which people discuss morally thorny situations, so "Asking my boyfriend to stop being friends
00:39:01.180 | with his ex," so this is an actual situation in Reddit.
00:39:05.500 | So depending on whom you ask, people have a different rule of thumb that they want to
00:39:09.740 | apply to this situation, and also it depends on what you care about.
00:39:16.780 | His ex might say, "Oh, it's fine to stay friends with an ex, but if you are caring
00:39:24.180 | about your significant other, then you might say, 'Oh, it's okay to ask your significant
00:39:31.980 | other to stop doing something you're uncomfortable with,'" and so forth.
00:39:36.740 | So people have really different values and different rules of thumbs that they prefer
00:39:42.540 | to use, which is why there's TV show dramas, there's movie dramas, and people cry and fight,
00:39:49.820 | argue, and so forth.
00:39:51.580 | So humans are complex beings.
00:39:55.140 | So given any situation and rule of thumb, so rule of thumb is generated by crowd workers.
00:40:00.380 | We then went ahead to label, so these are trained crowd workers, and some of these labels
00:40:08.940 | are drawn from moral foundation theories of Jonathan Haidt.
00:40:12.940 | So I'm not going to go into the details.
00:40:14.940 | If you're excited about this, you can check out the papers.
00:40:18.100 | But basically what it includes is that 300,000 rules of thumb written for 100,000 real-life
00:40:26.940 | situations.
00:40:27.940 | So this original situation is from Reddit, but the rest are paid crowd workers' hard
00:40:34.980 | work.
00:40:36.580 | And so each ROT annotated with 12 structured attributes, which include social judgments,
00:40:43.300 | cultural pressure, like wearing reasonable clothes at school, not PJ.
00:40:50.220 | It's cultural pressure.
00:40:51.260 | There's nothing illegal about it, but there's cultural pressure, for example.
00:40:55.400 | And then anticipated agreement, meaning, do you think other people generally agree that
00:41:01.140 | it's maybe a little bit awkward to wear PJ in the university or not?
00:41:07.740 | So there are different things we annotated, but we converted some of those annotations
00:41:15.140 | to QA.
00:41:17.420 | So it's usually in this free-form QA or yes/no QA or relative QA format.
00:41:23.120 | And then we trained UNICORN, which is pre-trained on T511B model.
00:41:29.840 | So UNICORN is universal common sense reasoning model trained on diverse QA problems.
00:41:34.900 | And then we trained that model further onto our common sense non-bank.
00:41:39.100 | That's the resulting Delphi.
00:41:41.660 | So why is this Delphi built on top of UNICORN?
00:41:44.860 | Because as we saw earlier, moral reasoning does require sometimes common sense reasoning
00:41:50.460 | as well.
00:41:51.460 | In fact, it requires language understanding, common sense understanding, and norms and
00:41:55.120 | morals all simultaneously.
00:41:57.140 | Here's a concrete example, paperclip maximizer.
00:42:00.860 | You all heard of that.
00:42:04.420 | The RL algorithm alone will not solve this problem.
00:42:07.140 | The reason why we worry about this is not because we don't have the perfect RL algorithm.
00:42:13.020 | It's because even if we encoded that, "Oh, yeah, do not kill humans while maximizing
00:42:20.700 | paperclip."
00:42:21.700 | It's not enough because then the machine could kill all the trees thinking that, "Well,
00:42:26.020 | I didn't kill humans and you didn't tell me not to kill trees and then go ahead and kill
00:42:32.940 | all the trees."
00:42:34.860 | This is almost common sense knowledge about what's obviously not okay to do.
00:42:40.060 | There's just so many of them, which means it's not possible to write them down to just
00:42:46.200 | like one clinical equation.
00:42:48.660 | There's so many endless list of things that AI obviously shouldn't do for safety reasons.
00:42:56.380 | We really need to, in order to make AI models really truly robust and safe, we need to teach
00:43:03.580 | basic human values as well as common sense.
00:43:07.620 | Here's another example if you want to look, but let me skip this.
00:43:11.820 | The previous one was about chat GPT.
00:43:13.940 | This is about a home device.
00:43:16.940 | Again, a home device suggested a 10-year-old child touch a penny to an exposed plug socket.
00:43:23.180 | Fortunately, the child did have common sense not to do so, but this does tell us something
00:43:30.060 | about the safety issue when the machine doesn't have common sense to prevent some of this
00:43:35.660 | bad stuff.
00:43:36.660 | Delphi is able to say that it's dangerous.
00:43:42.060 | This came out, in fact, almost two years ago at this point.
00:43:46.820 | We initially were going to just do this usual tweet that academics do, and we thought nobody
00:43:57.860 | would play with the demo, which is what usually happens after tweeting your demo.
00:44:02.100 | Nobody cares, we thought.
00:44:04.260 | But within a few hours, we had to take down the relative QA mode because that was the
00:44:08.940 | portion not trained with the social bias frames, so it was really revealing the underlying
00:44:14.300 | language models, racism, and sexism without filtering at all, so we had to take it down.
00:44:19.740 | People were asking, basically, which skin color is more morally acceptable and things
00:44:24.660 | like that.
00:44:25.660 | There were 25,000 adversarial examples over just one weekend.
00:44:32.420 | I could never succeed to instruct crowd workers to come up with such diverse and adversarial
00:44:37.460 | examples over two or three days.
00:44:41.420 | In fact, it was many academics and professors tweeting crazy about how to break Delphi all
00:44:47.860 | weekend long, so I thought initially that, "Oh, that's what professors do over the weekend."
00:44:53.280 | But then Monday comes, it blew even further.
00:44:56.460 | Everybody was doing this Delphi breaking and tweeting, so now we have quite a few examples.
00:45:04.420 | Spending all my weekend on Twitter, it says it's wrong.
00:45:07.820 | There was another funny one, "Should I make a contrived adversarial example to torment
00:45:11.640 | a language model on Twitter?
00:45:12.980 | It's petty."
00:45:13.980 | So, after lots of public attention, including an article, let's just say a concerned voice
00:45:25.820 | about our model, which is somewhat, personally, I think it's somewhat misunderstood, but for
00:45:33.140 | a variety of good reasons, but some of the concerns that I found has this internal fear
00:45:39.860 | about, "Are we making AI a moral authority?"
00:45:43.700 | We never endorsed the use of AI for moral advice.
00:45:46.580 | It was in the original disclaimer as well, except that people didn't really look at it.
00:45:52.140 | We didn't support the idea of replacing human judges in the courtroom either.
00:45:59.300 | But here's something really important.
00:46:01.060 | The fact that AI learns to interact with humans ethically does not make them a moral authority
00:46:05.580 | of humans.
00:46:06.580 | It's similar to how a human who tries to interact with each other ethically does not make…
00:46:12.540 | The fact that we are trying to be nice to each other does not entail that we're trying
00:46:16.780 | to be an authority over each other.
00:46:19.060 | Two things are really different.
00:46:21.020 | That's one thing that's really important.
00:46:22.420 | The other important aspect here is that some people have this idea that moral models are
00:46:28.060 | too challenging, it's unsafe at any accuracy, thus we should never work on it ever.
00:46:33.540 | The truth is, though, current AI systems are already morally relevant models.
00:46:40.380 | It may be making this kind of yes/no decision explicitly, but implicitly it's already doing
00:46:48.060 | that and sometimes it generates neural text generation output that is morally super explicit
00:46:56.700 | and relevant.
00:46:57.700 | So the neural language models are already there.
00:47:00.220 | We cannot really ban it.
00:47:02.260 | Even if the U.S. government bans it within the U.S., the U.S. government cannot ban this
00:47:08.240 | in other countries like Russia.
00:47:11.060 | So this is already happening.
00:47:12.940 | We've got to do something about it.
00:47:14.820 | Not working on it is an inaction, which is not necessarily a more correct thing to do
00:47:19.820 | than trying to do something about it.
00:47:22.620 | Another concern that some people had was that it's going to empower powerful people.
00:47:28.980 | Not necessarily true.
00:47:30.620 | This is why exactly we have to work on values and norms and all these biases, addressing
00:47:37.780 | biases so that it serves a diverse set of people.
00:47:43.540 | It turns out Delphi is a bit left-leaning because crowd workers who work for our team
00:47:48.900 | tends to be somewhat left-leaning.
00:47:52.260 | What it means is this, by the way, if we are more left-leaning than our crowd workers,
00:47:56.540 | you think that, "Oh my God, crowd workers have racism and sexism compared to what I
00:48:03.300 | believe in."
00:48:04.540 | And then the right-leaning people think that, "Oh my God, all these walk annotators and
00:48:12.420 | what about freedom of speech?"
00:48:14.620 | This is super divisive, unfortunately.
00:48:18.860 | But the answer is not to do anything about it because, as a matter of fact, my passion
00:48:25.780 | toward addressing racism and sexism came from our experience running for the Alexa Prize
00:48:33.300 | Challenge in 2016 and '17.
00:48:37.100 | We won the challenge, but here's the really sad part behind it.
00:48:42.980 | We had a list of thorny keywords to avoid that included skin color or sexual orientation.
00:48:54.500 | This is a serious form of discrimination.
00:48:56.780 | We cannot build AI models by having this sort of like banned list to be safe as if they
00:49:03.740 | don't exist.
00:49:05.140 | This was the status quo in 2017.
00:49:09.460 | The challenge remains this year, not only 2021, but this year as well.
00:49:15.980 | We really need to work on racism and sexism, but it turns out all the other moral questions
00:49:22.580 | share similar challenges, so I'll skip this over.
00:49:26.860 | But using Delphi, we had other follow-up works such as ProSocial Dialogue where using Delphi
00:49:33.020 | as sort of like a foundation common sense model or moral models to make your dialogue
00:49:39.100 | more socially acceptable.
00:49:43.060 | And then we also had this other paper where we used Delphi in a reinforcement learning
00:49:48.620 | agent to learn how to behave better in a game environment.
00:49:54.740 | There's a lot more work to be done.
00:49:56.300 | Of course, this is a tiny little step toward this huge challenge ahead of us, really aligning
00:50:02.420 | AI systems to humans.
00:50:04.620 | Here's one very quick comment on our new work-in-progress, Delphi Hybrid, where we include the neuro-symbolic
00:50:13.540 | reasoning to address major mistakes such as this, genocide if creating jobs.
00:50:18.700 | This was our early systems mistake.
00:50:21.140 | It's because our dataset doesn't have this kind of weird adversarial examples like genocide
00:50:26.660 | if creating jobs.
00:50:28.060 | Nobody speaks like that in real-life situations.
00:50:31.420 | So our model thought that if creating job, this is so positive and then didn't really
00:50:37.420 | realize how bad the genocide was because ready people don't discuss whether they're going
00:50:42.140 | to do genocide or not.
00:50:44.620 | Ready people who we annotated for social chemistry don't talk about whether they're going to
00:50:51.340 | do genocide or not.
00:50:53.020 | So our model framework is basically that of John Rose, which is descriptive ethics.
00:50:59.740 | But even John Rose in later years suggested that we need some top-down mechanism to overcome
00:51:06.080 | some of the biases that crowd people might have.
00:51:10.300 | This is exactly what we're going to do.
00:51:12.380 | We draw from Bernard Gold's moral theory framework about what not to do.
00:51:20.140 | There are basic universal things that everybody might agree what's not good to do.
00:51:25.980 | Then what we do is we develop basically a system where we parse out the original query
00:51:35.460 | into smaller events, like shooting a bear, killing a bear to save your child.
00:51:40.380 | We parse out the original query into a basic event and then check through this Comet model,
00:51:46.860 | common sense model, whether some of these events induce obviously negative or dangerous
00:51:54.220 | common sense inferences or not.
00:51:57.100 | And then we draw this graph of reasoning, a bit reminiscent of a Mayuric graph in the
00:52:04.780 | sense that we have a lot of these different reasoning we can do, and then they have entailment
00:52:12.340 | relations or contradiction relations so that we can do collective reasoning on top.
00:52:17.660 | We use again Max's set, the constraint optimization over it, so that we can finally make a more
00:52:23.500 | informed decision that is both interpretable and then being able to draw from this common
00:52:28.980 | sense knowledge to better guard the machine against adversarial examples.
00:52:34.540 | So the performance basically says we can do this without hurting the performance or even
00:52:40.780 | increasing the performance.
00:52:42.540 | So as a last comment, AI safety, equity, morality, these are all sort of like in the continuum
00:52:50.060 | of challenges.
00:52:51.980 | It's really difficult challenges because it's not clear whose moral values do we incorporate.
00:52:56.460 | I think that we should go with a value pluralism going forward to really endorse everybody's
00:53:03.340 | different culture and individual preferences, not just one country, one moral framework
00:53:09.940 | as the correct one.
00:53:12.380 | And really we need to do more collaboration across AI and humanities, even including philosophy
00:53:18.900 | and psychology and policymakers.
00:53:21.580 | So I think I'll stop here because I think I'm at time and now I'm ready for questions.
00:53:31.260 | Oh, there's already one question I see.
00:53:37.180 | Do you think legal records, criminal case law reflect the kind of descriptive morality
00:53:42.020 | that you're interested in capturing?
00:53:43.740 | Do you think using that as training data would be useful?
00:53:47.100 | Oh, this is an excellent question.
00:53:51.460 | I think the legal records does encode, potentially provide a really rich resource that if someone
00:53:58.860 | can really annotate like this, it might be helpful.
00:54:02.860 | We started with Reddit cases as just one short description of a situation because the current
00:54:10.500 | language understanding is not strong enough to do like a paragraph level precise understanding.
00:54:19.260 | Even chat GPT, although it looks really good at generation, my take on chat GPT is that
00:54:27.060 | it's better at generation than understanding, which is kind of the opposite of how humans
00:54:32.540 | Humans are actually better for understanding than generation.
00:54:36.060 | So you can read Pulitzer Prize winning news article without having any problem understanding
00:54:41.820 | the article, but you don't necessarily generate text that might win the award.
00:54:48.140 | So the, but the legal domain is really interesting.
00:54:51.620 | And I think that there's some active research, actually, even at Stanford, there's this pile
00:54:55.380 | of law that goes a step toward that direction.
00:54:58.900 | And it might really be helpful for better understanding what sort of different values
00:55:03.260 | people apply in jurisdictions and uncovering some biases that some people might have had
00:55:09.540 | in the past trials.
00:55:11.540 | So there might be some good use cases in that space.
00:55:17.820 | Next question.
00:55:18.820 | Awesome work.
00:55:19.820 | Thank you.
00:55:20.820 | A big picture question, curious to hear your thoughts on where do we go from here given
00:55:26.500 | larger and larger models coming out?
00:55:30.340 | Suppose we need a model to be 99% correct for a specific use case.
00:55:36.020 | To what extent do I see the solution set being that defining the narrow use cases or more
00:55:44.620 | data parameters or fine-tuning the type of work that I did for a smart trace, et cetera.
00:55:53.060 | Answer is likely it depends.
00:55:55.020 | Yeah.
00:55:56.020 | But still want to hear about it.
00:55:58.780 | Okay.
00:55:59.780 | So as far as foundation models go, it seems that the bigger is the better, except that,
00:56:06.900 | you know, I was very excited to read a bunch of tech companies' papers about foundation
00:56:12.620 | models in the past six months.
00:56:14.260 | There's just so many out there.
00:56:16.020 | So recording story there is that, well, if you have better data, then you can get away
00:56:23.140 | with a smaller model.
00:56:24.620 | So especially when you do instruction tuning, then you can get away with a smaller data.
00:56:31.580 | It's still a general model, but instruction tuning on the larger model might even be better.
00:56:38.420 | It's not the case that you don't gain any performance, but it's just that you can close
00:56:45.020 | the gap quite a bit.
00:56:46.080 | So for downstream use cases where typically practitioners want to use a smaller model,
00:56:54.820 | seems that investing more into data is definitely the answer.
00:56:59.020 | Investing more into a specific algorithm is also really, really good because algorithms
00:57:04.260 | can do a lot.
00:57:05.260 | So in this talk, I didn't go too crazy with algorithmic solutions, but maybe I'll be similar
00:57:09.860 | to the meiotic prompting, but in my lab, we designed a fair amount of decoding time algorithms
00:57:15.540 | where you can really close the performance gap quite a bit by doing so.
00:57:21.100 | So that's a good thing though, for folks in academia, because algorithm development feels
00:57:27.380 | like more academic or intellectually pleasing than really engineering, you know, downloading
00:57:34.320 | more data from the internet, and then, I don't know, cleaning the data because you have to
00:57:41.340 | clean the data.
00:57:42.840 | And all these are very engineering heavy, whereas decoding time algorithms, you can
00:57:46.860 | have fun inventing some new intellectually interesting thing that also improves the performance
00:57:55.140 | quite a bit.
00:57:56.380 | So yeah, there's many different ways to improve it, but I think the data quality matters a
00:58:01.380 | lot and algorithm actually matters a lot too.
00:58:05.500 | What do I think of Dan Hendricks' ethics benchmark?
00:58:09.300 | Yeah, so we did use that in, let's see, the common sense non-banks also draws from this
00:58:17.500 | ethics data set.
00:58:21.340 | We like the data set, we kind of disagree with some of the annotations we found, but
00:58:26.700 | this is very typical, by the way.
00:58:29.460 | The thing about morality is that throughout the humanities, we haven't sorted out yet.
00:58:34.380 | There's a lot of theories.
00:58:36.340 | Every theoretician has a different viewpoint, and then even like non-theoreticians have
00:58:41.900 | a very strong opinion about what they want to believe as correct from wrong, so there's
00:58:49.780 | that.
00:58:50.780 | There are different pros and cons.
00:58:55.200 | One thing I learned from this experiment is that although some of these data sets seem
00:58:59.580 | large, so ethics has a hundred thousands of examples, social chemistry has 300 thousands
00:59:06.940 | of judgments, social bias frames has 600 thousands of annotations, and so forth, and yet it only
00:59:14.860 | covers, I feel like it only covers still the small peak of the entire iceberg.
00:59:23.620 | There's a lot on the bottom.
00:59:26.200 | Humans certainly don't necessarily learn from all these examples.
00:59:29.300 | We just learn fundamental concepts and then can apply that without this larger-scale training,
00:59:35.100 | so there's something really lacking about the way that current machine learning is very
00:59:38.980 | data-heavy.
00:59:39.980 | That aside, I do think that none of these resources are perfect.
00:59:45.300 | They all have different pros and cons, and we really need to invest more into this, especially
00:59:49.740 | from academia, because the tech companies right now are not sharing any of their human
00:59:54.820 | annotation or human feedback data, especially when it's touching on toxicity or morality
01:00:02.180 | concerns.
01:00:03.180 | Reason being, these annotations, I'm pretty sure, are biased and not correct entirely,
01:00:07.940 | and that could really invite additional concerns from the public, so they're not releasing.
01:00:12.780 | But in order to really study this better, we really need to share this and then improve
01:00:17.460 | it as a community together.
01:00:21.100 | That's how I would respond to your question.
01:00:23.800 | Thank you for an excellent question.
01:00:26.780 | Do I think this tech is ready to be merged with the search?
01:00:32.460 | I wouldn't say ready, but they need something like this for sure.
01:00:37.260 | Home devices, the way that I think about Delphi is that it can really serve as a filter for
01:00:43.420 | other foundation models or application scenarios where they're about to generate something,
01:00:49.140 | and you can put a safety filter, which can really help.
01:00:55.120 | In some sense, in this work, I went through this super fast, but here, basically, what
01:01:01.740 | happens is that, let's see, the reason why we built this is because we found that chatbots,
01:01:10.780 | the publicly available ones, tend to endorse, tend to be too positive to the point that
01:01:16.260 | they want to endorse problematic situations, like a user says, "Holocaust never happened."
01:01:23.340 | Then the chatbot says, "Yeah, I agree with you."
01:01:26.900 | If you say, "I'm a big fan of Hitler," then the chatbot might say, "Yeah, yeah, yeah."
01:01:33.620 | The user might say, "I'm so depressed, I'm going to kill myself."
01:01:36.900 | And then the chatbot says, "Go ahead, great idea."
01:01:40.940 | Being positive is not being harmless.
01:01:44.300 | Being positive to a problematic content can be very toxic and very harmful, so development
01:01:53.540 | like Delphi, even though Delphi is far from being perfect, and it's also biased, it has
01:01:58.300 | a Western bias, could really help with the downstream models.
01:02:04.940 | Yeah, so continuing on that question, "There has been many concerns about using GPT-like
01:02:10.460 | models with the search because misinformation."
01:02:12.740 | Ooh, that's another can of worms.
01:02:15.500 | Others say, "We just need more RLHF plus knowledge graphs."
01:02:20.700 | So, yeah, misinformation is, yeah, something else that seems we are really lagging behind
01:02:31.180 | because we don't have very powerful fact-checking models yet, so that's a different story.
01:02:38.020 | But even that aside, just in terms of norms and ethics that are safe and fair for people
01:02:48.220 | to use, I think RLHF direction is great, but they usually also need the human demonstration,
01:02:58.060 | not just the human feedback.
01:03:01.260 | The problem is that tech companies own them and nobody is sharing anything.
01:03:07.500 | That makes it really difficult to make meaningful progress as a community together, so I do
01:03:12.540 | think that data is really important.
01:03:15.700 | The off-the-shelf models cannot learn morals and ethics on their own.
01:03:19.780 | It has to be somehow taught more directly.
01:03:24.860 | "We really just need to do more research in this space," period, is how I view it.
01:03:32.220 | That makes sense.
01:03:34.540 | We also have some questions on Slido, so I can ask them for you, folks.
01:03:41.620 | One question is, "What's the complexity of Mayutic prompting?
01:03:46.460 | How many times does the LM need to be queried?"
01:03:49.540 | Yeah, so honestly, it's a bit slow.
01:03:54.300 | In fact, this Delphi hybrid is also slow.
01:03:57.540 | If you try to do this graph reasoning, maybe I'm not going to do that, but the graph reasoning
01:04:03.100 | is slow because you have to call so many times over and over, and some of this can be batched.
01:04:13.580 | Some of this cannot be batched, especially if it's recursive, but I would say the chain
01:04:17.820 | of thought is also a bit slower.
01:04:21.100 | The max-set solver in itself is pretty fast, because this is such an easy graph.
01:04:27.300 | So there's a bit of a delay, but it's a bit slower, but maybe not too bad, is what I should
01:04:35.260 | have said.
01:04:36.940 | Great, thank you.
01:04:41.180 | Cool.
01:04:42.180 | Another question is, "How does Comet compare to GPT-3, if GPT-3 is fine-tuned on commonsense
01:04:49.940 | data, especially if you're doing some sort of instruction fine-tuning?"
01:04:52.980 | Yeah, so then the larger wins, period.
01:04:58.060 | The larger is going to be the better, especially if you're going to just fine-tune GPT-3.
01:05:03.060 | It's game over.
01:05:06.220 | For that reason, some folks might think that the larger is always better, therefore don't
01:05:11.420 | work on a smaller model.
01:05:13.780 | But I think there are two reasons as to why small models are interesting to look at as
01:05:18.180 | well.
01:05:19.180 | First of all, it's just easier to use.
01:05:21.460 | But more intellectually, it's also very interesting if you can make a smaller model better and
01:05:26.380 | catch up on the larger model.
01:05:28.660 | Personally, I think there's something about the size of the larger model that is more
01:05:35.380 | about the information complexity that is the key reason.
01:05:38.180 | I don't think it's just size in the sense that if you have really a lot of data, but
01:05:43.060 | the data is repetitive and really simple, probably you don't get the same amount of
01:05:47.580 | performance gain, which was basically the case when we looked at this output, this result
01:05:55.340 | where even though the loose teacher GPT-3 generated a lot more data than the critical
01:06:02.660 | teacher, here the quality of the data was more important than the quantity.
01:06:08.220 | So I think the complexity of the data itself is more important than the size.
01:06:15.260 | And oftentimes, when you just increase the size of the data together with the model,
01:06:20.060 | you do increase the complexity of information of the data as well as the model's capability
01:06:25.860 | of learning the complexity.
01:06:27.660 | But if we can catch up on that complexity of information, either through inference algorithms
01:06:32.780 | or through better data, then we can close the gap quite a bit, which is intellectually
01:06:37.300 | very interesting research space to be.
01:06:40.100 | Okay, this is a personal question, but I would say humans normally have a critic model.
01:06:46.460 | So I think before you speak, we just don't generate, we also think this is a good thing
01:06:51.260 | or a bad thing to say.
01:06:52.580 | So people have been like, the community as a whole has been focusing a lot on generative
01:06:55.900 | models, like net billion size parameters, but should we also focus on big sized critic
01:06:59.580 | models that can do fact-checking, a lot of this sort of stuff?
01:07:02.300 | So what's your opinion on that?
01:07:04.580 | Great point, excellent.
01:07:05.580 | Yeah, I think we can definitely invest more into critic model because they go really together
01:07:14.300 | well with the generative models for making the output better or filtering output better.
01:07:20.620 | And yeah, there's not as much of an investment into that.
01:07:24.340 | So I really like the question or suggestion for the research community, it's more like
01:07:32.220 | Great.
01:07:33.220 | Yeah, I'll say, let's see, you have some more questions I can do on the last one.
01:07:39.620 | Let's see.
01:07:40.620 | Oh, I guess one is like, do you believe language models should completely avoid questions involving
01:07:47.140 | morals and ethics?
01:07:49.700 | Similar to like open air, restricting chat GPT from giving opinions?
01:07:53.140 | Yeah, I actually don't mind at all if AI just avoids, evades from all of that, except when
01:08:01.540 | somebody is saying morally questionable things, it's also nice for the AI not to go with it.
01:08:11.500 | So or at least to recognize it as something not okay, and then try to tone it down.
01:08:21.020 | But I don't think there's any particular reason why AI should actually answer moral questions
01:08:27.540 | directly in a more downstream use cases.
01:08:30.620 | But really, the goal of this Delphi was making all these judgments more explicit so that
01:08:37.420 | we can actually study it more explicitly, as opposed to keeping everything just so like
01:08:42.500 | implicit.
01:08:43.500 | Okay, that's a fun question.
01:08:47.020 | So do you think common sense is an emergent property in like large language models?
01:08:52.220 | Oh, yeah.
01:08:53.220 | Yeah, it is definitely emergent, as in like, when we saw this major boost jump in performance
01:09:03.540 | with GPT-3, I do believe that it's emergent capability, but I don't think so.
01:09:13.780 | This particular evaluation is not very adversarial, by the way, this is like a sort of like a
01:09:18.020 | piece of cake, you know, reasonably easy evaluation scenario.
01:09:22.140 | So the thing about common sense, though, is that it can be so adversarial, so infinitely
01:09:29.660 | many different ways.
01:09:31.620 | And then, you know, there are always people like Gary Marcos, who wants to come up with
01:09:36.900 | very, you know, like weird, weird attack scenarios, like, you know, how crushed the porcelain
01:09:43.500 | added to breast milk can support infant digestive system, and then chat GPT-3 says nonsense.
01:09:49.980 | And so the usual problem with common sense is this adversarial situations where people
01:09:58.420 | don't have any problem getting fooled by this, even though, you know, you and I see this
01:10:02.260 | for the first time, no problem, because we have a true conceptual understanding.
01:10:07.100 | That is the backbone of our common sense understanding.
01:10:09.740 | But that's really lacking in the way that transformers are designed to focus on predicting
01:10:16.180 | which word comes next, as opposed to learning the world knowledge.
01:10:20.820 | And in some sense, you know, now with the RLHF, instead of predicting which word comes
01:10:27.020 | next, we're trying to align the model output better with the human preferences.
01:10:32.260 | But that again, is not really aligned with the different goal of let's make sense of
01:10:37.620 | the world, and then build knowledge model.
01:10:40.580 | So these are all different learning objectives.
01:10:43.540 | And really, that is why I believe that although common sense does emerge from language models,
01:10:52.260 | fundamentally language models are not equivalent to knowledge models.
01:10:56.020 | And we really got to focus on building knowledge models.
01:11:00.060 | I think there's one last zoom question.
01:11:12.360 | Value pluralism.
01:11:13.600 | Yeah.
01:11:14.600 | It's an empty concept.
01:11:17.620 | You don't want to include all value systems.
01:11:20.940 | So maybe it is a value.
01:11:23.260 | Is it empty or not?
01:11:24.540 | Okay.
01:11:25.540 | Thank you for excellent question.
01:11:29.420 | So I believe that we shouldn't endorse conspiracy theories at all, or any other, you know, morally
01:11:40.780 | questionable cases.
01:11:43.180 | But then still there's this thorny situation of what to do with, you know, left to left
01:11:51.340 | people versus lightly left people versus right leaning people, if US and then, you know,
01:11:57.060 | every country has some other political divide division as well.
01:12:01.740 | So here, I feel like we really need to sort out what to do with this about, regardless
01:12:08.300 | of this, you know, some of these challenges, it is true that, you know, I personally don't
01:12:14.420 | have a religion, but I respect people with a religion.
01:12:19.900 | And you know, I respect people with a different cultural background.
01:12:23.700 | And we kind of have some sense of how much do we do we believe that we should respect
01:12:29.900 | each other, even though, you know, the beliefs are different.
01:12:33.640 | So we probably need to work together.
01:12:36.660 | And it shouldn't be just AI researchers making this decision, by the way, this decision has
01:12:40.860 | to come from the humanities at large, which is why the data sharing actually is important.
01:12:46.540 | But basically, I think the current version that I have in mind is that the AI doesn't
01:12:54.640 | need to understand what sort of differences are okay differences, the fact that people
01:13:01.180 | do have differences in certain questions should be learned by AI, so that there are distribution
01:13:09.660 | of opinions as opposed to one correct answer.
01:13:12.900 | And then it should deny some of the controversial theories, even though I'm sure that, you know,
01:13:19.180 | some people will be very unhappy about that.
01:13:21.500 | But well, we have to decide something like that.
01:13:25.940 | I am reasonably optimistic that if humanities at large work together, we can do that.
01:13:31.740 | Because after all, laws are like that to laws, you know, this is a human artifact that people
01:13:38.220 | agreed upon, somehow that, you know, there's this core rules that people should abide by.
01:13:45.300 | So I'm hoping that we can also define universals and particulars and respect particulars whenever
01:13:53.500 | it's a respectable, otherwise have some basic universals that reflect, you know, core human
01:14:01.940 | values.
01:14:02.940 | And then, as far as this left leaning situation, by the way, if just the goal is to make your
01:14:07.780 | AI systems safe for anybody, actually, we can make the AI filter extremely equity aware.
01:14:18.220 | And it's not going to violate the freedom of speech by doing so, just to make AI to
01:14:22.300 | avoid the same things that are potentially microaggression for some population.
01:14:27.940 | And you know, we still don't really exclude people who care more about freedom of speech
01:14:35.100 | over equity by doing so.
01:14:37.700 | So I think there are ways, but this really requires a lot more research is how I view
01:14:43.460 | I think that's mostly it.
01:14:44.460 | Thanks a lot for coming.
01:14:45.460 | This was a great talk.
01:14:46.460 | Okay, thank you very much.
01:14:47.460 | Thanks so much.
01:14:48.460 | Yeah.
01:14:49.460 | I think that's it.
01:14:59.220 | Thanks.
01:15:00.220 | Thanks.
01:15:01.220 | [BLANK_AUDIO]