back to index

David Ferrucci: The Story of IBM Watson Winning in Jeopardy | AI Podcast Clips


Chapters

0:0 What is Jeopardy
6:0 The Story of IBM Watson
10:50 The Challenges
15:30 Web Search
21:30 Question Analysis
26:19 Performance
30:16 Success

Whisper Transcript | Transcript Only Page

00:00:00.000 | - So one of the greatest accomplishments
00:00:05.000 | in the history of AI is Watson competing
00:00:10.440 | in the game of Jeopardy against humans.
00:00:13.420 | And you were a lead in that, a critical part of that.
00:00:18.520 | Let's start at the very basics.
00:00:20.200 | What is the game of Jeopardy?
00:00:22.440 | The game for us humans, human versus human.
00:00:25.440 | - Right, so it's to take a question and answer it.
00:00:30.440 | The game of Jeopardy, well--
00:00:34.320 | - It's just the opposite.
00:00:35.160 | - Actually, well, no, but it's not, right?
00:00:38.360 | It's really not, it's really to get a question and answer,
00:00:41.440 | but it's what we call a factoid question.
00:00:43.520 | So this notion of like, it really relates to some fact
00:00:46.440 | that few people would argue
00:00:48.880 | whether the facts are true or not.
00:00:50.160 | In fact, most people wouldn't.
00:00:51.160 | Jeopardy kind of counts on the idea
00:00:52.640 | that these statements have factual answers.
00:00:57.240 | And the idea is to, first of all,
00:01:01.600 | determine whether or not you know the answer,
00:01:03.360 | which is sort of an interesting twist.
00:01:05.680 | - So first of all, understand the question.
00:01:07.440 | - You have to understand the question.
00:01:08.440 | What is it asking?
00:01:09.460 | And that's a good point,
00:01:10.360 | because the questions are not asked directly, right?
00:01:14.040 | - They're all like, the way the questions are asked
00:01:16.480 | is nonlinear.
00:01:17.960 | It's like, it's a little bit witty,
00:01:20.280 | it's a little bit playful sometimes.
00:01:22.080 | It's a little bit tricky.
00:01:25.520 | - Yeah, they're asked in, exactly,
00:01:27.520 | in numerous witty, tricky ways.
00:01:30.200 | Exactly what they're asking is not obvious.
00:01:32.120 | It takes inexperienced humans a while to go,
00:01:34.960 | what is it even asking?
00:01:36.480 | And that's sort of an interesting realization
00:01:38.720 | that you have when somebody says,
00:01:39.720 | oh, Jeopardy is a question answering show.
00:01:42.040 | And then it's like, oh, I know a lot.
00:01:43.480 | And then you read it,
00:01:44.300 | and you're still trying to process the question,
00:01:46.560 | and the champions have answered and moved on.
00:01:48.720 | They're like, there are three questions ahead
00:01:51.320 | of the time you figured out what the question even meant.
00:01:53.640 | So there's definitely an ability there
00:01:56.040 | to just parse out what the question even is.
00:01:59.120 | So that was certainly challenging.
00:02:00.440 | It's interesting, historically, though,
00:02:01.820 | if you look back at the Jeopardy games much earlier.
00:02:05.800 | - Like '60s, '70s, that kind of thing.
00:02:07.760 | - The questions were much more direct.
00:02:09.780 | They weren't quite like that.
00:02:10.920 | They got sort of more and more interesting.
00:02:13.280 | The way they asked them,
00:02:14.280 | that sort of got more and more interesting,
00:02:16.040 | and subtle, and nuanced, and humorous, and witty over time,
00:02:20.400 | which really required the human
00:02:22.080 | to kind of make the right connections
00:02:23.800 | in figuring out what the question was even asking.
00:02:26.420 | So yeah, you have to figure out the questions even asking.
00:02:29.500 | Then you have to determine whether or not
00:02:31.700 | you think you know the answer.
00:02:33.200 | And because you have to buzz in really quickly,
00:02:36.920 | you sort of have to make that determination
00:02:39.320 | as quickly as you possibly can.
00:02:40.760 | Otherwise, you lose the opportunity to buzz in.
00:02:42.960 | - Maybe even before you really know if you know the answer.
00:02:46.080 | - I think a lot of humans will assume
00:02:48.160 | and they'll look at it, process it very superficially.
00:02:52.560 | In other words, what's the topic, what are some keywords,
00:02:55.560 | and just say, do I know this area or not
00:02:58.240 | before they actually know the answer?
00:03:00.400 | Then they'll buzz in and think about it.
00:03:02.800 | So it's interesting what humans do.
00:03:04.240 | Now some people who know all things,
00:03:06.520 | like Ken Jennings or something,
00:03:08.040 | or the more recent Big Jeopardy player,
00:03:10.000 | I mean, they'll just buzz in.
00:03:12.000 | They'll just assume they know all about Jeopardy
00:03:13.680 | and they'll just buzz in.
00:03:15.480 | Watson, interestingly, didn't even come close
00:03:17.920 | to knowing all of Jeopardy, right?
00:03:19.640 | Watson really-- - Even at the peak,
00:03:21.360 | even at its best. - Yeah, so for example,
00:03:23.320 | I mean, we had this thing called recall,
00:03:25.520 | which is like how many of all the Jeopardy questions,
00:03:28.960 | how many could we even find the right answer for,
00:03:32.960 | like anywhere?
00:03:34.000 | Like could we come up with, if we had a big body of knowledge
00:03:37.640 | of something in the order of several terabytes,
00:03:39.360 | I mean, from a web scale, it was actually very small.
00:03:42.440 | But from a book scale,
00:03:43.920 | I was talking about millions of books, right?
00:03:45.880 | So they're calling millions of books,
00:03:47.880 | encyclopedias, dictionaries, books,
00:03:49.680 | so it's still a ton of information.
00:03:51.880 | And for, I think it was only 85%
00:03:54.760 | was the answer anywhere to be found.
00:03:57.200 | So you're already down at that level
00:03:59.920 | just to get started, right?
00:04:01.680 | So, and so it was important to get a very quick sense of,
00:04:06.680 | do you think you know the right answer to this question?
00:04:09.640 | So we had to compute that confidence
00:04:11.800 | as quickly as we possibly could.
00:04:13.900 | So in effect, we had to answer it
00:04:16.080 | and at least spend some time essentially answering it
00:04:21.080 | and then judging the confidence
00:04:23.320 | that our answer was right,
00:04:26.240 | and then deciding whether or not
00:04:27.680 | we were confident enough to buzz in.
00:04:29.600 | And that would depend on what else was going on in the game
00:04:31.520 | because there was a risk.
00:04:33.000 | So like, if you're really in a situation
00:04:34.680 | where I have to take a guess, I have very little to lose,
00:04:37.960 | then you'll buzz in with less confidence.
00:04:39.840 | - So that was accounted for the financial standings
00:04:42.560 | of the different competitors.
00:04:43.900 | - Correct.
00:04:45.020 | How much of the game was laughed,
00:04:46.220 | how much time was laughed,
00:04:47.860 | where you were in the standing, things like that.
00:04:50.340 | - How many hundreds of milliseconds
00:04:52.460 | that we're talking about here?
00:04:53.500 | Do you have a sense of what is,
00:04:55.580 | like if it's, what's the target?
00:04:58.020 | - So, I mean, we targeted answering in under three seconds.
00:05:03.020 | And--
00:05:04.260 | - Buzzing in, so the decision to buzz in
00:05:07.980 | and then the actual answering,
00:05:09.560 | are those two different stages?
00:05:10.820 | - Yeah, they were two different things.
00:05:12.260 | In fact, we had multiple stages,
00:05:14.120 | whereas like we would say, let's estimate our confidence,
00:05:16.980 | which was sort of a shallow answering process.
00:05:20.660 | And then ultimately decide to buzz in,
00:05:23.440 | and then we may take another second or something
00:05:25.940 | to kind of go in there and do that.
00:05:30.500 | But by and large, we're saying like, we can't play the game.
00:05:33.540 | We can't even compete if we can't, on average,
00:05:37.220 | answer these questions in around three seconds or less.
00:05:39.980 | - So you stepped in,
00:05:41.340 | so there's these three humans playing a game,
00:05:44.940 | and you stepped in with the idea that IBM Watson
00:05:47.620 | would be one of, replace one of the humans
00:05:49.580 | and compete against two.
00:05:51.620 | Can you tell the story of Watson taking on this game?
00:05:56.380 | - Sure.
00:05:57.220 | - It seems exceptionally difficult.
00:05:58.340 | - Yeah, so the story was that it was coming up,
00:06:03.140 | I think, to the 10-year anniversary of Big Blue.
00:06:06.620 | Not Big Blue, Deep Blue.
00:06:08.420 | IBM wanted to do sort of another kind of really,
00:06:11.580 | fun challenge, public challenge,
00:06:13.820 | that can bring attention to IBM research
00:06:16.020 | and the kind of the cool stuff that we were doing.
00:06:18.260 | I had been working in AI at IBM for some time.
00:06:23.420 | I had a team doing what's called
00:06:26.140 | open domain factoid question answering,
00:06:28.260 | which is, we're not gonna tell you what the questions are,
00:06:30.660 | we're not even gonna tell you what they're about.
00:06:32.740 | Can you go off and get accurate answers to these questions?
00:06:36.500 | And it was an area of AI research that I was involved in.
00:06:41.100 | And so it was a very specific passion of mine.
00:06:43.980 | Language understanding had always been a passion of mine.
00:06:46.780 | One sort of narrow slice on whether or not
00:06:49.300 | you could do anything with language
00:06:50.660 | was this notion of open domain,
00:06:52.260 | meaning I could ask anything about anything,
00:06:54.260 | factoid, meaning it essentially had an answer,
00:06:57.540 | and being able to do that accurately and quickly.
00:07:00.620 | So that was a research area that my team had already been in.
00:07:03.660 | And so completely independently,
00:07:05.940 | several IBM executives, like, what are we gonna do?
00:07:08.700 | What's the next cool thing to do?
00:07:10.700 | And Ken Jennings was on his winning streak.
00:07:13.540 | This was like, whatever it was, 2004, I think,
00:07:16.300 | was on his winning streak.
00:07:18.420 | And someone thought, hey, that would be really cool
00:07:20.540 | if the computer can play Jeopardy.
00:07:23.580 | And so this was like in 2004,
00:07:25.380 | they were shopping this thing around.
00:07:27.660 | And everyone was telling the research execs, no way.
00:07:33.180 | Like, this is crazy.
00:07:34.780 | And we had some pretty senior people in the field
00:07:36.700 | and they're saying, no, this is crazy.
00:07:37.820 | And it would come across my desk and I was like,
00:07:39.820 | but that's kind of what I'm really interested in doing.
00:07:42.720 | But there was such this prevailing sense of,
00:07:46.380 | this is nuts, we're not gonna risk IBM's reputation on this,
00:07:49.020 | we're just not doing it.
00:07:49.860 | And this happened in 2004, it happened in 2005.
00:07:52.780 | At the end of 2006, it was coming around again.
00:07:57.780 | And I was coming off of a,
00:08:00.700 | I was doing the open domain question answering stuff,
00:08:02.700 | but I was coming off of a couple other projects.
00:08:05.540 | I had a lot more time to put into this.
00:08:07.660 | And I argued that it could be done.
00:08:09.780 | And I argued it would be crazy not to do this.
00:08:12.340 | - Can I, you can be honest at this point.
00:08:15.420 | So even though you argued for it,
00:08:17.180 | what's the confidence that you had yourself,
00:08:19.780 | privately, that this could be done?
00:08:22.260 | We just told the story,
00:08:25.220 | how you tell stories to convince others.
00:08:27.340 | How confident were you?
00:08:28.540 | What was your estimation of the problem at that time?
00:08:32.260 | - So I thought it was possible.
00:08:33.900 | And a lot of people thought it was impossible.
00:08:35.900 | I thought it was possible.
00:08:37.460 | A reason why I thought it was possible
00:08:38.740 | was because I did some brief experimentation.
00:08:41.100 | I knew a lot about how we were approaching
00:08:43.060 | open domain factoid question answering.
00:08:45.540 | We've been doing it for some years.
00:08:47.220 | I looked at the Jeopardy stuff.
00:08:48.900 | I said, this is gonna be hard
00:08:50.460 | for a lot of the points that we mentioned earlier.
00:08:53.740 | Hard to interpret the question,
00:08:55.300 | hard to do it quickly enough,
00:08:58.500 | hard to compute an accurate confidence.
00:09:00.080 | None of this stuff had been done well enough before.
00:09:02.620 | But a lot of the technologies we're building
00:09:04.220 | were the kinds of technologies that should work.
00:09:07.040 | But more to the point, what was driving me was,
00:09:10.420 | I was in IBM Research.
00:09:12.380 | I was a senior leader in IBM Research.
00:09:14.460 | And this is the kind of stuff we were supposed to do.
00:09:16.660 | In other words, we were basically supposed to--
00:09:18.340 | - This is the moonshot.
00:09:19.260 | This is the--
00:09:20.100 | - I mean, we were supposed to take things and say,
00:09:21.480 | this is an active research area.
00:09:23.660 | It's our obligation to kind of,
00:09:27.100 | if we have the opportunity, to push it to the limits.
00:09:29.660 | And if it doesn't work,
00:09:31.060 | to understand more deeply why we can't do it.
00:09:34.360 | And so I was very committed to that notion,
00:09:37.020 | saying, folks, this is what we do.
00:09:39.660 | It's crazy not to do this.
00:09:41.820 | This is an active research area.
00:09:43.300 | We've been in this for years.
00:09:44.580 | Why wouldn't we take this grand challenge
00:09:47.020 | and push it as hard as we can?
00:09:50.300 | At the very least, we'd be able to come out and say,
00:09:52.780 | here's why this problem is way hard.
00:09:56.660 | Here's what we tried and here's how we failed.
00:09:58.260 | So I was very driven as a scientist from that perspective.
00:10:03.260 | And then I also argued,
00:10:06.220 | based on what we did a feasibility study,
00:10:08.220 | of why I thought it was hard but possible.
00:10:10.540 | And I showed examples of where it succeeded,
00:10:13.780 | where it failed, why it failed,
00:10:15.700 | and sort of a high-level architectural approach
00:10:17.740 | for why we should do it.
00:10:19.100 | But for the most part, at that point,
00:10:21.840 | the execs really were just looking
00:10:23.220 | for someone crazy enough to say yes.
00:10:25.460 | Because for several years at that point,
00:10:27.460 | everyone had said no.
00:10:29.220 | I'm not willing to risk my reputation
00:10:31.860 | and my career on this thing.
00:10:34.420 | - Clearly, you did not have such fears.
00:10:36.300 | - I did not.
00:10:37.580 | - So you dived right in, and yet,
00:10:40.820 | from what I understand,
00:10:42.420 | it was performing very poorly in the beginning.
00:10:45.940 | So what were the initial approaches and why did they fail?
00:10:49.380 | - Well, there were lots of hard aspects to it.
00:10:54.460 | I mean, one of the reasons why prior approaches
00:10:57.340 | that we had worked on in the past failed
00:11:01.980 | was because the questions were difficult to interpret.
00:11:06.980 | Like, what are you even asking for?
00:11:09.700 | Very often, if the question was very direct,
00:11:12.100 | like, what city?
00:11:13.740 | Even then, it could be tricky.
00:11:16.220 | But what city or what person,
00:11:21.220 | often when it would name it very clearly,
00:11:23.820 | you would know that.
00:11:25.580 | And if there were just a small set of them,
00:11:27.700 | in other words, we're gonna ask about these five types.
00:11:31.140 | Like, it's gonna be an answer,
00:11:33.180 | and the answer will be a city in this state
00:11:36.420 | or a city in this country.
00:11:37.380 | The answer will be a person of this type, right?
00:11:40.620 | Like an actor or whatever it is.
00:11:42.340 | But it turns out that in "Jeopardy!"
00:11:43.980 | there were like tens of thousands of these things.
00:11:47.220 | And it was a very, very long tale.
00:11:49.180 | Meaning, you know, it just went on and on.
00:11:52.100 | And so even if you focused on trying to encode the types
00:11:56.500 | at the very top, like there's five that were the most,
00:11:59.420 | let's say five of the most frequent,
00:12:01.180 | you still cover a very small percentage of the data.
00:12:03.740 | So you couldn't take that approach of saying,
00:12:06.740 | I'm just going to try to collect facts
00:12:09.380 | about these five or 10 types or 20 types
00:12:12.380 | or 50 types or whatever.
00:12:14.020 | So that was like one of the first things,
00:12:16.540 | like, what do you do about that?
00:12:17.820 | And so we came up with an approach toward that.
00:12:21.100 | And the approach looked promising.
00:12:23.100 | And we continued to improve our ability
00:12:25.380 | to handle that problem throughout the project.
00:12:29.140 | The other issue was that right from the outset,
00:12:32.020 | I said, we're not going to,
00:12:34.180 | I committed to doing this in three to five years.
00:12:37.220 | So we did it in four.
00:12:38.660 | So I got lucky.
00:12:40.540 | But one of the things that that putting that,
00:12:42.580 | like stake in the ground was,
00:12:45.300 | and I knew how hard the language understanding problem was.
00:12:47.380 | I said, we're not going to actually understand language
00:12:51.260 | to solve this problem.
00:12:52.360 | We are not going to interpret the question
00:12:57.060 | and the domain of knowledge that the question refers to
00:12:59.780 | in reason over that to answer these questions.
00:13:02.020 | Obviously, we're not going to be doing that.
00:13:03.780 | At the same time, simple search wasn't good enough
00:13:07.820 | to confidently answer with a single correct answer.
00:13:12.580 | - First of all, that's like brilliant.
00:13:13.860 | That's such a great mix of innovation
00:13:15.740 | and practical engineering three, four, eight.
00:13:18.260 | So you're not trying to solve the general NLU problem.
00:13:21.420 | You're saying, let's solve this in any way possible.
00:13:24.900 | - Oh, yeah, no, I was committed to saying,
00:13:27.460 | look, we're just solving
00:13:28.300 | the open domain question answering problem.
00:13:30.620 | We're using Jeopardy as a driver for that.
00:13:33.220 | - Big benchmark.
00:13:34.060 | - Hard enough, big benchmark, exactly.
00:13:36.140 | And now we're--
00:13:37.820 | - How do we do it?
00:13:38.660 | - We could just like, whatever,
00:13:39.580 | like just figure out what works,
00:13:40.780 | because I want to be able to go back
00:13:41.980 | to the academic science community and say,
00:13:44.500 | here's what we tried, here's what worked,
00:13:46.420 | here's what didn't work.
00:13:47.780 | I don't want to go in and say,
00:13:49.900 | oh, I only have one technology,
00:13:51.620 | I have a hammer, I'm only going to use this.
00:13:53.020 | I'm going to do whatever it takes.
00:13:54.340 | I'm like, I'm going to think out of the box,
00:13:55.580 | I'm going to do whatever it takes.
00:13:56.940 | One, and I also, there was another thing I believed.
00:14:00.180 | I believed that the fundamental NLP technologies
00:14:04.220 | and machine learning technologies would be adequate.
00:14:08.420 | And this was an issue of how do we enhance them?
00:14:11.580 | How do we integrate them?
00:14:13.260 | How do we advance them?
00:14:14.940 | So I had one researcher who came to me
00:14:16.820 | who had been working on question answering
00:14:18.260 | with me for a very long time,
00:14:19.760 | who had said, we're going to need Maxwell's equations
00:14:23.900 | for question answering.
00:14:25.300 | And I said, if we need some fundamental formula
00:14:28.340 | that breaks new ground in how we understand language,
00:14:31.460 | we're screwed.
00:14:32.700 | We're not going to get there from here.
00:14:34.020 | Like, I am not counting, my assumption is I'm not counting
00:14:39.300 | on some brand new invention.
00:14:41.980 | What I'm counting on is the ability to take everything
00:14:46.180 | it has done before to figure out an architecture
00:14:49.860 | on how to integrate it well, and then see where it breaks
00:14:53.900 | and make the necessary advances we need to make
00:14:56.820 | until this thing works.
00:14:58.140 | - Yeah, push it hard to see where it breaks
00:15:00.060 | and then patch it up.
00:15:01.260 | I mean, that's how people change the world.
00:15:02.820 | I mean, that's the Elon Musk approach with rockets,
00:15:05.540 | SpaceX, that's the Henry Ford and so on.
00:15:08.420 | I love it.
00:15:09.260 | - And I happen to be, and in this case,
00:15:10.700 | I happen to be right, but like, we didn't know.
00:15:13.900 | But you kind of have to put a stake in it,
00:15:15.500 | how you're going to run the project.
00:15:16.980 | - So yeah, and backtracking to search.
00:15:19.940 | So if you were to do, what's the brute force solution?
00:15:23.900 | What would you search over?
00:15:25.700 | So you have a question, how would you search
00:15:28.780 | the possible space of answers?
00:15:30.900 | - Look, web search has come a long way, even since then.
00:15:34.420 | But at the time, like, you know, first of all,
00:15:37.420 | I mean, there were a couple of other constraints
00:15:38.860 | around the problem, which is interesting.
00:15:40.540 | So you couldn't go out to the web.
00:15:42.700 | You couldn't search the internet.
00:15:44.580 | In other words, the AI experiment was,
00:15:47.220 | we want a self-contained device.
00:15:50.060 | The device, if the device is as big as a room, fine,
00:15:52.540 | it's as big as a room, but we want a self-contained device,
00:15:56.300 | contained device.
00:15:57.580 | You're not going out to the internet,
00:15:58.820 | you don't have a lifeline to anything.
00:16:01.160 | So it had to kind of fit in a shoebox, if you will,
00:16:03.860 | or at least the size of a few refrigerators,
00:16:06.180 | whatever it might be.
00:16:07.620 | So, but also you couldn't just get out there.
00:16:10.020 | You couldn't go off network, right, to kind of go.
00:16:12.660 | So there was that limitation.
00:16:14.500 | But then we did, but the basic thing was go do web search.
00:16:18.940 | Problem was, even when we went and did a web search,
00:16:22.500 | I don't remember exactly the numbers,
00:16:24.140 | but somewhere in the order of 65% of the time,
00:16:27.140 | the answer would be somewhere, you know,
00:16:29.880 | in the top 10 or 20 documents.
00:16:32.500 | So first of all, that's not even good enough
00:16:33.980 | to play Jeopardy.
00:16:35.860 | In other words, even if you could pull the,
00:16:37.740 | even if you could perfectly pull the answer
00:16:39.820 | out of the top 20 documents, top 10 documents,
00:16:42.500 | whatever it was, which we didn't know how to do,
00:16:44.820 | but even if you could do that,
00:16:46.500 | and you knew it was right,
00:16:48.700 | unless we had enough confidence in it, right?
00:16:50.300 | So you'd have to pull out the right answer,
00:16:52.060 | you'd have to have confidence it was the right answer.
00:16:54.400 | And then you'd have to do that fast enough
00:16:56.020 | to now go buzz in, and you'd still only get 65% of them
00:16:59.580 | right, which doesn't even put you in the winner's circle.
00:17:02.220 | Winner's circle, you have to be up over 70,
00:17:04.620 | and you have to do it really quick,
00:17:05.620 | and you have to do it really quickly.
00:17:07.580 | But now the problem is, well,
00:17:09.660 | even if I had somewhere in the top 10 documents,
00:17:12.060 | how do I figure out where in the top 10 documents
00:17:14.540 | that answer is?
00:17:15.540 | And how do I compute a confidence
00:17:17.620 | of all the possible candidates?
00:17:19.300 | So it's not like I go in knowing the right answer
00:17:21.380 | and have to pick it.
00:17:22.200 | I don't know the right answer.
00:17:23.500 | I have a bunch of documents,
00:17:25.140 | somewhere in there is the right answer.
00:17:26.680 | How do I, as a machine, go out and figure out
00:17:28.640 | which one's right?
00:17:29.540 | And then how do I score it?
00:17:31.020 | So, and now how do I deal with the fact
00:17:34.860 | that I can't actually go out to the web?
00:17:36.900 | - First of all, if you pause on that, just think about it.
00:17:39.580 | If you could go to the web,
00:17:41.740 | do you think that problem is solvable,
00:17:43.860 | if you just pause on it?
00:17:45.100 | Just thinking even beyond Jeopardy,
00:17:47.880 | do you think the problem of reading text
00:17:50.940 | defined where the answer is?
00:17:53.220 | - Well, we solved that in some definition of solved,
00:17:56.260 | given the Jeopardy challenge.
00:17:57.540 | - How did you do it for Jeopardy?
00:17:58.580 | So how do you take a body of work on a particular topic
00:18:02.820 | and extract the key pieces of information?
00:18:05.500 | - So now forgetting about the huge volumes
00:18:08.660 | that are on the web, right?
00:18:09.620 | So now we have to figure out,
00:18:10.820 | we did a lot of source research.
00:18:12.300 | In other words, what body of knowledge
00:18:15.300 | is gonna be small enough, but broad enough,
00:18:18.320 | to answer Jeopardy?
00:18:19.540 | And we ultimately did find the body of knowledge
00:18:21.500 | that did that.
00:18:22.340 | I mean, it included Wikipedia and a bunch of other stuff.
00:18:24.700 | - So like encyclopedia type of stuff.
00:18:26.300 | I don't know if you can speak to--
00:18:27.140 | - Encyclopedia, dictionaries,
00:18:28.100 | different types of semantic resources,
00:18:30.540 | unlike WordNet and other types of semantic resources.
00:18:32.780 | Like that, as well as like some web crawls.
00:18:35.660 | In other words, where we went out and took that content
00:18:38.660 | and then expanded it based on producing statistical,
00:18:42.100 | statistically producing seeds,
00:18:44.220 | using those seeds for other searches,
00:18:47.180 | and then expanding that.
00:18:48.340 | So using these like expansion techniques,
00:18:51.100 | we went out and found enough content
00:18:53.220 | and we're like, okay, this is good.
00:18:54.220 | And even up until the end,
00:18:56.580 | we had a thread of research,
00:18:57.980 | it was always trying to figure out
00:18:59.380 | what content could we efficiently include.
00:19:01.820 | - I mean, there's a lot of popular,
00:19:03.020 | like what is the church lady?
00:19:05.020 | Well, I think was one of the, like what?
00:19:07.660 | I guess that's probably an encyclopedia.
00:19:11.660 | So I guess-- - So that's an encyclopedia,
00:19:13.500 | but then we would take that stuff
00:19:15.660 | and we would go out and we would expand.
00:19:17.380 | In other words, we go find other content
00:19:19.740 | that wasn't in the core resources and expand it.
00:19:22.580 | You know, the amount of content,
00:19:23.860 | grew it by an order of magnitude,
00:19:25.780 | but still, again, from a web scale perspective,
00:19:28.180 | this is very small amount of content.
00:19:30.140 | - It's very select.
00:19:30.980 | - And then we then took all that content,
00:19:32.700 | we pre-analyzed the crap out of it,
00:19:34.820 | meaning we parsed it,
00:19:38.100 | broke it down into all those individual words
00:19:40.300 | and we did semantic, static and semantic parses on it,
00:19:44.220 | had computer algorithms that annotated it
00:19:46.580 | and we indexed that in a very rich and very fast index.
00:19:51.580 | So we have a relatively huge amount of,
00:19:54.820 | let's say the equivalent of, for the sake of argument,
00:19:56.980 | two to 5 million bucks.
00:19:58.580 | We've now analyzed all that, blowing up its size even more
00:20:01.420 | because now we have all this metadata
00:20:03.220 | and then we richly indexed all of that
00:20:05.260 | and by way in a giant in-memory cache.
00:20:08.540 | So Watson did not go to disk.
00:20:11.580 | - So the infrastructure component there,
00:20:13.300 | if you could just speak to it, how tough it,
00:20:15.460 | I mean, I know 2000, maybe this is 2008, nine,
00:20:20.380 | you know, that's kind of a long time ago.
00:20:24.300 | - Right.
00:20:25.500 | - How hard is it to use multiple machines?
00:20:27.500 | Like how hard is the infrastructure component,
00:20:29.460 | the hardware component?
00:20:31.180 | - So we used IBM hardware.
00:20:33.420 | We had something like, I forget exactly,
00:20:35.660 | but close to 3000 cores completely connected.
00:20:40.340 | So you had a switch where, you know,
00:20:41.660 | every CPU was connected to every other CPU.
00:20:43.420 | - And they were sharing memory in some kind of way.
00:20:45.700 | - Large shared memory, right?
00:20:47.620 | And all this data was pre-analyzed
00:20:50.380 | and put into a very fast indexing structure
00:20:54.500 | that was all in memory.
00:20:57.940 | And then we took that question,
00:21:01.060 | we would analyze the question.
00:21:04.060 | So all the content was now pre-analyzed.
00:21:06.860 | So if I went and tried to find a piece of content,
00:21:10.460 | it would come back with all the metadata
00:21:12.220 | that we had pre-computed.
00:21:14.220 | - How do you shove that question?
00:21:16.580 | How do you connect the big stuff,
00:21:19.660 | the big knowledge base of the metadata
00:21:21.420 | that's indexed to the simple little witty,
00:21:24.660 | confusing question?
00:21:26.540 | - Right.
00:21:27.380 | So therein lies, you know, the Watson architecture, right?
00:21:30.900 | So we would take the question,
00:21:32.540 | we would analyze the question.
00:21:34.300 | So which means that we would parse it
00:21:36.620 | and interpret it a bunch of different ways.
00:21:38.340 | We'd try to figure out what is it asking about?
00:21:40.420 | So we would come, we had multiple strategies
00:21:44.020 | to kind of determine what was it asking for.
00:21:46.820 | That might be represented as a simple string,
00:21:49.060 | a character string,
00:21:51.020 | or something we would connect back
00:21:52.740 | to different semantic types
00:21:54.420 | that were from existing resources.
00:21:55.740 | So anyway, the bottom line is we would do a bunch
00:21:57.940 | of analysis in the question.
00:22:00.020 | And question analysis had to finish and had to finish fast.
00:22:03.820 | So we do the question analysis
00:22:04.940 | because then from the question analysis,
00:22:07.500 | we would now produce searches.
00:22:09.420 | So we would, and we had built,
00:22:12.300 | using open source search engines, we modified them.
00:22:15.700 | We had a number of different search engines we would use
00:22:18.620 | that had different characteristics.
00:22:20.340 | We went in there and engineered
00:22:22.140 | and modified those search engines,
00:22:24.140 | ultimately to now take our question analysis,
00:22:28.140 | produce multiple queries based on different interpretations
00:22:31.740 | of the question and fire out a whole bunch
00:22:34.060 | of searches in parallel.
00:22:36.100 | And they would come back with passages.
00:22:40.700 | So these were passive search algorithms.
00:22:42.620 | They would come back with passages.
00:22:44.460 | And so now let's say you had a thousand passages.
00:22:47.820 | Now for each passage, you parallelize again.
00:22:51.420 | So you went out and you parallelize the search.
00:22:55.860 | Each search would now come back
00:22:57.060 | with a whole bunch of passages.
00:22:59.340 | Maybe you had a total of a thousand
00:23:01.060 | or 5,000, whatever passages.
00:23:02.900 | For each passage now, you'd go and figure out
00:23:05.380 | whether or not there was a candidate,
00:23:06.740 | we'd call a candidate answer in there.
00:23:08.780 | So you had a whole bunch of other algorithms
00:23:11.740 | that would find candidate answers,
00:23:13.460 | possible answers to the question.
00:23:15.820 | And so you had candidate answers,
00:23:17.820 | called candidate answers generators,
00:23:19.740 | a whole bunch of those.
00:23:20.980 | So for every one of these components,
00:23:23.300 | the team was constantly doing research,
00:23:25.100 | coming up better ways to generate search queries
00:23:27.500 | from the questions, better ways to analyze the question,
00:23:30.100 | better ways to generate candidates.
00:23:31.580 | - And speed, so better is accuracy and speed.
00:23:35.500 | - Correct, so right, speed and accuracy,
00:23:38.180 | for the most part, were separated.
00:23:40.580 | We handled that sort of in separate ways.
00:23:42.260 | Like I focus purely on accuracy and to inaccuracy,
00:23:45.220 | are we ultimately getting more questions
00:23:46.940 | and producing more accurate confidences?
00:23:48.780 | And then a whole nother team
00:23:50.220 | that was constantly analyzing the workflow
00:23:52.460 | to find the bottlenecks,
00:23:53.860 | and then figuring out how to both parallelize
00:23:55.780 | and drive the algorithm speed.
00:23:58.100 | But anyway, so now think of it like,
00:23:59.980 | again, you have this big fan out now, right?
00:24:01.740 | Because you had multiple queries,
00:24:03.660 | now you have thousands of candidate answers.
00:24:06.980 | For each candidate answer, you're gonna score it.
00:24:09.980 | So you're gonna use all the data that built up.
00:24:12.420 | You're gonna use the question analysis.
00:24:15.460 | You're gonna use how the query was generated.
00:24:17.580 | You're gonna use the passage itself.
00:24:19.820 | And you're gonna use the candidate answer
00:24:21.620 | that was generated.
00:24:22.900 | And you're gonna score that.
00:24:25.460 | So now we have a group of researchers
00:24:28.020 | coming up with scorers.
00:24:29.700 | There are hundreds of different scorers.
00:24:31.900 | So now you're getting a fan out of it again,
00:24:34.180 | from however many candidate answers you have,
00:24:36.980 | to all the different scorers.
00:24:38.860 | So if you have 200 different scorers
00:24:40.860 | and you have a thousand candidates,
00:24:42.020 | now you have 200,000 scores.
00:24:44.780 | And so now you gotta figure out,
00:24:46.700 | how do I now rank these answers
00:24:51.940 | based on the scores that came back?
00:24:54.060 | And I wanna rank them based on the likelihood
00:24:55.900 | that they're a correct answer to the question.
00:24:58.240 | So every scorer was its own research project.
00:25:00.980 | - What do you mean by scorer?
00:25:01.940 | So is that the annotation process
00:25:03.660 | of basically a human being saying that this answer
00:25:07.180 | has a quality of-- - Yeah, think of it
00:25:08.020 | as you can think of it, if you wanna think of it,
00:25:10.360 | what you're doing, if you wanna think about
00:25:12.780 | what a human would be doing,
00:25:13.620 | a human would be looking at a possible answer.
00:25:16.620 | They'd be reading the, Emily Dickinson,
00:25:20.440 | they'd be reading the passage in which that occurred.
00:25:23.100 | They'd be looking at the question,
00:25:24.940 | and they'd be making a decision of how likely it is
00:25:27.940 | that Emily Dickinson, given this evidence in this passage,
00:25:31.840 | is the right answer to that question.
00:25:33.500 | - Got it, so that's the annotation task.
00:25:35.780 | That's the annotation task. - That's the scoring task.
00:25:38.380 | - But scoring implies zero to one kind of continuous--
00:25:40.880 | - That's right, you give it a zero to one score.
00:25:42.560 | - So it's not a binary-- - No, you give it a score.
00:25:45.560 | Give it a zero, yeah, exactly, a zero to one score.
00:25:47.960 | - So what humans do, give different scores,
00:25:50.080 | so you have to somehow normalize and all that kind of stuff
00:25:52.520 | that deal with all that complexity.
00:25:53.840 | - Depends on what your strategy is.
00:25:55.440 | - It could be relative, too.
00:25:57.640 | It could be-- - We actually looked
00:26:00.400 | at the raw scores as well as standardized scores,
00:26:02.320 | because humans are not involved in this.
00:26:04.560 | Humans are not involved. - Sorry, so I'm misunderstanding
00:26:06.920 | the process here.
00:26:08.260 | This is passages.
00:26:10.020 | Where is the ground truth coming from?
00:26:12.860 | - Ground truth is only the answers to the questions.
00:26:15.460 | - So it's end to end. - It's end to end.
00:26:18.580 | So I was always driving end to end performance.
00:26:21.900 | It was a very interesting engineering approach,
00:26:26.900 | and ultimately scientific and research approach,
00:26:29.620 | always driving end to end.
00:26:30.780 | Now, that's not to say we wouldn't make hypotheses
00:26:36.780 | that individual component performance was related
00:26:41.780 | in some way to end to end performance.
00:26:43.940 | Of course we would, because people would have to
00:26:46.600 | build individual components.
00:26:48.420 | But ultimately, to get your component
00:26:50.420 | integrated into the system, you had to show impact
00:26:53.060 | on end to end performance, question answering performance.
00:26:55.500 | - So there's many very smart people working on this,
00:26:57.940 | and they're basically trying to sell their ideas
00:27:01.080 | as a component that should be part of the system.
00:27:02.980 | - That's right, and they would do research
00:27:05.580 | on their component, and they would say things like,
00:27:08.860 | you know, I'm gonna improve this as a candidate generator,
00:27:12.660 | or I'm gonna improve this as a question score,
00:27:15.380 | or as a passage score, I'm gonna improve this,
00:27:18.620 | or as a parser, and I can improve it by 2%
00:27:23.500 | on its component metric, like a better parse,
00:27:26.300 | or a better candidate, or a better type estimation,
00:27:28.740 | or whatever it is.
00:27:29.780 | And then I would say, I need to understand
00:27:32.160 | how the improvement on that component metric
00:27:34.880 | is gonna affect the end to end performance.
00:27:37.280 | If you can't estimate that, and can't do experiments
00:27:40.080 | to demonstrate that, it doesn't get in.
00:27:42.940 | - That's like the best run AI project I've ever heard.
00:27:47.100 | That's awesome, okay.
00:27:48.600 | What breakthrough would you say,
00:27:51.380 | like, I'm sure there's a lot of day to day breakthroughs,
00:27:53.820 | but was there like a breakthrough
00:27:55.180 | that really helped improve performance?
00:27:57.440 | Like where people began to believe?
00:27:59.740 | Or is it just a gradual process?
00:28:02.060 | - Well, I think it was a gradual process,
00:28:04.060 | but one of the things that I think gave people confidence
00:28:08.500 | that we can get there was that,
00:28:11.140 | as we follow this procedure of different ideas,
00:28:16.140 | build different components,
00:28:18.700 | plug them into the architecture, run the system,
00:28:20.740 | see how we do, do the error analysis,
00:28:24.220 | start off new research projects to improve things,
00:28:27.660 | and the very important idea that
00:28:32.100 | the individual component work
00:28:36.020 | did not have to deeply understand
00:28:39.180 | everything that was going on with every other component.
00:28:41.780 | And this is where we leverage machine learning
00:28:44.620 | in a very important way.
00:28:46.940 | So while individual components could be
00:28:49.020 | statistically driven machine learning components,
00:28:50.820 | some of them were heuristic,
00:28:52.300 | some of them were machine learning components,
00:28:54.160 | the system as a whole combined all the scores
00:28:57.660 | using machine learning.
00:29:00.100 | This was critical because that way you can divide and conquer
00:29:03.940 | so you can say, okay, you work on your candidate generator,
00:29:07.060 | or you work on this approach to answer scoring,
00:29:09.300 | you work on this approach to type scoring,
00:29:11.340 | you work on this approach to passage search
00:29:14.060 | or to passage selection and so forth.
00:29:15.900 | But when we just plug it in,
00:29:19.140 | and we had enough training data to say,
00:29:21.580 | now we can train and figure out how do we weigh
00:29:26.100 | all the scores relative to each other
00:29:28.900 | based on predicting the outcome,
00:29:31.460 | which is right or wrong on Jeopardy.
00:29:33.500 | And we had enough training data to do that.
00:29:36.340 | So this enabled people to work independently
00:29:40.180 | and to let the machine learning do the integration.
00:29:42.900 | - Beautiful, so yeah, the machine learning
00:29:44.700 | is doing the fusion,
00:29:45.900 | and then it's a human orchestrated ensemble
00:29:48.580 | of different approaches.
00:29:49.980 | That's great.
00:29:51.520 | Still impressive that you were able to get it done
00:29:55.140 | in a few years.
00:29:57.180 | That's not obvious to me that it's doable
00:30:00.020 | if I just put myself in that mindset.
00:30:02.940 | But when you look back at the Jeopardy challenge,
00:30:05.480 | again, when you're looking up at the stars,
00:30:09.860 | what are you most proud of?
00:30:11.340 | Just looking back at those days.
00:30:15.020 | - I'm most proud of my...
00:30:21.340 | (mouse clicking)
00:30:24.100 | My commitment and my team's commitment
00:30:30.500 | to be true to the science,
00:30:33.500 | to not be afraid to fail.
00:30:37.660 | - That's beautiful because there's so much pressure
00:30:41.180 | because it is a public event, it is a public show,
00:30:44.020 | that you were dedicated to the idea.
00:30:46.580 | - That's right.
00:30:50.100 | - Do you think it was a success?
00:30:52.740 | In the eyes of the world, it was a success.
00:30:54.940 | By your, I'm sure, exceptionally high standards,
00:30:59.300 | is there something you regret you would do differently?
00:31:03.300 | - It was a success.
00:31:07.460 | It was a success for our goal.
00:31:09.700 | Our goal was to build the most advanced
00:31:12.940 | open domain question answering system.
00:31:14.940 | We went back to the old problems
00:31:17.980 | that we used to try to solve,
00:31:19.540 | and we did dramatically better on all of them,
00:31:22.740 | as well as we beat Jeopardy.
00:31:25.700 | So we won at Jeopardy.
00:31:27.340 | So it was a success.
00:31:29.740 | I worry that the world would not understand it as a success
00:31:35.740 | because it came down to only one game.
00:31:38.300 | And I knew, statistically speaking,
00:31:40.020 | this can be a huge technical success,
00:31:41.860 | and we could still lose that one game.
00:31:43.460 | And that's a whole 'nother theme of the journey.
00:31:46.860 | But it was a success.
00:31:49.860 | It was not a success in natural language understanding,
00:31:53.220 | but that was not the goal.
00:31:54.520 | - Yeah, that was, but I would argue,
00:32:00.260 | I understand what you're saying in terms of the science,
00:32:03.740 | but I would argue that the inspiration of it, right,
00:32:07.260 | not a success in terms of solving
00:32:10.780 | natural language understanding,
00:32:12.380 | but it was a success of being an inspiration
00:32:15.900 | to future challenges.
00:32:17.500 | - Absolutely.
00:32:18.420 | - That drive future efforts.
00:32:20.180 | (sniffles)
00:32:22.420 | (upbeat music)
00:32:25.020 | (upbeat music)
00:32:27.600 | (upbeat music)
00:32:30.180 | (upbeat music)
00:32:32.760 | (upbeat music)
00:32:35.340 | (upbeat music)