back to indexDavid Ferrucci: The Story of IBM Watson Winning in Jeopardy | AI Podcast Clips
Chapters
0:0 What is Jeopardy
6:0 The Story of IBM Watson
10:50 The Challenges
15:30 Web Search
21:30 Question Analysis
26:19 Performance
30:16 Success
00:00:13.420 |
And you were a lead in that, a critical part of that. 00:00:25.440 |
- Right, so it's to take a question and answer it. 00:00:38.360 |
It's really not, it's really to get a question and answer, 00:00:43.520 |
So this notion of like, it really relates to some fact 00:01:01.600 |
determine whether or not you know the answer, 00:01:10.360 |
because the questions are not asked directly, right? 00:01:14.040 |
- They're all like, the way the questions are asked 00:01:36.480 |
And that's sort of an interesting realization 00:01:44.300 |
and you're still trying to process the question, 00:01:46.560 |
and the champions have answered and moved on. 00:01:48.720 |
They're like, there are three questions ahead 00:01:51.320 |
of the time you figured out what the question even meant. 00:02:01.820 |
if you look back at the Jeopardy games much earlier. 00:02:16.040 |
and subtle, and nuanced, and humorous, and witty over time, 00:02:23.800 |
in figuring out what the question was even asking. 00:02:26.420 |
So yeah, you have to figure out the questions even asking. 00:02:33.200 |
And because you have to buzz in really quickly, 00:02:40.760 |
Otherwise, you lose the opportunity to buzz in. 00:02:42.960 |
- Maybe even before you really know if you know the answer. 00:02:48.160 |
and they'll look at it, process it very superficially. 00:02:52.560 |
In other words, what's the topic, what are some keywords, 00:03:12.000 |
They'll just assume they know all about Jeopardy 00:03:15.480 |
Watson, interestingly, didn't even come close 00:03:25.520 |
which is like how many of all the Jeopardy questions, 00:03:28.960 |
how many could we even find the right answer for, 00:03:34.000 |
Like could we come up with, if we had a big body of knowledge 00:03:37.640 |
of something in the order of several terabytes, 00:03:39.360 |
I mean, from a web scale, it was actually very small. 00:03:43.920 |
I was talking about millions of books, right? 00:04:01.680 |
So, and so it was important to get a very quick sense of, 00:04:06.680 |
do you think you know the right answer to this question? 00:04:16.080 |
and at least spend some time essentially answering it 00:04:29.600 |
And that would depend on what else was going on in the game 00:04:34.680 |
where I have to take a guess, I have very little to lose, 00:04:39.840 |
- So that was accounted for the financial standings 00:04:47.860 |
where you were in the standing, things like that. 00:04:58.020 |
- So, I mean, we targeted answering in under three seconds. 00:05:14.120 |
whereas like we would say, let's estimate our confidence, 00:05:16.980 |
which was sort of a shallow answering process. 00:05:23.440 |
and then we may take another second or something 00:05:30.500 |
But by and large, we're saying like, we can't play the game. 00:05:33.540 |
We can't even compete if we can't, on average, 00:05:37.220 |
answer these questions in around three seconds or less. 00:05:41.340 |
so there's these three humans playing a game, 00:05:44.940 |
and you stepped in with the idea that IBM Watson 00:05:51.620 |
Can you tell the story of Watson taking on this game? 00:05:58.340 |
- Yeah, so the story was that it was coming up, 00:06:03.140 |
I think, to the 10-year anniversary of Big Blue. 00:06:08.420 |
IBM wanted to do sort of another kind of really, 00:06:16.020 |
and the kind of the cool stuff that we were doing. 00:06:18.260 |
I had been working in AI at IBM for some time. 00:06:28.260 |
which is, we're not gonna tell you what the questions are, 00:06:30.660 |
we're not even gonna tell you what they're about. 00:06:32.740 |
Can you go off and get accurate answers to these questions? 00:06:36.500 |
And it was an area of AI research that I was involved in. 00:06:41.100 |
And so it was a very specific passion of mine. 00:06:43.980 |
Language understanding had always been a passion of mine. 00:06:54.260 |
factoid, meaning it essentially had an answer, 00:06:57.540 |
and being able to do that accurately and quickly. 00:07:00.620 |
So that was a research area that my team had already been in. 00:07:05.940 |
several IBM executives, like, what are we gonna do? 00:07:13.540 |
This was like, whatever it was, 2004, I think, 00:07:18.420 |
And someone thought, hey, that would be really cool 00:07:27.660 |
And everyone was telling the research execs, no way. 00:07:34.780 |
And we had some pretty senior people in the field 00:07:37.820 |
And it would come across my desk and I was like, 00:07:39.820 |
but that's kind of what I'm really interested in doing. 00:07:46.380 |
this is nuts, we're not gonna risk IBM's reputation on this, 00:07:49.860 |
And this happened in 2004, it happened in 2005. 00:07:52.780 |
At the end of 2006, it was coming around again. 00:08:00.700 |
I was doing the open domain question answering stuff, 00:08:02.700 |
but I was coming off of a couple other projects. 00:08:09.780 |
And I argued it would be crazy not to do this. 00:08:28.540 |
What was your estimation of the problem at that time? 00:08:33.900 |
And a lot of people thought it was impossible. 00:08:38.740 |
was because I did some brief experimentation. 00:08:50.460 |
for a lot of the points that we mentioned earlier. 00:09:00.080 |
None of this stuff had been done well enough before. 00:09:04.220 |
were the kinds of technologies that should work. 00:09:07.040 |
But more to the point, what was driving me was, 00:09:14.460 |
And this is the kind of stuff we were supposed to do. 00:09:16.660 |
In other words, we were basically supposed to-- 00:09:20.100 |
- I mean, we were supposed to take things and say, 00:09:27.100 |
if we have the opportunity, to push it to the limits. 00:09:31.060 |
to understand more deeply why we can't do it. 00:09:50.300 |
At the very least, we'd be able to come out and say, 00:09:56.660 |
Here's what we tried and here's how we failed. 00:09:58.260 |
So I was very driven as a scientist from that perspective. 00:10:15.700 |
and sort of a high-level architectural approach 00:10:42.420 |
it was performing very poorly in the beginning. 00:10:45.940 |
So what were the initial approaches and why did they fail? 00:10:49.380 |
- Well, there were lots of hard aspects to it. 00:10:54.460 |
I mean, one of the reasons why prior approaches 00:11:01.980 |
was because the questions were difficult to interpret. 00:11:27.700 |
in other words, we're gonna ask about these five types. 00:11:37.380 |
The answer will be a person of this type, right? 00:11:43.980 |
there were like tens of thousands of these things. 00:11:52.100 |
And so even if you focused on trying to encode the types 00:11:56.500 |
at the very top, like there's five that were the most, 00:12:01.180 |
you still cover a very small percentage of the data. 00:12:03.740 |
So you couldn't take that approach of saying, 00:12:17.820 |
And so we came up with an approach toward that. 00:12:25.380 |
to handle that problem throughout the project. 00:12:29.140 |
The other issue was that right from the outset, 00:12:34.180 |
I committed to doing this in three to five years. 00:12:40.540 |
But one of the things that that putting that, 00:12:45.300 |
and I knew how hard the language understanding problem was. 00:12:47.380 |
I said, we're not going to actually understand language 00:12:57.060 |
and the domain of knowledge that the question refers to 00:12:59.780 |
in reason over that to answer these questions. 00:13:03.780 |
At the same time, simple search wasn't good enough 00:13:07.820 |
to confidently answer with a single correct answer. 00:13:15.740 |
and practical engineering three, four, eight. 00:13:18.260 |
So you're not trying to solve the general NLU problem. 00:13:21.420 |
You're saying, let's solve this in any way possible. 00:13:56.940 |
One, and I also, there was another thing I believed. 00:14:00.180 |
I believed that the fundamental NLP technologies 00:14:04.220 |
and machine learning technologies would be adequate. 00:14:08.420 |
And this was an issue of how do we enhance them? 00:14:19.760 |
who had said, we're going to need Maxwell's equations 00:14:25.300 |
And I said, if we need some fundamental formula 00:14:28.340 |
that breaks new ground in how we understand language, 00:14:34.020 |
Like, I am not counting, my assumption is I'm not counting 00:14:41.980 |
What I'm counting on is the ability to take everything 00:14:46.180 |
it has done before to figure out an architecture 00:14:49.860 |
on how to integrate it well, and then see where it breaks 00:14:53.900 |
and make the necessary advances we need to make 00:15:02.820 |
I mean, that's the Elon Musk approach with rockets, 00:15:10.700 |
I happen to be right, but like, we didn't know. 00:15:19.940 |
So if you were to do, what's the brute force solution? 00:15:30.900 |
- Look, web search has come a long way, even since then. 00:15:34.420 |
But at the time, like, you know, first of all, 00:15:37.420 |
I mean, there were a couple of other constraints 00:15:50.060 |
The device, if the device is as big as a room, fine, 00:15:52.540 |
it's as big as a room, but we want a self-contained device, 00:16:01.160 |
So it had to kind of fit in a shoebox, if you will, 00:16:07.620 |
So, but also you couldn't just get out there. 00:16:10.020 |
You couldn't go off network, right, to kind of go. 00:16:14.500 |
But then we did, but the basic thing was go do web search. 00:16:18.940 |
Problem was, even when we went and did a web search, 00:16:24.140 |
but somewhere in the order of 65% of the time, 00:16:39.820 |
out of the top 20 documents, top 10 documents, 00:16:42.500 |
whatever it was, which we didn't know how to do, 00:16:48.700 |
unless we had enough confidence in it, right? 00:16:52.060 |
you'd have to have confidence it was the right answer. 00:16:56.020 |
to now go buzz in, and you'd still only get 65% of them 00:16:59.580 |
right, which doesn't even put you in the winner's circle. 00:17:09.660 |
even if I had somewhere in the top 10 documents, 00:17:12.060 |
how do I figure out where in the top 10 documents 00:17:19.300 |
So it's not like I go in knowing the right answer 00:17:26.680 |
How do I, as a machine, go out and figure out 00:17:36.900 |
- First of all, if you pause on that, just think about it. 00:17:53.220 |
- Well, we solved that in some definition of solved, 00:17:58.580 |
So how do you take a body of work on a particular topic 00:18:19.540 |
And we ultimately did find the body of knowledge 00:18:22.340 |
I mean, it included Wikipedia and a bunch of other stuff. 00:18:30.540 |
unlike WordNet and other types of semantic resources. 00:18:35.660 |
In other words, where we went out and took that content 00:18:38.660 |
and then expanded it based on producing statistical, 00:19:19.740 |
that wasn't in the core resources and expand it. 00:19:25.780 |
but still, again, from a web scale perspective, 00:19:38.100 |
broke it down into all those individual words 00:19:40.300 |
and we did semantic, static and semantic parses on it, 00:19:46.580 |
and we indexed that in a very rich and very fast index. 00:19:54.820 |
let's say the equivalent of, for the sake of argument, 00:19:58.580 |
We've now analyzed all that, blowing up its size even more 00:20:15.460 |
I mean, I know 2000, maybe this is 2008, nine, 00:20:27.500 |
Like how hard is the infrastructure component, 00:20:35.660 |
but close to 3000 cores completely connected. 00:20:43.420 |
- And they were sharing memory in some kind of way. 00:21:06.860 |
So if I went and tried to find a piece of content, 00:21:27.380 |
So therein lies, you know, the Watson architecture, right? 00:21:38.340 |
We'd try to figure out what is it asking about? 00:21:46.820 |
That might be represented as a simple string, 00:21:55.740 |
So anyway, the bottom line is we would do a bunch 00:22:00.020 |
And question analysis had to finish and had to finish fast. 00:22:12.300 |
using open source search engines, we modified them. 00:22:15.700 |
We had a number of different search engines we would use 00:22:24.140 |
ultimately to now take our question analysis, 00:22:28.140 |
produce multiple queries based on different interpretations 00:22:44.460 |
And so now let's say you had a thousand passages. 00:22:51.420 |
So you went out and you parallelize the search. 00:23:02.900 |
For each passage now, you'd go and figure out 00:23:25.100 |
coming up better ways to generate search queries 00:23:27.500 |
from the questions, better ways to analyze the question, 00:23:31.580 |
- And speed, so better is accuracy and speed. 00:23:42.260 |
Like I focus purely on accuracy and to inaccuracy, 00:23:53.860 |
and then figuring out how to both parallelize 00:24:06.980 |
For each candidate answer, you're gonna score it. 00:24:09.980 |
So you're gonna use all the data that built up. 00:24:15.460 |
You're gonna use how the query was generated. 00:24:34.180 |
from however many candidate answers you have, 00:24:54.060 |
And I wanna rank them based on the likelihood 00:24:55.900 |
that they're a correct answer to the question. 00:24:58.240 |
So every scorer was its own research project. 00:25:03.660 |
of basically a human being saying that this answer 00:25:08.020 |
as you can think of it, if you wanna think of it, 00:25:13.620 |
a human would be looking at a possible answer. 00:25:20.440 |
they'd be reading the passage in which that occurred. 00:25:24.940 |
and they'd be making a decision of how likely it is 00:25:27.940 |
that Emily Dickinson, given this evidence in this passage, 00:25:35.780 |
That's the annotation task. - That's the scoring task. 00:25:38.380 |
- But scoring implies zero to one kind of continuous-- 00:25:40.880 |
- That's right, you give it a zero to one score. 00:25:42.560 |
- So it's not a binary-- - No, you give it a score. 00:25:45.560 |
Give it a zero, yeah, exactly, a zero to one score. 00:25:50.080 |
so you have to somehow normalize and all that kind of stuff 00:26:00.400 |
at the raw scores as well as standardized scores, 00:26:04.560 |
Humans are not involved. - Sorry, so I'm misunderstanding 00:26:12.860 |
- Ground truth is only the answers to the questions. 00:26:18.580 |
So I was always driving end to end performance. 00:26:21.900 |
It was a very interesting engineering approach, 00:26:26.900 |
and ultimately scientific and research approach, 00:26:30.780 |
Now, that's not to say we wouldn't make hypotheses 00:26:36.780 |
that individual component performance was related 00:26:43.940 |
Of course we would, because people would have to 00:26:50.420 |
integrated into the system, you had to show impact 00:26:53.060 |
on end to end performance, question answering performance. 00:26:55.500 |
- So there's many very smart people working on this, 00:26:57.940 |
and they're basically trying to sell their ideas 00:27:01.080 |
as a component that should be part of the system. 00:27:05.580 |
on their component, and they would say things like, 00:27:08.860 |
you know, I'm gonna improve this as a candidate generator, 00:27:12.660 |
or I'm gonna improve this as a question score, 00:27:15.380 |
or as a passage score, I'm gonna improve this, 00:27:23.500 |
on its component metric, like a better parse, 00:27:26.300 |
or a better candidate, or a better type estimation, 00:27:37.280 |
If you can't estimate that, and can't do experiments 00:27:42.940 |
- That's like the best run AI project I've ever heard. 00:27:51.380 |
like, I'm sure there's a lot of day to day breakthroughs, 00:28:04.060 |
but one of the things that I think gave people confidence 00:28:11.140 |
as we follow this procedure of different ideas, 00:28:18.700 |
plug them into the architecture, run the system, 00:28:24.220 |
start off new research projects to improve things, 00:28:39.180 |
everything that was going on with every other component. 00:28:41.780 |
And this is where we leverage machine learning 00:28:49.020 |
statistically driven machine learning components, 00:28:52.300 |
some of them were machine learning components, 00:28:54.160 |
the system as a whole combined all the scores 00:29:00.100 |
This was critical because that way you can divide and conquer 00:29:03.940 |
so you can say, okay, you work on your candidate generator, 00:29:07.060 |
or you work on this approach to answer scoring, 00:29:21.580 |
now we can train and figure out how do we weigh 00:29:40.180 |
and to let the machine learning do the integration. 00:29:51.520 |
Still impressive that you were able to get it done 00:30:02.940 |
But when you look back at the Jeopardy challenge, 00:30:37.660 |
- That's beautiful because there's so much pressure 00:30:41.180 |
because it is a public event, it is a public show, 00:30:54.940 |
By your, I'm sure, exceptionally high standards, 00:30:59.300 |
is there something you regret you would do differently? 00:31:19.540 |
and we did dramatically better on all of them, 00:31:29.740 |
I worry that the world would not understand it as a success 00:31:43.460 |
And that's a whole 'nother theme of the journey. 00:31:49.860 |
It was not a success in natural language understanding, 00:32:00.260 |
I understand what you're saying in terms of the science, 00:32:03.740 |
but I would argue that the inspiration of it, right,