Back to Index

David Ferrucci: The Story of IBM Watson Winning in Jeopardy | AI Podcast Clips


Chapters

0:0 What is Jeopardy
6:0 The Story of IBM Watson
10:50 The Challenges
15:30 Web Search
21:30 Question Analysis
26:19 Performance
30:16 Success

Transcript

- So one of the greatest accomplishments in the history of AI is Watson competing in the game of Jeopardy against humans. And you were a lead in that, a critical part of that. Let's start at the very basics. What is the game of Jeopardy? The game for us humans, human versus human.

- Right, so it's to take a question and answer it. The game of Jeopardy, well-- - It's just the opposite. - Actually, well, no, but it's not, right? It's really not, it's really to get a question and answer, but it's what we call a factoid question. So this notion of like, it really relates to some fact that few people would argue whether the facts are true or not.

In fact, most people wouldn't. Jeopardy kind of counts on the idea that these statements have factual answers. And the idea is to, first of all, determine whether or not you know the answer, which is sort of an interesting twist. - So first of all, understand the question. - You have to understand the question.

What is it asking? And that's a good point, because the questions are not asked directly, right? - They're all like, the way the questions are asked is nonlinear. It's like, it's a little bit witty, it's a little bit playful sometimes. It's a little bit tricky. - Yeah, they're asked in, exactly, in numerous witty, tricky ways.

Exactly what they're asking is not obvious. It takes inexperienced humans a while to go, what is it even asking? And that's sort of an interesting realization that you have when somebody says, oh, Jeopardy is a question answering show. And then it's like, oh, I know a lot. And then you read it, and you're still trying to process the question, and the champions have answered and moved on.

They're like, there are three questions ahead of the time you figured out what the question even meant. So there's definitely an ability there to just parse out what the question even is. So that was certainly challenging. It's interesting, historically, though, if you look back at the Jeopardy games much earlier.

- Like '60s, '70s, that kind of thing. - The questions were much more direct. They weren't quite like that. They got sort of more and more interesting. The way they asked them, that sort of got more and more interesting, and subtle, and nuanced, and humorous, and witty over time, which really required the human to kind of make the right connections in figuring out what the question was even asking.

So yeah, you have to figure out the questions even asking. Then you have to determine whether or not you think you know the answer. And because you have to buzz in really quickly, you sort of have to make that determination as quickly as you possibly can. Otherwise, you lose the opportunity to buzz in.

- Maybe even before you really know if you know the answer. - I think a lot of humans will assume and they'll look at it, process it very superficially. In other words, what's the topic, what are some keywords, and just say, do I know this area or not before they actually know the answer?

Then they'll buzz in and think about it. So it's interesting what humans do. Now some people who know all things, like Ken Jennings or something, or the more recent Big Jeopardy player, I mean, they'll just buzz in. They'll just assume they know all about Jeopardy and they'll just buzz in.

Watson, interestingly, didn't even come close to knowing all of Jeopardy, right? Watson really-- - Even at the peak, even at its best. - Yeah, so for example, I mean, we had this thing called recall, which is like how many of all the Jeopardy questions, how many could we even find the right answer for, like anywhere?

Like could we come up with, if we had a big body of knowledge of something in the order of several terabytes, I mean, from a web scale, it was actually very small. But from a book scale, I was talking about millions of books, right? So they're calling millions of books, encyclopedias, dictionaries, books, so it's still a ton of information.

And for, I think it was only 85% was the answer anywhere to be found. So you're already down at that level just to get started, right? So, and so it was important to get a very quick sense of, do you think you know the right answer to this question?

So we had to compute that confidence as quickly as we possibly could. So in effect, we had to answer it and at least spend some time essentially answering it and then judging the confidence that our answer was right, and then deciding whether or not we were confident enough to buzz in.

And that would depend on what else was going on in the game because there was a risk. So like, if you're really in a situation where I have to take a guess, I have very little to lose, then you'll buzz in with less confidence. - So that was accounted for the financial standings of the different competitors.

- Correct. How much of the game was laughed, how much time was laughed, where you were in the standing, things like that. - How many hundreds of milliseconds that we're talking about here? Do you have a sense of what is, like if it's, what's the target? - So, I mean, we targeted answering in under three seconds.

And-- - Buzzing in, so the decision to buzz in and then the actual answering, are those two different stages? - Yeah, they were two different things. In fact, we had multiple stages, whereas like we would say, let's estimate our confidence, which was sort of a shallow answering process. And then ultimately decide to buzz in, and then we may take another second or something to kind of go in there and do that.

But by and large, we're saying like, we can't play the game. We can't even compete if we can't, on average, answer these questions in around three seconds or less. - So you stepped in, so there's these three humans playing a game, and you stepped in with the idea that IBM Watson would be one of, replace one of the humans and compete against two.

Can you tell the story of Watson taking on this game? - Sure. - It seems exceptionally difficult. - Yeah, so the story was that it was coming up, I think, to the 10-year anniversary of Big Blue. Not Big Blue, Deep Blue. IBM wanted to do sort of another kind of really, fun challenge, public challenge, that can bring attention to IBM research and the kind of the cool stuff that we were doing.

I had been working in AI at IBM for some time. I had a team doing what's called open domain factoid question answering, which is, we're not gonna tell you what the questions are, we're not even gonna tell you what they're about. Can you go off and get accurate answers to these questions?

And it was an area of AI research that I was involved in. And so it was a very specific passion of mine. Language understanding had always been a passion of mine. One sort of narrow slice on whether or not you could do anything with language was this notion of open domain, meaning I could ask anything about anything, factoid, meaning it essentially had an answer, and being able to do that accurately and quickly.

So that was a research area that my team had already been in. And so completely independently, several IBM executives, like, what are we gonna do? What's the next cool thing to do? And Ken Jennings was on his winning streak. This was like, whatever it was, 2004, I think, was on his winning streak.

And someone thought, hey, that would be really cool if the computer can play Jeopardy. And so this was like in 2004, they were shopping this thing around. And everyone was telling the research execs, no way. Like, this is crazy. And we had some pretty senior people in the field and they're saying, no, this is crazy.

And it would come across my desk and I was like, but that's kind of what I'm really interested in doing. But there was such this prevailing sense of, this is nuts, we're not gonna risk IBM's reputation on this, we're just not doing it. And this happened in 2004, it happened in 2005.

At the end of 2006, it was coming around again. And I was coming off of a, I was doing the open domain question answering stuff, but I was coming off of a couple other projects. I had a lot more time to put into this. And I argued that it could be done.

And I argued it would be crazy not to do this. - Can I, you can be honest at this point. So even though you argued for it, what's the confidence that you had yourself, privately, that this could be done? We just told the story, how you tell stories to convince others.

How confident were you? What was your estimation of the problem at that time? - So I thought it was possible. And a lot of people thought it was impossible. I thought it was possible. A reason why I thought it was possible was because I did some brief experimentation. I knew a lot about how we were approaching open domain factoid question answering.

We've been doing it for some years. I looked at the Jeopardy stuff. I said, this is gonna be hard for a lot of the points that we mentioned earlier. Hard to interpret the question, hard to do it quickly enough, hard to compute an accurate confidence. None of this stuff had been done well enough before.

But a lot of the technologies we're building were the kinds of technologies that should work. But more to the point, what was driving me was, I was in IBM Research. I was a senior leader in IBM Research. And this is the kind of stuff we were supposed to do.

In other words, we were basically supposed to-- - This is the moonshot. This is the-- - I mean, we were supposed to take things and say, this is an active research area. It's our obligation to kind of, if we have the opportunity, to push it to the limits. And if it doesn't work, to understand more deeply why we can't do it.

And so I was very committed to that notion, saying, folks, this is what we do. It's crazy not to do this. This is an active research area. We've been in this for years. Why wouldn't we take this grand challenge and push it as hard as we can? At the very least, we'd be able to come out and say, here's why this problem is way hard.

Here's what we tried and here's how we failed. So I was very driven as a scientist from that perspective. And then I also argued, based on what we did a feasibility study, of why I thought it was hard but possible. And I showed examples of where it succeeded, where it failed, why it failed, and sort of a high-level architectural approach for why we should do it.

But for the most part, at that point, the execs really were just looking for someone crazy enough to say yes. Because for several years at that point, everyone had said no. I'm not willing to risk my reputation and my career on this thing. - Clearly, you did not have such fears.

- I did not. - So you dived right in, and yet, from what I understand, it was performing very poorly in the beginning. So what were the initial approaches and why did they fail? - Well, there were lots of hard aspects to it. I mean, one of the reasons why prior approaches that we had worked on in the past failed was because the questions were difficult to interpret.

Like, what are you even asking for? Very often, if the question was very direct, like, what city? Even then, it could be tricky. But what city or what person, often when it would name it very clearly, you would know that. And if there were just a small set of them, in other words, we're gonna ask about these five types.

Like, it's gonna be an answer, and the answer will be a city in this state or a city in this country. The answer will be a person of this type, right? Like an actor or whatever it is. But it turns out that in "Jeopardy!" there were like tens of thousands of these things.

And it was a very, very long tale. Meaning, you know, it just went on and on. And so even if you focused on trying to encode the types at the very top, like there's five that were the most, let's say five of the most frequent, you still cover a very small percentage of the data.

So you couldn't take that approach of saying, I'm just going to try to collect facts about these five or 10 types or 20 types or 50 types or whatever. So that was like one of the first things, like, what do you do about that? And so we came up with an approach toward that.

And the approach looked promising. And we continued to improve our ability to handle that problem throughout the project. The other issue was that right from the outset, I said, we're not going to, I committed to doing this in three to five years. So we did it in four. So I got lucky.

But one of the things that that putting that, like stake in the ground was, and I knew how hard the language understanding problem was. I said, we're not going to actually understand language to solve this problem. We are not going to interpret the question and the domain of knowledge that the question refers to in reason over that to answer these questions.

Obviously, we're not going to be doing that. At the same time, simple search wasn't good enough to confidently answer with a single correct answer. - First of all, that's like brilliant. That's such a great mix of innovation and practical engineering three, four, eight. So you're not trying to solve the general NLU problem.

You're saying, let's solve this in any way possible. - Oh, yeah, no, I was committed to saying, look, we're just solving the open domain question answering problem. We're using Jeopardy as a driver for that. - Big benchmark. - Hard enough, big benchmark, exactly. And now we're-- - How do we do it?

- We could just like, whatever, like just figure out what works, because I want to be able to go back to the academic science community and say, here's what we tried, here's what worked, here's what didn't work. I don't want to go in and say, oh, I only have one technology, I have a hammer, I'm only going to use this.

I'm going to do whatever it takes. I'm like, I'm going to think out of the box, I'm going to do whatever it takes. One, and I also, there was another thing I believed. I believed that the fundamental NLP technologies and machine learning technologies would be adequate. And this was an issue of how do we enhance them?

How do we integrate them? How do we advance them? So I had one researcher who came to me who had been working on question answering with me for a very long time, who had said, we're going to need Maxwell's equations for question answering. And I said, if we need some fundamental formula that breaks new ground in how we understand language, we're screwed.

We're not going to get there from here. Like, I am not counting, my assumption is I'm not counting on some brand new invention. What I'm counting on is the ability to take everything it has done before to figure out an architecture on how to integrate it well, and then see where it breaks and make the necessary advances we need to make until this thing works.

- Yeah, push it hard to see where it breaks and then patch it up. I mean, that's how people change the world. I mean, that's the Elon Musk approach with rockets, SpaceX, that's the Henry Ford and so on. I love it. - And I happen to be, and in this case, I happen to be right, but like, we didn't know.

But you kind of have to put a stake in it, how you're going to run the project. - So yeah, and backtracking to search. So if you were to do, what's the brute force solution? What would you search over? So you have a question, how would you search the possible space of answers?

- Look, web search has come a long way, even since then. But at the time, like, you know, first of all, I mean, there were a couple of other constraints around the problem, which is interesting. So you couldn't go out to the web. You couldn't search the internet. In other words, the AI experiment was, we want a self-contained device.

The device, if the device is as big as a room, fine, it's as big as a room, but we want a self-contained device, contained device. You're not going out to the internet, you don't have a lifeline to anything. So it had to kind of fit in a shoebox, if you will, or at least the size of a few refrigerators, whatever it might be.

So, but also you couldn't just get out there. You couldn't go off network, right, to kind of go. So there was that limitation. But then we did, but the basic thing was go do web search. Problem was, even when we went and did a web search, I don't remember exactly the numbers, but somewhere in the order of 65% of the time, the answer would be somewhere, you know, in the top 10 or 20 documents.

So first of all, that's not even good enough to play Jeopardy. In other words, even if you could pull the, even if you could perfectly pull the answer out of the top 20 documents, top 10 documents, whatever it was, which we didn't know how to do, but even if you could do that, and you knew it was right, unless we had enough confidence in it, right?

So you'd have to pull out the right answer, you'd have to have confidence it was the right answer. And then you'd have to do that fast enough to now go buzz in, and you'd still only get 65% of them right, which doesn't even put you in the winner's circle.

Winner's circle, you have to be up over 70, and you have to do it really quick, and you have to do it really quickly. But now the problem is, well, even if I had somewhere in the top 10 documents, how do I figure out where in the top 10 documents that answer is?

And how do I compute a confidence of all the possible candidates? So it's not like I go in knowing the right answer and have to pick it. I don't know the right answer. I have a bunch of documents, somewhere in there is the right answer. How do I, as a machine, go out and figure out which one's right?

And then how do I score it? So, and now how do I deal with the fact that I can't actually go out to the web? - First of all, if you pause on that, just think about it. If you could go to the web, do you think that problem is solvable, if you just pause on it?

Just thinking even beyond Jeopardy, do you think the problem of reading text defined where the answer is? - Well, we solved that in some definition of solved, given the Jeopardy challenge. - How did you do it for Jeopardy? So how do you take a body of work on a particular topic and extract the key pieces of information?

- So now forgetting about the huge volumes that are on the web, right? So now we have to figure out, we did a lot of source research. In other words, what body of knowledge is gonna be small enough, but broad enough, to answer Jeopardy? And we ultimately did find the body of knowledge that did that.

I mean, it included Wikipedia and a bunch of other stuff. - So like encyclopedia type of stuff. I don't know if you can speak to-- - Encyclopedia, dictionaries, different types of semantic resources, unlike WordNet and other types of semantic resources. Like that, as well as like some web crawls.

In other words, where we went out and took that content and then expanded it based on producing statistical, statistically producing seeds, using those seeds for other searches, and then expanding that. So using these like expansion techniques, we went out and found enough content and we're like, okay, this is good.

And even up until the end, we had a thread of research, it was always trying to figure out what content could we efficiently include. - I mean, there's a lot of popular, like what is the church lady? Well, I think was one of the, like what? I guess that's probably an encyclopedia.

So I guess-- - So that's an encyclopedia, but then we would take that stuff and we would go out and we would expand. In other words, we go find other content that wasn't in the core resources and expand it. You know, the amount of content, grew it by an order of magnitude, but still, again, from a web scale perspective, this is very small amount of content.

- It's very select. - And then we then took all that content, we pre-analyzed the crap out of it, meaning we parsed it, broke it down into all those individual words and we did semantic, static and semantic parses on it, had computer algorithms that annotated it and we indexed that in a very rich and very fast index.

So we have a relatively huge amount of, let's say the equivalent of, for the sake of argument, two to 5 million bucks. We've now analyzed all that, blowing up its size even more because now we have all this metadata and then we richly indexed all of that and by way in a giant in-memory cache.

So Watson did not go to disk. - So the infrastructure component there, if you could just speak to it, how tough it, I mean, I know 2000, maybe this is 2008, nine, you know, that's kind of a long time ago. - Right. - How hard is it to use multiple machines?

Like how hard is the infrastructure component, the hardware component? - So we used IBM hardware. We had something like, I forget exactly, but close to 3000 cores completely connected. So you had a switch where, you know, every CPU was connected to every other CPU. - And they were sharing memory in some kind of way.

- Large shared memory, right? And all this data was pre-analyzed and put into a very fast indexing structure that was all in memory. And then we took that question, we would analyze the question. So all the content was now pre-analyzed. So if I went and tried to find a piece of content, it would come back with all the metadata that we had pre-computed.

- How do you shove that question? How do you connect the big stuff, the big knowledge base of the metadata that's indexed to the simple little witty, confusing question? - Right. So therein lies, you know, the Watson architecture, right? So we would take the question, we would analyze the question.

So which means that we would parse it and interpret it a bunch of different ways. We'd try to figure out what is it asking about? So we would come, we had multiple strategies to kind of determine what was it asking for. That might be represented as a simple string, a character string, or something we would connect back to different semantic types that were from existing resources.

So anyway, the bottom line is we would do a bunch of analysis in the question. And question analysis had to finish and had to finish fast. So we do the question analysis because then from the question analysis, we would now produce searches. So we would, and we had built, using open source search engines, we modified them.

We had a number of different search engines we would use that had different characteristics. We went in there and engineered and modified those search engines, ultimately to now take our question analysis, produce multiple queries based on different interpretations of the question and fire out a whole bunch of searches in parallel.

And they would come back with passages. So these were passive search algorithms. They would come back with passages. And so now let's say you had a thousand passages. Now for each passage, you parallelize again. So you went out and you parallelize the search. Each search would now come back with a whole bunch of passages.

Maybe you had a total of a thousand or 5,000, whatever passages. For each passage now, you'd go and figure out whether or not there was a candidate, we'd call a candidate answer in there. So you had a whole bunch of other algorithms that would find candidate answers, possible answers to the question.

And so you had candidate answers, called candidate answers generators, a whole bunch of those. So for every one of these components, the team was constantly doing research, coming up better ways to generate search queries from the questions, better ways to analyze the question, better ways to generate candidates. - And speed, so better is accuracy and speed.

- Correct, so right, speed and accuracy, for the most part, were separated. We handled that sort of in separate ways. Like I focus purely on accuracy and to inaccuracy, are we ultimately getting more questions and producing more accurate confidences? And then a whole nother team that was constantly analyzing the workflow to find the bottlenecks, and then figuring out how to both parallelize and drive the algorithm speed.

But anyway, so now think of it like, again, you have this big fan out now, right? Because you had multiple queries, now you have thousands of candidate answers. For each candidate answer, you're gonna score it. So you're gonna use all the data that built up. You're gonna use the question analysis.

You're gonna use how the query was generated. You're gonna use the passage itself. And you're gonna use the candidate answer that was generated. And you're gonna score that. So now we have a group of researchers coming up with scorers. There are hundreds of different scorers. So now you're getting a fan out of it again, from however many candidate answers you have, to all the different scorers.

So if you have 200 different scorers and you have a thousand candidates, now you have 200,000 scores. And so now you gotta figure out, how do I now rank these answers based on the scores that came back? And I wanna rank them based on the likelihood that they're a correct answer to the question.

So every scorer was its own research project. - What do you mean by scorer? So is that the annotation process of basically a human being saying that this answer has a quality of-- - Yeah, think of it as you can think of it, if you wanna think of it, what you're doing, if you wanna think about what a human would be doing, a human would be looking at a possible answer.

They'd be reading the, Emily Dickinson, they'd be reading the passage in which that occurred. They'd be looking at the question, and they'd be making a decision of how likely it is that Emily Dickinson, given this evidence in this passage, is the right answer to that question. - Got it, so that's the annotation task.

That's the annotation task. - That's the scoring task. - But scoring implies zero to one kind of continuous-- - That's right, you give it a zero to one score. - So it's not a binary-- - No, you give it a score. Give it a zero, yeah, exactly, a zero to one score.

- So what humans do, give different scores, so you have to somehow normalize and all that kind of stuff that deal with all that complexity. - Depends on what your strategy is. - It could be relative, too. It could be-- - We actually looked at the raw scores as well as standardized scores, because humans are not involved in this.

Humans are not involved. - Sorry, so I'm misunderstanding the process here. This is passages. Where is the ground truth coming from? - Ground truth is only the answers to the questions. - So it's end to end. - It's end to end. So I was always driving end to end performance.

It was a very interesting engineering approach, and ultimately scientific and research approach, always driving end to end. Now, that's not to say we wouldn't make hypotheses that individual component performance was related in some way to end to end performance. Of course we would, because people would have to build individual components.

But ultimately, to get your component integrated into the system, you had to show impact on end to end performance, question answering performance. - So there's many very smart people working on this, and they're basically trying to sell their ideas as a component that should be part of the system.

- That's right, and they would do research on their component, and they would say things like, you know, I'm gonna improve this as a candidate generator, or I'm gonna improve this as a question score, or as a passage score, I'm gonna improve this, or as a parser, and I can improve it by 2% on its component metric, like a better parse, or a better candidate, or a better type estimation, or whatever it is.

And then I would say, I need to understand how the improvement on that component metric is gonna affect the end to end performance. If you can't estimate that, and can't do experiments to demonstrate that, it doesn't get in. - That's like the best run AI project I've ever heard.

That's awesome, okay. What breakthrough would you say, like, I'm sure there's a lot of day to day breakthroughs, but was there like a breakthrough that really helped improve performance? Like where people began to believe? Or is it just a gradual process? - Well, I think it was a gradual process, but one of the things that I think gave people confidence that we can get there was that, as we follow this procedure of different ideas, build different components, plug them into the architecture, run the system, see how we do, do the error analysis, start off new research projects to improve things, and the very important idea that the individual component work did not have to deeply understand everything that was going on with every other component.

And this is where we leverage machine learning in a very important way. So while individual components could be statistically driven machine learning components, some of them were heuristic, some of them were machine learning components, the system as a whole combined all the scores using machine learning. This was critical because that way you can divide and conquer so you can say, okay, you work on your candidate generator, or you work on this approach to answer scoring, you work on this approach to type scoring, you work on this approach to passage search or to passage selection and so forth.

But when we just plug it in, and we had enough training data to say, now we can train and figure out how do we weigh all the scores relative to each other based on predicting the outcome, which is right or wrong on Jeopardy. And we had enough training data to do that.

So this enabled people to work independently and to let the machine learning do the integration. - Beautiful, so yeah, the machine learning is doing the fusion, and then it's a human orchestrated ensemble of different approaches. That's great. Still impressive that you were able to get it done in a few years.

That's not obvious to me that it's doable if I just put myself in that mindset. But when you look back at the Jeopardy challenge, again, when you're looking up at the stars, what are you most proud of? Just looking back at those days. - I'm most proud of my...

(mouse clicking) My commitment and my team's commitment to be true to the science, to not be afraid to fail. - That's beautiful because there's so much pressure because it is a public event, it is a public show, that you were dedicated to the idea. - That's right. - Do you think it was a success?

In the eyes of the world, it was a success. By your, I'm sure, exceptionally high standards, is there something you regret you would do differently? - It was a success. It was a success for our goal. Our goal was to build the most advanced open domain question answering system.

We went back to the old problems that we used to try to solve, and we did dramatically better on all of them, as well as we beat Jeopardy. So we won at Jeopardy. So it was a success. I worry that the world would not understand it as a success because it came down to only one game.

And I knew, statistically speaking, this can be a huge technical success, and we could still lose that one game. And that's a whole 'nother theme of the journey. But it was a success. It was not a success in natural language understanding, but that was not the goal. - Yeah, that was, but I would argue, I understand what you're saying in terms of the science, but I would argue that the inspiration of it, right, not a success in terms of solving natural language understanding, but it was a success of being an inspiration to future challenges.

- Absolutely. - That drive future efforts. (sniffles) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) you