back to index

Increasing data science productivity; founders of spaCy & Prodigy


Chapters

0:0 Introduction
3:55 Syntax Positive
12:32 Syntax
14:35 Term Sensitive ECH
16:19 Using ECH in Spacey
17:21 Parsing Algorithm
17:44 Transition Based Parsing
22:24 Splitting Tokens
24:44 Learning to Merge
26:56 User Experience
29:16 SpaCy vs Stanford
30:54 Endtoend systems
32:48 Longrange dependencies
33:46 Language variation
35:9 Current resolution
36:49 How to use spaCy
37:42 Language generation
45:44 Binary decision
52:51 Recipes
55:26 Example

Whisper Transcript | Transcript Only Page

00:00:00.000 | OK, so yeah, this is our first time in the Bay Area,
00:00:03.960 | so it's nice to meet you all.
00:00:05.600 | And thanks for coming.
00:00:07.160 | Not so much to notice.
00:00:09.200 | So I'll start by just giving a quick introduction of us
00:00:11.640 | and some of the things that we're doing
00:00:13.840 | before I start with the main content of the talk, which
00:00:16.360 | is about this open source library that we developed
00:00:19.960 | spacey for natural language processing.
00:00:22.440 | So the other things that we develop as well at Explosion
00:00:25.400 | AI is a machine learning library behind spacey think,
00:00:31.000 | which allows us to avoid depending on other libraries
00:00:33.520 | and keep control of everything and make sure that everything
00:00:36.140 | is easy to install.
00:00:38.080 | We also have an annotation tool that we
00:00:40.080 | develop alongside spacey prodigy, which is what
00:00:42.440 | Innes will be talking about.
00:00:44.320 | And we're also preparing a data store of other pre-trained
00:00:46.840 | models for more specific languages and use cases
00:00:49.960 | and things that people will be able to use that basically
00:00:53.680 | will extend the capabilities of the software
00:00:56.440 | for more specific use cases.
00:00:59.640 | So to give you a quick introduction to Innes and I,
00:01:02.840 | which is basically all of Explosion AI,
00:01:05.640 | so I've been working on natural language processing
00:01:07.880 | for pretty much my whole career.
00:01:09.520 | I started doing this after doing a PhD in computer science.
00:01:13.960 | I started off in linguistics and then kind of moved
00:01:15.960 | across to computational linguistics.
00:01:18.880 | And then around 2014, I saw that these technologies
00:01:22.160 | were getting increasingly viable.
00:01:23.960 | And I was also at the point in my career
00:01:26.000 | where I was supposed to start writing grant proposals, which
00:01:28.960 | didn't really agree with me.
00:01:30.120 | So I decided to leave and I saw that there
00:01:32.000 | was a gap in the capabilities available for something that
00:01:35.040 | actually translated the research systems
00:01:37.040 | to something that was more practically focused.
00:01:39.920 | And then soon after I moved to Berlin to do this,
00:01:42.640 | I met Innes.
00:01:44.000 | And we've been working together since on these things.
00:01:46.360 | And I think we kind of have a nice complementarity of things.
00:01:49.920 | She is the lead developer of our annotation tool Prodigy
00:01:54.200 | and has also been working on Spacey pretty much
00:01:56.880 | since the first release.
00:01:59.360 | So I included this slide, which we normally actually
00:02:02.320 | give this when we talk to companies specifically.
00:02:04.480 | But I think that it's a good thing to include
00:02:06.560 | to give you a bit of this is what
00:02:08.960 | we tell people about what we do and how we make money
00:02:11.640 | and how the company works.
00:02:13.080 | And I think that this is a very valid question
00:02:14.920 | that people would have about an open source library.
00:02:17.000 | It's like, well, why are you doing this and how
00:02:19.200 | does it fit into the rest of your projects and plans?
00:02:22.640 | So the Explain It Like I'm 5 version, which I guess
00:02:25.760 | is also the Explain It Like I'm Senior Management version,
00:02:28.800 | is we give an analogy.
00:02:30.280 | It's kind of like a boutique kitchen.
00:02:31.880 | So the free recipes we publish online,
00:02:34.680 | you can see, is kind of like the open source software.
00:02:37.120 | So that's Spacey thing, et cetera.
00:02:39.600 | At the start of the company, especially,
00:02:41.360 | we were doing consulting, which I'm
00:02:43.160 | happy to say we've been able to wind down over the last six
00:02:46.400 | months and focus on our products.
00:02:48.360 | And then we also focus on a line of kitchen gadgets, which
00:02:51.800 | is things like Prodigy.
00:02:53.160 | These are these downloadable tools
00:02:54.880 | to use alongside the open source software.
00:02:56.960 | And soon we'll have this sort of premium ingredients, which
00:02:59.460 | are the pre-trained models.
00:03:01.240 | So the thing that we don't do here
00:03:02.880 | is enterprise support, which I guess
00:03:04.680 | is probably the most common way that people
00:03:07.840 | fund open source software or imagine
00:03:09.720 | that they'll fund open source software with a business model.
00:03:12.600 | And we really don't like this because we want our software
00:03:16.040 | to be as easy to use as possible and as transparent as possible
00:03:18.960 | and the documentation to be good.
00:03:20.720 | So I think it's kind of weird to have this thing where you have
00:03:24.360 | explicitly a plan that we're going to make our free stuff
00:03:26.960 | as good as possible.
00:03:28.240 | And then we're going to have this service
00:03:30.440 | that we hope people pay us lots of money for,
00:03:33.440 | but we hope nobody uses.
00:03:35.120 | And that's kind of weird, right?
00:03:36.520 | It's kind of weird to have a company that you
00:03:39.140 | hope that your paid offering is really poor value to people.
00:03:41.720 | And so we don't think that that's a good way to do it.
00:03:44.440 | And so instead, we have the downloadable tools,
00:03:47.560 | I think, is a good way to--
00:03:49.800 | we have something which works alongside Spacey
00:03:51.840 | and I think is useful to people who use Spacey as well.
00:03:56.840 | OK, so onto the sort of main content of the talk
00:04:01.680 | and the bit that I'll be talking about.
00:04:04.600 | So I'm going to talk to you about the syntactic parser
00:04:08.080 | within Spacey, the natural language processing library
00:04:10.920 | that we use.
00:04:12.040 | And so before I do it, so this is kind of what
00:04:16.040 | it looks like as sort of visualized as an output.
00:04:19.800 | So it's this sort of tree-based structure
00:04:22.240 | that gives you the syntactic relationships between words.
00:04:26.760 | So the way that you should read this here
00:04:29.240 | is that the arrow pointing from this word to this word
00:04:33.640 | means that Apple is a child of looking in the tree.
00:04:37.520 | And it's a child with this relationship and such.
00:04:39.960 | In other words, Apple is the subject of looking.
00:04:42.480 | And is is an auxiliary verb attached to looking,
00:04:45.560 | and then at is a prepositional phrase attached to looking.
00:04:49.280 | So these sorts of relationships tell you
00:04:51.480 | about the syntactic structure of the sentence
00:04:53.200 | and basically help you get at the who did what to whom
00:04:55.880 | sort of relationships in the sentence
00:04:57.720 | and also to extract phrases and things.
00:04:59.760 | So for instance, here, to make the thing more easy to read,
00:05:02.840 | we've merged UK startup, which is a sort of basic noun phrase
00:05:06.560 | into one unit.
00:05:08.800 | And you can find these sorts of phrases
00:05:10.520 | more easily from given the syntactic structure.
00:05:13.520 | And just above here, we've got an example
00:05:16.120 | of what the code looks like to actually get
00:05:18.880 | the syntactic structure or navigate the tree.
00:05:21.600 | In spaCy, you just get this NLP object after loading the models.
00:05:26.240 | And you just use that as a function
00:05:27.720 | that you feed text or pipetext through if you've
00:05:30.320 | got a sequence of texts.
00:05:32.240 | And given that, you get a document object, which you can
00:05:35.880 | just use as an iterable. And from the tokens,
00:05:38.840 | you get attributes that you can use to navigate the tree.
00:05:42.040 | So for instance, here, the dependency relationship
00:05:44.680 | is just a dot depth.
00:05:47.000 | By default, that's an integer key, integer ID,
00:05:50.040 | because everything's kind of coded to an integer
00:05:52.120 | for easy and efficient processing.
00:05:55.320 | But then you can get the text value with an underscore
00:05:57.840 | as well.
00:05:58.520 | And then you can navigate up the tree with dot head.
00:06:01.000 | And then you can look at the left and right children
00:06:03.160 | of the tree as well.
00:06:04.280 | So we try to have a rich API that makes it easy
00:06:06.200 | to use these dependency relationships.
00:06:09.640 | So that just getting dependency parses, obviously,
00:06:13.920 | just the first step, you want to actually use it in some way.
00:06:16.400 | And that's why we have this API to make that easy.
00:06:20.600 | So the question that always comes up with this,
00:06:23.360 | and I think this is a very interesting thing
00:06:25.600 | for the field in general, is what's the point of parsing?
00:06:28.760 | What is this actually good for in terms of applications?
00:06:31.920 | So Yoav Goldberg is a very prominent parsing researcher.
00:06:35.360 | And this is kind of the stuff that he's
00:06:37.240 | studied for most of his career.
00:06:38.640 | And he's one of the more well-known parsing people.
00:06:42.000 | And so it's interesting to see him and other people reflect
00:06:44.880 | on this and say that he finds it fascinating
00:06:46.800 | that even though we have so many best papers in NLP,
00:06:50.040 | so it's kind of a high prestige thing to study parsing.
00:06:53.600 | But it seems like syntax is hardly
00:06:56.080 | used in practice in most of these applications.
00:06:59.720 | So the question is, why is this?
00:07:03.120 | Because parsing is based on trees and structured predictions,
00:07:05.840 | it's kind of fun to study.
00:07:06.720 | And there's all these deep algorithmic questions.
00:07:08.720 | Is it just kind of this catnip to researchers?
00:07:11.800 | And does it have this kind of over prominence in the field?
00:07:17.360 | Or is it that there is something deeper about this
00:07:22.360 | and we should really continue studying this?
00:07:25.520 | Well, I can go either way on this.
00:07:29.840 | And so this slide shows you the case for parsing.
00:07:32.680 | And then I'll kind of have a counterpoint in a second.
00:07:35.440 | So I think that the most important case for parsing
00:07:38.920 | is that there's a sort of deep truth to the fact
00:07:41.240 | that sentences are tree-structured.
00:07:42.960 | They just are, right?
00:07:45.400 | Language syntactic structure of sentences is recursive.
00:07:49.400 | And that means that you can have arbitrarily long gaps
00:07:52.320 | between two words which are related.
00:07:54.480 | So for instance, if you have a relationship between,
00:07:59.760 | say, a subject and a verb like syntax is,
00:08:03.840 | whether the subject of that verb is plural or singular
00:08:08.960 | is going to change the form of the verb.
00:08:11.080 | And that dependency between them can be arbitrarily long
00:08:13.880 | because you can have this nested structure.
00:08:16.840 | But it can't be arbitrarily long in tree space
00:08:18.920 | because the relationship between them
00:08:23.360 | will always be the subject and the verb,
00:08:25.120 | like sort of next to each other in the tree.
00:08:28.000 | So you can see how, for some of these things,
00:08:30.280 | it should be more efficient to think about it
00:08:32.320 | or model it as a tree.
00:08:34.080 | And the tree should tell you things
00:08:35.600 | that you otherwise would have to infer
00:08:38.880 | from an enormous amount of data.
00:08:41.000 | It should be more efficient in this way.
00:08:43.040 | So we can say, OK, in theory, this should be important.
00:08:46.440 | And it should be something that we
00:08:47.920 | study based on this knowledge about how sentences
00:08:50.840 | are structured.
00:08:53.440 | So then the counterpoint to this is, all right,
00:08:56.080 | so sentences are tree-structured.
00:08:57.640 | And that's a truth about sentences.
00:08:59.600 | But it's also true that they're written and read in order.
00:09:02.480 | So if you read a sentence, you do read it from left to right,
00:09:05.920 | or in English anyway, basically from start to finish,
00:09:08.600 | or you hear a sentence from start to finish.
00:09:10.840 | And this really puts a sort of bounding on the linear complexity
00:09:14.760 | that you will empirically see.
00:09:16.920 | Because when somebody wrote this sentence,
00:09:19.200 | yes, they could have an arbitrarily long dependency.
00:09:21.960 | But they expect that that would mean
00:09:24.000 | that their audience listening to it
00:09:25.440 | will have to wait arbitrarily long between some word
00:09:28.880 | and the thing that it attaches to.
00:09:30.560 | And that's kind of not very nice.
00:09:32.160 | So empirically, it's not very surprising to see
00:09:35.720 | that most dependencies are, in fact, short.
00:09:38.440 | And there's a lot of arguments that the options that
00:09:42.240 | are kind of provided to grammars are sort of arranged
00:09:45.160 | that you're able to keep your dependencies short.
00:09:48.040 | Like that's sort of some of the reasons
00:09:49.760 | you have options for how to move things around in sentences
00:09:52.240 | to make nice reading orders.
00:09:53.800 | Because you want short dependencies.
00:09:56.160 | So this means that if most dependencies are short,
00:09:58.880 | then processing text as, say, chunks of words of one or two
00:10:02.720 | at a time kind of gives you a pretty similar view.
00:10:06.040 | Most of the time, you don't get something that's
00:10:08.240 | so dramatically different if you look at a tree
00:10:11.160 | instead of looking at chunks of three or four word sentences.
00:10:14.400 | So this is kind of a counterpoint that
00:10:16.240 | says maybe even though the sentences are, in fact,
00:10:20.280 | restructured, maybe it's not that crucially useful.
00:10:23.760 | So I think that the part that makes this particularly
00:10:27.080 | rewarding to look at syntax or particularly useful
00:10:30.320 | to provide syntactic structures in a library like Spacey
00:10:33.400 | is that they're application independent.
00:10:35.480 | So the syntactic structure of the sentence
00:10:38.320 | doesn't depend on what you hope to do with the sentence
00:10:40.600 | or how you hope to process it.
00:10:42.120 | And that's something that's quite different from other labels
00:10:44.520 | or other information that we can attach to the sentence.
00:10:47.400 | If you're doing something like a sentiment analysis,
00:10:49.680 | there's no truth about the sentiment of a sentence
00:10:53.080 | that's independent of what you're hoping to process.
00:10:55.760 | That's not a thing that's in the text itself.
00:10:58.200 | It's a lens that you want to take on it based
00:11:01.160 | on how you want to process it.
00:11:02.480 | So whether you consider some review
00:11:05.800 | to be positive or negative depends on your application.
00:11:10.760 | It's not necessarily in the text itself.
00:11:12.680 | Because what counts as positive or negative?
00:11:15.080 | What's the labeling scheme?
00:11:16.160 | What's the rating scheme?
00:11:17.920 | Or exactly what are they talking about?
00:11:20.800 | Well, the taxonomy that you have will
00:11:23.040 | depend on what you're hoping to process with.
00:11:26.160 | Those things aren't in the language.
00:11:28.160 | But details about the syntactic structure are in the language.
00:11:31.600 | They're things which are just part
00:11:33.160 | of the structure of the code.
00:11:36.000 | And that means that we can provide these things,
00:11:38.600 | learn it once, and give it to many people.
00:11:40.520 | And I think that that's very valuable and useful
00:11:42.640 | and different from other types of annotations
00:11:44.800 | that we could calculate and attach.
00:11:46.480 | And that's why SPACY provides pre-trained models for syntax,
00:11:49.360 | but doesn't provide pre-trained models for something
00:11:51.400 | like sentiment.
00:11:52.080 | Because we know how to give you a syntactic analysis that's
00:11:56.760 | as useful as it may be, or maybe not,
00:11:59.480 | depending on whether that actually solves your problems.
00:12:02.200 | But at least it's sort of true and generalizable,
00:12:04.640 | whereas we don't know what categorization scheme you
00:12:08.640 | want to classify your text in.
00:12:09.880 | So we can't give you a pre-trained model
00:12:11.560 | that does that, because that's your own problem.
00:12:14.920 | So we try to basically give you these things, which
00:12:17.560 | are annotation layers, which do generalize in this way.
00:12:19.920 | And that means that there has to be
00:12:21.680 | a sort of linguistic truth to them.
00:12:23.240 | And that means that looking at things
00:12:25.080 | like the semantic roles, or sentence structure,
00:12:27.200 | or sentence divisions are things that we can do.
00:12:30.320 | And that's why we are interested in this.
00:12:33.680 | So the other thing about syntactic structures
00:12:36.720 | and whether they're useful or not is that in English,
00:12:40.440 | not using syntax is pretty powerful,
00:12:42.400 | because English orthography happens to cut things up
00:12:46.120 | into pretty convenient units.
00:12:48.360 | They're not optimal units, but they're still pretty nice
00:12:52.120 | in a way that doesn't really hold true
00:12:54.080 | across a lot of other languages.
00:12:55.920 | So in the bottom right here, we have
00:12:57.520 | Japanese, which usually isn't segmented into words.
00:13:01.320 | You can't just cut that up trivially with white space
00:13:04.640 | and get something that you can feed into a search engine,
00:13:07.000 | or get something that you can feed forward
00:13:08.720 | into a topic model.
00:13:09.800 | You have to do some extra work.
00:13:11.480 | And the extra work that you do there really
00:13:13.280 | should consider syntactic structure.
00:13:15.280 | You can use a technology that only makes linear decisions,
00:13:18.920 | but the truth about what counts as a word or not
00:13:21.880 | is very entangled with the syntactic structure.
00:13:23.880 | And so there's real value in doing it jointly
00:13:25.880 | with syntactic parsing.
00:13:27.880 | For other languages, you have kind of the opposite problem.
00:13:30.560 | So we have here a German word, and this
00:13:37.080 | is the German word for income tax return.
00:13:39.240 | Now, whether or not you want that to be sort of one unit
00:13:43.600 | will depend on what you're looking for.
00:13:45.200 | For many applications, actually, the English phrase
00:13:47.960 | is too short.
00:13:49.200 | And the domain object, the thing that you
00:13:51.480 | want to be looking for and having a single node
00:13:55.360 | in your knowledge graph for, would actually
00:13:57.120 | be income tax return.
00:13:58.040 | That's pretty awesome.
00:13:59.800 | But in other applications, maybe you just
00:14:01.600 | want to look for tax.
00:14:02.640 | And so in those cases, the German word will be too large
00:14:05.800 | and your data will be too sparse.
00:14:07.480 | So there's sort of different aspects to this.
00:14:12.000 | In the bottom left here, we have an example of Hebrew.
00:14:16.160 | And like Arabic and a couple of other languages like this,
00:14:21.960 | there's no vowels in the text.
00:14:23.560 | And the words tend to have all sorts of attachments
00:14:27.200 | that are difficult to segment off.
00:14:29.680 | So there, again, you have difficult segmentation problems
00:14:32.400 | that are all tangled up with the syntactic processing.
00:14:35.920 | OK, so going forward to sort of an example of what we can do
00:14:42.240 | if we recognize non-whitespace-looking words
00:14:46.320 | and feed them into some of the other processing stuff
00:14:50.800 | that we have.
00:14:51.720 | So this is a demo that we prepared a couple of years
00:14:55.080 | ago for an approach that we call the term Sense2Vec.
00:15:00.560 | So all this is is basically processing text
00:15:03.440 | using natural language processing tools, in this case
00:15:05.720 | specifically spaCy, in order to recognize these concepts that
00:15:09.800 | are longer than one word.
00:15:11.040 | So specifically here, we looked for base noun phrases and also
00:15:15.120 | named entities.
00:15:16.000 | And we just merged those into one token
00:15:17.640 | before feeding the text forward into a word2vec
00:15:21.720 | implementation, which gives you these semantic relationships.
00:15:25.360 | And this lets you search for and find similarities
00:15:29.600 | between phrases which are much longer than one word.
00:15:32.800 | And as soon as you do this, you find,
00:15:35.000 | ah, the things which I'm searching for
00:15:37.000 | are much more specific in meaning.
00:15:38.400 | I'm not looking for one meaning of learning
00:15:41.880 | or one meaning of processing, which
00:15:43.280 | doesn't tend to be so useful or interesting.
00:15:45.120 | Instead, you can find things related to natural language
00:15:48.720 | processing.
00:15:49.320 | And then you see, ah, machine learning, computer vision,
00:15:51.520 | et cetera.
00:15:52.080 | These are real results that came out of the thing
00:15:55.480 | as soon as you did this division.
00:15:58.520 | And so we can do this for other languages as well.
00:16:02.360 | So if we were hoping to use word2vec for a language
00:16:06.800 | like Chinese, you really want to be processing it into words
00:16:10.080 | before you do that.
00:16:11.280 | Or if you're going to do this for a language like Finnish,
00:16:14.280 | you really want to cut off the morphological suffixes
00:16:17.200 | before you do this.
00:16:20.520 | So incidentally, Innis has cleaned up
00:16:24.440 | the sensitivevec recently.
00:16:26.120 | So you can actually use this as a handy component
00:16:29.160 | within Spacey.
00:16:30.640 | So you can load up a stand model and then
00:16:35.240 | add a component that gives you these sensitivevec sensors.
00:16:38.480 | So you can just say, all right, the token for three
00:16:42.720 | would be natural language processing,
00:16:44.280 | because it would do it emerging for you.
00:16:46.000 | And then you can also look up the similarity.
00:16:47.880 | So it's now much easier to actually use the pre-trained
00:16:52.280 | model and use that approach within Spacey.
00:16:56.520 | Incidentally, we have this concept of an extension
00:16:59.200 | attribute in Spacey so that you can kind of attach your own
00:17:02.720 | things to the tokens so that you can basically
00:17:05.880 | attach your own little markups or processing things.
00:17:09.160 | So this underscore object is kind of a free space
00:17:13.600 | that you can attach attributes to, which ends up
00:17:16.360 | being quite convenient.
00:17:17.640 | It's a lot more convenient than trying to subclass something
00:17:20.800 | or something.
00:17:23.320 | So for the rest of the talk, I'll
00:17:26.720 | give you a little bit of a pretty brief overview
00:17:30.040 | of the parsing algorithm and then explain
00:17:33.120 | how we're going to-- how we're modifying the parsing algorithm
00:17:35.600 | to work with languages other than English
00:17:38.840 | so that we can basically broaden out the support of Spacey
00:17:41.680 | to these other languages.
00:17:43.760 | So what we see here is a completed parse.
00:17:49.960 | And I'm going to talk you through the steps
00:17:53.640 | or the decision points that the parser is going
00:17:55.400 | to make to derive this structure.
00:17:58.160 | And so the kind of key--
00:18:01.440 | I think to keep in mind-- or the key aspect of the solution
00:18:06.720 | is that it's going to read the sentence from left to right
00:18:09.560 | and maintain some state.
00:18:10.920 | And then it's going to have a fixed inventory of actions
00:18:14.640 | that it has to choose between to manipulate the current parse
00:18:18.320 | state to build up the arcs.
00:18:19.920 | And this type of approach, which is called
00:18:22.160 | transition-based parsing, I find deeply satisfying
00:18:25.080 | because it's linear in time because you only
00:18:30.600 | make so many decisions per word.
00:18:32.480 | And I do think that it makes a lot of sense
00:18:34.440 | to take algorithms which process language incrementally.
00:18:38.200 | I think that that's deeply satisfying and correct
00:18:41.640 | in a way that a lot of other approaches to parsing aren't.
00:18:44.520 | And it's also a very flexible approach.
00:18:46.240 | So we can do joint modeling and have it output
00:18:49.320 | all sorts of other structures as well as the parse tree.
00:18:52.480 | And that's actually what we're going to do.
00:18:54.140 | So already in Spacey, we have joint prediction
00:18:57.680 | of the sentence boundaries in the parse tree.
00:18:59.680 | And what we're going to do is extend
00:19:01.480 | this to this joint prediction of word boundaries as well.
00:19:04.840 | OK, so here's how the decision process of building the tree
00:19:09.960 | works.
00:19:10.680 | So we start off with an initial state.
00:19:13.000 | And so for ease of notation or ease of readability,
00:19:16.860 | we're notating the first word of the buffer.
00:19:20.600 | So the first word that's being focused on
00:19:23.240 | as this beam of highlighting.
00:19:26.000 | And then the other element of the state is a stack.
00:19:30.440 | And so as the first action that we do,
00:19:33.920 | we have an action that can advance the buffer one
00:19:36.760 | and put the word that was previously at the start
00:19:39.040 | of the buffer onto the stack.
00:19:40.440 | So here's what that shift move is going to look like.
00:19:43.520 | So here we have Google on the stack, which we write up here.
00:19:48.200 | And the first word of the buffer is reader.
00:19:50.920 | And so then another action that we can take
00:19:53.680 | is to form a dependency arc between the word that's
00:19:56.600 | on top of the stack and the first word of the buffer.
00:19:59.000 | So in this case, we want to attach Google
00:20:02.960 | as a child of reader.
00:20:04.320 | So we have an action that does that.
00:20:05.840 | And because we're building a tree,
00:20:07.920 | when we make an arc to Google, we
00:20:11.080 | know that we can pop it from the stack because it's a tree.
00:20:15.440 | It only can have one head.
00:20:17.360 | It can only have one attachment point.
00:20:21.680 | It's a different type of graph.
00:20:23.520 | And so that means that we can do that and keep moving forward.
00:20:27.240 | So here's what that looks like.
00:20:28.520 | We add an arc and pop Google from the stack.
00:20:31.920 | So now we make the next move.
00:20:34.240 | Clearly, we've got no words on the stack.
00:20:36.000 | So we should put reader on the stack so that we can continue.
00:20:39.560 | Now we're at was.
00:20:41.120 | And now we want to decide whether we should make an arc
00:20:43.840 | directly between was and reader.
00:20:45.840 | In this case, no, we want to attach was to canceled.
00:20:48.600 | So we're going to move was onto the stack
00:20:51.000 | and move forward onto canceled.
00:20:52.600 | So then here we do want this arc between canceled and was.
00:20:57.960 | So we do another left arc.
00:20:59.600 | And so we basically continue here.
00:21:01.720 | So we're sort of stepping back a bit and thinking about this.
00:21:06.440 | We've got a fixed inventory of actions.
00:21:08.520 | And so long as we can predict the right sequence
00:21:11.880 | of those actions, we can derive the correct pass.
00:21:14.120 | So that's how the machine learning model
00:21:15.720 | is going to work here.
00:21:16.920 | The machine learning model is going
00:21:18.280 | to be a classifier that predicts, given some state,
00:21:21.120 | what to do next.
00:21:22.840 | And you can sort of imagine that we can have other actions
00:21:26.520 | instead if we wanted to predict other aspects of the structure.
00:21:29.720 | So in the case of spaCy, we have an action
00:21:31.640 | that inserts a sentence boundary.
00:21:33.840 | So it just says, all right, given the words currently
00:21:36.440 | on the stack, you have to make actions
00:21:38.760 | that can clear the stack.
00:21:40.440 | But you're not allowed to push the next token
00:21:42.720 | until your stack is clear.
00:21:44.000 | And that means that there's going
00:21:45.320 | to be a sentence boundary there.
00:21:48.240 | And we could have other actions as well.
00:21:51.240 | There's been work to jointly predict part of speech tags
00:21:54.920 | at the same time as you're parsing.
00:21:56.240 | Or you can do semantics at the same time as you do syntax.
00:21:59.920 | And so you can code up all sorts of structures into this.
00:22:02.640 | And you're going to read the sentence left to right.
00:22:05.000 | And you're going to output some meaning structure attached
00:22:07.800 | to it.
00:22:08.360 | And as I said, I find this a satisfying way
00:22:10.960 | to do natural language understanding.
00:22:13.760 | Because it does involve reading the sentence
00:22:16.360 | and adding an interpretation incrementally.
00:22:20.000 | OK, so that's what this looks like as we proceed through.
00:22:25.320 | So all right, so how are we going
00:22:27.080 | to do this splitting up or merging of other things?
00:22:31.160 | Well, it's actually not that complicated,
00:22:33.240 | given this transition-based framework.
00:22:35.080 | So already, you can kind of see that in order
00:22:38.600 | to merge tokens, all we really have to do
00:22:40.880 | is we've got those tokens.
00:22:42.280 | And if we wanted Google Reader to be one token,
00:22:45.080 | we just have to have some special dependency label, which
00:22:47.360 | we are going to have in the tree.
00:22:49.400 | And so obviously, the label subtoken.
00:22:51.840 | And then all we have to do is say, all right,
00:22:55.280 | at the end of parsing, we're going
00:22:56.540 | to consider that as one token.
00:22:58.560 | So the step from going through something like this
00:23:03.040 | and labeling a language like Chinese is actually super
00:23:06.040 | simple.
00:23:06.540 | We just have to prepare the training data
00:23:08.240 | so that the tokens are individual characters.
00:23:11.000 | And then we can say, all right, things
00:23:14.180 | which should be one word should have this sort of structure
00:23:18.000 | with this label.
00:23:18.960 | And then if the parser decides that those things are attached
00:23:22.920 | together, then at the end of it, you just merge them up.
00:23:26.640 | The splitting tokens is more complicated,
00:23:29.140 | because you have to have some universal actions that
00:23:31.900 | manipulates the strings.
00:23:33.080 | So I'm still sort of working on the implementation of this
00:23:35.600 | in a way that's sort of clean and tidy.
00:23:39.200 | But I actually think that this will
00:23:40.700 | be useful for a lot of English texts
00:23:42.200 | as well, because if you have English texts that's
00:23:46.000 | sort of misspelled, a lot of the time,
00:23:48.200 | things which should be two tokens get merged into one.
00:23:50.720 | So it's is a particularly common and frustrating one of this,
00:23:55.240 | because the verb is, it should be its own token.
00:23:59.280 | But if you have it's as ITS, which is also
00:24:01.680 | a common word in English, you need
00:24:04.840 | to figure out that you have to have two parser actions, two
00:24:07.680 | parser states for that.
00:24:09.240 | And in general, you could have a statistical model
00:24:12.000 | that reads the sentence beforehand.
00:24:13.760 | But that statistical model that is
00:24:15.600 | going to read the sentence and process it
00:24:17.520 | is going to end up taking on work
00:24:19.200 | and doing jobs of figuring out the syntactic structure
00:24:21.880 | of the sentence in order to make those decisions.
00:24:24.000 | And that's why I think doing these things jointly
00:24:26.000 | is kind of satisfying, because instead
00:24:28.320 | of learning that information in one level of representation
00:24:31.480 | and thrown it away, only to build up
00:24:33.840 | the same information in the next pass of the pipeline,
00:24:37.440 | you can do it all at once.
00:24:38.720 | And so I think that the joint incremental approaches,
00:24:41.760 | I think, are very satisfying and good.
00:24:44.880 | So where are we at the moment?
00:24:46.880 | So I've implemented the learning to merge side
00:24:52.440 | of things, which involves figuring out
00:24:55.960 | better alignments between the gold standard tokenization
00:24:59.040 | and the output of the tokenizer.
00:25:01.440 | And that's allowed me to complete the experiments
00:25:04.040 | for Chinese, Vietnamese, and Japanese
00:25:07.520 | of the Conference in Natural Language Learning 2017
00:25:11.120 | benchmark, which was a sort of bake-off of these parsing
00:25:13.760 | models, which was conducted last year.
00:25:16.560 | Now, in that benchmark, the team from Stanford
00:25:20.200 | did extremely well, compared to everybody else in the field.
00:25:24.400 | There were some two or three percentage points better.
00:25:28.400 | And so at the moment, we were ranking
00:25:30.800 | kind of at the top of what was the second place pack.
00:25:33.240 | So most of the languages were coming sort of
00:25:35.280 | underneath the Stanford system, but with significantly better
00:25:40.000 | efficiency and with sort of this end-to-end process.
00:25:42.960 | And in particular, we're doing better than Stanford
00:25:45.760 | on these languages like Chinese, Vietnamese, and Japanese,
00:25:48.120 | because the Stanford system did have this disadvantage of using
00:25:51.240 | the sort of preprocessed text.
00:25:52.680 | They didn't do the whole task.
00:25:53.880 | They wanted to just use the provided preprocessed texts
00:25:59.480 | so that they could focus on the parsing algorithm.
00:26:01.680 | And that meant that they did have this error propagation
00:26:03.880 | problem.
00:26:04.440 | If the inputs are incorrect because the preprocessed
00:26:07.360 | segmenter is incorrect, then they
00:26:11.560 | have a big disadvantage on these languages.
00:26:13.680 | So satisfyingly, the sort of doing all at once
00:26:18.000 | and entangling all of these representations,
00:26:21.200 | it does have this advantage.
00:26:22.320 | And we're seeing that in the results
00:26:24.160 | that we have for those languages.
00:26:26.320 | And the other thing that's satisfying
00:26:27.680 | is this joint modeling approach of deciding the segmentation
00:26:31.600 | at the same time as deciding the parse structure is consistently
00:26:35.040 | better than the pipeline approach in our experiments.
00:26:37.520 | So basically, we're getting a sort of 1% to 3% improvement
00:26:42.840 | from this, which is about the same size
00:26:44.840 | as we're getting from using the neural network model instead
00:26:47.400 | of the linear model.
00:26:48.280 | So I've found this also quite satisfying,
00:26:50.400 | that the sort of conceptually neat solution
00:26:53.560 | is also working well in practice.
00:26:57.200 | So where does this go, and what do we
00:26:58.840 | hope to deliver from this?
00:27:00.480 | Yes, that would probably be it.
00:27:06.360 | How am I for time?
00:27:11.440 | OK, well, this is the last slide, so wrapping up.
00:27:16.080 | OK, so what we want to do is we want
00:27:18.840 | to deliver a sort of workflow or user experience
00:27:21.680 | where it's very easy to start with the pre-trained models
00:27:24.360 | for different languages and broad application areas.
00:27:28.220 | And we want to make sure that they
00:27:29.640 | have the same representation across languages.
00:27:31.680 | So you get the same parse scheme, which the folks have
00:27:37.640 | been working hard on and basically now
00:27:39.360 | have a pretty satisfying solution from the universal
00:27:41.400 | dependencies.
00:27:42.440 | And so if you're processing text from different languages,
00:27:44.720 | it should be easy to find, say, subject-verb relationships
00:27:47.240 | or direct-object relationships.
00:27:48.920 | And that should work across basically any language
00:27:51.400 | so that you can use these parse trees
00:27:53.600 | and basically have a level of abstraction
00:27:55.600 | from which language the text is in.
00:27:59.400 | And then given this, you should be
00:28:01.840 | able to do pretty powerful rule-based matching
00:28:03.840 | from the parse tree and other annotations to provide it.
00:28:07.480 | So it should be pretty easy to find information,
00:28:11.280 | even without knowing much about the language and reuse
00:28:13.600 | rules across language.
00:28:15.600 | And then if the syntaxic model and the identity models
00:28:21.360 | that we provide aren't accurate enough,
00:28:23.800 | the library should support easy updating of those,
00:28:26.400 | including learning new vocabulary items
00:28:28.600 | without you taking particular effort from this.
00:28:32.160 | And overall, we sort of want to emphasize
00:28:34.520 | a workflow of rapid iteration and data annotation.
00:28:37.800 | So the concept of this is that we
00:28:40.000 | should be able to provide things which
00:28:42.080 | give a broad-based understanding of language,
00:28:45.800 | but that still ends up with a need for the knowledge
00:28:50.400 | specific to your domain and a training
00:28:51.920 | guide and evaluation data specific to your problems.
00:28:55.200 | And we want to make sure that it's easy to connect the two up
00:28:58.880 | and go the extra.
00:29:00.200 | Start from a basic understanding of language
00:29:02.120 | and move forward to building the specific applications, which
00:29:05.680 | is-- now Innes will be talking about that aspect
00:29:09.600 | of the sort of intended package.
00:29:12.240 | [LAUGHTER]
00:29:18.680 | Yeah.
00:29:19.680 | [APPLAUSE]
00:29:21.600 | Right.
00:29:29.480 | So to--
00:29:30.040 | [INAUDIBLE]
00:29:32.040 | Yes, certainly.
00:29:34.080 | So the leap asked what the sort of overall difference
00:29:37.200 | or main-- most important difference between Spacey's
00:29:39.400 | parsing algorithm and Stanford's parsing algorithm.
00:29:42.160 | So amongst other things, the sort
00:29:43.680 | of most fundamental difference is that Stanford's system
00:29:48.960 | is a graph-based parser.
00:29:50.520 | So this is O(n) squared, or maybe O(n) cubed,
00:29:55.720 | in length of the sentence.
00:29:57.120 | So you're unable to use this type of parsing algorithm
00:30:00.920 | for joint segmentation and parsing.
00:30:03.520 | You have to have a pre-segmented text, which
00:30:05.880 | is why it has this disadvantage on languages, or text which
00:30:11.920 | is more difficult to segment into sentences.
00:30:14.200 | So in Spacey, we want to make sure
00:30:15.560 | that we basically only use linear time algorithms.
00:30:19.720 | And that's why we only take this transition-based approach.
00:30:25.640 | Other reasons sort of why they get such a good result.
00:30:29.800 | Other people have done graph-based models,
00:30:31.880 | and they're not nearly as accurate.
00:30:34.080 | So I hope to meet the Stanford team in the next couple of days
00:30:38.160 | and shake out the details of why this system is so accurate.
00:30:42.280 | Because, actually, it is quite surprising.
00:30:44.680 | I've read their papers several times,
00:30:46.180 | and I can't get the sort of one key insight that
00:30:48.920 | means that their system performs so well.
00:30:51.040 | It's interesting.
00:30:56.080 | [INAUDIBLE]
00:31:02.200 | So I think that--
00:31:04.540 | Right, yes, certainly.
00:31:07.320 | So the question, which is a very good one
00:31:10.440 | that many people have been thinking about,
00:31:12.800 | is to what extent can end-to-end systems, which maybe
00:31:17.320 | learn things about syntax, but learn them latently
00:31:19.680 | and don't have an explicit syntactic representation
00:31:22.200 | internally, replace the need for this type
00:31:25.180 | of syntactic processing?
00:31:27.320 | So I would say that for any application where
00:31:30.680 | there's sufficient text, currently the best approach
00:31:34.440 | or the state-of-the-art approach doesn't use a parser.
00:31:37.480 | And actually, this includes translation and other things
00:31:40.240 | where you would kind of expect that having
00:31:42.320 | an explicit syntactic layer would help.
00:31:44.200 | If there's enough text, it seems that going straight
00:31:46.320 | to the end-to-end representation tends to be better.
00:31:49.320 | However, that does involve having a lot of text.
00:31:51.680 | And for most applications, creating that much training
00:31:55.200 | data, especially initially when you're prototyping,
00:31:58.280 | tends not to be such a viable solution.
00:32:00.400 | So the way that I see it is that the parsing stuff
00:32:04.560 | is a great scaffolding.
00:32:05.960 | And it's a very practical thing to have in your toolbox,
00:32:09.000 | especially when you're trying to figure out
00:32:10.660 | how to model the problem.
00:32:12.000 | So because otherwise, you end up in this chicken and egg
00:32:14.080 | situation of, well, we need lots of data
00:32:16.580 | to make our model work well.
00:32:18.680 | And otherwise, it just doesn't really get off the ground.
00:32:21.040 | But then how do we even know that we're
00:32:22.800 | collecting the right data for the right model
00:32:24.480 | until we have that data collected
00:32:25.840 | and we can see the accuracy?
00:32:27.680 | So if you can take sort of smaller steps
00:32:30.120 | using these sort of rule-based scaffolding and bootstrapping
00:32:32.920 | approaches, I think you have a much more powerful and practical
00:32:36.200 | set of tools.
00:32:37.240 | And then finally, once you have a system
00:32:39.480 | that you know you want to e-cat every percent,
00:32:42.080 | maybe you end up collecting enough data
00:32:43.620 | that you don't need a parser in your solution explicitly.
00:32:47.560 | [INAUDIBLE]
00:32:49.360 | So Dilip has pointed to a paper that recently showed
00:32:58.800 | that it, you know, BLSTM models don't necessarily
00:33:02.480 | learn long range dependencies.
00:33:04.320 | I think that that's probably true.
00:33:06.560 | But as somebody who's worked on parsing for a lot of my career,
00:33:11.800 | I try to remind myself not to cherry pick results.
00:33:14.680 | And even if I do find a paper that shows that parsing
00:33:17.680 | works on something, well, the overall trend
00:33:19.720 | is that BLSTM models which don't use parsing work well.
00:33:24.400 | And the fact is that long range dependencies are kind of rare.
00:33:27.440 | So that's basically why it's important
00:33:34.920 | to be asking, well, what are these things good for,
00:33:38.200 | and not say, oh, everything should be using parsing.
00:33:41.060 | Because it's true that not everything should.
00:33:43.240 | [INAUDIBLE]
00:33:45.240 | So the question is, if we look at other aspects of language
00:33:57.280 | variation instead of just, say, the segmentation and things,
00:34:01.560 | how does the incremental model perform?
00:34:03.480 | So specifically, how does it perform
00:34:05.200 | in free word order languages, perhaps ones
00:34:08.680 | with longer range or crossing dependencies?
00:34:12.240 | So Stanford, actually, their paper
00:34:14.340 | had excellent analysis about a lot of these questions.
00:34:16.540 | And so they showed that their model, which
00:34:19.980 | is much less sensitive to whether the trees are
00:34:22.220 | projective, they do do relatively well in those languages.
00:34:26.580 | So for our preliminary results, we
00:34:33.840 | do fine on German and pretty well on Russian.
00:34:37.660 | We still suck at Finnish.
00:34:39.240 | And I think there's a bug in Korean.
00:34:43.220 | It's at like 50%.
00:34:45.600 | So it's a mixed bag.
00:34:47.960 | But I would say that there's some problems
00:34:50.320 | to solve about the projectivity.
00:34:52.400 | The way that I'm doing this is a little bit crude at the moment.
00:34:54.820 | So in general, there is a disadvantage
00:35:00.520 | that we take from the incremental approach in this.
00:35:02.920 | And there's a lot of clever solutions
00:35:05.160 | that I'm looking into for this.
00:35:07.040 | So yeah.
00:35:08.760 | [INAUDIBLE]
00:35:12.760 | So there's a pretty good extension package
00:35:15.640 | for coreference resolution that has taken some of the pressure
00:35:19.560 | off us to support it internally.
00:35:21.720 | We do think that coreference resolution is something
00:35:23.920 | that does belong in the library, because it's
00:35:25.720 | something that does have that property of being
00:35:27.960 | a language internal thing.
00:35:29.200 | I think that there's a truth about whether that he or she
00:35:31.660 | belongs to that noun that doesn't depend on the application.
00:35:34.800 | It's just a true fact about that sentence.
00:35:36.760 | So we're very interested in being
00:35:38.200 | able to give you that piece of annotation.
00:35:40.000 | I wouldn't quite say the same thing about the sentiment.
00:35:43.720 | I don't quite know--
00:35:45.720 | I haven't been convinced by any schema of sentiment
00:35:48.320 | that is sufficiently independent of what you're trying to do
00:35:51.520 | that we could provide it.
00:35:52.680 | Instead, what we do provide you is a text categorization
00:35:54.940 | library.
00:35:55.640 | And the text categorization model that we have
00:35:58.120 | is only one of many that you might build.
00:36:01.700 | And it's not best for every application.
00:36:04.040 | But it does do pretty well for short text.
00:36:07.680 | And I think that on many sentiment benchmarks,
00:36:11.880 | it performs quite well.
00:36:13.360 | It's a lot slower than some other ways
00:36:16.520 | that you could do sentiment.
00:36:17.800 | So it depends on what type of text you're trying to process
00:36:20.800 | and that sort of thing.
00:36:22.240 | [INAUDIBLE]
00:36:24.960 | Well, oh, yes.
00:36:25.640 | So explicitly, the coreference resolution package
00:36:31.560 | that you should use is called MuralCoref.
00:36:33.760 | So MuralCoref.
00:36:34.660 | [INAUDIBLE]
00:36:36.240 | Yeah.
00:36:37.240 | Yeah, and it's built on PyTorch.
00:36:39.640 | It's overall pretty good.
00:36:40.720 | You can train it yourself.
00:36:42.120 | Yeah, well, PyTorch is the machine learning layer.
00:36:45.340 | But yes, it's built on space.
00:36:46.600 | So yes.
00:36:47.080 | So for German, I think it's pretty easy.
00:37:00.520 | I've been using the word vectors trained by fast text.
00:37:03.920 | And you can basically just plug that
00:37:06.880 | in so there's one command to convert that
00:37:10.040 | into a spacey vocab object and load it up.
00:37:14.000 | We're trying to provide pre-trained models which
00:37:15.880 | don't depend on pre-trained word vectors
00:37:18.720 | so that you can bring your own.
00:37:19.880 | Because otherwise, there's this conflict
00:37:21.520 | of the model's been trained to expect some word vectors.
00:37:25.120 | And then if you sub your own in, it's
00:37:26.760 | going to get different input representations.
00:37:30.240 | But yeah, training or bringing your own vectors
00:37:34.120 | is designed to be pretty easy.
00:37:35.920 | And if it's not, I apologize if there's bugs,
00:37:38.440 | and we'll try to fix them.
00:37:40.140 | So the question is, after parsing and interpreting,
00:37:50.240 | do we have an interlingual representation that
00:37:52.520 | can then be used to generate another language?
00:37:55.440 | The answer is probably not.
00:37:56.760 | I mean, we don't have generation capabilities in spacey.
00:38:00.600 | People have worked on this sort of thing.
00:38:02.840 | But in general, having an explicit interlingual
00:38:07.160 | tends to perform less well than more brute force
00:38:10.080 | statistical approaches to syntax.
00:38:12.080 | And I think the reason does sort of make sense
00:38:14.880 | that the languages are pretty different in the way
00:38:18.920 | that they phrase things and the way
00:38:20.120 | that they model the world in lots of ways.
00:38:21.800 | And so getting a translation that's remotely
00:38:24.680 | idiomatic out of that sort of interlingual representation
00:38:26.960 | is pretty tough.
00:38:28.920 | And then there's another argument
00:38:30.360 | that you're solving a subproblem that's
00:38:33.320 | harder than the direct translation approach, which
00:38:37.280 | I'm not sure whether I buy that argument or not.
00:38:39.280 | But it's a common one that people use.
00:38:40.840 | OK, so should we move forward to the next talk?
00:38:48.480 | Thanks.
00:38:50.480 | [APPLAUSE]
00:38:52.440 | So yeah, we started out by hearing a lot about the more
00:39:04.840 | theoretical side of things.
00:39:06.040 | And I'm actually going to talk about how we collect and build
00:39:10.360 | training data for all these great models we can now build.
00:39:12.680 | And the nice thing about machine learning
00:39:15.760 | is that, well, we can now train a system
00:39:17.800 | by just showing examples of what we want.
00:39:20.000 | And that's great.
00:39:20.680 | But the problem is, of course, we need those examples.
00:39:23.560 | And even if you're like, oh, I got this all figured out.
00:39:26.080 | Are you using this amazing unsupervised method
00:39:28.520 | where my system just infers the categories from the data
00:39:32.760 | and I never need to label any data?
00:39:34.760 | That's pretty nice, but you still
00:39:36.240 | need some way of evaluating your system.
00:39:38.160 | So we pretty much always need some form of annotations.
00:39:42.800 | And now the question is, well, why do we even care about this?
00:39:47.280 | Why do we care about whether this is efficient,
00:39:50.960 | whether this works or not?
00:39:53.120 | The thing is, the big problem is that we actually,
00:39:56.920 | with many things in data science and machine learning,
00:39:59.600 | we need to try out things before we know whether they work.
00:40:02.720 | Or we often don't know whether an idea is going
00:40:04.720 | to work before we try it.
00:40:05.840 | So we need to expect to do annotation lots of times
00:40:09.720 | and start off from scratch.
00:40:12.080 | Start all over again if we fucked up our label scheme.
00:40:15.920 | Try something else.
00:40:16.720 | We need to do this lots of times, so it needs to work.
00:40:19.440 | And similarly, especially if you're
00:40:23.040 | working in a company in a team where you really
00:40:26.680 | want to use your model to find something out,
00:40:29.560 | ideally the person building the model
00:40:31.480 | should be involved in that process.
00:40:33.760 | And also, we always say good annotation teams are small.
00:40:36.800 | A lot of people don't understand this.
00:40:38.340 | There's a lot of movement towards,
00:40:42.120 | oh, let's crowdsource this, get hundreds of volunteers,
00:40:44.640 | and we always have to remind, especially companies,
00:40:47.600 | that well, look at the big corpora
00:40:50.040 | that we use to train models.
00:40:51.920 | The good ones were produced by very few people,
00:40:54.120 | and there's a reason for that.
00:40:57.000 | More people doesn't always mean better results, actually
00:40:59.520 | quite the opposite.
00:41:00.240 | So how great would it be if actually the developer
00:41:03.920 | of the model could be involved in labeling the data?
00:41:08.600 | And of course, we also have the problem of the specialist
00:41:11.680 | knowledge, especially in industries where this matters.
00:41:16.920 | You might want to have a medical professional give some feedback
00:41:20.680 | on the labels, or actually really label your data,
00:41:23.480 | or maybe a finance expert.
00:41:26.000 | And yeah, those people usually have limited time.
00:41:28.080 | If you get an hour off their time,
00:41:29.520 | you want to use it more efficiently,
00:41:31.400 | and you don't want to bore them to death,
00:41:33.800 | or actually find the one person who has nothing else to do,
00:41:36.800 | because their knowledge is probably not
00:41:39.260 | as valuable as other experts' knowledge.
00:41:43.560 | And yeah, another big problem, since you want humans,
00:41:49.640 | is that humans are actually--
00:41:52.040 | humans kind of suck.
00:41:53.920 | We're not that efficient at a lot of things.
00:41:57.880 | So for example, we really have problems
00:41:59.960 | performing boring, unstructured tasks,
00:42:02.580 | especially things that require multiple steps and multiple
00:42:04.920 | things we need to get right.
00:42:06.240 | We can't remember stuff.
00:42:08.640 | We're bad at consistency and getting stuff right.
00:42:12.640 | So fortunately, computers are really good at that stuff.
00:42:16.880 | And in fact, it's probably also the main reason
00:42:19.180 | we built computers.
00:42:20.840 | So there's really no need to waste the human's time
00:42:25.480 | by making them do stuff that they're
00:42:26.760 | going to do badly anyways.
00:42:28.160 | And instead, we want our annotation tooling
00:42:31.040 | to be as automated as possible.
00:42:33.760 | Or in general, we want to automate as much as possible,
00:42:36.240 | and really have the human focus on the stuff
00:42:37.920 | that the human is good at, and we really need that input.
00:42:40.720 | And that's usually context, ambiguity, stuff
00:42:43.920 | like we can look at a sentence, and most of us
00:42:46.080 | will be able to understand a figure of speech
00:42:48.160 | immediately without thinking twice about it.
00:42:50.120 | That's the stuff that's really, really hard for a computer.
00:42:53.320 | Also, put differently, humans are good at precision.
00:42:56.520 | Computers are good at recall.
00:42:58.720 | So the thing is, yeah, what I'm saying here,
00:43:01.880 | it sounds a bit like our floss and eat your veggies.
00:43:06.040 | Yeah, we probably will have had some experience
00:43:08.600 | with labeling data.
00:43:09.560 | And normally, yeah, we also gave this talk
00:43:13.080 | to a crowd of more data science focused industry professionals.
00:43:19.320 | And actually, you'd be surprised how many companies we talked
00:43:24.440 | to, also very large companies, very actually
00:43:26.560 | technologically sophisticated companies,
00:43:28.760 | that mostly use Excel spreadsheets for everything.
00:43:31.880 | And it's not inherently bad, but they
00:43:35.020 | are very obvious problems with Excel spreadsheets.
00:43:37.800 | And there's definitely a lot of room for improvement.
00:43:39.960 | So once people figure this out and realize
00:43:42.800 | that maybe they could do something better,
00:43:44.800 | or it's just terrible, like we don't want to do this,
00:43:47.120 | the next move is normally, let's move this all out
00:43:49.480 | to Mechanical Turk or some other crowd-sourced platform.
00:43:53.440 | And yeah, Mechanical Turk, the Amazon cloud of human labor.
00:44:00.600 | And so, yeah, people do that.
00:44:01.960 | And then I was also surprised that their results are not
00:44:04.320 | very good.
00:44:04.960 | And the problem is, yeah, OK.
00:44:07.280 | So you have some guy do it for $5 an hour,
00:44:10.560 | get the data back, train your model doesn't work.
00:44:13.520 | And actually, it's very difficult to then retroactively
00:44:17.720 | find out what the problem was.
00:44:18.880 | Maybe your label scheme was bad.
00:44:21.160 | Maybe your idea was bad.
00:44:23.200 | Maybe the data was bad.
00:44:24.280 | Maybe you didn't write your annotation manual properly.
00:44:28.280 | Maybe-- actually, yeah, another nice thing.
00:44:30.320 | Maybe you paid too much, because if you
00:44:32.680 | pay too much on Mechanical Turk, you attract all the bad actors.
00:44:35.600 | So you kind of have to stick to the half minimum wage.
00:44:40.200 | So that could have been a problem.
00:44:42.040 | Maybe your model was bad.
00:44:43.160 | Your training code was bad.
00:44:44.240 | It's very, very difficult to find that out.
00:44:46.640 | And also, you realize that, well, it's not really just
00:44:49.440 | the cheap click work.
00:44:52.000 | You need to do it more.
00:44:53.000 | So then, yeah, what most people conclude from this
00:44:56.160 | is, fuck this labeling in general.
00:44:59.840 | I don't want to do this anymore.
00:45:01.240 | Let's just find some unsupervised method
00:45:03.880 | and not bother with this.
00:45:05.080 | And that's actually-- yeah, also, the conversation
00:45:09.120 | I had recently where we talked to a larger media company,
00:45:11.840 | and they'd done exactly that.
00:45:13.040 | And now they have a few hundred clusters.
00:45:15.320 | And it's really great.
00:45:16.240 | They have really great clusters.
00:45:17.560 | But now, their problem is that they
00:45:19.920 | have no idea what these clusters are.
00:45:21.840 | So they now need to label their clusters.
00:45:23.840 | And now, they're kind of back in the beginning.
00:45:25.960 | And I think what we see from this
00:45:28.040 | is that the label data itself, the fact
00:45:30.560 | that we need label data, that's an opportunity.
00:45:32.560 | That is not the problem.
00:45:33.560 | The problem is how we do it.
00:45:35.960 | And yeah, so we've been thinking about this a lot.
00:45:41.520 | At least, from our point of view,
00:45:43.480 | there are a lot of things we could do better.
00:45:46.320 | So one of the things, really, to work against this problem
00:45:50.400 | that we have caused by us being human
00:45:53.400 | is that we need to break down these very complex things we're
00:45:58.240 | asking the humans into smaller, simpler questions.
00:46:02.120 | And ideally, these should be binary decisions.
00:46:04.840 | So we can have a much better annotation speed
00:46:07.240 | because we can move through the things faster.
00:46:09.240 | And we can also measure the reliability much easier
00:46:13.080 | than if we ask people open questions.
00:46:14.960 | Because we can actually say, OK, do our annotators agree?
00:46:17.420 | Do they not agree?
00:46:18.680 | Because that's, in the end, very important
00:46:20.720 | to find out whether we've collected data the right way.
00:46:23.760 | And the binary thing itself, it sounds a bit radical.
00:46:28.560 | But actually, if you think about it, most, or pretty much
00:46:31.600 | any task, can be broken down into a sequence of binary
00:46:35.960 | decisions, like, yes or no, decisions.
00:46:38.400 | It might mean that we have to accept that, OK,
00:46:40.640 | and if we annotating a sentence or entities,
00:46:43.120 | we won't actually end up with a gold standard data
00:46:46.960 | for this sentence.
00:46:48.040 | We might actually end up with only partially annotated data.
00:46:51.000 | And I have to deal with that.
00:46:52.880 | But still, we're actually able to use our human's time
00:46:56.360 | more efficiently, which is often much more important.
00:47:00.120 | So a lot of examples I'm going to show you now
00:47:05.200 | from using our annotation tool Prodigy, which, yeah,
00:47:08.920 | we started building as an internal tool.
00:47:11.540 | But we very, very quickly realized
00:47:12.880 | that, OK, this is really something pretty much every
00:47:15.400 | company we talk to, most users we talk to,
00:47:18.080 | this was always something that kept coming up.
00:47:20.440 | So we thought, OK, what if we really combine all these ideas
00:47:24.760 | we already have, and how to train a model,
00:47:27.080 | actually use the technology we're working with within the tool,
00:47:31.200 | and also use the insights we have from user experience,
00:47:36.240 | and how to get humans to do stuff most efficiently,
00:47:41.480 | how to get humans excited, actually,
00:47:42.880 | even the whole idea of gamification,
00:47:45.160 | how to get humans to really stick to doing something,
00:47:50.920 | and put this all into one tool, and that's Prodigy.
00:47:54.520 | And so here, we see some examples of those tasks,
00:48:00.760 | and how we can present things in a more binary way.
00:48:03.760 | So in the top left, we have an entity task.
00:48:08.440 | So here, this comes from Reddit, and we're
00:48:11.440 | labeling whether something is a product or not.
00:48:14.000 | And what we did here is we load in a spacing model,
00:48:17.760 | asked the model to label the products,
00:48:20.400 | and then we look at them and say yes or no.
00:48:23.520 | Or we can also use a mode where we can then actually
00:48:27.240 | click on this, remove this, label something else.
00:48:31.360 | But still, you see, OK, we don't have to do this
00:48:34.040 | in an Excel spreadsheet.
00:48:35.000 | We actually get one question, we look at this,
00:48:37.120 | and pretty much immediately, we can say yes or no.
00:48:42.400 | The same here, on the right, they were using--
00:48:44.800 | I think this is actually a real example using the YOLO2 model
00:48:49.160 | with the default categories.
00:48:51.160 | And we have an image of a skateboard.
00:48:53.640 | We could say, is this a skateboard, yes or no?
00:48:57.000 | And yeah, immediately, have our annotations here.
00:49:01.320 | And even this one in the corner, even
00:49:03.360 | if we're not able to really break it down
00:49:06.000 | into a true binary task, we can still
00:49:07.640 | make it more efficient and easier for a human to answer.
00:49:12.640 | Because here, with keyboard shortcuts,
00:49:14.680 | you can still do maybe two, three seconds per annotation
00:49:19.200 | and you have an answer.
00:49:20.800 | Or we say, hey, it's actually so fast,
00:49:23.000 | if we can get to one second, we might as well
00:49:26.400 | label our entire corpus twice, positive, negative,
00:49:30.840 | other labels we want to do, and just move through it quicker.
00:49:36.360 | And yeah, to give you some background on why did we do
00:49:40.840 | this, what do we think Prodigy should achieve,
00:49:46.120 | we really think that, OK, we want
00:49:48.560 | to be able to make annotation so efficient that data scientists
00:49:51.760 | can do it themselves.
00:49:52.600 | Or here, what we call data scientists
00:49:55.000 | can also be researchers and people working with the data,
00:49:58.000 | people training the model.
00:50:01.440 | Yeah, reading it like that, it still doesn't sound like fun.
00:50:03.880 | But the idea is, we could really make a process that's
00:50:07.880 | efficient that you actually really want to do this
00:50:10.040 | because you don't have to depend on anyone else.
00:50:12.640 | You can just get the job done and see
00:50:15.040 | whether your idea works or not.
00:50:16.200 | And the same-- yeah, and this also
00:50:18.600 | means you can iterate faster.
00:50:20.040 | We're very used to, OK, you iterate on your code,
00:50:22.040 | but you can actually iterate on your code and your data.
00:50:24.240 | You try something out, doesn't work, try something else.
00:50:28.320 | Maybe see, OK, is it going to work
00:50:29.960 | if I collect more annotations?
00:50:31.600 | You can all try this out.
00:50:33.400 | And we also want to waste as little time as possible
00:50:38.160 | and use what the model already knows
00:50:41.280 | and have the human correct its predictions instead of just
00:50:43.800 | having a human do everything from scratch.
00:50:45.760 | And as a library itself, we really
00:50:49.040 | want Prodigy to fit into the Python ecosystem.
00:50:52.600 | We want it to be customizable, extensible in Python.
00:50:56.440 | You can write scripts for it.
00:50:57.800 | And we also-- it was a very conscious decision
00:51:00.960 | not to make it a SaaS tool, because we think data privacy
00:51:03.800 | is important.
00:51:05.520 | You shouldn't have to send your text to our servers
00:51:07.960 | for no reason.
00:51:09.040 | And we also think you shouldn't be looked in.
00:51:11.520 | Like, you should get a JSON format
00:51:13.080 | out that you can use to train your models however you like,
00:51:15.560 | and not our random format that you can then
00:51:18.920 | download from our servers.
00:51:20.960 | So that's where we're going with Prodigy.
00:51:23.000 | And here's a very simple illustration
00:51:27.680 | of how the app looks. The center are recipes,
00:51:31.560 | which are very simple.
00:51:32.440 | Python scripts that orchestrate the whole thing.
00:51:35.120 | You have a REST API that communicates with the web app
00:51:38.360 | naturally so you can see things on the screen.
00:51:42.800 | You have your data that's coming in, which is text images.
00:51:46.840 | And you can have an optional model state
00:51:49.480 | that's updated in a loop, if you want that.
00:51:52.200 | And then the model then communicates with a recipe.
00:51:58.440 | You can, as the user annotates, it's updated in a loop
00:52:02.640 | and can suggest more annotations that are more compatible
00:52:08.040 | with the annotator's recent decisions.
00:52:10.200 | And yeah, there's a database and a command line interface
00:52:13.160 | so you can actually use it efficiently
00:52:16.360 | and don't have to worry about these aspects.
00:52:19.160 | So here, can you see?
00:52:20.200 | Yeah, in the corner we have a simple example
00:52:23.560 | of a recipe function, which really is just a Python
00:52:27.280 | function.
00:52:28.320 | You load your data in and then you return this dictionary
00:52:31.120 | of components, for example, an ID of the data set,
00:52:34.680 | how to store your data, a stream of examples.
00:52:36.880 | You can pass in callbacks to update your model,
00:52:40.480 | things to execute before the thing starts.
00:52:43.080 | So the idea is really, OK, if you need to load something in,
00:52:46.760 | if you can write that in Python, you can do it in Prodigy.
00:52:50.640 | And you can also-- we provide a bunch of pre-built-in recipes
00:52:58.560 | for different tasks with some ideas of how
00:53:01.400 | we think it could work, like named entity recognition.
00:53:04.760 | For example, you can use the model,
00:53:06.800 | correct its predictions.
00:53:07.880 | You can use the model, say yes or no, to things.
00:53:11.000 | You can use it for dependency parsing and look at an arc
00:53:15.960 | and annotate that.
00:53:16.880 | We have recipes that use word vectors
00:53:20.120 | to build terminology lists, text classification.
00:53:22.880 | So there's also a lot that you can mix and match creatively.
00:53:26.520 | For example, you have the multiple choice example
00:53:30.640 | that's not really tied to any machine learning task,
00:53:34.280 | but it fits pretty much into any of these workflows
00:53:37.560 | that you might be doing.
00:53:38.920 | And of course, the evaluation is also
00:53:40.640 | something we think is very, very important
00:53:42.360 | and is often neglected, especially in more industry use
00:53:47.480 | cases.
00:53:49.320 | But we think there's actually-- ABL evaluation is actually
00:53:51.520 | a very powerful way of testing whether your output is really
00:53:57.720 | what you want it to be.
00:54:00.920 | And so here we see an example of how
00:54:05.560 | you can chain different workflows together,
00:54:07.960 | all using models, word vectors, things you already
00:54:10.600 | have in order to get where you want to get to faster.
00:54:14.400 | So here, a simple example, we want to label fruit.
00:54:20.200 | It's kind of a stupid example because it's that--
00:54:22.800 | I can't think of many use cases where you actually
00:54:25.520 | want to do that, but it makes a great illustration here.
00:54:30.160 | So yeah, we start off, we say, OK, we want fruit.
00:54:33.000 | What are fruit?
00:54:33.760 | We have some examples, apple, pear, banana.
00:54:36.040 | That's what we can think of.
00:54:37.120 | And we also have word vectors that we can use that will easily
00:54:42.320 | give us more terms that are similar to these three fruit
00:54:46.680 | terms that we came up with.
00:54:48.400 | And then we can use this terminology list
00:54:50.840 | that we collected by just saying yes or no to what we've gotten
00:54:54.560 | out of the word vectors, look at those in our data,
00:54:58.240 | and then say whether apples in this context is a fruit or not.
00:55:04.160 | Because we're not just labeling all fruit terms as a fruit
00:55:11.560 | entity, because it could be apple, the company.
00:55:14.200 | But we get to look at it, and it's much more efficient
00:55:16.320 | than if you ask the human to sit through and highlight
00:55:19.440 | every instance of fruit nouns in your text.
00:55:26.680 | And so this also leads to one of our main aspects
00:55:34.000 | of the tool, workflows that we're especially proud of
00:55:36.360 | and that we think really can make a difference, which is we
00:55:39.280 | can actually start by telling the computer more abstract
00:55:43.280 | rules of what we're looking for and then annotating
00:55:45.560 | the exceptions instead of really starting from scratch.
00:55:48.920 | Or we can even use the technology
00:55:51.360 | we're working with to build these semi-automatically using
00:55:55.000 | word vectors, using other cool things that we can now do.
00:55:58.040 | And then, of course, also specifically
00:56:00.920 | look at those examples that the statistical model we
00:56:04.920 | want to train is most uncertain about.
00:56:07.000 | So we try to avoid the predictions
00:56:10.440 | where we can be pretty sure that they're correct
00:56:12.920 | and actually really ask the human first about the stuff
00:56:17.760 | that's 50/50 and where really the human feedback makes
00:56:21.760 | most of the difference.
00:56:23.560 | And so here's a quick example.
00:56:25.600 | Let's say, OK, we want to label locations.
00:56:30.560 | We start off with one city, San Francisco.
00:56:33.560 | And then we look at what else is similar to that term.
00:56:36.840 | So these are actually real suggestions
00:56:38.840 | from that Sense2Vec model that Matt showed earlier.
00:56:42.080 | And as you can see, the nice thing
00:56:44.640 | is we're using word vectors.
00:56:45.920 | We're not using a dictionary.
00:56:47.120 | So we're not going to annotate California and maybe
00:56:49.680 | University of San Francisco.
00:56:51.000 | But we're not going to annotate California roles
00:56:53.840 | because we're in a vector space and we
00:56:55.720 | know that what we're actually looking for
00:56:57.440 | is at least similar to the real meaning of the word.
00:57:00.480 | And a lot of these are super trivial to answer.
00:57:03.200 | So we can accept them, we can reject them,
00:57:05.240 | or we can ignore them because this is a bit too ambiguous
00:57:08.840 | and we don't actually want that in our list
00:57:11.040 | because it can mean too many things.
00:57:12.920 | And then from here, we can actually
00:57:15.400 | create a pattern that uses spaces, attributes,
00:57:20.080 | or in this case, the lower case form of the token and GPE
00:57:26.840 | that stands for geopolitical entities or anything
00:57:30.040 | with the government.
00:57:31.320 | And that's what we're trying to label.
00:57:32.760 | So we can easily build up these roles very quickly,
00:57:35.960 | very automated, and then we have a bunch of locations
00:57:40.080 | that we can then match in our text.
00:57:41.920 | So here, it found a mention of Virginia,
00:57:45.680 | which we can then accept.
00:57:47.520 | So that's a very, very simple example of this.
00:57:49.920 | But of course, this also works for slightly more complex
00:57:54.640 | constructs where we can really take advantage
00:57:57.340 | of the syntactic structure.
00:57:59.040 | So here, this was a finance example.
00:58:01.880 | So what we're trying to do is we want
00:58:03.400 | to extract information about executive compensation.
00:58:07.840 | So yeah, some executive receives some amount of money
00:58:11.880 | in stock, for example, like this one.
00:58:14.040 | And this is a pretty difficult task.
00:58:17.240 | But also, the idea here is we have this theory
00:58:20.480 | that maybe if we could train a model, a text classification
00:58:23.160 | model, to predict whether a sentence is
00:58:26.520 | about executive compensation or not,
00:58:29.120 | we can then very, very easily use what we already
00:58:32.720 | know about the text to extract, let's say, the first person
00:58:35.680 | entity.
00:58:36.600 | We extract the amount of money, put that in our database.
00:58:39.480 | And we've actually-- yeah, we found a good solution
00:58:42.660 | for an otherwise very, very complex task.
00:58:45.640 | So for this, this is just an idea.
00:58:48.920 | We haven't tried this in detail, but one possible pattern
00:58:52.400 | using token attributes we have available
00:58:55.680 | would be let's try and look for an entity type person,
00:59:00.960 | followed by a token with a lemma receive.
00:59:05.240 | So received, receives, receiving, and followed by a token
00:59:09.920 | with the entity type money.
00:59:11.680 | And let's just look at what this pulls up.
00:59:15.040 | That's an idea.
00:59:15.520 | I mean, there are plenty of other possible patterns
00:59:19.580 | you can come up with.
00:59:20.740 | And the nice thing is we're actually
00:59:22.280 | going to be looking at them again in context.
00:59:24.320 | So they don't have to be perfect.
00:59:25.920 | And even actually, in fact, even if it pulls up random stuff
00:59:29.320 | that you realize is totally not what you want,
00:59:32.960 | this is also very important.
00:59:34.280 | Because you won't only be collecting annotations
00:59:37.580 | for the things you know are definitely right.
00:59:39.720 | You're also collecting annotations
00:59:41.960 | for the things that are very, very similar or look very,
00:59:44.420 | very similar to what you're looking for but are actually
00:59:46.680 | not what you're looking for.
00:59:47.920 | And that's probably just as important
00:59:50.840 | as the positive examples.
00:59:54.240 | So yeah, the moral of the story is what we're saying
00:59:57.600 | is we're very used to iterating on our code as programmers.
01:00:02.720 | But you should really be doing both.
01:00:04.800 | The data is just as important.
01:00:07.120 | So as we see here, OK, that's the normal type of programming.
01:00:10.840 | You have a runtime program.
01:00:13.600 | You work on the source code.
01:00:15.200 | You compile it, get your runtime program.
01:00:17.000 | You don't like something about your program.
01:00:19.120 | You go back, change the source code, compile it, and so on.
01:00:21.520 | That's a pretty standard workflow.
01:00:23.640 | And in machine learning, we don't
01:00:26.400 | have a runtime program in that sense.
01:00:28.120 | We have a runtime model.
01:00:29.640 | So the part we should really be thinking about and working on
01:00:33.400 | is the training data.
01:00:35.000 | Instead, most focus is currently on the training algorithm.
01:00:39.400 | And if you use that analogy, that's
01:00:42.560 | very similar to going and tweaking your compiler
01:00:45.760 | if you're not happy with your runtime program.
01:00:48.640 | You can do that, but of course, you probably go back and edit
01:00:52.400 | your source code.
01:00:53.920 | I think this is actually a pretty good example.
01:00:56.400 | It's pretty accurate.
01:00:59.120 | There are only so many training algorithms,
01:01:01.040 | but what really makes a difference is your data.
01:01:03.360 | So if you have a good way and a fast way of iterating
01:01:06.480 | on that data, and you're able to really master
01:01:11.840 | this part of the problem, you'll also
01:01:13.520 | get to try more things quickly.
01:01:16.720 | As we know, most ideas don't actually work.
01:01:19.800 | It's always one of these things that's kind of misrepresented.
01:01:22.480 | A lot of people have this idea, ooh,
01:01:24.280 | you're doing all these amazing AI things,
01:01:27.000 | and everything just works.
01:01:28.080 | It's like, kind of doesn't.
01:01:29.280 | Nothing works.
01:01:30.520 | And sometimes things work.
01:01:32.920 | And you really want to find the things that actually work.
01:01:35.600 | And for that, you need to try them.
01:01:37.560 | And so it also means if you can actually
01:01:41.560 | figure out what works before you try it and invest in it,
01:01:45.440 | you can actually be more successful overall
01:01:47.200 | because you're not going to waste your time on the things that
01:01:51.280 | might fail and more scale things up that actually weren't even
01:01:55.600 | supposed to work in the first place.
01:01:57.000 | And one thing that's also very important to us
01:01:59.200 | is you can really build custom solutions.
01:02:01.800 | You can build solutions that fit exactly to your use case,
01:02:05.920 | and you'll keep these on it.
01:02:08.480 | If you collect your own data, you'll keep that forever,
01:02:10.680 | and nobody can lock you in.
01:02:11.760 | You're not just consuming some API,
01:02:13.920 | and if that API shuts down, you can start again from scratch.
01:02:18.200 | You have your data, no matter what
01:02:20.000 | other cool things we can do at some point in the future,
01:02:22.760 | you can always go back to your labeled data
01:02:25.720 | and really build your own systems.
01:02:29.360 | And we believe that this is really something
01:02:31.200 | that's very important in the future of the technology.
01:02:34.680 | That's also a reason why we think
01:02:36.200 | AI development in general in companies
01:02:37.920 | should be done in-house.
01:02:40.520 | And, yeah, we're hoping that we can keep providing useful tools
01:02:44.640 | that will make this easier.
01:02:48.280 | Yeah.
01:02:50.120 | [APPLAUSE]
01:02:52.600 | So the question is, yeah, Jeremy thinks
01:03:08.800 | we write very good software, even though we're only two people,
01:03:11.760 | and how are we doing that?
01:03:12.920 | Yeah, that's a very good question.
01:03:14.280 | I mean, we do get this a lot.
01:03:18.000 | I mean, I think it's--
01:03:19.760 | I don't even know where this idea comes from that, like, yeah,
01:03:22.960 | you can scale things up.
01:03:24.240 | Like, I don't know, scaling things up makes things better.
01:03:28.120 | Because I do think, yeah, actually,
01:03:30.120 | the more people you get involved,
01:03:31.480 | you sometimes-- it actually can have a very negative impact
01:03:35.040 | on the quality of the software you produce.
01:03:38.560 | In our case, it's just, OK, it just works.
01:03:40.320 | Like, I also don't like this idea of, oh,
01:03:43.360 | if everyone can do exactly the same thing if they just
01:03:45.600 | work hard, even though people like thinking of it that way.
01:03:48.200 | It's just, OK, in our case, we have a good combination
01:03:51.480 | of things that we like to do, things
01:03:53.560 | that we happen to be good at, and it just works together.
01:03:57.560 | So I guess we are lucky in that way,
01:04:00.440 | but we also cut out a lot of bullshit,
01:04:02.760 | like the amount of meetings we don't take,
01:04:05.080 | the amount of events we don't go to.
01:04:08.000 | I mean, yeah, it's kind of ironic saying that, speaking
01:04:11.600 | at an event, but I really don't normally go to many events.
01:04:17.200 | We don't take coffee dates with random people.
01:04:19.720 | We barely know.
01:04:22.800 | Yeah, we mostly, we really just like to write software.
01:04:26.720 | And yeah, we've had some good ideas in the past.
01:04:35.600 | Thanks for making this cool.
01:04:38.080 | I wish I had it two years ago.
01:04:40.560 | Have you done any experiments to see
01:04:42.440 | if there's actually biases and [INAUDIBLE]
01:04:45.400 | to show them your model examples versus just
01:04:48.840 | how many do you think [INAUDIBLE]
01:04:51.320 | you don't look at any trade-offs [INAUDIBLE]
01:04:54.440 | I mean, also, the question is, if we've
01:04:59.440 | done any experiments where we compare the binary decisions
01:05:04.440 | and whether it influences the annotators
01:05:06.720 | versus really doing everything from scratch.
01:05:08.960 | So we haven't done experiments specifically
01:05:11.880 | focusing on the bias because that's, in some sense,
01:05:15.960 | that's difficult because we're looking at the output.
01:05:18.720 | We're looking at, does it improve accuracy?
01:05:21.520 | We've done experiments of manual annotation
01:05:24.160 | versus binary annotation, but also mostly focused
01:05:28.480 | on our own tooling because we think it's kind of useless.
01:05:31.980 | Like, yeah, we can present you a study
01:05:33.480 | where we said, oh, we did stuff in an Excel spreadsheet
01:05:35.760 | and then we did stuff in Prodigy and it was much better.
01:05:38.320 | So it's really mostly focused around our own tooling
01:05:41.560 | and we did find that--
01:05:44.040 | well, it depends on the task you're doing.
01:05:46.600 | That's the other thing.
01:05:47.920 | I feel like giving these answers sounds unsatisfying
01:05:50.400 | because I'm always saying, well, it depends on your data.
01:05:53.120 | But that's also the whole point of it
01:05:56.040 | because we're doing this because your data is different
01:05:59.640 | and there's no one size fits all solution.
01:06:04.440 | But essentially, so we found what--
01:06:06.680 | binary annotation works especially well
01:06:08.280 | if you already have a pre-trained model that
01:06:10.320 | predicts something, ideally also something that's
01:06:13.000 | not completely terrible.
01:06:16.120 | Otherwise, the pattern approach does work very well
01:06:18.840 | on very specific domains.
01:06:22.600 | Like, we did one example of where we labeled drug names
01:06:26.480 | on Reddit, like on our opiates, which was a pretty good--
01:06:29.920 | this was a pretty good data source
01:06:31.280 | because it's a very specific topic.
01:06:33.400 | And also, it's a subreddit that's very on topic
01:06:35.960 | because people who go on Reddit to discuss opiate use,
01:06:44.080 | usually are very dedicated to talking about this one topic.
01:06:47.080 | So it was a good, interesting data source.
01:06:48.920 | And so what we wanted to do is we labeled drug names,
01:06:54.200 | drugs, and pharmaceuticals in order to, for example,
01:06:59.040 | have a better tool set to really analyze
01:07:01.120 | the content of this subreddit and see how it develops
01:07:04.360 | over time anyway.
01:07:05.400 | So there we found the pattern-based approach
01:07:08.200 | worked very, very well because we have very specific terms.
01:07:11.400 | We can use word vectors to bootstrap these.
01:07:14.600 | Especially also, we can include spelling mistakes and stuff,
01:07:17.840 | which was very interesting.
01:07:18.960 | Like, we can really build up good word lists,
01:07:21.120 | find them in the text, confirm them, and get to pretty decent
01:07:24.120 | accuracy.
01:07:24.960 | I would expect this work to work a little less well,
01:07:28.560 | the cold start problem, on a much more ambiguous domain.
01:07:31.880 | And there, you're probably better off to say, OK,
01:07:33.880 | we're labeling by hand.
01:07:35.480 | But even there, that's something I haven't really
01:07:37.400 | shown in detail here.
01:07:38.200 | But we also have a manual interface
01:07:40.360 | where you highlight.
01:07:42.080 | But what we do there is we use the tokenizer
01:07:44.840 | to pre-segment the text.
01:07:46.560 | So you don't have to sit there and pixel
01:07:48.640 | perfect, like, highlight, and then, ah, shit,
01:07:51.000 | now I've got the white space in.
01:07:52.440 | Let's start again.
01:07:53.480 | So that's another thing we're doing.
01:07:55.440 | You can be much lazier in highlighting.
01:08:00.160 | And also, there, get more efficiency out of it.
01:08:03.840 | And still use a simpler interface.
01:08:06.760 | Yeah?
01:08:08.280 | So you mentioned about the [INAUDIBLE]
01:08:12.440 | Yeah.
01:08:13.440 | [INAUDIBLE]
01:08:35.120 | So the question is, first, you gave an example
01:08:41.000 | of annotating patient data, which is obviously very
01:08:45.240 | problematic because doctors are not always very specific
01:08:48.080 | in what they fill in.
01:08:49.040 | And then, in the end, this was how did they enrich that with--
01:08:52.600 | So what they did is they got foundation of the [INAUDIBLE]
01:08:57.960 | Yeah.
01:08:59.960 | Yeah, so basically, OK, the question
01:09:01.320 | is whether we have some experience in the medical field
01:09:04.960 | mixing this.
01:09:06.920 | The answer is, well, we haven't personally done this.
01:09:09.040 | But we do have quite a few companies in that domain,
01:09:13.880 | also because the tool itself is quite appealing
01:09:17.120 | because you can run it in your own compliant environment,
01:09:20.960 | you know, that data privacy aspect.
01:09:23.880 | But it's interesting to explore.
01:09:27.120 | That's maybe also where, OK, having the professionals--
01:09:29.680 | getting the medical professionals more involved
01:09:31.520 | might make sense, which normally is very difficult.
01:09:34.720 | You don't want a doctor to do all the work themselves.
01:09:37.640 | But if you can find some way to distill that and then ask
01:09:40.720 | the doctor, OK, you wrote this here, does that mean--
01:09:44.520 | you wrote x, does that mean y?
01:09:46.160 | And the doctor says, yep.
01:09:47.360 | Or the doctor says, nah.
01:09:49.080 | If you can try this out and extract some information,
01:09:53.400 | well, that could be one idea to solve that, for example.
01:09:56.480 | Yeah, I can definitely see that.
01:09:59.480 | [INAUDIBLE]
01:10:07.920 | You can.
01:10:08.600 | Like right now, we don't have a built-in logic for that,
01:10:13.320 | although we are working on--
01:10:15.080 | oh, sorry, I forgot to repeat the question--
01:10:18.080 | inter-annotator agreement, if you can calculate that
01:10:21.040 | and incorporate that into your model.
01:10:23.120 | So we're actually working on an extension
01:10:25.120 | for Prodigy, which is much more specifically
01:10:27.040 | for managing multiple annotators.
01:10:28.840 | Because the tool here, we really designed specifically
01:10:32.240 | as a developer tool first and then scaling it up a second.
01:10:36.800 | But since you have the binary feedback,
01:10:38.880 | and if you have an idea, if you have an algorithm you want to use
01:10:41.640 | and you know what you want, you can already
01:10:44.400 | do that fairly easily because you can download
01:10:46.960 | all the data as JSON.
01:10:48.640 | You have a key that's answer, which is either
01:10:50.720 | accept, reject, or ignore.
01:10:53.200 | You can attach your own arbitrary data like a user ID.
01:10:57.160 | And then it's fairly trivial to write your own function that
01:11:00.240 | really takes all of this, reads it in, computes something,
01:11:03.760 | and then uses this later on.
01:11:06.200 | So that's definitely possible.
01:11:07.360 | But this is also something we're really interested in exploring
01:11:11.400 | and working on.
01:11:12.160 | And the binary interface is great,
01:11:14.320 | but the trainer kind of is great, but yeah.
01:11:17.400 | Yeah, so we see binary.
01:11:18.840 | That's a big advantage of the binary interface
01:11:21.480 | is that there are only two options.
01:11:25.800 | You filter out the ignored ones, and then you
01:11:28.800 | can really answer that question.
01:11:30.720 | Yeah.
01:11:31.880 | [INAUDIBLE]
01:11:32.840 | Yeah, well, you can design--
01:11:43.480 | so the question was--
01:11:47.280 | one interface I showed, which was the sentiment
01:11:49.280 | one with the multiple selections.
01:11:51.600 | This is not binary.
01:11:52.520 | That's true.
01:11:53.400 | And actually, it's also something we usually
01:11:55.120 | tell our users avoid this as much as possible, if you can.
01:11:59.680 | And in some cases, you might still want that.
01:12:01.880 | Or we say, look, a lot of people still
01:12:05.000 | think of surveys when they think of annotating data.
01:12:07.000 | And I get where this is coming from,
01:12:09.960 | but I think if you can leave that sort of mindset
01:12:12.600 | and really open up a bit and think of other creative ways,
01:12:15.000 | you could get more out of this.
01:12:16.380 | If you want to re-engineer a survey,
01:12:18.400 | maybe you want to use a survey tool.
01:12:21.120 | So for example, if I were doing this with those four options,
01:12:24.640 | I would say, OK, we have all texts.
01:12:27.360 | The annotator sees every text four times and says,
01:12:30.320 | is this happy or is this not happy?
01:12:32.720 | And because you can get to one second for annotation,
01:12:35.040 | that's very fast.
01:12:36.400 | Like, even if you have thousands of examples,
01:12:38.360 | you can do this in a day yourself.
01:12:40.680 | And so that's how we would probably solve this.
01:12:42.880 | And it also means you get every example four times.
01:12:46.520 | And for each text, you know, is it sad?
01:12:48.600 | Is it happy?
01:12:49.360 | Is it neutral?
01:12:50.200 | Is it something else?
01:12:51.320 | You have much more data.
01:12:52.600 | But not everyone wants this.
01:12:54.040 | Some people really want to build that survey.
01:12:56.400 | And we let them.
01:12:57.600 | But yeah.
01:12:59.920 | Yeah.
01:13:00.920 | [INAUDIBLE]
01:13:01.920 | So the question is, if you're doing the same example
01:13:15.200 | multiple times, whether it slows down the annotation or not.
01:13:18.520 | Well, actually, I mean, it's difficult to say
01:13:21.000 | because it depends.
01:13:22.240 | But I've actually found that even if you do the bare maths,
01:13:26.160 | it can easily be much faster.
01:13:27.800 | Because if you say, OK, 1,000 examples.
01:13:31.400 | And normally, if you really have to think
01:13:33.520 | about five different concepts that are maybe not even fully
01:13:36.120 | related, that just every tiny bit of friction
01:13:38.920 | you put between a human and the interface or the decision
01:13:41.800 | can very significantly slow down the process.
01:13:44.000 | So you think about, oh, is this happy?
01:13:45.560 | Or is this sad?
01:13:46.560 | Or is this about sports?
01:13:47.800 | Or is this about horses?
01:13:49.840 | And just this thing that can easily add like 10 seconds
01:13:52.720 | to each question.
01:13:54.320 | So if you do the whole thing three times at one second,
01:13:59.680 | you're still faster than you would have been
01:14:02.480 | if you'd added this friction.
01:14:04.520 | And the other part is just a human error.
01:14:07.840 | If you have to think too much, you're much more likely to fuck
01:14:11.160 | it up and do it badly.
01:14:12.160 | And then that's also something you want to avoid.
01:14:15.440 | [INAUDIBLE]
01:14:15.920 | But the active learning helps a lot here as well.
01:14:18.240 | So if you have your labels, it's pretty confident
01:14:20.560 | that the economic labels don't apply.
01:14:22.280 | And so you just don't have to learn something.
01:14:24.440 | Yeah, to repeat this, the active learning also
01:14:28.080 | makes a difference here.
01:14:29.680 | Because you could actually--
01:14:32.280 | yeah, you could pre-select the ones
01:14:34.160 | that really make a difference to annotate
01:14:36.080 | and don't have to really go through every single one that
01:14:39.920 | is not as important as some of the other ones
01:14:42.680 | that you really care for.
01:14:44.880 | Yeah, do you have any experience working with tasks like that
01:14:48.560 | or how you sort of [INAUDIBLE]
01:14:50.520 | Yeah, so the question is, yeah, what
01:14:52.480 | about tasks that need a lot of context,
01:14:54.080 | like the whole medical history or just a whole document.
01:14:57.760 | So we have-- and whether we have experience with that.
01:15:00.800 | So in general, we do say, if your task requires so much
01:15:05.640 | context that you can't fit this into the prodigy interface,
01:15:08.520 | then it doesn't mean that you can't train a model on that.
01:15:11.160 | But for most of the tasks that users most commonly want to do,
01:15:14.440 | this is often also an indicator that it's very, very difficult
01:15:16.800 | to actually teach your model that if you're
01:15:19.080 | doing named entity recognition or even text classification
01:15:23.080 | and you need a lot of context and all the context
01:15:26.320 | is equally as important, that's often an indicator
01:15:29.080 | that that might not work so well.
01:15:31.360 | So for example, text classification,
01:15:32.800 | we say, OK, we start off by selecting one sentence
01:15:35.880 | from the whole document.
01:15:37.280 | And then instead of you annotating the whole document,
01:15:41.360 | you say, OK, this is the most important sentence.
01:15:44.320 | Does this label apply or not?
01:15:46.040 | So there are some tricks we use to get around this problem
01:15:51.720 | because, yeah, we also think that, OK, it's
01:15:56.280 | important to get this across and frame it in that way
01:15:59.360 | because, yeah, if you need two pages on your screen,
01:16:03.320 | it's not efficient at all.
01:16:06.520 | And also likely, you can do all that work,
01:16:08.800 | but your model won't learn that because your model needs
01:16:12.000 | local context as well, at least, for the tasks that we are--
01:16:15.880 | I don't know if you have anything to add to that.
01:16:18.880 | Yeah, OK.
01:16:20.880 | Often, it's important to take into account the models
01:16:23.960 | really that are available.
01:16:25.440 | [INAUDIBLE]
01:16:25.920 | Yeah, so the suggestion was, OK, having some tools,
01:16:33.560 | some process that goes along with the software that
01:16:37.000 | helps people break this down.
01:16:38.640 | Yeah, we've actually been thinking about this a lot
01:16:40.640 | because we do realize the tool is quite new,
01:16:43.120 | and we're introducing a lot of new concepts at once,
01:16:45.160 | and also some best practices where we think,
01:16:47.200 | ah, that's how you should do it, or you could try this.
01:16:49.960 | And we are also realizing that there's
01:16:52.200 | no real satisfying one-size-fits-all answer.
01:16:55.160 | That's another problem.
01:16:56.440 | Everyone's use case is different,
01:16:57.800 | so right now what we're doing is we have a support form
01:17:00.440 | for Prodigy where we answer people's questions.
01:17:03.040 | And actually, a lot of users share
01:17:04.840 | what they're working on, asking for tips.
01:17:08.320 | We kind of talk about it.
01:17:09.640 | Other users come in and are like, oh, I actually
01:17:11.640 | try to do this type of legal annotation,
01:17:14.800 | and here's what worked for me, and have this sort of exchange
01:17:17.840 | around it to figure out, OK, what works.
01:17:21.080 | Because, yeah, it's just like I think
01:17:23.320 | in machine learning, deep learning,
01:17:24.820 | a lot of the best practices are still evolving,
01:17:28.560 | and it's very, very specific.
01:17:32.160 | So it's definitely-- yeah, we're open for suggestion there
01:17:35.360 | as well, but we're still in the process of really coming up
01:17:39.200 | with a good set of best practices and ideas.
01:17:42.000 | The question is whether-- yeah, we
01:17:51.360 | have any plans to sell models like medical models?
01:17:54.240 | Yes, as part of what Matt mentioned
01:17:55.920 | in the very introduction, we are definitely
01:17:58.360 | planning on having more of a models--
01:18:02.400 | like an online store for very, very specific models.
01:18:05.120 | So medical-- that's a very, very interesting domain.
01:18:09.320 | And if so, we really want to have it specific,
01:18:11.800 | like medical texts in French or Chinese,
01:18:16.440 | and really go in that direction.
01:18:17.700 | Because we believe that, OK, pre-trained models
01:18:20.120 | are very valuable, and even if you do medical texts,
01:18:22.800 | you can start off with a pre-trained model,
01:18:25.200 | then you can use a tool like Prodigy or something else
01:18:27.400 | to really fine tune it on your very, very specific context,
01:18:31.120 | have word vectors in it that already fit to your domain,
01:18:34.720 | and maybe up those as well.
01:18:35.880 | We think that this is a very future-proof way of working
01:18:39.720 | with these technologies.
01:18:41.320 | Yeah?
01:18:43.800 | Yeah?
01:18:44.280 | [INAUDIBLE]
01:18:47.160 | So currently-- so a question is the text classification model
01:18:50.040 | we're using in Prodigy.
01:18:51.600 | More info-- more details on that.
01:18:53.160 | So what we're using is Spacey's text classification model.
01:18:56.360 | That's what's built in.
01:18:58.520 | But I think actually this question is pretty good,
01:19:00.520 | because what's important to note is that Prodigy itself
01:19:04.080 | comes with a few built-in recipes that are basically
01:19:07.560 | ideas for, OK, how you could train a text classifier.
01:19:10.120 | You could use Spacey.
01:19:11.320 | But it's definitely not tied to those.
01:19:13.520 | The idea-- the tool itself is really the scaffolding
01:19:15.680 | around it.
01:19:16.200 | So if you say, hey, I wrote my own model using PyTorch,
01:19:19.720 | and I would like to train this, all you need to do
01:19:22.360 | is you need to have one function that takes examples
01:19:25.000 | and updates your model.
01:19:26.160 | And you need to have one function that takes raw texts
01:19:29.480 | and outputs the score for each text.
01:19:32.360 | And then you provide that to Prodigy.
01:19:35.120 | And then you can use the same active learning mechanism
01:19:39.720 | as you would use with a built-in model.
01:19:42.160 | So the idea is really the models we ship
01:19:45.200 | are just a suggestion or an idea you can use to try it out.
01:19:48.840 | But ultimately, we also hope that people in the future
01:19:51.800 | will transition to just plugging in their own model
01:19:54.840 | and just using the scaffolding around it to do that.
01:19:59.400 | But we definitely don't want to lock anyone in and say,
01:20:01.680 | oh, you have to use spaCy, especially for NER and stuff
01:20:05.120 | and other things.
01:20:06.240 | We think spaCy is pretty good.
01:20:08.080 | But if you don't want to do that for other use cases, especially
01:20:11.040 | text classification, we think that a lot of cases--
01:20:13.800 | well, you might want to use scikit-learn or vocal-webbit.
01:20:18.080 | Yeah, or what a great name.
01:20:21.880 | Yeah, or basically something completely custom.
01:20:25.440 | Yeah.
01:20:25.920 | [INAUDIBLE]
01:20:30.640 | So the question is, active learning part,
01:20:32.600 | whether this is built on the underlying model--
01:20:34.640 | [INAUDIBLE]
01:20:35.440 | Oh, yeah.
01:20:35.920 | [INAUDIBLE]
01:20:42.960 | So the question is, active learning versus no active
01:20:45.720 | learning, how well this works.
01:20:47.080 | First also, maybe as a general introduction,
01:20:49.320 | so what we're doing for most of these samples
01:20:51.320 | is we use a basic uncertainty sampling.
01:20:53.680 | That's what we found works best.
01:20:55.600 | But we also know there are lots of other ways
01:20:57.520 | you could be solving that.
01:20:58.520 | So in the end, how we implement this
01:21:02.040 | is we have a simple function that takes a stream
01:21:04.280 | and outputs a sorted stream based on the assigned
01:21:08.120 | scores and the model in the loop.
01:21:10.440 | So how you wire this up, again, is also up to you.
01:21:13.320 | And yeah, to answer the part about what works best,
01:21:19.000 | in general, in our kind of framework, where really,
01:21:22.160 | you see one sentence at a time.
01:21:23.480 | And often, you start off with a model not knowing very much.
01:21:27.120 | The active learning component, basically,
01:21:29.200 | resorting the stream is actually very crucial.
01:21:31.360 | Because otherwise, if you start from scratch,
01:21:34.040 | have very few examples, you'll be
01:21:35.760 | annotating for a very, very long time.
01:21:37.480 | And all kinds of random predictions,
01:21:41.640 | you annotate your stream in order.
01:21:44.080 | There's very little-- you need some kind of guidance
01:21:46.560 | that tells you, OK, what to work on next, especially
01:21:48.640 | if you feed in millions of texts.
01:21:50.520 | You need to sort them.
01:21:51.880 | You need to pre-select them based on something.
01:21:54.400 | And this could be the model's predictions.
01:21:56.280 | This could be something else.
01:21:57.160 | This could be the keywords or the patterns.
01:21:59.440 | But without that, yeah, it's very, very difficult.
01:22:04.480 | And that's kind of what we're trying to solve with a tool.
01:22:07.680 | Thank you so much, Innes and Matthew.
01:22:18.400 | I've got to say, anybody who's using fast AI,
01:22:23.760 | any time you've used fast AI NLP or fastAI.txt,
01:22:27.360 | you've called the spacey tokenize function.
01:22:30.520 | You're using spacey behind the scenes.
01:22:32.960 | And the reason you're using spacey
01:22:34.560 | is because I tried every damn tokenizer I could find.
01:22:38.680 | And spaceys was so much better than everything else.
01:22:42.120 | And then the kind of story of fast AI's development
01:22:44.560 | is that over time, I get sick of all the shitty parts
01:22:47.400 | of every third-party library I find.
01:22:49.040 | And I gradually rewrite them myself.
01:22:50.560 | And the fact that I haven't rewritten spacey or attempted
01:22:52.920 | to is because I actually think it's
01:22:55.320 | one of those rare pieces of software
01:22:56.760 | that doesn't suck at all.
01:22:58.520 | It's actually really good.
01:23:01.320 | And it's got good documentation.
01:23:02.960 | And it's got a good install story and so forth.
01:23:06.400 | And I haven't used Prodigy, but just the fact
01:23:09.040 | that these guys are working on.
01:23:11.040 | I recognize the importance of active learning
01:23:13.120 | and the importance of combining human plus machine.
01:23:16.120 | What's in that rare category of people, in my opinion,
01:23:18.720 | are actually working on what's one of the most
01:23:20.400 | important problems today.
01:23:22.440 | So thank you both so much for coming
01:23:25.480 | and for this fantastic talk.
01:23:27.280 | And I look forward to seeing what you do next.
01:23:29.320 | Thank you.
01:23:29.800 | [APPLAUSE]
01:23:32.560 | (audience applauds)