Increasing data science productivity; founders of spaCy & Prodigy

00:00:00.000 | OK, so yeah, this is our first time in the Bay Area,

00:00:03.960 | so it's nice to meet you all.

00:00:05.600 | And thanks for coming.

00:00:07.160 | Not so much to notice.

00:00:09.200 | So I'll start by just giving a quick introduction of us

00:00:11.640 | and some of the things that we're doing

00:00:13.840 | before I start with the main content of the talk, which

00:00:16.360 | is about this open source library that we developed

00:00:19.960 | spacey for natural language processing.

00:00:22.440 | So the other things that we develop as well at Explosion

00:00:25.400 | AI is a machine learning library behind spacey think,

00:00:31.000 | which allows us to avoid depending on other libraries

00:00:33.520 | and keep control of everything and make sure that everything

00:00:36.140 | is easy to install.

00:00:38.080 | We also have an annotation tool that we

00:00:40.080 | develop alongside spacey prodigy, which is what

00:00:42.440 | Innes will be talking about.

00:00:44.320 | And we're also preparing a data store of other pre-trained

00:00:46.840 | models for more specific languages and use cases

00:00:49.960 | and things that people will be able to use that basically

00:00:53.680 | will extend the capabilities of the software

00:00:56.440 | for more specific use cases.

00:00:59.640 | So to give you a quick introduction to Innes and I,

00:01:02.840 | which is basically all of Explosion AI,

00:01:05.640 | so I've been working on natural language processing

00:01:07.880 | for pretty much my whole career.

00:01:09.520 | I started doing this after doing a PhD in computer science.

00:01:13.960 | I started off in linguistics and then kind of moved

00:01:15.960 | across to computational linguistics.

00:01:18.880 | And then around 2014, I saw that these technologies

00:01:22.160 | were getting increasingly viable.

00:01:23.960 | And I was also at the point in my career

00:01:26.000 | where I was supposed to start writing grant proposals, which

00:01:28.960 | didn't really agree with me.

00:01:30.120 | So I decided to leave and I saw that there

00:01:32.000 | was a gap in the capabilities available for something that

00:01:35.040 | actually translated the research systems

00:01:37.040 | to something that was more practically focused.

00:01:39.920 | And then soon after I moved to Berlin to do this,

00:01:42.640 | I met Innes.

00:01:44.000 | And we've been working together since on these things.

00:01:46.360 | And I think we kind of have a nice complementarity of things.

00:01:49.920 | She is the lead developer of our annotation tool Prodigy

00:01:54.200 | and has also been working on Spacey pretty much

00:01:56.880 | since the first release.

00:01:59.360 | So I included this slide, which we normally actually

00:02:02.320 | give this when we talk to companies specifically.

00:02:04.480 | But I think that it's a good thing to include

00:02:06.560 | to give you a bit of this is what

00:02:08.960 | we tell people about what we do and how we make money

00:02:11.640 | and how the company works.

00:02:13.080 | And I think that this is a very valid question

00:02:14.920 | that people would have about an open source library.

00:02:17.000 | It's like, well, why are you doing this and how

00:02:19.200 | does it fit into the rest of your projects and plans?

00:02:22.640 | So the Explain It Like I'm 5 version, which I guess

00:02:25.760 | is also the Explain It Like I'm Senior Management version,

00:02:28.800 | is we give an analogy.

00:02:30.280 | It's kind of like a boutique kitchen.

00:02:31.880 | So the free recipes we publish online,

00:02:34.680 | you can see, is kind of like the open source software.

00:02:37.120 | So that's Spacey thing, et cetera.

00:02:39.600 | At the start of the company, especially,

00:02:41.360 | we were doing consulting, which I'm

00:02:43.160 | happy to say we've been able to wind down over the last six

00:02:46.400 | months and focus on our products.

00:02:48.360 | And then we also focus on a line of kitchen gadgets, which

00:02:51.800 | is things like Prodigy.

00:02:53.160 | These are these downloadable tools

00:02:54.880 | to use alongside the open source software.

00:02:56.960 | And soon we'll have this sort of premium ingredients, which

00:02:59.460 | are the pre-trained models.

00:03:01.240 | So the thing that we don't do here

00:03:02.880 | is enterprise support, which I guess

00:03:04.680 | is probably the most common way that people

00:03:07.840 | fund open source software or imagine

00:03:09.720 | that they'll fund open source software with a business model.

00:03:12.600 | And we really don't like this because we want our software

00:03:16.040 | to be as easy to use as possible and as transparent as possible

00:03:18.960 | and the documentation to be good.

00:03:20.720 | So I think it's kind of weird to have this thing where you have

00:03:24.360 | explicitly a plan that we're going to make our free stuff

00:03:26.960 | as good as possible.

00:03:28.240 | And then we're going to have this service

00:03:30.440 | that we hope people pay us lots of money for,

00:03:33.440 | but we hope nobody uses.

00:03:35.120 | And that's kind of weird, right?

00:03:36.520 | It's kind of weird to have a company that you

00:03:39.140 | hope that your paid offering is really poor value to people.

00:03:41.720 | And so we don't think that that's a good way to do it.

00:03:44.440 | And so instead, we have the downloadable tools,

00:03:47.560 | I think, is a good way to--

00:03:49.800 | we have something which works alongside Spacey

00:03:51.840 | and I think is useful to people who use Spacey as well.

00:03:56.840 | OK, so onto the sort of main content of the talk

00:04:01.680 | and the bit that I'll be talking about.

00:04:04.600 | So I'm going to talk to you about the syntactic parser

00:04:08.080 | within Spacey, the natural language processing library

00:04:10.920 | that we use.

00:04:12.040 | And so before I do it, so this is kind of what

00:04:16.040 | it looks like as sort of visualized as an output.

00:04:19.800 | So it's this sort of tree-based structure

00:04:22.240 | that gives you the syntactic relationships between words.

00:04:26.760 | So the way that you should read this here

00:04:29.240 | is that the arrow pointing from this word to this word

00:04:33.640 | means that Apple is a child of looking in the tree.

00:04:37.520 | And it's a child with this relationship and such.

00:04:39.960 | In other words, Apple is the subject of looking.

00:04:42.480 | And is is an auxiliary verb attached to looking,

00:04:45.560 | and then at is a prepositional phrase attached to looking.

00:04:49.280 | So these sorts of relationships tell you

00:04:51.480 | about the syntactic structure of the sentence

00:04:53.200 | and basically help you get at the who did what to whom

00:04:55.880 | sort of relationships in the sentence

00:04:57.720 | and also to extract phrases and things.

00:04:59.760 | So for instance, here, to make the thing more easy to read,

00:05:02.840 | we've merged UK startup, which is a sort of basic noun phrase

00:05:06.560 | into one unit.

00:05:08.800 | And you can find these sorts of phrases

00:05:10.520 | more easily from given the syntactic structure.

00:05:13.520 | And just above here, we've got an example

00:05:16.120 | of what the code looks like to actually get

00:05:18.880 | the syntactic structure or navigate the tree.

00:05:21.600 | In spaCy, you just get this NLP object after loading the models.

00:05:26.240 | And you just use that as a function

00:05:27.720 | that you feed text or pipetext through if you've

00:05:30.320 | got a sequence of texts.

00:05:32.240 | And given that, you get a document object, which you can

00:05:35.880 | just use as an iterable. And from the tokens,

00:05:38.840 | you get attributes that you can use to navigate the tree.

00:05:42.040 | So for instance, here, the dependency relationship

00:05:44.680 | is just a dot depth.

00:05:47.000 | By default, that's an integer key, integer ID,

00:05:50.040 | because everything's kind of coded to an integer

00:05:52.120 | for easy and efficient processing.

00:05:55.320 | But then you can get the text value with an underscore

00:05:57.840 | as well.

00:05:58.520 | And then you can navigate up the tree with dot head.

00:06:01.000 | And then you can look at the left and right children

00:06:03.160 | of the tree as well.

00:06:04.280 | So we try to have a rich API that makes it easy

00:06:06.200 | to use these dependency relationships.

00:06:09.640 | So that just getting dependency parses, obviously,

00:06:13.920 | just the first step, you want to actually use it in some way.

00:06:16.400 | And that's why we have this API to make that easy.

00:06:20.600 | So the question that always comes up with this,

00:06:23.360 | and I think this is a very interesting thing

00:06:25.600 | for the field in general, is what's the point of parsing?

00:06:28.760 | What is this actually good for in terms of applications?

00:06:31.920 | So Yoav Goldberg is a very prominent parsing researcher.

00:06:35.360 | And this is kind of the stuff that he's

00:06:37.240 | studied for most of his career.

00:06:38.640 | And he's one of the more well-known parsing people.

00:06:42.000 | And so it's interesting to see him and other people reflect

00:06:44.880 | on this and say that he finds it fascinating

00:06:46.800 | that even though we have so many best papers in NLP,

00:06:50.040 | so it's kind of a high prestige thing to study parsing.

00:06:53.600 | But it seems like syntax is hardly

00:06:56.080 | used in practice in most of these applications.

00:06:59.720 | So the question is, why is this?

00:07:03.120 | Because parsing is based on trees and structured predictions,

00:07:05.840 | it's kind of fun to study.

00:07:06.720 | And there's all these deep algorithmic questions.

00:07:08.720 | Is it just kind of this catnip to researchers?

00:07:11.800 | And does it have this kind of over prominence in the field?

00:07:17.360 | Or is it that there is something deeper about this

00:07:22.360 | and we should really continue studying this?

00:07:25.520 | Well, I can go either way on this.

00:07:29.840 | And so this slide shows you the case for parsing.

00:07:32.680 | And then I'll kind of have a counterpoint in a second.

00:07:35.440 | So I think that the most important case for parsing

00:07:38.920 | is that there's a sort of deep truth to the fact

00:07:41.240 | that sentences are tree-structured.

00:07:42.960 | They just are, right?

00:07:45.400 | Language syntactic structure of sentences is recursive.

00:07:49.400 | And that means that you can have arbitrarily long gaps

00:07:52.320 | between two words which are related.

00:07:54.480 | So for instance, if you have a relationship between,

00:07:59.760 | say, a subject and a verb like syntax is,

00:08:03.840 | whether the subject of that verb is plural or singular

00:08:08.960 | is going to change the form of the verb.

00:08:11.080 | And that dependency between them can be arbitrarily long

00:08:13.880 | because you can have this nested structure.

00:08:16.840 | But it can't be arbitrarily long in tree space

00:08:18.920 | because the relationship between them

00:08:23.360 | will always be the subject and the verb,

00:08:25.120 | like sort of next to each other in the tree.

00:08:28.000 | So you can see how, for some of these things,

00:08:30.280 | it should be more efficient to think about it

00:08:32.320 | or model it as a tree.

00:08:34.080 | And the tree should tell you things

00:08:35.600 | that you otherwise would have to infer

00:08:38.880 | from an enormous amount of data.

00:08:41.000 | It should be more efficient in this way.

00:08:43.040 | So we can say, OK, in theory, this should be important.

00:08:46.440 | And it should be something that we

00:08:47.920 | study based on this knowledge about how sentences

00:08:50.840 | are structured.

00:08:53.440 | So then the counterpoint to this is, all right,

00:08:56.080 | so sentences are tree-structured.

00:08:57.640 | And that's a truth about sentences.

00:08:59.600 | But it's also true that they're written and read in order.

00:09:02.480 | So if you read a sentence, you do read it from left to right,

00:09:05.920 | or in English anyway, basically from start to finish,

00:09:08.600 | or you hear a sentence from start to finish.

00:09:10.840 | And this really puts a sort of bounding on the linear complexity

00:09:14.760 | that you will empirically see.

00:09:16.920 | Because when somebody wrote this sentence,

00:09:19.200 | yes, they could have an arbitrarily long dependency.

00:09:21.960 | But they expect that that would mean

00:09:24.000 | that their audience listening to it

00:09:25.440 | will have to wait arbitrarily long between some word

00:09:28.880 | and the thing that it attaches to.

00:09:30.560 | And that's kind of not very nice.

00:09:32.160 | So empirically, it's not very surprising to see

00:09:35.720 | that most dependencies are, in fact, short.

00:09:38.440 | And there's a lot of arguments that the options that

00:09:42.240 | are kind of provided to grammars are sort of arranged

00:09:45.160 | that you're able to keep your dependencies short.

00:09:48.040 | Like that's sort of some of the reasons

00:09:49.760 | you have options for how to move things around in sentences

00:09:52.240 | to make nice reading orders.

00:09:53.800 | Because you want short dependencies.

00:09:56.160 | So this means that if most dependencies are short,

00:09:58.880 | then processing text as, say, chunks of words of one or two

00:10:02.720 | at a time kind of gives you a pretty similar view.

00:10:06.040 | Most of the time, you don't get something that's

00:10:08.240 | so dramatically different if you look at a tree

00:10:11.160 | instead of looking at chunks of three or four word sentences.

00:10:14.400 | So this is kind of a counterpoint that

00:10:16.240 | says maybe even though the sentences are, in fact,

00:10:20.280 | restructured, maybe it's not that crucially useful.

00:10:23.760 | So I think that the part that makes this particularly

00:10:27.080 | rewarding to look at syntax or particularly useful

00:10:30.320 | to provide syntactic structures in a library like Spacey

00:10:33.400 | is that they're application independent.

00:10:35.480 | So the syntactic structure of the sentence

00:10:38.320 | doesn't depend on what you hope to do with the sentence

00:10:40.600 | or how you hope to process it.

00:10:42.120 | And that's something that's quite different from other labels

00:10:44.520 | or other information that we can attach to the sentence.

00:10:47.400 | If you're doing something like a sentiment analysis,

00:10:49.680 | there's no truth about the sentiment of a sentence

00:10:53.080 | that's independent of what you're hoping to process.

00:10:55.760 | That's not a thing that's in the text itself.

00:10:58.200 | It's a lens that you want to take on it based

00:11:01.160 | on how you want to process it.

00:11:02.480 | So whether you consider some review

00:11:05.800 | to be positive or negative depends on your application.

00:11:10.760 | It's not necessarily in the text itself.

00:11:12.680 | Because what counts as positive or negative?

00:11:15.080 | What's the labeling scheme?

00:11:16.160 | What's the rating scheme?

00:11:17.920 | Or exactly what are they talking about?

00:11:20.800 | Well, the taxonomy that you have will

00:11:23.040 | depend on what you're hoping to process with.

00:11:26.160 | Those things aren't in the language.

00:11:28.160 | But details about the syntactic structure are in the language.

00:11:31.600 | They're things which are just part

00:11:33.160 | of the structure of the code.

00:11:36.000 | And that means that we can provide these things,

00:11:38.600 | learn it once, and give it to many people.

00:11:40.520 | And I think that that's very valuable and useful

00:11:42.640 | and different from other types of annotations

00:11:44.800 | that we could calculate and attach.

00:11:46.480 | And that's why SPACY provides pre-trained models for syntax,

00:11:49.360 | but doesn't provide pre-trained models for something

00:11:51.400 | like sentiment.

00:11:52.080 | Because we know how to give you a syntactic analysis that's

00:11:56.760 | as useful as it may be, or maybe not,

00:11:59.480 | depending on whether that actually solves your problems.

00:12:02.200 | But at least it's sort of true and generalizable,

00:12:04.640 | whereas we don't know what categorization scheme you

00:12:08.640 | want to classify your text in.

00:12:09.880 | So we can't give you a pre-trained model

00:12:11.560 | that does that, because that's your own problem.

00:12:14.920 | So we try to basically give you these things, which

00:12:17.560 | are annotation layers, which do generalize in this way.

00:12:19.920 | And that means that there has to be

00:12:21.680 | a sort of linguistic truth to them.

00:12:23.240 | And that means that looking at things

00:12:25.080 | like the semantic roles, or sentence structure,

00:12:27.200 | or sentence divisions are things that we can do.

00:12:30.320 | And that's why we are interested in this.

00:12:33.680 | So the other thing about syntactic structures

00:12:36.720 | and whether they're useful or not is that in English,

00:12:40.440 | not using syntax is pretty powerful,

00:12:42.400 | because English orthography happens to cut things up

00:12:46.120 | into pretty convenient units.

00:12:48.360 | They're not optimal units, but they're still pretty nice

00:12:52.120 | in a way that doesn't really hold true

00:12:54.080 | across a lot of other languages.

00:12:55.920 | So in the bottom right here, we have

00:12:57.520 | Japanese, which usually isn't segmented into words.

00:13:01.320 | You can't just cut that up trivially with white space

00:13:04.640 | and get something that you can feed into a search engine,

00:13:07.000 | or get something that you can feed forward

00:13:08.720 | into a topic model.

00:13:09.800 | You have to do some extra work.

00:13:11.480 | And the extra work that you do there really

00:13:13.280 | should consider syntactic structure.

00:13:15.280 | You can use a technology that only makes linear decisions,

00:13:18.920 | but the truth about what counts as a word or not

00:13:21.880 | is very entangled with the syntactic structure.

00:13:23.880 | And so there's real value in doing it jointly

00:13:25.880 | with syntactic parsing.

00:13:27.880 | For other languages, you have kind of the opposite problem.

00:13:30.560 | So we have here a German word, and this

00:13:37.080 | is the German word for income tax return.

00:13:39.240 | Now, whether or not you want that to be sort of one unit

00:13:43.600 | will depend on what you're looking for.

00:13:45.200 | For many applications, actually, the English phrase

00:13:47.960 | is too short.

00:13:49.200 | And the domain object, the thing that you

00:13:51.480 | want to be looking for and having a single node

00:13:55.360 | in your knowledge graph for, would actually

00:13:57.120 | be income tax return.

00:13:58.040 | That's pretty awesome.

00:13:59.800 | But in other applications, maybe you just

00:14:01.600 | want to look for tax.

00:14:02.640 | And so in those cases, the German word will be too large

00:14:05.800 | and your data will be too sparse.

00:14:07.480 | So there's sort of different aspects to this.

00:14:12.000 | In the bottom left here, we have an example of Hebrew.

00:14:16.160 | And like Arabic and a couple of other languages like this,

00:14:21.960 | there's no vowels in the text.

00:14:23.560 | And the words tend to have all sorts of attachments

00:14:27.200 | that are difficult to segment off.

00:14:29.680 | So there, again, you have difficult segmentation problems

00:14:32.400 | that are all tangled up with the syntactic processing.

00:14:35.920 | OK, so going forward to sort of an example of what we can do

00:14:42.240 | if we recognize non-whitespace-looking words

00:14:46.320 | and feed them into some of the other processing stuff

00:14:50.800 | that we have.

00:14:51.720 | So this is a demo that we prepared a couple of years

00:14:55.080 | ago for an approach that we call the term Sense2Vec.

00:15:00.560 | So all this is is basically processing text

00:15:03.440 | using natural language processing tools, in this case

00:15:05.720 | specifically spaCy, in order to recognize these concepts that

00:15:09.800 | are longer than one word.

00:15:11.040 | So specifically here, we looked for base noun phrases and also

00:15:15.120 | named entities.

00:15:16.000 | And we just merged those into one token

00:15:17.640 | before feeding the text forward into a word2vec

00:15:21.720 | implementation, which gives you these semantic relationships.

00:15:25.360 | And this lets you search for and find similarities

00:15:29.600 | between phrases which are much longer than one word.

00:15:32.800 | And as soon as you do this, you find,

00:15:35.000 | ah, the things which I'm searching for

00:15:37.000 | are much more specific in meaning.

00:15:38.400 | I'm not looking for one meaning of learning

00:15:41.880 | or one meaning of processing, which

00:15:43.280 | doesn't tend to be so useful or interesting.

00:15:45.120 | Instead, you can find things related to natural language

00:15:48.720 | processing.

00:15:49.320 | And then you see, ah, machine learning, computer vision,

00:15:51.520 | et cetera.

00:15:52.080 | These are real results that came out of the thing

00:15:55.480 | as soon as you did this division.

00:15:58.520 | And so we can do this for other languages as well.

00:16:02.360 | So if we were hoping to use word2vec for a language

00:16:06.800 | like Chinese, you really want to be processing it into words

00:16:10.080 | before you do that.

00:16:11.280 | Or if you're going to do this for a language like Finnish,

00:16:14.280 | you really want to cut off the morphological suffixes

00:16:17.200 | before you do this.

00:16:19.520 | OK.

00:16:20.520 | So incidentally, Innis has cleaned up

00:16:24.440 | the sensitivevec recently.

00:16:26.120 | So you can actually use this as a handy component

00:16:29.160 | within Spacey.

00:16:30.640 | So you can load up a stand model and then

00:16:35.240 | add a component that gives you these sensitivevec sensors.

00:16:38.480 | So you can just say, all right, the token for three

00:16:42.720 | would be natural language processing,

00:16:44.280 | because it would do it emerging for you.

00:16:46.000 | And then you can also look up the similarity.

00:16:47.880 | So it's now much easier to actually use the pre-trained

00:16:52.280 | model and use that approach within Spacey.

00:16:56.520 | Incidentally, we have this concept of an extension

00:16:59.200 | attribute in Spacey so that you can kind of attach your own

00:17:02.720 | things to the tokens so that you can basically

00:17:05.880 | attach your own little markups or processing things.

00:17:09.160 | So this underscore object is kind of a free space

00:17:13.600 | that you can attach attributes to, which ends up

00:17:16.360 | being quite convenient.

00:17:17.640 | It's a lot more convenient than trying to subclass something

00:17:20.800 | or something.

00:17:22.520 | OK.

00:17:23.320 | So for the rest of the talk, I'll

00:17:26.720 | give you a little bit of a pretty brief overview

00:17:30.040 | of the parsing algorithm and then explain

00:17:33.120 | how we're going to-- how we're modifying the parsing algorithm

00:17:35.600 | to work with languages other than English

00:17:38.840 | so that we can basically broaden out the support of Spacey

00:17:41.680 | to these other languages.

00:17:43.760 | So what we see here is a completed parse.

00:17:49.960 | And I'm going to talk you through the steps

00:17:53.640 | or the decision points that the parser is going

00:17:55.400 | to make to derive this structure.

00:17:58.160 | And so the kind of key--

00:18:01.440 | I think to keep in mind-- or the key aspect of the solution

00:18:06.720 | is that it's going to read the sentence from left to right

00:18:09.560 | and maintain some state.

00:18:10.920 | And then it's going to have a fixed inventory of actions

00:18:14.640 | that it has to choose between to manipulate the current parse

00:18:18.320 | state to build up the arcs.

00:18:19.920 | And this type of approach, which is called

00:18:22.160 | transition-based parsing, I find deeply satisfying

00:18:25.080 | because it's linear in time because you only

00:18:30.600 | make so many decisions per word.

00:18:32.480 | And I do think that it makes a lot of sense

00:18:34.440 | to take algorithms which process language incrementally.

00:18:38.200 | I think that that's deeply satisfying and correct

00:18:41.640 | in a way that a lot of other approaches to parsing aren't.

00:18:44.520 | And it's also a very flexible approach.

00:18:46.240 | So we can do joint modeling and have it output

00:18:49.320 | all sorts of other structures as well as the parse tree.

00:18:52.480 | And that's actually what we're going to do.

00:18:54.140 | So already in Spacey, we have joint prediction

00:18:57.680 | of the sentence boundaries in the parse tree.

00:18:59.680 | And what we're going to do is extend

00:19:01.480 | this to this joint prediction of word boundaries as well.

00:19:04.840 | OK, so here's how the decision process of building the tree

00:19:09.960 | works.

00:19:10.680 | So we start off with an initial state.

00:19:13.000 | And so for ease of notation or ease of readability,

00:19:16.860 | we're notating the first word of the buffer.

00:19:20.600 | So the first word that's being focused on

00:19:23.240 | as this beam of highlighting.

00:19:26.000 | And then the other element of the state is a stack.

00:19:30.440 | And so as the first action that we do,

00:19:33.920 | we have an action that can advance the buffer one

00:19:36.760 | and put the word that was previously at the start

00:19:39.040 | of the buffer onto the stack.

00:19:40.440 | So here's what that shift move is going to look like.

00:19:43.520 | So here we have Google on the stack, which we write up here.

00:19:48.200 | And the first word of the buffer is reader.

00:19:50.920 | And so then another action that we can take

00:19:53.680 | is to form a dependency arc between the word that's

00:19:56.600 | on top of the stack and the first word of the buffer.

00:19:59.000 | So in this case, we want to attach Google

00:20:02.960 | as a child of reader.

00:20:04.320 | So we have an action that does that.

00:20:05.840 | And because we're building a tree,

00:20:07.920 | when we make an arc to Google, we

00:20:11.080 | know that we can pop it from the stack because it's a tree.

00:20:15.440 | It only can have one head.

00:20:17.360 | It can only have one attachment point.

00:20:21.680 | It's a different type of graph.

00:20:23.520 | And so that means that we can do that and keep moving forward.

00:20:27.240 | So here's what that looks like.

00:20:28.520 | We add an arc and pop Google from the stack.

00:20:31.920 | So now we make the next move.

00:20:34.240 | Clearly, we've got no words on the stack.

00:20:36.000 | So we should put reader on the stack so that we can continue.

00:20:39.560 | Now we're at was.

00:20:41.120 | And now we want to decide whether we should make an arc

00:20:43.840 | directly between was and reader.

00:20:45.840 | In this case, no, we want to attach was to canceled.

00:20:48.600 | So we're going to move was onto the stack

00:20:51.000 | and move forward onto canceled.

00:20:52.600 | So then here we do want this arc between canceled and was.

00:20:57.960 | So we do another left arc.

00:20:59.600 | And so we basically continue here.

00:21:01.720 | So we're sort of stepping back a bit and thinking about this.

00:21:06.440 | We've got a fixed inventory of actions.

00:21:08.520 | And so long as we can predict the right sequence

00:21:11.880 | of those actions, we can derive the correct pass.

00:21:14.120 | So that's how the machine learning model

00:21:15.720 | is going to work here.

00:21:16.920 | The machine learning model is going

00:21:18.280 | to be a classifier that predicts, given some state,

00:21:21.120 | what to do next.

00:21:22.840 | And you can sort of imagine that we can have other actions

00:21:26.520 | instead if we wanted to predict other aspects of the structure.

00:21:29.720 | So in the case of spaCy, we have an action

00:21:31.640 | that inserts a sentence boundary.

00:21:33.840 | So it just says, all right, given the words currently

00:21:36.440 | on the stack, you have to make actions

00:21:38.760 | that can clear the stack.

00:21:40.440 | But you're not allowed to push the next token

00:21:42.720 | until your stack is clear.

00:21:44.000 | And that means that there's going

00:21:45.320 | to be a sentence boundary there.

00:21:48.240 | And we could have other actions as well.

00:21:51.240 | There's been work to jointly predict part of speech tags

00:21:54.920 | at the same time as you're parsing.

00:21:56.240 | Or you can do semantics at the same time as you do syntax.

00:21:59.920 | And so you can code up all sorts of structures into this.

00:22:02.640 | And you're going to read the sentence left to right.

00:22:05.000 | And you're going to output some meaning structure attached

00:22:07.800 | to it.

00:22:08.360 | And as I said, I find this a satisfying way

00:22:10.960 | to do natural language understanding.

00:22:13.760 | Because it does involve reading the sentence

00:22:16.360 | and adding an interpretation incrementally.

00:22:20.000 | OK, so that's what this looks like as we proceed through.

00:22:25.320 | So all right, so how are we going

00:22:27.080 | to do this splitting up or merging of other things?

00:22:31.160 | Well, it's actually not that complicated,

00:22:33.240 | given this transition-based framework.

00:22:35.080 | So already, you can kind of see that in order

00:22:38.600 | to merge tokens, all we really have to do

00:22:40.880 | is we've got those tokens.

00:22:42.280 | And if we wanted Google Reader to be one token,

00:22:45.080 | we just have to have some special dependency label, which

00:22:47.360 | we are going to have in the tree.

00:22:49.400 | And so obviously, the label subtoken.

00:22:51.840 | And then all we have to do is say, all right,

00:22:55.280 | at the end of parsing, we're going

00:22:56.540 | to consider that as one token.

00:22:58.560 | So the step from going through something like this

00:23:03.040 | and labeling a language like Chinese is actually super

00:23:06.040 | simple.

00:23:06.540 | We just have to prepare the training data

00:23:08.240 | so that the tokens are individual characters.

00:23:11.000 | And then we can say, all right, things

00:23:14.180 | which should be one word should have this sort of structure

00:23:18.000 | with this label.

00:23:18.960 | And then if the parser decides that those things are attached

00:23:22.920 | together, then at the end of it, you just merge them up.

00:23:26.640 | The splitting tokens is more complicated,

00:23:29.140 | because you have to have some universal actions that

00:23:31.900 | manipulates the strings.

00:23:33.080 | So I'm still sort of working on the implementation of this

00:23:35.600 | in a way that's sort of clean and tidy.

00:23:39.200 | But I actually think that this will

00:23:40.700 | be useful for a lot of English texts

00:23:42.200 | as well, because if you have English texts that's

00:23:46.000 | sort of misspelled, a lot of the time,

00:23:48.200 | things which should be two tokens get merged into one.

00:23:50.720 | So it's is a particularly common and frustrating one of this,

00:23:55.240 | because the verb is, it should be its own token.

00:23:59.280 | But if you have it's as ITS, which is also

00:24:01.680 | a common word in English, you need

00:24:04.840 | to figure out that you have to have two parser actions, two

00:24:07.680 | parser states for that.

00:24:09.240 | And in general, you could have a statistical model

00:24:12.000 | that reads the sentence beforehand.

00:24:13.760 | But that statistical model that is

00:24:15.600 | going to read the sentence and process it

00:24:17.520 | is going to end up taking on work

00:24:19.200 | and doing jobs of figuring out the syntactic structure

00:24:21.880 | of the sentence in order to make those decisions.

00:24:24.000 | And that's why I think doing these things jointly

00:24:26.000 | is kind of satisfying, because instead

00:24:28.320 | of learning that information in one level of representation

00:24:31.480 | and thrown it away, only to build up

00:24:33.840 | the same information in the next pass of the pipeline,

00:24:37.440 | you can do it all at once.

00:24:38.720 | And so I think that the joint incremental approaches,

00:24:41.760 | I think, are very satisfying and good.

00:24:44.160 | OK.

00:24:44.880 | So where are we at the moment?

00:24:46.880 | So I've implemented the learning to merge side

00:24:52.440 | of things, which involves figuring out

00:24:55.960 | better alignments between the gold standard tokenization

00:24:59.040 | and the output of the tokenizer.

00:25:01.440 | And that's allowed me to complete the experiments

00:25:04.040 | for Chinese, Vietnamese, and Japanese

00:25:07.520 | of the Conference in Natural Language Learning 2017

00:25:11.120 | benchmark, which was a sort of bake-off of these parsing

00:25:13.760 | models, which was conducted last year.

00:25:16.560 | Now, in that benchmark, the team from Stanford

00:25:20.200 | did extremely well, compared to everybody else in the field.

00:25:24.400 | There were some two or three percentage points better.

00:25:28.400 | And so at the moment, we were ranking

00:25:30.800 | kind of at the top of what was the second place pack.

00:25:33.240 | So most of the languages were coming sort of

00:25:35.280 | underneath the Stanford system, but with significantly better

00:25:40.000 | efficiency and with sort of this end-to-end process.

00:25:42.960 | And in particular, we're doing better than Stanford

00:25:45.760 | on these languages like Chinese, Vietnamese, and Japanese,

00:25:48.120 | because the Stanford system did have this disadvantage of using

00:25:51.240 | the sort of preprocessed text.

00:25:52.680 | They didn't do the whole task.

00:25:53.880 | They wanted to just use the provided preprocessed texts

00:25:59.480 | so that they could focus on the parsing algorithm.

00:26:01.680 | And that meant that they did have this error propagation

00:26:03.880 | problem.

00:26:04.440 | If the inputs are incorrect because the preprocessed

00:26:07.360 | segmenter is incorrect, then they

00:26:11.560 | have a big disadvantage on these languages.

00:26:13.680 | So satisfyingly, the sort of doing all at once

00:26:18.000 | and entangling all of these representations,

00:26:21.200 | it does have this advantage.

00:26:22.320 | And we're seeing that in the results

00:26:24.160 | that we have for those languages.

00:26:26.320 | And the other thing that's satisfying

00:26:27.680 | is this joint modeling approach of deciding the segmentation

00:26:31.600 | at the same time as deciding the parse structure is consistently

00:26:35.040 | better than the pipeline approach in our experiments.

00:26:37.520 | So basically, we're getting a sort of 1% to 3% improvement

00:26:42.840 | from this, which is about the same size

00:26:44.840 | as we're getting from using the neural network model instead

00:26:47.400 | of the linear model.

00:26:48.280 | So I've found this also quite satisfying,

00:26:50.400 | that the sort of conceptually neat solution

00:26:53.560 | is also working well in practice.

00:26:57.200 | So where does this go, and what do we

00:26:58.840 | hope to deliver from this?

00:27:00.480 | Yes, that would probably be it.

00:27:06.360 | How am I for time?

00:27:11.440 | OK, well, this is the last slide, so wrapping up.

00:27:16.080 | OK, so what we want to do is we want

00:27:18.840 | to deliver a sort of workflow or user experience

00:27:21.680 | where it's very easy to start with the pre-trained models

00:27:24.360 | for different languages and broad application areas.

00:27:28.220 | And we want to make sure that they

00:27:29.640 | have the same representation across languages.

00:27:31.680 | So you get the same parse scheme, which the folks have

00:27:37.640 | been working hard on and basically now

00:27:39.360 | have a pretty satisfying solution from the universal

00:27:41.400 | dependencies.

00:27:42.440 | And so if you're processing text from different languages,

00:27:44.720 | it should be easy to find, say, subject-verb relationships

00:27:47.240 | or direct-object relationships.

00:27:48.920 | And that should work across basically any language

00:27:51.400 | so that you can use these parse trees

00:27:53.600 | and basically have a level of abstraction

00:27:55.600 | from which language the text is in.

00:27:59.400 | And then given this, you should be

00:28:01.840 | able to do pretty powerful rule-based matching

00:28:03.840 | from the parse tree and other annotations to provide it.

00:28:07.480 | So it should be pretty easy to find information,

00:28:11.280 | even without knowing much about the language and reuse

00:28:13.600 | rules across language.

00:28:15.600 | And then if the syntaxic model and the identity models

00:28:21.360 | that we provide aren't accurate enough,

00:28:23.800 | the library should support easy updating of those,

00:28:26.400 | including learning new vocabulary items

00:28:28.600 | without you taking particular effort from this.

00:28:32.160 | And overall, we sort of want to emphasize

00:28:34.520 | a workflow of rapid iteration and data annotation.

00:28:37.800 | So the concept of this is that we

00:28:40.000 | should be able to provide things which

00:28:42.080 | give a broad-based understanding of language,

00:28:45.800 | but that still ends up with a need for the knowledge

00:28:50.400 | specific to your domain and a training

00:28:51.920 | guide and evaluation data specific to your problems.

00:28:55.200 | And we want to make sure that it's easy to connect the two up

00:28:58.880 | and go the extra.

00:29:00.200 | Start from a basic understanding of language

00:29:02.120 | and move forward to building the specific applications, which

00:29:05.680 | is-- now Innes will be talking about that aspect

00:29:09.600 | of the sort of intended package.

00:29:12.240 | [LAUGHTER]

00:29:18.680 | Yeah.

00:29:19.680 | [APPLAUSE]

00:29:21.600 | Right.

00:29:29.480 | So to--

00:29:30.040 | [INAUDIBLE]

00:29:32.040 | Yes, certainly.

00:29:34.080 | So the leap asked what the sort of overall difference

00:29:37.200 | or main-- most important difference between Spacey's

00:29:39.400 | parsing algorithm and Stanford's parsing algorithm.

00:29:42.160 | So amongst other things, the sort

00:29:43.680 | of most fundamental difference is that Stanford's system

00:29:48.960 | is a graph-based parser.

00:29:50.520 | So this is O(n) squared, or maybe O(n) cubed,

00:29:55.720 | in length of the sentence.

00:29:57.120 | So you're unable to use this type of parsing algorithm

00:30:00.920 | for joint segmentation and parsing.

00:30:03.520 | You have to have a pre-segmented text, which

00:30:05.880 | is why it has this disadvantage on languages, or text which

00:30:11.920 | is more difficult to segment into sentences.

00:30:14.200 | So in Spacey, we want to make sure

00:30:15.560 | that we basically only use linear time algorithms.

00:30:19.720 | And that's why we only take this transition-based approach.

00:30:25.640 | Other reasons sort of why they get such a good result.

00:30:29.800 | Other people have done graph-based models,

00:30:31.880 | and they're not nearly as accurate.

00:30:34.080 | So I hope to meet the Stanford team in the next couple of days

00:30:38.160 | and shake out the details of why this system is so accurate.

00:30:42.280 | Because, actually, it is quite surprising.

00:30:44.680 | I've read their papers several times,

00:30:46.180 | and I can't get the sort of one key insight that

00:30:48.920 | means that their system performs so well.

00:30:51.040 | It's interesting.

00:30:56.080 | [INAUDIBLE]

00:31:02.200 | So I think that--

00:31:04.540 | Right, yes, certainly.

00:31:05.720 | Yes.

00:31:07.320 | So the question, which is a very good one

00:31:10.440 | that many people have been thinking about,

00:31:12.800 | is to what extent can end-to-end systems, which maybe

00:31:17.320 | learn things about syntax, but learn them latently

00:31:19.680 | and don't have an explicit syntactic representation

00:31:22.200 | internally, replace the need for this type

00:31:25.180 | of syntactic processing?

00:31:27.320 | So I would say that for any application where

00:31:30.680 | there's sufficient text, currently the best approach

00:31:34.440 | or the state-of-the-art approach doesn't use a parser.

00:31:37.480 | And actually, this includes translation and other things

00:31:40.240 | where you would kind of expect that having

00:31:42.320 | an explicit syntactic layer would help.

00:31:44.200 | If there's enough text, it seems that going straight

00:31:46.320 | to the end-to-end representation tends to be better.

00:31:49.320 | However, that does involve having a lot of text.

00:31:51.680 | And for most applications, creating that much training

00:31:55.200 | data, especially initially when you're prototyping,

00:31:58.280 | tends not to be such a viable solution.

00:32:00.400 | So the way that I see it is that the parsing stuff

00:32:04.560 | is a great scaffolding.

00:32:05.960 | And it's a very practical thing to have in your toolbox,

00:32:09.000 | especially when you're trying to figure out

00:32:10.660 | how to model the problem.

00:32:12.000 | So because otherwise, you end up in this chicken and egg

00:32:14.080 | situation of, well, we need lots of data

00:32:16.580 | to make our model work well.

00:32:18.680 | And otherwise, it just doesn't really get off the ground.

00:32:21.040 | But then how do we even know that we're

00:32:22.800 | collecting the right data for the right model

00:32:24.480 | until we have that data collected

00:32:25.840 | and we can see the accuracy?

00:32:27.680 | So if you can take sort of smaller steps

00:32:30.120 | using these sort of rule-based scaffolding and bootstrapping

00:32:32.920 | approaches, I think you have a much more powerful and practical

00:32:36.200 | set of tools.

00:32:37.240 | And then finally, once you have a system

00:32:39.480 | that you know you want to e-cat every percent,

00:32:42.080 | maybe you end up collecting enough data

00:32:43.620 | that you don't need a parser in your solution explicitly.

00:32:47.560 | [INAUDIBLE]

00:32:49.360 | So Dilip has pointed to a paper that recently showed

00:32:58.800 | that it, you know, BLSTM models don't necessarily

00:33:02.480 | learn long range dependencies.

00:33:04.320 | I think that that's probably true.

00:33:06.560 | But as somebody who's worked on parsing for a lot of my career,

00:33:11.800 | I try to remind myself not to cherry pick results.

00:33:14.680 | And even if I do find a paper that shows that parsing

00:33:17.680 | works on something, well, the overall trend

00:33:19.720 | is that BLSTM models which don't use parsing work well.

00:33:24.400 | And the fact is that long range dependencies are kind of rare.

00:33:27.440 | So that's basically why it's important

00:33:34.920 | to be asking, well, what are these things good for,

00:33:38.200 | and not say, oh, everything should be using parsing.

00:33:41.060 | Because it's true that not everything should.

00:33:43.240 | [INAUDIBLE]

00:33:45.240 | So the question is, if we look at other aspects of language

00:33:57.280 | variation instead of just, say, the segmentation and things,

00:34:01.560 | how does the incremental model perform?

00:34:03.480 | So specifically, how does it perform

00:34:05.200 | in free word order languages, perhaps ones

00:34:08.680 | with longer range or crossing dependencies?

00:34:12.240 | So Stanford, actually, their paper

00:34:14.340 | had excellent analysis about a lot of these questions.

00:34:16.540 | And so they showed that their model, which

00:34:19.980 | is much less sensitive to whether the trees are

00:34:22.220 | projective, they do do relatively well in those languages.

00:34:26.580 | So for our preliminary results, we

00:34:33.840 | do fine on German and pretty well on Russian.

00:34:37.660 | We still suck at Finnish.

00:34:39.240 | And I think there's a bug in Korean.

00:34:43.220 | It's at like 50%.

00:34:45.600 | So it's a mixed bag.

00:34:47.960 | But I would say that there's some problems

00:34:50.320 | to solve about the projectivity.

00:34:52.400 | The way that I'm doing this is a little bit crude at the moment.

00:34:54.820 | So in general, there is a disadvantage

00:35:00.520 | that we take from the incremental approach in this.

00:35:02.920 | And there's a lot of clever solutions

00:35:05.160 | that I'm looking into for this.

00:35:07.040 | So yeah.

00:35:08.760 | [INAUDIBLE]

00:35:12.760 | So there's a pretty good extension package

00:35:15.640 | for coreference resolution that has taken some of the pressure

00:35:19.560 | off us to support it internally.

00:35:21.720 | We do think that coreference resolution is something

00:35:23.920 | that does belong in the library, because it's

00:35:25.720 | something that does have that property of being

00:35:27.960 | a language internal thing.

00:35:29.200 | I think that there's a truth about whether that he or she

00:35:31.660 | belongs to that noun that doesn't depend on the application.

00:35:34.800 | It's just a true fact about that sentence.

00:35:36.760 | So we're very interested in being

00:35:38.200 | able to give you that piece of annotation.

00:35:40.000 | I wouldn't quite say the same thing about the sentiment.

00:35:43.720 | I don't quite know--

00:35:45.720 | I haven't been convinced by any schema of sentiment

00:35:48.320 | that is sufficiently independent of what you're trying to do

00:35:51.520 | that we could provide it.

00:35:52.680 | Instead, what we do provide you is a text categorization

00:35:54.940 | library.

00:35:55.640 | And the text categorization model that we have

00:35:58.120 | is only one of many that you might build.

00:36:01.700 | And it's not best for every application.

00:36:04.040 | But it does do pretty well for short text.

00:36:07.680 | And I think that on many sentiment benchmarks,

00:36:11.880 | it performs quite well.

00:36:13.360 | It's a lot slower than some other ways

00:36:16.520 | that you could do sentiment.

00:36:17.800 | So it depends on what type of text you're trying to process

00:36:20.800 | and that sort of thing.

00:36:22.240 | [INAUDIBLE]

00:36:24.960 | Well, oh, yes.

00:36:25.640 | So explicitly, the coreference resolution package

00:36:31.560 | that you should use is called MuralCoref.

00:36:33.760 | So MuralCoref.

00:36:34.660 | [INAUDIBLE]

00:36:36.240 | Yeah.

00:36:37.240 | Yeah, and it's built on PyTorch.

00:36:39.640 | It's overall pretty good.

00:36:40.720 | You can train it yourself.

00:36:42.120 | Yeah, well, PyTorch is the machine learning layer.

00:36:45.340 | But yes, it's built on space.

00:36:46.600 | So yes.

00:36:47.080 | So for German, I think it's pretty easy.

00:37:00.520 | I've been using the word vectors trained by fast text.

00:37:03.920 | And you can basically just plug that

00:37:06.880 | in so there's one command to convert that

00:37:10.040 | into a spacey vocab object and load it up.

00:37:14.000 | We're trying to provide pre-trained models which

00:37:15.880 | don't depend on pre-trained word vectors

00:37:18.720 | so that you can bring your own.

00:37:19.880 | Because otherwise, there's this conflict

00:37:21.520 | of the model's been trained to expect some word vectors.

00:37:25.120 | And then if you sub your own in, it's

00:37:26.760 | going to get different input representations.

00:37:30.240 | But yeah, training or bringing your own vectors

00:37:34.120 | is designed to be pretty easy.

00:37:35.920 | And if it's not, I apologize if there's bugs,

00:37:38.440 | and we'll try to fix them.

00:37:39.640 | So.

00:37:40.140 | So the question is, after parsing and interpreting,

00:37:50.240 | do we have an interlingual representation that

00:37:52.520 | can then be used to generate another language?

00:37:55.440 | The answer is probably not.

00:37:56.760 | I mean, we don't have generation capabilities in spacey.

00:38:00.600 | People have worked on this sort of thing.

00:38:02.840 | But in general, having an explicit interlingual

00:38:07.160 | tends to perform less well than more brute force

00:38:10.080 | statistical approaches to syntax.

00:38:12.080 | And I think the reason does sort of make sense

00:38:14.880 | that the languages are pretty different in the way

00:38:18.920 | that they phrase things and the way

00:38:20.120 | that they model the world in lots of ways.

00:38:21.800 | And so getting a translation that's remotely

00:38:24.680 | idiomatic out of that sort of interlingual representation

00:38:26.960 | is pretty tough.

00:38:28.920 | And then there's another argument

00:38:30.360 | that you're solving a subproblem that's

00:38:33.320 | harder than the direct translation approach, which

00:38:37.280 | I'm not sure whether I buy that argument or not.

00:38:39.280 | But it's a common one that people use.

00:38:40.840 | OK, so should we move forward to the next talk?

00:38:48.480 | Thanks.

00:38:50.480 | [APPLAUSE]

00:38:52.440 | So yeah, we started out by hearing a lot about the more

00:39:04.840 | theoretical side of things.

00:39:06.040 | And I'm actually going to talk about how we collect and build

00:39:10.360 | training data for all these great models we can now build.

00:39:12.680 | And the nice thing about machine learning

00:39:15.760 | is that, well, we can now train a system

00:39:17.800 | by just showing examples of what we want.

00:39:20.000 | And that's great.

00:39:20.680 | But the problem is, of course, we need those examples.

00:39:23.560 | And even if you're like, oh, I got this all figured out.

00:39:26.080 | Are you using this amazing unsupervised method

00:39:28.520 | where my system just infers the categories from the data

00:39:32.760 | and I never need to label any data?

00:39:34.760 | That's pretty nice, but you still

00:39:36.240 | need some way of evaluating your system.

00:39:38.160 | So we pretty much always need some form of annotations.

00:39:42.800 | And now the question is, well, why do we even care about this?

00:39:47.280 | Why do we care about whether this is efficient,

00:39:50.960 | whether this works or not?

00:39:53.120 | The thing is, the big problem is that we actually,

00:39:56.920 | with many things in data science and machine learning,

00:39:59.600 | we need to try out things before we know whether they work.

00:40:02.720 | Or we often don't know whether an idea is going

00:40:04.720 | to work before we try it.

00:40:05.840 | So we need to expect to do annotation lots of times

00:40:09.720 | and start off from scratch.

00:40:12.080 | Start all over again if we fucked up our label scheme.

00:40:15.920 | Try something else.

00:40:16.720 | We need to do this lots of times, so it needs to work.

00:40:19.440 | And similarly, especially if you're

00:40:23.040 | working in a company in a team where you really

00:40:26.680 | want to use your model to find something out,

00:40:29.560 | ideally the person building the model

00:40:31.480 | should be involved in that process.

00:40:33.760 | And also, we always say good annotation teams are small.

00:40:36.800 | A lot of people don't understand this.

00:40:38.340 | There's a lot of movement towards,

00:40:42.120 | oh, let's crowdsource this, get hundreds of volunteers,

00:40:44.640 | and we always have to remind, especially companies,

00:40:47.600 | that well, look at the big corpora

00:40:50.040 | that we use to train models.

00:40:51.920 | The good ones were produced by very few people,

00:40:54.120 | and there's a reason for that.

00:40:57.000 | More people doesn't always mean better results, actually

00:40:59.520 | quite the opposite.

00:41:00.240 | So how great would it be if actually the developer

00:41:03.920 | of the model could be involved in labeling the data?

00:41:08.600 | And of course, we also have the problem of the specialist

00:41:11.680 | knowledge, especially in industries where this matters.

00:41:16.920 | You might want to have a medical professional give some feedback

00:41:20.680 | on the labels, or actually really label your data,

00:41:23.480 | or maybe a finance expert.

00:41:26.000 | And yeah, those people usually have limited time.

00:41:28.080 | If you get an hour off their time,

00:41:29.520 | you want to use it more efficiently,

00:41:31.400 | and you don't want to bore them to death,

00:41:33.800 | or actually find the one person who has nothing else to do,

00:41:36.800 | because their knowledge is probably not

00:41:39.260 | as valuable as other experts' knowledge.

00:41:43.560 | And yeah, another big problem, since you want humans,

00:41:49.640 | is that humans are actually--

00:41:52.040 | humans kind of suck.

00:41:53.920 | We're not that efficient at a lot of things.

00:41:57.880 | So for example, we really have problems

00:41:59.960 | performing boring, unstructured tasks,

00:42:02.580 | especially things that require multiple steps and multiple

00:42:04.920 | things we need to get right.

00:42:06.240 | We can't remember stuff.

00:42:08.640 | We're bad at consistency and getting stuff right.

00:42:12.640 | So fortunately, computers are really good at that stuff.

00:42:16.880 | And in fact, it's probably also the main reason

00:42:19.180 | we built computers.

00:42:20.840 | So there's really no need to waste the human's time

00:42:25.480 | by making them do stuff that they're

00:42:26.760 | going to do badly anyways.

00:42:28.160 | And instead, we want our annotation tooling

00:42:31.040 | to be as automated as possible.

00:42:33.760 | Or in general, we want to automate as much as possible,

00:42:36.240 | and really have the human focus on the stuff

00:42:37.920 | that the human is good at, and we really need that input.

00:42:40.720 | And that's usually context, ambiguity, stuff

00:42:43.920 | like we can look at a sentence, and most of us

00:42:46.080 | will be able to understand a figure of speech

00:42:48.160 | immediately without thinking twice about it.

00:42:50.120 | That's the stuff that's really, really hard for a computer.

00:42:53.320 | Also, put differently, humans are good at precision.

00:42:56.520 | Computers are good at recall.

00:42:58.720 | So the thing is, yeah, what I'm saying here,

00:43:01.880 | it sounds a bit like our floss and eat your veggies.

00:43:06.040 | Yeah, we probably will have had some experience

00:43:08.600 | with labeling data.

00:43:09.560 | And normally, yeah, we also gave this talk

00:43:13.080 | to a crowd of more data science focused industry professionals.

00:43:19.320 | And actually, you'd be surprised how many companies we talked

00:43:24.440 | to, also very large companies, very actually

00:43:26.560 | technologically sophisticated companies,

00:43:28.760 | that mostly use Excel spreadsheets for everything.

00:43:31.880 | And it's not inherently bad, but they

00:43:35.020 | are very obvious problems with Excel spreadsheets.

00:43:37.800 | And there's definitely a lot of room for improvement.

00:43:39.960 | So once people figure this out and realize

00:43:42.800 | that maybe they could do something better,

00:43:44.800 | or it's just terrible, like we don't want to do this,

00:43:47.120 | the next move is normally, let's move this all out

00:43:49.480 | to Mechanical Turk or some other crowd-sourced platform.

00:43:53.440 | And yeah, Mechanical Turk, the Amazon cloud of human labor.

00:44:00.600 | And so, yeah, people do that.

00:44:01.960 | And then I was also surprised that their results are not

00:44:04.320 | very good.

00:44:04.960 | And the problem is, yeah, OK.

00:44:07.280 | So you have some guy do it for $5 an hour,

00:44:10.560 | get the data back, train your model doesn't work.

00:44:13.520 | And actually, it's very difficult to then retroactively

00:44:17.720 | find out what the problem was.

00:44:18.880 | Maybe your label scheme was bad.

00:44:21.160 | Maybe your idea was bad.

00:44:23.200 | Maybe the data was bad.

00:44:24.280 | Maybe you didn't write your annotation manual properly.

00:44:28.280 | Maybe-- actually, yeah, another nice thing.

00:44:30.320 | Maybe you paid too much, because if you

00:44:32.680 | pay too much on Mechanical Turk, you attract all the bad actors.

00:44:35.600 | So you kind of have to stick to the half minimum wage.

00:44:40.200 | So that could have been a problem.

00:44:42.040 | Maybe your model was bad.

00:44:43.160 | Your training code was bad.

00:44:44.240 | It's very, very difficult to find that out.

00:44:46.640 | And also, you realize that, well, it's not really just

00:44:49.440 | the cheap click work.

00:44:52.000 | You need to do it more.

00:44:53.000 | So then, yeah, what most people conclude from this

00:44:56.160 | is, fuck this labeling in general.

00:44:59.840 | I don't want to do this anymore.

00:45:01.240 | Let's just find some unsupervised method

00:45:03.880 | and not bother with this.

00:45:05.080 | And that's actually-- yeah, also, the conversation

00:45:09.120 | I had recently where we talked to a larger media company,

00:45:11.840 | and they'd done exactly that.

00:45:13.040 | And now they have a few hundred clusters.

00:45:15.320 | And it's really great.

00:45:16.240 | They have really great clusters.

00:45:17.560 | But now, their problem is that they

00:45:19.920 | have no idea what these clusters are.

00:45:21.840 | So they now need to label their clusters.

00:45:23.840 | And now, they're kind of back in the beginning.

00:45:25.960 | And I think what we see from this

00:45:28.040 | is that the label data itself, the fact

00:45:30.560 | that we need label data, that's an opportunity.

00:45:32.560 | That is not the problem.

00:45:33.560 | The problem is how we do it.

00:45:35.960 | And yeah, so we've been thinking about this a lot.

00:45:41.520 | At least, from our point of view,

00:45:43.480 | there are a lot of things we could do better.

00:45:46.320 | So one of the things, really, to work against this problem

00:45:50.400 | that we have caused by us being human

00:45:53.400 | is that we need to break down these very complex things we're

00:45:58.240 | asking the humans into smaller, simpler questions.

00:46:02.120 | And ideally, these should be binary decisions.

00:46:04.840 | So we can have a much better annotation speed

00:46:07.240 | because we can move through the things faster.

00:46:09.240 | And we can also measure the reliability much easier

00:46:13.080 | than if we ask people open questions.

00:46:14.960 | Because we can actually say, OK, do our annotators agree?

00:46:17.420 | Do they not agree?

00:46:18.680 | Because that's, in the end, very important

00:46:20.720 | to find out whether we've collected data the right way.

00:46:23.760 | And the binary thing itself, it sounds a bit radical.

00:46:28.560 | But actually, if you think about it, most, or pretty much

00:46:31.600 | any task, can be broken down into a sequence of binary

00:46:35.960 | decisions, like, yes or no, decisions.

00:46:38.400 | It might mean that we have to accept that, OK,

00:46:40.640 | and if we annotating a sentence or entities,

00:46:43.120 | we won't actually end up with a gold standard data

00:46:46.960 | for this sentence.

00:46:48.040 | We might actually end up with only partially annotated data.

00:46:51.000 | And I have to deal with that.

00:46:52.880 | But still, we're actually able to use our human's time

00:46:56.360 | more efficiently, which is often much more important.

00:47:00.120 | So a lot of examples I'm going to show you now

00:47:05.200 | from using our annotation tool Prodigy, which, yeah,

00:47:08.920 | we started building as an internal tool.

00:47:11.540 | But we very, very quickly realized

00:47:12.880 | that, OK, this is really something pretty much every

00:47:15.400 | company we talk to, most users we talk to,

00:47:18.080 | this was always something that kept coming up.

00:47:20.440 | So we thought, OK, what if we really combine all these ideas

00:47:24.760 | we already have, and how to train a model,

00:47:27.080 | actually use the technology we're working with within the tool,

00:47:31.200 | and also use the insights we have from user experience,

00:47:36.240 | and how to get humans to do stuff most efficiently,

00:47:41.480 | how to get humans excited, actually,

00:47:42.880 | even the whole idea of gamification,

00:47:45.160 | how to get humans to really stick to doing something,

00:47:50.920 | and put this all into one tool, and that's Prodigy.

00:47:54.520 | And so here, we see some examples of those tasks,

00:48:00.760 | and how we can present things in a more binary way.

00:48:03.760 | So in the top left, we have an entity task.

00:48:08.440 | So here, this comes from Reddit, and we're

00:48:11.440 | labeling whether something is a product or not.

00:48:14.000 | And what we did here is we load in a spacing model,

00:48:17.760 | asked the model to label the products,

00:48:20.400 | and then we look at them and say yes or no.

00:48:23.520 | Or we can also use a mode where we can then actually

00:48:27.240 | click on this, remove this, label something else.

00:48:31.360 | But still, you see, OK, we don't have to do this

00:48:34.040 | in an Excel spreadsheet.

00:48:35.000 | We actually get one question, we look at this,

00:48:37.120 | and pretty much immediately, we can say yes or no.

00:48:42.400 | The same here, on the right, they were using--

00:48:44.800 | I think this is actually a real example using the YOLO2 model

00:48:49.160 | with the default categories.

00:48:51.160 | And we have an image of a skateboard.

00:48:53.640 | We could say, is this a skateboard, yes or no?

00:48:57.000 | And yeah, immediately, have our annotations here.

00:49:01.320 | And even this one in the corner, even

00:49:03.360 | if we're not able to really break it down

00:49:06.000 | into a true binary task, we can still

00:49:07.640 | make it more efficient and easier for a human to answer.

00:49:12.640 | Because here, with keyboard shortcuts,

00:49:14.680 | you can still do maybe two, three seconds per annotation

00:49:19.200 | and you have an answer.

00:49:20.800 | Or we say, hey, it's actually so fast,

00:49:23.000 | if we can get to one second, we might as well

00:49:26.400 | label our entire corpus twice, positive, negative,

00:49:30.840 | other labels we want to do, and just move through it quicker.

00:49:36.360 | And yeah, to give you some background on why did we do

00:49:40.840 | this, what do we think Prodigy should achieve,

00:49:46.120 | we really think that, OK, we want

00:49:48.560 | to be able to make annotation so efficient that data scientists

00:49:51.760 | can do it themselves.

00:49:52.600 | Or here, what we call data scientists

00:49:55.000 | can also be researchers and people working with the data,

00:49:58.000 | people training the model.

00:50:01.440 | Yeah, reading it like that, it still doesn't sound like fun.

00:50:03.880 | But the idea is, we could really make a process that's

00:50:07.880 | efficient that you actually really want to do this

00:50:10.040 | because you don't have to depend on anyone else.

00:50:12.640 | You can just get the job done and see

00:50:15.040 | whether your idea works or not.

00:50:16.200 | And the same-- yeah, and this also

00:50:18.600 | means you can iterate faster.

00:50:20.040 | We're very used to, OK, you iterate on your code,

00:50:22.040 | but you can actually iterate on your code and your data.

00:50:24.240 | You try something out, doesn't work, try something else.

00:50:28.320 | Maybe see, OK, is it going to work

00:50:29.960 | if I collect more annotations?

00:50:31.600 | You can all try this out.

00:50:33.400 | And we also want to waste as little time as possible

00:50:38.160 | and use what the model already knows

00:50:41.280 | and have the human correct its predictions instead of just

00:50:43.800 | having a human do everything from scratch.

00:50:45.760 | And as a library itself, we really

00:50:49.040 | want Prodigy to fit into the Python ecosystem.

00:50:52.600 | We want it to be customizable, extensible in Python.

00:50:56.440 | You can write scripts for it.

00:50:57.800 | And we also-- it was a very conscious decision

00:51:00.960 | not to make it a SaaS tool, because we think data privacy

00:51:03.800 | is important.

00:51:05.520 | You shouldn't have to send your text to our servers

00:51:07.960 | for no reason.

00:51:09.040 | And we also think you shouldn't be looked in.

00:51:11.520 | Like, you should get a JSON format

00:51:13.080 | out that you can use to train your models however you like,

00:51:15.560 | and not our random format that you can then

00:51:18.920 | download from our servers.

00:51:20.960 | So that's where we're going with Prodigy.

00:51:23.000 | And here's a very simple illustration

00:51:27.680 | of how the app looks. The center are recipes,

00:51:31.560 | which are very simple.

00:51:32.440 | Python scripts that orchestrate the whole thing.

00:51:35.120 | You have a REST API that communicates with the web app

00:51:38.360 | naturally so you can see things on the screen.

00:51:42.800 | You have your data that's coming in, which is text images.

00:51:46.840 | And you can have an optional model state

00:51:49.480 | that's updated in a loop, if you want that.

00:51:52.200 | And then the model then communicates with a recipe.

00:51:58.440 | You can, as the user annotates, it's updated in a loop

00:52:02.640 | and can suggest more annotations that are more compatible

00:52:08.040 | with the annotator's recent decisions.

00:52:10.200 | And yeah, there's a database and a command line interface

00:52:13.160 | so you can actually use it efficiently

00:52:16.360 | and don't have to worry about these aspects.

00:52:19.160 | So here, can you see?

00:52:20.200 | Yeah, in the corner we have a simple example

00:52:23.560 | of a recipe function, which really is just a Python

00:52:27.280 | function.

00:52:28.320 | You load your data in and then you return this dictionary

00:52:31.120 | of components, for example, an ID of the data set,

00:52:34.680 | how to store your data, a stream of examples.

00:52:36.880 | You can pass in callbacks to update your model,

00:52:40.480 | things to execute before the thing starts.

00:52:43.080 | So the idea is really, OK, if you need to load something in,

00:52:46.760 | if you can write that in Python, you can do it in Prodigy.

00:52:50.640 | And you can also-- we provide a bunch of pre-built-in recipes

00:52:58.560 | for different tasks with some ideas of how

00:53:01.400 | we think it could work, like named entity recognition.

00:53:04.760 | For example, you can use the model,

00:53:06.800 | correct its predictions.

00:53:07.880 | You can use the model, say yes or no, to things.

00:53:11.000 | You can use it for dependency parsing and look at an arc

00:53:15.960 | and annotate that.

00:53:16.880 | We have recipes that use word vectors

00:53:20.120 | to build terminology lists, text classification.

00:53:22.880 | So there's also a lot that you can mix and match creatively.

00:53:26.520 | For example, you have the multiple choice example

00:53:30.640 | that's not really tied to any machine learning task,

00:53:34.280 | but it fits pretty much into any of these workflows

00:53:37.560 | that you might be doing.

00:53:38.920 | And of course, the evaluation is also

00:53:40.640 | something we think is very, very important

00:53:42.360 | and is often neglected, especially in more industry use

00:53:47.480 | cases.

00:53:49.320 | But we think there's actually-- ABL evaluation is actually

00:53:51.520 | a very powerful way of testing whether your output is really

00:53:57.720 | what you want it to be.

00:54:00.920 | And so here we see an example of how

00:54:05.560 | you can chain different workflows together,

00:54:07.960 | all using models, word vectors, things you already

00:54:10.600 | have in order to get where you want to get to faster.

00:54:14.400 | So here, a simple example, we want to label fruit.

00:54:20.200 | It's kind of a stupid example because it's that--

00:54:22.800 | I can't think of many use cases where you actually

00:54:25.520 | want to do that, but it makes a great illustration here.

00:54:30.160 | So yeah, we start off, we say, OK, we want fruit.

00:54:33.000 | What are fruit?

00:54:33.760 | We have some examples, apple, pear, banana.

00:54:36.040 | That's what we can think of.

00:54:37.120 | And we also have word vectors that we can use that will easily

00:54:42.320 | give us more terms that are similar to these three fruit

00:54:46.680 | terms that we came up with.

00:54:48.400 | And then we can use this terminology list

00:54:50.840 | that we collected by just saying yes or no to what we've gotten

00:54:54.560 | out of the word vectors, look at those in our data,

00:54:58.240 | and then say whether apples in this context is a fruit or not.

00:55:04.160 | Because we're not just labeling all fruit terms as a fruit

00:55:11.560 | entity, because it could be apple, the company.

00:55:14.200 | But we get to look at it, and it's much more efficient

00:55:16.320 | than if you ask the human to sit through and highlight

00:55:19.440 | every instance of fruit nouns in your text.

00:55:26.680 | And so this also leads to one of our main aspects

00:55:34.000 | of the tool, workflows that we're especially proud of

00:55:36.360 | and that we think really can make a difference, which is we

00:55:39.280 | can actually start by telling the computer more abstract

00:55:43.280 | rules of what we're looking for and then annotating

00:55:45.560 | the exceptions instead of really starting from scratch.

00:55:48.920 | Or we can even use the technology

00:55:51.360 | we're working with to build these semi-automatically using

00:55:55.000 | word vectors, using other cool things that we can now do.

00:55:58.040 | And then, of course, also specifically

00:56:00.920 | look at those examples that the statistical model we

00:56:04.920 | want to train is most uncertain about.

00:56:07.000 | So we try to avoid the predictions

00:56:10.440 | where we can be pretty sure that they're correct

00:56:12.920 | and actually really ask the human first about the stuff

00:56:17.760 | that's 50/50 and where really the human feedback makes

00:56:21.760 | most of the difference.

00:56:23.560 | And so here's a quick example.

00:56:25.600 | Let's say, OK, we want to label locations.

00:56:30.560 | We start off with one city, San Francisco.

00:56:33.560 | And then we look at what else is similar to that term.

00:56:36.840 | So these are actually real suggestions

00:56:38.840 | from that Sense2Vec model that Matt showed earlier.

00:56:42.080 | And as you can see, the nice thing

00:56:44.640 | is we're using word vectors.

00:56:45.920 | We're not using a dictionary.

00:56:47.120 | So we're not going to annotate California and maybe

00:56:49.680 | University of San Francisco.

00:56:51.000 | But we're not going to annotate California roles

00:56:53.840 | because we're in a vector space and we

00:56:55.720 | know that what we're actually looking for

00:56:57.440 | is at least similar to the real meaning of the word.

00:57:00.480 | And a lot of these are super trivial to answer.

00:57:03.200 | So we can accept them, we can reject them,

00:57:05.240 | or we can ignore them because this is a bit too ambiguous

00:57:08.840 | and we don't actually want that in our list

00:57:11.040 | because it can mean too many things.

00:57:12.920 | And then from here, we can actually

00:57:15.400 | create a pattern that uses spaces, attributes,

00:57:20.080 | or in this case, the lower case form of the token and GPE

00:57:26.840 | that stands for geopolitical entities or anything

00:57:30.040 | with the government.

00:57:31.320 | And that's what we're trying to label.

00:57:32.760 | So we can easily build up these roles very quickly,

00:57:35.960 | very automated, and then we have a bunch of locations

00:57:40.080 | that we can then match in our text.

00:57:41.920 | So here, it found a mention of Virginia,

00:57:45.680 | which we can then accept.

00:57:47.520 | So that's a very, very simple example of this.

00:57:49.920 | But of course, this also works for slightly more complex

00:57:54.640 | constructs where we can really take advantage

00:57:57.340 | of the syntactic structure.

00:57:59.040 | So here, this was a finance example.

00:58:01.880 | So what we're trying to do is we want

00:58:03.400 | to extract information about executive compensation.

00:58:07.840 | So yeah, some executive receives some amount of money

00:58:11.880 | in stock, for example, like this one.

00:58:14.040 | And this is a pretty difficult task.

00:58:17.240 | But also, the idea here is we have this theory

00:58:20.480 | that maybe if we could train a model, a text classification

00:58:23.160 | model, to predict whether a sentence is

00:58:26.520 | about executive compensation or not,

00:58:29.120 | we can then very, very easily use what we already

00:58:32.720 | know about the text to extract, let's say, the first person

00:58:35.680 | entity.

00:58:36.600 | We extract the amount of money, put that in our database.

00:58:39.480 | And we've actually-- yeah, we found a good solution

00:58:42.660 | for an otherwise very, very complex task.

00:58:45.640 | So for this, this is just an idea.

00:58:48.920 | We haven't tried this in detail, but one possible pattern

00:58:52.400 | using token attributes we have available

00:58:55.680 | would be let's try and look for an entity type person,

00:59:00.960 | followed by a token with a lemma receive.

00:59:05.240 | So received, receives, receiving, and followed by a token

00:59:09.920 | with the entity type money.

00:59:11.680 | And let's just look at what this pulls up.

00:59:15.040 | That's an idea.

00:59:15.520 | I mean, there are plenty of other possible patterns

00:59:19.580 | you can come up with.

00:59:20.740 | And the nice thing is we're actually

00:59:22.280 | going to be looking at them again in context.

00:59:24.320 | So they don't have to be perfect.

00:59:25.920 | And even actually, in fact, even if it pulls up random stuff

00:59:29.320 | that you realize is totally not what you want,

00:59:32.960 | this is also very important.

00:59:34.280 | Because you won't only be collecting annotations

00:59:37.580 | for the things you know are definitely right.

00:59:39.720 | You're also collecting annotations

00:59:41.960 | for the things that are very, very similar or look very,

00:59:44.420 | very similar to what you're looking for but are actually

00:59:46.680 | not what you're looking for.

00:59:47.920 | And that's probably just as important

00:59:50.840 | as the positive examples.

00:59:54.240 | So yeah, the moral of the story is what we're saying

00:59:57.600 | is we're very used to iterating on our code as programmers.

01:00:02.720 | But you should really be doing both.

01:00:04.800 | The data is just as important.

01:00:07.120 | So as we see here, OK, that's the normal type of programming.

01:00:10.840 | You have a runtime program.

01:00:13.600 | You work on the source code.

01:00:15.200 | You compile it, get your runtime program.

01:00:17.000 | You don't like something about your program.

01:00:19.120 | You go back, change the source code, compile it, and so on.

01:00:21.520 | That's a pretty standard workflow.

01:00:23.640 | And in machine learning, we don't

01:00:26.400 | have a runtime program in that sense.

01:00:28.120 | We have a runtime model.

01:00:29.640 | So the part we should really be thinking about and working on

01:00:33.400 | is the training data.

01:00:35.000 | Instead, most focus is currently on the training algorithm.

01:00:39.400 | And if you use that analogy, that's

01:00:42.560 | very similar to going and tweaking your compiler

01:00:45.760 | if you're not happy with your runtime program.

01:00:48.640 | You can do that, but of course, you probably go back and edit

01:00:52.400 | your source code.

01:00:53.920 | I think this is actually a pretty good example.

01:00:56.400 | It's pretty accurate.

01:00:59.120 | There are only so many training algorithms,

01:01:01.040 | but what really makes a difference is your data.

01:01:03.360 | So if you have a good way and a fast way of iterating

01:01:06.480 | on that data, and you're able to really master

01:01:11.840 | this part of the problem, you'll also

01:01:13.520 | get to try more things quickly.

01:01:16.720 | As we know, most ideas don't actually work.

01:01:19.800 | It's always one of these things that's kind of misrepresented.

01:01:22.480 | A lot of people have this idea, ooh,

01:01:24.280 | you're doing all these amazing AI things,

01:01:27.000 | and everything just works.

01:01:28.080 | It's like, kind of doesn't.

01:01:29.280 | Nothing works.

01:01:30.520 | And sometimes things work.

01:01:32.920 | And you really want to find the things that actually work.

01:01:35.600 | And for that, you need to try them.

01:01:37.560 | And so it also means if you can actually

01:01:41.560 | figure out what works before you try it and invest in it,

01:01:45.440 | you can actually be more successful overall

01:01:47.200 | because you're not going to waste your time on the things that

01:01:51.280 | might fail and more scale things up that actually weren't even

01:01:55.600 | supposed to work in the first place.

01:01:57.000 | And one thing that's also very important to us

01:01:59.200 | is you can really build custom solutions.

01:02:01.800 | You can build solutions that fit exactly to your use case,

01:02:05.920 | and you'll keep these on it.

01:02:08.480 | If you collect your own data, you'll keep that forever,

01:02:10.680 | and nobody can lock you in.

01:02:11.760 | You're not just consuming some API,

01:02:13.920 | and if that API shuts down, you can start again from scratch.

01:02:18.200 | You have your data, no matter what

01:02:20.000 | other cool things we can do at some point in the future,

01:02:22.760 | you can always go back to your labeled data

01:02:25.720 | and really build your own systems.

01:02:29.360 | And we believe that this is really something

01:02:31.200 | that's very important in the future of the technology.

01:02:34.680 | That's also a reason why we think

01:02:36.200 | AI development in general in companies

01:02:37.920 | should be done in-house.

01:02:40.520 | And, yeah, we're hoping that we can keep providing useful tools

01:02:44.640 | that will make this easier.

01:02:48.280 | Yeah.

01:02:50.120 | [APPLAUSE]

01:02:52.600 | So the question is, yeah, Jeremy thinks

01:03:08.800 | we write very good software, even though we're only two people,

01:03:11.760 | and how are we doing that?

01:03:12.920 | Yeah, that's a very good question.

01:03:14.280 | I mean, we do get this a lot.

01:03:18.000 | I mean, I think it's--

01:03:19.760 | I don't even know where this idea comes from that, like, yeah,

01:03:22.960 | you can scale things up.

01:03:24.240 | Like, I don't know, scaling things up makes things better.

01:03:28.120 | Because I do think, yeah, actually,

01:03:30.120 | the more people you get involved,

01:03:31.480 | you sometimes-- it actually can have a very negative impact

01:03:35.040 | on the quality of the software you produce.

01:03:38.560 | In our case, it's just, OK, it just works.

01:03:40.320 | Like, I also don't like this idea of, oh,

01:03:43.360 | if everyone can do exactly the same thing if they just

01:03:45.600 | work hard, even though people like thinking of it that way.

01:03:48.200 | It's just, OK, in our case, we have a good combination

01:03:51.480 | of things that we like to do, things

01:03:53.560 | that we happen to be good at, and it just works together.

01:03:57.560 | So I guess we are lucky in that way,

01:04:00.440 | but we also cut out a lot of bullshit,

01:04:02.760 | like the amount of meetings we don't take,

01:04:05.080 | the amount of events we don't go to.

01:04:08.000 | I mean, yeah, it's kind of ironic saying that, speaking

01:04:11.600 | at an event, but I really don't normally go to many events.

01:04:17.200 | We don't take coffee dates with random people.

01:04:19.720 | We barely know.

01:04:22.800 | Yeah, we mostly, we really just like to write software.

01:04:26.720 | And yeah, we've had some good ideas in the past.

01:04:35.600 | Thanks for making this cool.

01:04:38.080 | I wish I had it two years ago.

01:04:40.560 | Have you done any experiments to see

01:04:42.440 | if there's actually biases and [INAUDIBLE]

01:04:45.400 | to show them your model examples versus just

01:04:48.840 | how many do you think [INAUDIBLE]

01:04:51.320 | you don't look at any trade-offs [INAUDIBLE]

01:04:54.440 | I mean, also, the question is, if we've

01:04:59.440 | done any experiments where we compare the binary decisions

01:05:04.440 | and whether it influences the annotators

01:05:06.720 | versus really doing everything from scratch.

01:05:08.960 | So we haven't done experiments specifically

01:05:11.880 | focusing on the bias because that's, in some sense,

01:05:15.960 | that's difficult because we're looking at the output.

01:05:18.720 | We're looking at, does it improve accuracy?

01:05:21.520 | We've done experiments of manual annotation

01:05:24.160 | versus binary annotation, but also mostly focused

01:05:28.480 | on our own tooling because we think it's kind of useless.

01:05:31.980 | Like, yeah, we can present you a study

01:05:33.480 | where we said, oh, we did stuff in an Excel spreadsheet

01:05:35.760 | and then we did stuff in Prodigy and it was much better.

01:05:38.320 | So it's really mostly focused around our own tooling

01:05:41.560 | and we did find that--

01:05:44.040 | well, it depends on the task you're doing.

01:05:46.600 | That's the other thing.

01:05:47.920 | I feel like giving these answers sounds unsatisfying

01:05:50.400 | because I'm always saying, well, it depends on your data.

01:05:53.120 | But that's also the whole point of it

01:05:56.040 | because we're doing this because your data is different

01:05:59.640 | and there's no one size fits all solution.

01:06:04.440 | But essentially, so we found what--

01:06:06.680 | binary annotation works especially well

01:06:08.280 | if you already have a pre-trained model that

01:06:10.320 | predicts something, ideally also something that's

01:06:13.000 | not completely terrible.

01:06:16.120 | Otherwise, the pattern approach does work very well

01:06:18.840 | on very specific domains.

01:06:22.600 | Like, we did one example of where we labeled drug names

01:06:26.480 | on Reddit, like on our opiates, which was a pretty good--

01:06:29.920 | this was a pretty good data source

01:06:31.280 | because it's a very specific topic.

01:06:33.400 | And also, it's a subreddit that's very on topic

01:06:35.960 | because people who go on Reddit to discuss opiate use,

01:06:44.080 | usually are very dedicated to talking about this one topic.

01:06:47.080 | So it was a good, interesting data source.

01:06:48.920 | And so what we wanted to do is we labeled drug names,

01:06:54.200 | drugs, and pharmaceuticals in order to, for example,

01:06:59.040 | have a better tool set to really analyze

01:07:01.120 | the content of this subreddit and see how it develops

01:07:04.360 | over time anyway.

01:07:05.400 | So there we found the pattern-based approach

01:07:08.200 | worked very, very well because we have very specific terms.

01:07:11.400 | We can use word vectors to bootstrap these.

01:07:14.600 | Especially also, we can include spelling mistakes and stuff,

01:07:17.840 | which was very interesting.

01:07:18.960 | Like, we can really build up good word lists,

01:07:21.120 | find them in the text, confirm them, and get to pretty decent

01:07:24.120 | accuracy.

01:07:24.960 | I would expect this work to work a little less well,

01:07:28.560 | the cold start problem, on a much more ambiguous domain.

01:07:31.880 | And there, you're probably better off to say, OK,

01:07:33.880 | we're labeling by hand.

01:07:35.480 | But even there, that's something I haven't really

01:07:37.400 | shown in detail here.

01:07:38.200 | But we also have a manual interface

01:07:40.360 | where you highlight.

01:07:42.080 | But what we do there is we use the tokenizer

01:07:44.840 | to pre-segment the text.

01:07:46.560 | So you don't have to sit there and pixel

01:07:48.640 | perfect, like, highlight, and then, ah, shit,

01:07:51.000 | now I've got the white space in.

01:07:52.440 | Let's start again.

01:07:53.480 | So that's another thing we're doing.

01:07:55.440 | You can be much lazier in highlighting.

01:08:00.160 | And also, there, get more efficiency out of it.

01:08:03.840 | And still use a simpler interface.

01:08:06.760 | Yeah?

01:08:08.280 | So you mentioned about the [INAUDIBLE]

01:08:12.440 | Yeah.

01:08:13.440 | [INAUDIBLE]

01:08:35.120 | So the question is, first, you gave an example

01:08:41.000 | of annotating patient data, which is obviously very

01:08:45.240 | problematic because doctors are not always very specific

01:08:48.080 | in what they fill in.

01:08:49.040 | And then, in the end, this was how did they enrich that with--

01:08:52.600 | So what they did is they got foundation of the [INAUDIBLE]

01:08:57.960 | Yeah.

01:08:59.960 | Yeah, so basically, OK, the question

01:09:01.320 | is whether we have some experience in the medical field

01:09:04.960 | mixing this.

01:09:06.920 | The answer is, well, we haven't personally done this.

01:09:09.040 | But we do have quite a few companies in that domain,

01:09:13.880 | also because the tool itself is quite appealing

01:09:17.120 | because you can run it in your own compliant environment,

01:09:20.960 | you know, that data privacy aspect.

01:09:23.880 | But it's interesting to explore.

01:09:27.120 | That's maybe also where, OK, having the professionals--

01:09:29.680 | getting the medical professionals more involved

01:09:31.520 | might make sense, which normally is very difficult.

01:09:34.720 | You don't want a doctor to do all the work themselves.

01:09:37.640 | But if you can find some way to distill that and then ask

01:09:40.720 | the doctor, OK, you wrote this here, does that mean--

01:09:44.520 | you wrote x, does that mean y?

01:09:46.160 | And the doctor says, yep.

01:09:47.360 | Or the doctor says, nah.

01:09:49.080 | If you can try this out and extract some information,

01:09:53.400 | well, that could be one idea to solve that, for example.

01:09:56.480 | Yeah, I can definitely see that.

01:09:59.480 | [INAUDIBLE]

01:10:07.920 | You can.

01:10:08.600 | Like right now, we don't have a built-in logic for that,

01:10:13.320 | although we are working on--

01:10:15.080 | oh, sorry, I forgot to repeat the question--

01:10:18.080 | inter-annotator agreement, if you can calculate that

01:10:21.040 | and incorporate that into your model.

01:10:23.120 | So we're actually working on an extension

01:10:25.120 | for Prodigy, which is much more specifically

01:10:27.040 | for managing multiple annotators.

01:10:28.840 | Because the tool here, we really designed specifically

01:10:32.240 | as a developer tool first and then scaling it up a second.

01:10:36.800 | But since you have the binary feedback,

01:10:38.880 | and if you have an idea, if you have an algorithm you want to use

01:10:41.640 | and you know what you want, you can already

01:10:44.400 | do that fairly easily because you can download

01:10:46.960 | all the data as JSON.

01:10:48.640 | You have a key that's answer, which is either

01:10:50.720 | accept, reject, or ignore.

01:10:53.200 | You can attach your own arbitrary data like a user ID.

01:10:57.160 | And then it's fairly trivial to write your own function that

01:11:00.240 | really takes all of this, reads it in, computes something,

01:11:03.760 | and then uses this later on.

01:11:06.200 | So that's definitely possible.

01:11:07.360 | But this is also something we're really interested in exploring

01:11:11.400 | and working on.

01:11:12.160 | And the binary interface is great,

01:11:14.320 | but the trainer kind of is great, but yeah.

01:11:17.400 | Yeah, so we see binary.

01:11:18.840 | That's a big advantage of the binary interface

01:11:21.480 | is that there are only two options.

01:11:25.800 | You filter out the ignored ones, and then you

01:11:28.800 | can really answer that question.

01:11:30.720 | Yeah.

01:11:31.880 | [INAUDIBLE]

01:11:32.840 | Yeah, well, you can design--

01:11:43.480 | so the question was--

01:11:47.280 | one interface I showed, which was the sentiment

01:11:49.280 | one with the multiple selections.

01:11:51.600 | This is not binary.

01:11:52.520 | That's true.

01:11:53.400 | And actually, it's also something we usually

01:11:55.120 | tell our users avoid this as much as possible, if you can.

01:11:59.680 | And in some cases, you might still want that.

01:12:01.880 | Or we say, look, a lot of people still

01:12:05.000 | think of surveys when they think of annotating data.

01:12:07.000 | And I get where this is coming from,

01:12:09.960 | but I think if you can leave that sort of mindset

01:12:12.600 | and really open up a bit and think of other creative ways,

01:12:15.000 | you could get more out of this.

01:12:16.380 | If you want to re-engineer a survey,

01:12:18.400 | maybe you want to use a survey tool.

01:12:21.120 | So for example, if I were doing this with those four options,

01:12:24.640 | I would say, OK, we have all texts.

01:12:27.360 | The annotator sees every text four times and says,

01:12:30.320 | is this happy or is this not happy?

01:12:32.720 | And because you can get to one second for annotation,

01:12:35.040 | that's very fast.

01:12:36.400 | Like, even if you have thousands of examples,

01:12:38.360 | you can do this in a day yourself.

01:12:40.680 | And so that's how we would probably solve this.

01:12:42.880 | And it also means you get every example four times.

01:12:46.520 | And for each text, you know, is it sad?

01:12:48.600 | Is it happy?

01:12:49.360 | Is it neutral?

01:12:50.200 | Is it something else?

01:12:51.320 | You have much more data.

01:12:52.600 | But not everyone wants this.

01:12:54.040 | Some people really want to build that survey.

01:12:56.400 | And we let them.

01:12:57.600 | But yeah.

01:12:59.920 | Yeah.

01:13:00.920 | [INAUDIBLE]

01:13:01.920 | So the question is, if you're doing the same example

01:13:15.200 | multiple times, whether it slows down the annotation or not.

01:13:18.520 | Well, actually, I mean, it's difficult to say

01:13:21.000 | because it depends.

01:13:22.240 | But I've actually found that even if you do the bare maths,

01:13:26.160 | it can easily be much faster.

01:13:27.800 | Because if you say, OK, 1,000 examples.

01:13:31.400 | And normally, if you really have to think

01:13:33.520 | about five different concepts that are maybe not even fully

01:13:36.120 | related, that just every tiny bit of friction

01:13:38.920 | you put between a human and the interface or the decision

01:13:41.800 | can very significantly slow down the process.

01:13:44.000 | So you think about, oh, is this happy?

01:13:45.560 | Or is this sad?

01:13:46.560 | Or is this about sports?

01:13:47.800 | Or is this about horses?

01:13:49.840 | And just this thing that can easily add like 10 seconds

01:13:52.720 | to each question.

01:13:54.320 | So if you do the whole thing three times at one second,

01:13:59.680 | you're still faster than you would have been

01:14:02.480 | if you'd added this friction.

01:14:04.520 | And the other part is just a human error.

01:14:07.840 | If you have to think too much, you're much more likely to fuck

01:14:11.160 | it up and do it badly.

01:14:12.160 | And then that's also something you want to avoid.

01:14:15.440 | [INAUDIBLE]

01:14:15.920 | But the active learning helps a lot here as well.

01:14:18.240 | So if you have your labels, it's pretty confident

01:14:20.560 | that the economic labels don't apply.

01:14:22.280 | And so you just don't have to learn something.

01:14:24.440 | Yeah, to repeat this, the active learning also

01:14:28.080 | makes a difference here.

01:14:29.680 | Because you could actually--

01:14:32.280 | yeah, you could pre-select the ones

01:14:34.160 | that really make a difference to annotate

01:14:36.080 | and don't have to really go through every single one that

01:14:39.920 | is not as important as some of the other ones

01:14:42.680 | that you really care for.

01:14:44.880 | Yeah, do you have any experience working with tasks like that

01:14:48.560 | or how you sort of [INAUDIBLE]

01:14:50.520 | Yeah, so the question is, yeah, what

01:14:52.480 | about tasks that need a lot of context,

01:14:54.080 | like the whole medical history or just a whole document.

01:14:57.760 | So we have-- and whether we have experience with that.

01:15:00.800 | So in general, we do say, if your task requires so much

01:15:05.640 | context that you can't fit this into the prodigy interface,

01:15:08.520 | then it doesn't mean that you can't train a model on that.

01:15:11.160 | But for most of the tasks that users most commonly want to do,

01:15:14.440 | this is often also an indicator that it's very, very difficult

01:15:16.800 | to actually teach your model that if you're

01:15:19.080 | doing named entity recognition or even text classification

01:15:23.080 | and you need a lot of context and all the context

01:15:26.320 | is equally as important, that's often an indicator

01:15:29.080 | that that might not work so well.

01:15:31.360 | So for example, text classification,

01:15:32.800 | we say, OK, we start off by selecting one sentence

01:15:35.880 | from the whole document.

01:15:37.280 | And then instead of you annotating the whole document,

01:15:41.360 | you say, OK, this is the most important sentence.

01:15:44.320 | Does this label apply or not?

01:15:46.040 | So there are some tricks we use to get around this problem

01:15:51.720 | because, yeah, we also think that, OK, it's

01:15:56.280 | important to get this across and frame it in that way

01:15:59.360 | because, yeah, if you need two pages on your screen,

01:16:03.320 | it's not efficient at all.

01:16:06.520 | And also likely, you can do all that work,

01:16:08.800 | but your model won't learn that because your model needs

01:16:12.000 | local context as well, at least, for the tasks that we are--

01:16:15.880 | I don't know if you have anything to add to that.

01:16:18.880 | Yeah, OK.

01:16:20.880 | Often, it's important to take into account the models

01:16:23.960 | really that are available.

01:16:25.440 | [INAUDIBLE]

01:16:25.920 | Yeah, so the suggestion was, OK, having some tools,

01:16:33.560 | some process that goes along with the software that

01:16:37.000 | helps people break this down.

01:16:38.640 | Yeah, we've actually been thinking about this a lot

01:16:40.640 | because we do realize the tool is quite new,

01:16:43.120 | and we're introducing a lot of new concepts at once,

01:16:45.160 | and also some best practices where we think,

01:16:47.200 | ah, that's how you should do it, or you could try this.

01:16:49.960 | And we are also realizing that there's

01:16:52.200 | no real satisfying one-size-fits-all answer.

01:16:55.160 | That's another problem.

01:16:56.440 | Everyone's use case is different,

01:16:57.800 | so right now what we're doing is we have a support form

01:17:00.440 | for Prodigy where we answer people's questions.

01:17:03.040 | And actually, a lot of users share

01:17:04.840 | what they're working on, asking for tips.

01:17:08.320 | We kind of talk about it.

01:17:09.640 | Other users come in and are like, oh, I actually

01:17:11.640 | try to do this type of legal annotation,

01:17:14.800 | and here's what worked for me, and have this sort of exchange

01:17:17.840 | around it to figure out, OK, what works.

01:17:21.080 | Because, yeah, it's just like I think

01:17:23.320 | in machine learning, deep learning,

01:17:24.820 | a lot of the best practices are still evolving,

01:17:28.560 | and it's very, very specific.

01:17:32.160 | So it's definitely-- yeah, we're open for suggestion there

01:17:35.360 | as well, but we're still in the process of really coming up

01:17:39.200 | with a good set of best practices and ideas.

01:17:42.000 | The question is whether-- yeah, we

01:17:51.360 | have any plans to sell models like medical models?

01:17:54.240 | Yes, as part of what Matt mentioned

01:17:55.920 | in the very introduction, we are definitely

01:17:58.360 | planning on having more of a models--

01:18:02.400 | like an online store for very, very specific models.

01:18:05.120 | So medical-- that's a very, very interesting domain.

01:18:09.320 | And if so, we really want to have it specific,

01:18:11.800 | like medical texts in French or Chinese,

01:18:16.440 | and really go in that direction.

01:18:17.700 | Because we believe that, OK, pre-trained models

01:18:20.120 | are very valuable, and even if you do medical texts,

01:18:22.800 | you can start off with a pre-trained model,

01:18:25.200 | then you can use a tool like Prodigy or something else

01:18:27.400 | to really fine tune it on your very, very specific context,

01:18:31.120 | have word vectors in it that already fit to your domain,

01:18:34.720 | and maybe up those as well.

01:18:35.880 | We think that this is a very future-proof way of working

01:18:39.720 | with these technologies.

01:18:41.320 | Yeah?

01:18:43.800 | Yeah?

01:18:44.280 | [INAUDIBLE]

01:18:47.160 | So currently-- so a question is the text classification model

01:18:50.040 | we're using in Prodigy.

01:18:51.600 | More info-- more details on that.

01:18:53.160 | So what we're using is Spacey's text classification model.

01:18:56.360 | That's what's built in.

01:18:58.520 | But I think actually this question is pretty good,

01:19:00.520 | because what's important to note is that Prodigy itself

01:19:04.080 | comes with a few built-in recipes that are basically

01:19:07.560 | ideas for, OK, how you could train a text classifier.

01:19:10.120 | You could use Spacey.

01:19:11.320 | But it's definitely not tied to those.

01:19:13.520 | The idea-- the tool itself is really the scaffolding

01:19:15.680 | around it.

01:19:16.200 | So if you say, hey, I wrote my own model using PyTorch,

01:19:19.720 | and I would like to train this, all you need to do

01:19:22.360 | is you need to have one function that takes examples

01:19:25.000 | and updates your model.

01:19:26.160 | And you need to have one function that takes raw texts

01:19:29.480 | and outputs the score for each text.

01:19:32.360 | And then you provide that to Prodigy.

01:19:35.120 | And then you can use the same active learning mechanism

01:19:39.720 | as you would use with a built-in model.

01:19:42.160 | So the idea is really the models we ship

01:19:45.200 | are just a suggestion or an idea you can use to try it out.

01:19:48.840 | But ultimately, we also hope that people in the future

01:19:51.800 | will transition to just plugging in their own model

01:19:54.840 | and just using the scaffolding around it to do that.

01:19:59.400 | But we definitely don't want to lock anyone in and say,

01:20:01.680 | oh, you have to use spaCy, especially for NER and stuff

01:20:05.120 | and other things.

01:20:06.240 | We think spaCy is pretty good.

01:20:08.080 | But if you don't want to do that for other use cases, especially

01:20:11.040 | text classification, we think that a lot of cases--

01:20:13.800 | well, you might want to use scikit-learn or vocal-webbit.

01:20:18.080 | Yeah, or what a great name.

01:20:21.880 | Yeah, or basically something completely custom.

01:20:25.440 | Yeah.

01:20:25.920 | [INAUDIBLE]

01:20:30.640 | So the question is, active learning part,

01:20:32.600 | whether this is built on the underlying model--

01:20:34.640 | [INAUDIBLE]

01:20:35.440 | Oh, yeah.

01:20:35.920 | [INAUDIBLE]

01:20:42.960 | So the question is, active learning versus no active

01:20:45.720 | learning, how well this works.

01:20:47.080 | First also, maybe as a general introduction,

01:20:49.320 | so what we're doing for most of these samples

01:20:51.320 | is we use a basic uncertainty sampling.

01:20:53.680 | That's what we found works best.

01:20:55.600 | But we also know there are lots of other ways

01:20:57.520 | you could be solving that.

01:20:58.520 | So in the end, how we implement this

01:21:02.040 | is we have a simple function that takes a stream

01:21:04.280 | and outputs a sorted stream based on the assigned

01:21:08.120 | scores and the model in the loop.

01:21:10.440 | So how you wire this up, again, is also up to you.

01:21:13.320 | And yeah, to answer the part about what works best,

01:21:19.000 | in general, in our kind of framework, where really,

01:21:22.160 | you see one sentence at a time.

01:21:23.480 | And often, you start off with a model not knowing very much.

01:21:27.120 | The active learning component, basically,

01:21:29.200 | resorting the stream is actually very crucial.

01:21:31.360 | Because otherwise, if you start from scratch,

01:21:34.040 | have very few examples, you'll be

01:21:35.760 | annotating for a very, very long time.

01:21:37.480 | And all kinds of random predictions,

01:21:41.640 | you annotate your stream in order.

01:21:44.080 | There's very little-- you need some kind of guidance

01:21:46.560 | that tells you, OK, what to work on next, especially

01:21:48.640 | if you feed in millions of texts.

01:21:50.520 | You need to sort them.

01:21:51.880 | You need to pre-select them based on something.

01:21:54.400 | And this could be the model's predictions.

01:21:56.280 | This could be something else.

01:21:57.160 | This could be the keywords or the patterns.

01:21:59.440 | But without that, yeah, it's very, very difficult.

01:22:04.480 | And that's kind of what we're trying to solve with a tool.

01:22:07.680 | Thank you so much, Innes and Matthew.

01:22:18.400 | I've got to say, anybody who's using fast AI,

01:22:23.760 | any time you've used fast AI NLP or fastAI.txt,

01:22:27.360 | you've called the spacey tokenize function.

01:22:30.520 | You're using spacey behind the scenes.

01:22:32.960 | And the reason you're using spacey

01:22:34.560 | is because I tried every damn tokenizer I could find.

01:22:38.680 | And spaceys was so much better than everything else.

01:22:42.120 | And then the kind of story of fast AI's development

01:22:44.560 | is that over time, I get sick of all the shitty parts

01:22:47.400 | of every third-party library I find.

01:22:49.040 | And I gradually rewrite them myself.

01:22:50.560 | And the fact that I haven't rewritten spacey or attempted

01:22:52.920 | to is because I actually think it's

01:22:55.320 | one of those rare pieces of software

01:22:56.760 | that doesn't suck at all.

01:22:58.520 | It's actually really good.

01:23:01.320 | And it's got good documentation.

01:23:02.960 | And it's got a good install story and so forth.

01:23:06.400 | And I haven't used Prodigy, but just the fact

01:23:09.040 | that these guys are working on.

01:23:11.040 | I recognize the importance of active learning

01:23:13.120 | and the importance of combining human plus machine.

01:23:16.120 | What's in that rare category of people, in my opinion,

01:23:18.720 | are actually working on what's one of the most

01:23:20.400 | important problems today.

01:23:22.440 | So thank you both so much for coming

01:23:25.480 | and for this fantastic talk.

01:23:27.280 | And I look forward to seeing what you do next.

01:23:29.320 | Thank you.

01:23:29.800 | [APPLAUSE]

01:23:32.560 | (audience applauds)

Increasing data science productivity; founders of spaCy & Prodigy

Chapters