back to indexIncreasing data science productivity; founders of spaCy & Prodigy
Chapters
0:0 Introduction
3:55 Syntax Positive
12:32 Syntax
14:35 Term Sensitive ECH
16:19 Using ECH in Spacey
17:21 Parsing Algorithm
17:44 Transition Based Parsing
22:24 Splitting Tokens
24:44 Learning to Merge
26:56 User Experience
29:16 SpaCy vs Stanford
30:54 Endtoend systems
32:48 Longrange dependencies
33:46 Language variation
35:9 Current resolution
36:49 How to use spaCy
37:42 Language generation
45:44 Binary decision
52:51 Recipes
55:26 Example
00:00:00.000 |
OK, so yeah, this is our first time in the Bay Area, 00:00:09.200 |
So I'll start by just giving a quick introduction of us 00:00:13.840 |
before I start with the main content of the talk, which 00:00:16.360 |
is about this open source library that we developed 00:00:22.440 |
So the other things that we develop as well at Explosion 00:00:25.400 |
AI is a machine learning library behind spacey think, 00:00:31.000 |
which allows us to avoid depending on other libraries 00:00:33.520 |
and keep control of everything and make sure that everything 00:00:40.080 |
develop alongside spacey prodigy, which is what 00:00:44.320 |
And we're also preparing a data store of other pre-trained 00:00:46.840 |
models for more specific languages and use cases 00:00:49.960 |
and things that people will be able to use that basically 00:00:59.640 |
So to give you a quick introduction to Innes and I, 00:01:05.640 |
so I've been working on natural language processing 00:01:09.520 |
I started doing this after doing a PhD in computer science. 00:01:13.960 |
I started off in linguistics and then kind of moved 00:01:18.880 |
And then around 2014, I saw that these technologies 00:01:26.000 |
where I was supposed to start writing grant proposals, which 00:01:32.000 |
was a gap in the capabilities available for something that 00:01:37.040 |
to something that was more practically focused. 00:01:39.920 |
And then soon after I moved to Berlin to do this, 00:01:44.000 |
And we've been working together since on these things. 00:01:46.360 |
And I think we kind of have a nice complementarity of things. 00:01:49.920 |
She is the lead developer of our annotation tool Prodigy 00:01:54.200 |
and has also been working on Spacey pretty much 00:01:59.360 |
So I included this slide, which we normally actually 00:02:02.320 |
give this when we talk to companies specifically. 00:02:04.480 |
But I think that it's a good thing to include 00:02:08.960 |
we tell people about what we do and how we make money 00:02:13.080 |
And I think that this is a very valid question 00:02:14.920 |
that people would have about an open source library. 00:02:17.000 |
It's like, well, why are you doing this and how 00:02:19.200 |
does it fit into the rest of your projects and plans? 00:02:22.640 |
So the Explain It Like I'm 5 version, which I guess 00:02:25.760 |
is also the Explain It Like I'm Senior Management version, 00:02:34.680 |
you can see, is kind of like the open source software. 00:02:43.160 |
happy to say we've been able to wind down over the last six 00:02:48.360 |
And then we also focus on a line of kitchen gadgets, which 00:02:56.960 |
And soon we'll have this sort of premium ingredients, which 00:03:09.720 |
that they'll fund open source software with a business model. 00:03:12.600 |
And we really don't like this because we want our software 00:03:16.040 |
to be as easy to use as possible and as transparent as possible 00:03:20.720 |
So I think it's kind of weird to have this thing where you have 00:03:24.360 |
explicitly a plan that we're going to make our free stuff 00:03:30.440 |
that we hope people pay us lots of money for, 00:03:36.520 |
It's kind of weird to have a company that you 00:03:39.140 |
hope that your paid offering is really poor value to people. 00:03:41.720 |
And so we don't think that that's a good way to do it. 00:03:44.440 |
And so instead, we have the downloadable tools, 00:03:49.800 |
we have something which works alongside Spacey 00:03:51.840 |
and I think is useful to people who use Spacey as well. 00:03:56.840 |
OK, so onto the sort of main content of the talk 00:04:04.600 |
So I'm going to talk to you about the syntactic parser 00:04:08.080 |
within Spacey, the natural language processing library 00:04:12.040 |
And so before I do it, so this is kind of what 00:04:16.040 |
it looks like as sort of visualized as an output. 00:04:22.240 |
that gives you the syntactic relationships between words. 00:04:29.240 |
is that the arrow pointing from this word to this word 00:04:33.640 |
means that Apple is a child of looking in the tree. 00:04:37.520 |
And it's a child with this relationship and such. 00:04:39.960 |
In other words, Apple is the subject of looking. 00:04:42.480 |
And is is an auxiliary verb attached to looking, 00:04:45.560 |
and then at is a prepositional phrase attached to looking. 00:04:51.480 |
about the syntactic structure of the sentence 00:04:53.200 |
and basically help you get at the who did what to whom 00:04:59.760 |
So for instance, here, to make the thing more easy to read, 00:05:02.840 |
we've merged UK startup, which is a sort of basic noun phrase 00:05:10.520 |
more easily from given the syntactic structure. 00:05:18.880 |
the syntactic structure or navigate the tree. 00:05:21.600 |
In spaCy, you just get this NLP object after loading the models. 00:05:27.720 |
that you feed text or pipetext through if you've 00:05:32.240 |
And given that, you get a document object, which you can 00:05:35.880 |
just use as an iterable. And from the tokens, 00:05:38.840 |
you get attributes that you can use to navigate the tree. 00:05:42.040 |
So for instance, here, the dependency relationship 00:05:47.000 |
By default, that's an integer key, integer ID, 00:05:50.040 |
because everything's kind of coded to an integer 00:05:55.320 |
But then you can get the text value with an underscore 00:05:58.520 |
And then you can navigate up the tree with dot head. 00:06:01.000 |
And then you can look at the left and right children 00:06:04.280 |
So we try to have a rich API that makes it easy 00:06:09.640 |
So that just getting dependency parses, obviously, 00:06:13.920 |
just the first step, you want to actually use it in some way. 00:06:16.400 |
And that's why we have this API to make that easy. 00:06:20.600 |
So the question that always comes up with this, 00:06:25.600 |
for the field in general, is what's the point of parsing? 00:06:28.760 |
What is this actually good for in terms of applications? 00:06:31.920 |
So Yoav Goldberg is a very prominent parsing researcher. 00:06:38.640 |
And he's one of the more well-known parsing people. 00:06:42.000 |
And so it's interesting to see him and other people reflect 00:06:46.800 |
that even though we have so many best papers in NLP, 00:06:50.040 |
so it's kind of a high prestige thing to study parsing. 00:06:56.080 |
used in practice in most of these applications. 00:07:03.120 |
Because parsing is based on trees and structured predictions, 00:07:06.720 |
And there's all these deep algorithmic questions. 00:07:08.720 |
Is it just kind of this catnip to researchers? 00:07:11.800 |
And does it have this kind of over prominence in the field? 00:07:17.360 |
Or is it that there is something deeper about this 00:07:29.840 |
And so this slide shows you the case for parsing. 00:07:32.680 |
And then I'll kind of have a counterpoint in a second. 00:07:35.440 |
So I think that the most important case for parsing 00:07:38.920 |
is that there's a sort of deep truth to the fact 00:07:45.400 |
Language syntactic structure of sentences is recursive. 00:07:49.400 |
And that means that you can have arbitrarily long gaps 00:07:54.480 |
So for instance, if you have a relationship between, 00:08:03.840 |
whether the subject of that verb is plural or singular 00:08:11.080 |
And that dependency between them can be arbitrarily long 00:08:16.840 |
But it can't be arbitrarily long in tree space 00:08:28.000 |
So you can see how, for some of these things, 00:08:30.280 |
it should be more efficient to think about it 00:08:43.040 |
So we can say, OK, in theory, this should be important. 00:08:47.920 |
study based on this knowledge about how sentences 00:08:53.440 |
So then the counterpoint to this is, all right, 00:08:59.600 |
But it's also true that they're written and read in order. 00:09:02.480 |
So if you read a sentence, you do read it from left to right, 00:09:05.920 |
or in English anyway, basically from start to finish, 00:09:10.840 |
And this really puts a sort of bounding on the linear complexity 00:09:19.200 |
yes, they could have an arbitrarily long dependency. 00:09:25.440 |
will have to wait arbitrarily long between some word 00:09:32.160 |
So empirically, it's not very surprising to see 00:09:38.440 |
And there's a lot of arguments that the options that 00:09:42.240 |
are kind of provided to grammars are sort of arranged 00:09:45.160 |
that you're able to keep your dependencies short. 00:09:49.760 |
you have options for how to move things around in sentences 00:09:56.160 |
So this means that if most dependencies are short, 00:09:58.880 |
then processing text as, say, chunks of words of one or two 00:10:02.720 |
at a time kind of gives you a pretty similar view. 00:10:06.040 |
Most of the time, you don't get something that's 00:10:08.240 |
so dramatically different if you look at a tree 00:10:11.160 |
instead of looking at chunks of three or four word sentences. 00:10:16.240 |
says maybe even though the sentences are, in fact, 00:10:20.280 |
restructured, maybe it's not that crucially useful. 00:10:23.760 |
So I think that the part that makes this particularly 00:10:27.080 |
rewarding to look at syntax or particularly useful 00:10:30.320 |
to provide syntactic structures in a library like Spacey 00:10:38.320 |
doesn't depend on what you hope to do with the sentence 00:10:42.120 |
And that's something that's quite different from other labels 00:10:44.520 |
or other information that we can attach to the sentence. 00:10:47.400 |
If you're doing something like a sentiment analysis, 00:10:49.680 |
there's no truth about the sentiment of a sentence 00:10:53.080 |
that's independent of what you're hoping to process. 00:10:55.760 |
That's not a thing that's in the text itself. 00:10:58.200 |
It's a lens that you want to take on it based 00:11:05.800 |
to be positive or negative depends on your application. 00:11:23.040 |
depend on what you're hoping to process with. 00:11:28.160 |
But details about the syntactic structure are in the language. 00:11:36.000 |
And that means that we can provide these things, 00:11:40.520 |
And I think that that's very valuable and useful 00:11:42.640 |
and different from other types of annotations 00:11:46.480 |
And that's why SPACY provides pre-trained models for syntax, 00:11:49.360 |
but doesn't provide pre-trained models for something 00:11:52.080 |
Because we know how to give you a syntactic analysis that's 00:11:59.480 |
depending on whether that actually solves your problems. 00:12:02.200 |
But at least it's sort of true and generalizable, 00:12:04.640 |
whereas we don't know what categorization scheme you 00:12:11.560 |
that does that, because that's your own problem. 00:12:14.920 |
So we try to basically give you these things, which 00:12:17.560 |
are annotation layers, which do generalize in this way. 00:12:25.080 |
like the semantic roles, or sentence structure, 00:12:27.200 |
or sentence divisions are things that we can do. 00:12:33.680 |
So the other thing about syntactic structures 00:12:36.720 |
and whether they're useful or not is that in English, 00:12:42.400 |
because English orthography happens to cut things up 00:12:48.360 |
They're not optimal units, but they're still pretty nice 00:12:57.520 |
Japanese, which usually isn't segmented into words. 00:13:01.320 |
You can't just cut that up trivially with white space 00:13:04.640 |
and get something that you can feed into a search engine, 00:13:15.280 |
You can use a technology that only makes linear decisions, 00:13:18.920 |
but the truth about what counts as a word or not 00:13:21.880 |
is very entangled with the syntactic structure. 00:13:23.880 |
And so there's real value in doing it jointly 00:13:27.880 |
For other languages, you have kind of the opposite problem. 00:13:39.240 |
Now, whether or not you want that to be sort of one unit 00:13:45.200 |
For many applications, actually, the English phrase 00:13:51.480 |
want to be looking for and having a single node 00:14:02.640 |
And so in those cases, the German word will be too large 00:14:07.480 |
So there's sort of different aspects to this. 00:14:12.000 |
In the bottom left here, we have an example of Hebrew. 00:14:16.160 |
And like Arabic and a couple of other languages like this, 00:14:23.560 |
And the words tend to have all sorts of attachments 00:14:29.680 |
So there, again, you have difficult segmentation problems 00:14:32.400 |
that are all tangled up with the syntactic processing. 00:14:35.920 |
OK, so going forward to sort of an example of what we can do 00:14:46.320 |
and feed them into some of the other processing stuff 00:14:51.720 |
So this is a demo that we prepared a couple of years 00:14:55.080 |
ago for an approach that we call the term Sense2Vec. 00:15:03.440 |
using natural language processing tools, in this case 00:15:05.720 |
specifically spaCy, in order to recognize these concepts that 00:15:11.040 |
So specifically here, we looked for base noun phrases and also 00:15:17.640 |
before feeding the text forward into a word2vec 00:15:21.720 |
implementation, which gives you these semantic relationships. 00:15:25.360 |
And this lets you search for and find similarities 00:15:29.600 |
between phrases which are much longer than one word. 00:15:45.120 |
Instead, you can find things related to natural language 00:15:49.320 |
And then you see, ah, machine learning, computer vision, 00:15:52.080 |
These are real results that came out of the thing 00:15:58.520 |
And so we can do this for other languages as well. 00:16:02.360 |
So if we were hoping to use word2vec for a language 00:16:06.800 |
like Chinese, you really want to be processing it into words 00:16:11.280 |
Or if you're going to do this for a language like Finnish, 00:16:14.280 |
you really want to cut off the morphological suffixes 00:16:26.120 |
So you can actually use this as a handy component 00:16:35.240 |
add a component that gives you these sensitivevec sensors. 00:16:38.480 |
So you can just say, all right, the token for three 00:16:46.000 |
And then you can also look up the similarity. 00:16:47.880 |
So it's now much easier to actually use the pre-trained 00:16:56.520 |
Incidentally, we have this concept of an extension 00:16:59.200 |
attribute in Spacey so that you can kind of attach your own 00:17:02.720 |
things to the tokens so that you can basically 00:17:05.880 |
attach your own little markups or processing things. 00:17:09.160 |
So this underscore object is kind of a free space 00:17:13.600 |
that you can attach attributes to, which ends up 00:17:17.640 |
It's a lot more convenient than trying to subclass something 00:17:26.720 |
give you a little bit of a pretty brief overview 00:17:33.120 |
how we're going to-- how we're modifying the parsing algorithm 00:17:38.840 |
so that we can basically broaden out the support of Spacey 00:17:53.640 |
or the decision points that the parser is going 00:18:01.440 |
I think to keep in mind-- or the key aspect of the solution 00:18:06.720 |
is that it's going to read the sentence from left to right 00:18:10.920 |
And then it's going to have a fixed inventory of actions 00:18:14.640 |
that it has to choose between to manipulate the current parse 00:18:22.160 |
transition-based parsing, I find deeply satisfying 00:18:34.440 |
to take algorithms which process language incrementally. 00:18:38.200 |
I think that that's deeply satisfying and correct 00:18:41.640 |
in a way that a lot of other approaches to parsing aren't. 00:18:46.240 |
So we can do joint modeling and have it output 00:18:49.320 |
all sorts of other structures as well as the parse tree. 00:18:54.140 |
So already in Spacey, we have joint prediction 00:18:57.680 |
of the sentence boundaries in the parse tree. 00:19:01.480 |
this to this joint prediction of word boundaries as well. 00:19:04.840 |
OK, so here's how the decision process of building the tree 00:19:13.000 |
And so for ease of notation or ease of readability, 00:19:26.000 |
And then the other element of the state is a stack. 00:19:33.920 |
we have an action that can advance the buffer one 00:19:36.760 |
and put the word that was previously at the start 00:19:40.440 |
So here's what that shift move is going to look like. 00:19:43.520 |
So here we have Google on the stack, which we write up here. 00:19:53.680 |
is to form a dependency arc between the word that's 00:19:56.600 |
on top of the stack and the first word of the buffer. 00:20:11.080 |
know that we can pop it from the stack because it's a tree. 00:20:23.520 |
And so that means that we can do that and keep moving forward. 00:20:36.000 |
So we should put reader on the stack so that we can continue. 00:20:41.120 |
And now we want to decide whether we should make an arc 00:20:45.840 |
In this case, no, we want to attach was to canceled. 00:20:52.600 |
So then here we do want this arc between canceled and was. 00:21:01.720 |
So we're sort of stepping back a bit and thinking about this. 00:21:08.520 |
And so long as we can predict the right sequence 00:21:11.880 |
of those actions, we can derive the correct pass. 00:21:18.280 |
to be a classifier that predicts, given some state, 00:21:22.840 |
And you can sort of imagine that we can have other actions 00:21:26.520 |
instead if we wanted to predict other aspects of the structure. 00:21:33.840 |
So it just says, all right, given the words currently 00:21:40.440 |
But you're not allowed to push the next token 00:21:51.240 |
There's been work to jointly predict part of speech tags 00:21:56.240 |
Or you can do semantics at the same time as you do syntax. 00:21:59.920 |
And so you can code up all sorts of structures into this. 00:22:02.640 |
And you're going to read the sentence left to right. 00:22:05.000 |
And you're going to output some meaning structure attached 00:22:20.000 |
OK, so that's what this looks like as we proceed through. 00:22:27.080 |
to do this splitting up or merging of other things? 00:22:35.080 |
So already, you can kind of see that in order 00:22:42.280 |
And if we wanted Google Reader to be one token, 00:22:45.080 |
we just have to have some special dependency label, which 00:22:51.840 |
And then all we have to do is say, all right, 00:22:58.560 |
So the step from going through something like this 00:23:03.040 |
and labeling a language like Chinese is actually super 00:23:08.240 |
so that the tokens are individual characters. 00:23:14.180 |
which should be one word should have this sort of structure 00:23:18.960 |
And then if the parser decides that those things are attached 00:23:22.920 |
together, then at the end of it, you just merge them up. 00:23:29.140 |
because you have to have some universal actions that 00:23:33.080 |
So I'm still sort of working on the implementation of this 00:23:42.200 |
as well, because if you have English texts that's 00:23:48.200 |
things which should be two tokens get merged into one. 00:23:50.720 |
So it's is a particularly common and frustrating one of this, 00:23:55.240 |
because the verb is, it should be its own token. 00:24:04.840 |
to figure out that you have to have two parser actions, two 00:24:09.240 |
And in general, you could have a statistical model 00:24:19.200 |
and doing jobs of figuring out the syntactic structure 00:24:21.880 |
of the sentence in order to make those decisions. 00:24:24.000 |
And that's why I think doing these things jointly 00:24:28.320 |
of learning that information in one level of representation 00:24:33.840 |
the same information in the next pass of the pipeline, 00:24:38.720 |
And so I think that the joint incremental approaches, 00:24:46.880 |
So I've implemented the learning to merge side 00:24:55.960 |
better alignments between the gold standard tokenization 00:25:01.440 |
And that's allowed me to complete the experiments 00:25:07.520 |
of the Conference in Natural Language Learning 2017 00:25:11.120 |
benchmark, which was a sort of bake-off of these parsing 00:25:16.560 |
Now, in that benchmark, the team from Stanford 00:25:20.200 |
did extremely well, compared to everybody else in the field. 00:25:24.400 |
There were some two or three percentage points better. 00:25:30.800 |
kind of at the top of what was the second place pack. 00:25:35.280 |
underneath the Stanford system, but with significantly better 00:25:40.000 |
efficiency and with sort of this end-to-end process. 00:25:42.960 |
And in particular, we're doing better than Stanford 00:25:45.760 |
on these languages like Chinese, Vietnamese, and Japanese, 00:25:48.120 |
because the Stanford system did have this disadvantage of using 00:25:53.880 |
They wanted to just use the provided preprocessed texts 00:25:59.480 |
so that they could focus on the parsing algorithm. 00:26:01.680 |
And that meant that they did have this error propagation 00:26:04.440 |
If the inputs are incorrect because the preprocessed 00:26:13.680 |
So satisfyingly, the sort of doing all at once 00:26:27.680 |
is this joint modeling approach of deciding the segmentation 00:26:31.600 |
at the same time as deciding the parse structure is consistently 00:26:35.040 |
better than the pipeline approach in our experiments. 00:26:37.520 |
So basically, we're getting a sort of 1% to 3% improvement 00:26:44.840 |
as we're getting from using the neural network model instead 00:27:11.440 |
OK, well, this is the last slide, so wrapping up. 00:27:18.840 |
to deliver a sort of workflow or user experience 00:27:21.680 |
where it's very easy to start with the pre-trained models 00:27:24.360 |
for different languages and broad application areas. 00:27:29.640 |
have the same representation across languages. 00:27:31.680 |
So you get the same parse scheme, which the folks have 00:27:39.360 |
have a pretty satisfying solution from the universal 00:27:42.440 |
And so if you're processing text from different languages, 00:27:44.720 |
it should be easy to find, say, subject-verb relationships 00:27:48.920 |
And that should work across basically any language 00:28:01.840 |
able to do pretty powerful rule-based matching 00:28:03.840 |
from the parse tree and other annotations to provide it. 00:28:07.480 |
So it should be pretty easy to find information, 00:28:11.280 |
even without knowing much about the language and reuse 00:28:15.600 |
And then if the syntaxic model and the identity models 00:28:23.800 |
the library should support easy updating of those, 00:28:28.600 |
without you taking particular effort from this. 00:28:34.520 |
a workflow of rapid iteration and data annotation. 00:28:42.080 |
give a broad-based understanding of language, 00:28:45.800 |
but that still ends up with a need for the knowledge 00:28:51.920 |
guide and evaluation data specific to your problems. 00:28:55.200 |
And we want to make sure that it's easy to connect the two up 00:29:02.120 |
and move forward to building the specific applications, which 00:29:05.680 |
is-- now Innes will be talking about that aspect 00:29:34.080 |
So the leap asked what the sort of overall difference 00:29:37.200 |
or main-- most important difference between Spacey's 00:29:39.400 |
parsing algorithm and Stanford's parsing algorithm. 00:29:43.680 |
of most fundamental difference is that Stanford's system 00:29:50.520 |
So this is O(n) squared, or maybe O(n) cubed, 00:29:57.120 |
So you're unable to use this type of parsing algorithm 00:30:05.880 |
is why it has this disadvantage on languages, or text which 00:30:15.560 |
that we basically only use linear time algorithms. 00:30:19.720 |
And that's why we only take this transition-based approach. 00:30:25.640 |
Other reasons sort of why they get such a good result. 00:30:34.080 |
So I hope to meet the Stanford team in the next couple of days 00:30:38.160 |
and shake out the details of why this system is so accurate. 00:30:46.180 |
and I can't get the sort of one key insight that 00:31:12.800 |
is to what extent can end-to-end systems, which maybe 00:31:17.320 |
learn things about syntax, but learn them latently 00:31:19.680 |
and don't have an explicit syntactic representation 00:31:27.320 |
So I would say that for any application where 00:31:30.680 |
there's sufficient text, currently the best approach 00:31:34.440 |
or the state-of-the-art approach doesn't use a parser. 00:31:37.480 |
And actually, this includes translation and other things 00:31:44.200 |
If there's enough text, it seems that going straight 00:31:46.320 |
to the end-to-end representation tends to be better. 00:31:49.320 |
However, that does involve having a lot of text. 00:31:51.680 |
And for most applications, creating that much training 00:31:55.200 |
data, especially initially when you're prototyping, 00:32:00.400 |
So the way that I see it is that the parsing stuff 00:32:05.960 |
And it's a very practical thing to have in your toolbox, 00:32:12.000 |
So because otherwise, you end up in this chicken and egg 00:32:18.680 |
And otherwise, it just doesn't really get off the ground. 00:32:22.800 |
collecting the right data for the right model 00:32:30.120 |
using these sort of rule-based scaffolding and bootstrapping 00:32:32.920 |
approaches, I think you have a much more powerful and practical 00:32:39.480 |
that you know you want to e-cat every percent, 00:32:43.620 |
that you don't need a parser in your solution explicitly. 00:32:49.360 |
So Dilip has pointed to a paper that recently showed 00:32:58.800 |
that it, you know, BLSTM models don't necessarily 00:33:06.560 |
But as somebody who's worked on parsing for a lot of my career, 00:33:11.800 |
I try to remind myself not to cherry pick results. 00:33:14.680 |
And even if I do find a paper that shows that parsing 00:33:19.720 |
is that BLSTM models which don't use parsing work well. 00:33:24.400 |
And the fact is that long range dependencies are kind of rare. 00:33:34.920 |
to be asking, well, what are these things good for, 00:33:38.200 |
and not say, oh, everything should be using parsing. 00:33:41.060 |
Because it's true that not everything should. 00:33:45.240 |
So the question is, if we look at other aspects of language 00:33:57.280 |
variation instead of just, say, the segmentation and things, 00:34:14.340 |
had excellent analysis about a lot of these questions. 00:34:19.980 |
is much less sensitive to whether the trees are 00:34:22.220 |
projective, they do do relatively well in those languages. 00:34:33.840 |
do fine on German and pretty well on Russian. 00:34:52.400 |
The way that I'm doing this is a little bit crude at the moment. 00:35:00.520 |
that we take from the incremental approach in this. 00:35:15.640 |
for coreference resolution that has taken some of the pressure 00:35:21.720 |
We do think that coreference resolution is something 00:35:23.920 |
that does belong in the library, because it's 00:35:25.720 |
something that does have that property of being 00:35:29.200 |
I think that there's a truth about whether that he or she 00:35:31.660 |
belongs to that noun that doesn't depend on the application. 00:35:40.000 |
I wouldn't quite say the same thing about the sentiment. 00:35:45.720 |
I haven't been convinced by any schema of sentiment 00:35:48.320 |
that is sufficiently independent of what you're trying to do 00:35:52.680 |
Instead, what we do provide you is a text categorization 00:35:55.640 |
And the text categorization model that we have 00:36:07.680 |
And I think that on many sentiment benchmarks, 00:36:17.800 |
So it depends on what type of text you're trying to process 00:36:25.640 |
So explicitly, the coreference resolution package 00:36:42.120 |
Yeah, well, PyTorch is the machine learning layer. 00:37:00.520 |
I've been using the word vectors trained by fast text. 00:37:14.000 |
We're trying to provide pre-trained models which 00:37:21.520 |
of the model's been trained to expect some word vectors. 00:37:26.760 |
going to get different input representations. 00:37:30.240 |
But yeah, training or bringing your own vectors 00:37:35.920 |
And if it's not, I apologize if there's bugs, 00:37:40.140 |
So the question is, after parsing and interpreting, 00:37:50.240 |
do we have an interlingual representation that 00:37:52.520 |
can then be used to generate another language? 00:37:56.760 |
I mean, we don't have generation capabilities in spacey. 00:38:02.840 |
But in general, having an explicit interlingual 00:38:07.160 |
tends to perform less well than more brute force 00:38:12.080 |
And I think the reason does sort of make sense 00:38:14.880 |
that the languages are pretty different in the way 00:38:24.680 |
idiomatic out of that sort of interlingual representation 00:38:33.320 |
harder than the direct translation approach, which 00:38:37.280 |
I'm not sure whether I buy that argument or not. 00:38:40.840 |
OK, so should we move forward to the next talk? 00:38:52.440 |
So yeah, we started out by hearing a lot about the more 00:39:06.040 |
And I'm actually going to talk about how we collect and build 00:39:10.360 |
training data for all these great models we can now build. 00:39:20.680 |
But the problem is, of course, we need those examples. 00:39:23.560 |
And even if you're like, oh, I got this all figured out. 00:39:26.080 |
Are you using this amazing unsupervised method 00:39:28.520 |
where my system just infers the categories from the data 00:39:38.160 |
So we pretty much always need some form of annotations. 00:39:42.800 |
And now the question is, well, why do we even care about this? 00:39:47.280 |
Why do we care about whether this is efficient, 00:39:53.120 |
The thing is, the big problem is that we actually, 00:39:56.920 |
with many things in data science and machine learning, 00:39:59.600 |
we need to try out things before we know whether they work. 00:40:02.720 |
Or we often don't know whether an idea is going 00:40:05.840 |
So we need to expect to do annotation lots of times 00:40:12.080 |
Start all over again if we fucked up our label scheme. 00:40:16.720 |
We need to do this lots of times, so it needs to work. 00:40:23.040 |
working in a company in a team where you really 00:40:26.680 |
want to use your model to find something out, 00:40:33.760 |
And also, we always say good annotation teams are small. 00:40:42.120 |
oh, let's crowdsource this, get hundreds of volunteers, 00:40:44.640 |
and we always have to remind, especially companies, 00:40:51.920 |
The good ones were produced by very few people, 00:40:57.000 |
More people doesn't always mean better results, actually 00:41:00.240 |
So how great would it be if actually the developer 00:41:03.920 |
of the model could be involved in labeling the data? 00:41:08.600 |
And of course, we also have the problem of the specialist 00:41:11.680 |
knowledge, especially in industries where this matters. 00:41:16.920 |
You might want to have a medical professional give some feedback 00:41:20.680 |
on the labels, or actually really label your data, 00:41:26.000 |
And yeah, those people usually have limited time. 00:41:33.800 |
or actually find the one person who has nothing else to do, 00:41:43.560 |
And yeah, another big problem, since you want humans, 00:42:02.580 |
especially things that require multiple steps and multiple 00:42:08.640 |
We're bad at consistency and getting stuff right. 00:42:12.640 |
So fortunately, computers are really good at that stuff. 00:42:16.880 |
And in fact, it's probably also the main reason 00:42:20.840 |
So there's really no need to waste the human's time 00:42:33.760 |
Or in general, we want to automate as much as possible, 00:42:37.920 |
that the human is good at, and we really need that input. 00:42:43.920 |
like we can look at a sentence, and most of us 00:42:46.080 |
will be able to understand a figure of speech 00:42:50.120 |
That's the stuff that's really, really hard for a computer. 00:42:53.320 |
Also, put differently, humans are good at precision. 00:43:01.880 |
it sounds a bit like our floss and eat your veggies. 00:43:06.040 |
Yeah, we probably will have had some experience 00:43:13.080 |
to a crowd of more data science focused industry professionals. 00:43:19.320 |
And actually, you'd be surprised how many companies we talked 00:43:28.760 |
that mostly use Excel spreadsheets for everything. 00:43:35.020 |
are very obvious problems with Excel spreadsheets. 00:43:37.800 |
And there's definitely a lot of room for improvement. 00:43:44.800 |
or it's just terrible, like we don't want to do this, 00:43:47.120 |
the next move is normally, let's move this all out 00:43:49.480 |
to Mechanical Turk or some other crowd-sourced platform. 00:43:53.440 |
And yeah, Mechanical Turk, the Amazon cloud of human labor. 00:44:01.960 |
And then I was also surprised that their results are not 00:44:10.560 |
get the data back, train your model doesn't work. 00:44:13.520 |
And actually, it's very difficult to then retroactively 00:44:24.280 |
Maybe you didn't write your annotation manual properly. 00:44:32.680 |
pay too much on Mechanical Turk, you attract all the bad actors. 00:44:35.600 |
So you kind of have to stick to the half minimum wage. 00:44:46.640 |
And also, you realize that, well, it's not really just 00:44:53.000 |
So then, yeah, what most people conclude from this 00:45:05.080 |
And that's actually-- yeah, also, the conversation 00:45:09.120 |
I had recently where we talked to a larger media company, 00:45:23.840 |
And now, they're kind of back in the beginning. 00:45:30.560 |
that we need label data, that's an opportunity. 00:45:35.960 |
And yeah, so we've been thinking about this a lot. 00:45:43.480 |
there are a lot of things we could do better. 00:45:46.320 |
So one of the things, really, to work against this problem 00:45:53.400 |
is that we need to break down these very complex things we're 00:45:58.240 |
asking the humans into smaller, simpler questions. 00:46:02.120 |
And ideally, these should be binary decisions. 00:46:04.840 |
So we can have a much better annotation speed 00:46:07.240 |
because we can move through the things faster. 00:46:09.240 |
And we can also measure the reliability much easier 00:46:14.960 |
Because we can actually say, OK, do our annotators agree? 00:46:20.720 |
to find out whether we've collected data the right way. 00:46:23.760 |
And the binary thing itself, it sounds a bit radical. 00:46:28.560 |
But actually, if you think about it, most, or pretty much 00:46:31.600 |
any task, can be broken down into a sequence of binary 00:46:38.400 |
It might mean that we have to accept that, OK, 00:46:43.120 |
we won't actually end up with a gold standard data 00:46:48.040 |
We might actually end up with only partially annotated data. 00:46:52.880 |
But still, we're actually able to use our human's time 00:46:56.360 |
more efficiently, which is often much more important. 00:47:00.120 |
So a lot of examples I'm going to show you now 00:47:05.200 |
from using our annotation tool Prodigy, which, yeah, 00:47:12.880 |
that, OK, this is really something pretty much every 00:47:18.080 |
this was always something that kept coming up. 00:47:20.440 |
So we thought, OK, what if we really combine all these ideas 00:47:27.080 |
actually use the technology we're working with within the tool, 00:47:31.200 |
and also use the insights we have from user experience, 00:47:36.240 |
and how to get humans to do stuff most efficiently, 00:47:45.160 |
how to get humans to really stick to doing something, 00:47:50.920 |
and put this all into one tool, and that's Prodigy. 00:47:54.520 |
And so here, we see some examples of those tasks, 00:48:00.760 |
and how we can present things in a more binary way. 00:48:11.440 |
labeling whether something is a product or not. 00:48:14.000 |
And what we did here is we load in a spacing model, 00:48:23.520 |
Or we can also use a mode where we can then actually 00:48:27.240 |
click on this, remove this, label something else. 00:48:31.360 |
But still, you see, OK, we don't have to do this 00:48:35.000 |
We actually get one question, we look at this, 00:48:37.120 |
and pretty much immediately, we can say yes or no. 00:48:42.400 |
The same here, on the right, they were using-- 00:48:44.800 |
I think this is actually a real example using the YOLO2 model 00:48:53.640 |
We could say, is this a skateboard, yes or no? 00:48:57.000 |
And yeah, immediately, have our annotations here. 00:49:07.640 |
make it more efficient and easier for a human to answer. 00:49:14.680 |
you can still do maybe two, three seconds per annotation 00:49:23.000 |
if we can get to one second, we might as well 00:49:26.400 |
label our entire corpus twice, positive, negative, 00:49:30.840 |
other labels we want to do, and just move through it quicker. 00:49:36.360 |
And yeah, to give you some background on why did we do 00:49:40.840 |
this, what do we think Prodigy should achieve, 00:49:48.560 |
to be able to make annotation so efficient that data scientists 00:49:55.000 |
can also be researchers and people working with the data, 00:50:01.440 |
Yeah, reading it like that, it still doesn't sound like fun. 00:50:03.880 |
But the idea is, we could really make a process that's 00:50:07.880 |
efficient that you actually really want to do this 00:50:10.040 |
because you don't have to depend on anyone else. 00:50:20.040 |
We're very used to, OK, you iterate on your code, 00:50:22.040 |
but you can actually iterate on your code and your data. 00:50:24.240 |
You try something out, doesn't work, try something else. 00:50:33.400 |
And we also want to waste as little time as possible 00:50:41.280 |
and have the human correct its predictions instead of just 00:50:49.040 |
want Prodigy to fit into the Python ecosystem. 00:50:52.600 |
We want it to be customizable, extensible in Python. 00:50:57.800 |
And we also-- it was a very conscious decision 00:51:00.960 |
not to make it a SaaS tool, because we think data privacy 00:51:05.520 |
You shouldn't have to send your text to our servers 00:51:09.040 |
And we also think you shouldn't be looked in. 00:51:13.080 |
out that you can use to train your models however you like, 00:51:27.680 |
of how the app looks. The center are recipes, 00:51:32.440 |
Python scripts that orchestrate the whole thing. 00:51:35.120 |
You have a REST API that communicates with the web app 00:51:38.360 |
naturally so you can see things on the screen. 00:51:42.800 |
You have your data that's coming in, which is text images. 00:51:52.200 |
And then the model then communicates with a recipe. 00:51:58.440 |
You can, as the user annotates, it's updated in a loop 00:52:02.640 |
and can suggest more annotations that are more compatible 00:52:10.200 |
And yeah, there's a database and a command line interface 00:52:23.560 |
of a recipe function, which really is just a Python 00:52:28.320 |
You load your data in and then you return this dictionary 00:52:31.120 |
of components, for example, an ID of the data set, 00:52:34.680 |
how to store your data, a stream of examples. 00:52:36.880 |
You can pass in callbacks to update your model, 00:52:43.080 |
So the idea is really, OK, if you need to load something in, 00:52:46.760 |
if you can write that in Python, you can do it in Prodigy. 00:52:50.640 |
And you can also-- we provide a bunch of pre-built-in recipes 00:53:01.400 |
we think it could work, like named entity recognition. 00:53:07.880 |
You can use the model, say yes or no, to things. 00:53:11.000 |
You can use it for dependency parsing and look at an arc 00:53:20.120 |
to build terminology lists, text classification. 00:53:22.880 |
So there's also a lot that you can mix and match creatively. 00:53:26.520 |
For example, you have the multiple choice example 00:53:30.640 |
that's not really tied to any machine learning task, 00:53:34.280 |
but it fits pretty much into any of these workflows 00:53:42.360 |
and is often neglected, especially in more industry use 00:53:49.320 |
But we think there's actually-- ABL evaluation is actually 00:53:51.520 |
a very powerful way of testing whether your output is really 00:54:07.960 |
all using models, word vectors, things you already 00:54:10.600 |
have in order to get where you want to get to faster. 00:54:14.400 |
So here, a simple example, we want to label fruit. 00:54:20.200 |
It's kind of a stupid example because it's that-- 00:54:22.800 |
I can't think of many use cases where you actually 00:54:25.520 |
want to do that, but it makes a great illustration here. 00:54:30.160 |
So yeah, we start off, we say, OK, we want fruit. 00:54:37.120 |
And we also have word vectors that we can use that will easily 00:54:42.320 |
give us more terms that are similar to these three fruit 00:54:50.840 |
that we collected by just saying yes or no to what we've gotten 00:54:54.560 |
out of the word vectors, look at those in our data, 00:54:58.240 |
and then say whether apples in this context is a fruit or not. 00:55:04.160 |
Because we're not just labeling all fruit terms as a fruit 00:55:11.560 |
entity, because it could be apple, the company. 00:55:14.200 |
But we get to look at it, and it's much more efficient 00:55:16.320 |
than if you ask the human to sit through and highlight 00:55:26.680 |
And so this also leads to one of our main aspects 00:55:34.000 |
of the tool, workflows that we're especially proud of 00:55:36.360 |
and that we think really can make a difference, which is we 00:55:39.280 |
can actually start by telling the computer more abstract 00:55:43.280 |
rules of what we're looking for and then annotating 00:55:45.560 |
the exceptions instead of really starting from scratch. 00:55:51.360 |
we're working with to build these semi-automatically using 00:55:55.000 |
word vectors, using other cool things that we can now do. 00:56:00.920 |
look at those examples that the statistical model we 00:56:10.440 |
where we can be pretty sure that they're correct 00:56:12.920 |
and actually really ask the human first about the stuff 00:56:17.760 |
that's 50/50 and where really the human feedback makes 00:56:33.560 |
And then we look at what else is similar to that term. 00:56:38.840 |
from that Sense2Vec model that Matt showed earlier. 00:56:47.120 |
So we're not going to annotate California and maybe 00:56:51.000 |
But we're not going to annotate California roles 00:56:57.440 |
is at least similar to the real meaning of the word. 00:57:00.480 |
And a lot of these are super trivial to answer. 00:57:05.240 |
or we can ignore them because this is a bit too ambiguous 00:57:15.400 |
create a pattern that uses spaces, attributes, 00:57:20.080 |
or in this case, the lower case form of the token and GPE 00:57:26.840 |
that stands for geopolitical entities or anything 00:57:32.760 |
So we can easily build up these roles very quickly, 00:57:35.960 |
very automated, and then we have a bunch of locations 00:57:47.520 |
So that's a very, very simple example of this. 00:57:49.920 |
But of course, this also works for slightly more complex 00:57:54.640 |
constructs where we can really take advantage 00:58:03.400 |
to extract information about executive compensation. 00:58:07.840 |
So yeah, some executive receives some amount of money 00:58:17.240 |
But also, the idea here is we have this theory 00:58:20.480 |
that maybe if we could train a model, a text classification 00:58:29.120 |
we can then very, very easily use what we already 00:58:32.720 |
know about the text to extract, let's say, the first person 00:58:36.600 |
We extract the amount of money, put that in our database. 00:58:39.480 |
And we've actually-- yeah, we found a good solution 00:58:48.920 |
We haven't tried this in detail, but one possible pattern 00:58:55.680 |
would be let's try and look for an entity type person, 00:59:05.240 |
So received, receives, receiving, and followed by a token 00:59:15.520 |
I mean, there are plenty of other possible patterns 00:59:22.280 |
going to be looking at them again in context. 00:59:25.920 |
And even actually, in fact, even if it pulls up random stuff 00:59:29.320 |
that you realize is totally not what you want, 00:59:34.280 |
Because you won't only be collecting annotations 00:59:37.580 |
for the things you know are definitely right. 00:59:41.960 |
for the things that are very, very similar or look very, 00:59:44.420 |
very similar to what you're looking for but are actually 00:59:54.240 |
So yeah, the moral of the story is what we're saying 00:59:57.600 |
is we're very used to iterating on our code as programmers. 01:00:07.120 |
So as we see here, OK, that's the normal type of programming. 01:00:19.120 |
You go back, change the source code, compile it, and so on. 01:00:29.640 |
So the part we should really be thinking about and working on 01:00:35.000 |
Instead, most focus is currently on the training algorithm. 01:00:42.560 |
very similar to going and tweaking your compiler 01:00:45.760 |
if you're not happy with your runtime program. 01:00:48.640 |
You can do that, but of course, you probably go back and edit 01:00:53.920 |
I think this is actually a pretty good example. 01:01:01.040 |
but what really makes a difference is your data. 01:01:03.360 |
So if you have a good way and a fast way of iterating 01:01:06.480 |
on that data, and you're able to really master 01:01:19.800 |
It's always one of these things that's kind of misrepresented. 01:01:32.920 |
And you really want to find the things that actually work. 01:01:41.560 |
figure out what works before you try it and invest in it, 01:01:47.200 |
because you're not going to waste your time on the things that 01:01:51.280 |
might fail and more scale things up that actually weren't even 01:01:57.000 |
And one thing that's also very important to us 01:02:01.800 |
You can build solutions that fit exactly to your use case, 01:02:08.480 |
If you collect your own data, you'll keep that forever, 01:02:13.920 |
and if that API shuts down, you can start again from scratch. 01:02:20.000 |
other cool things we can do at some point in the future, 01:02:31.200 |
that's very important in the future of the technology. 01:02:40.520 |
And, yeah, we're hoping that we can keep providing useful tools 01:03:08.800 |
we write very good software, even though we're only two people, 01:03:19.760 |
I don't even know where this idea comes from that, like, yeah, 01:03:24.240 |
Like, I don't know, scaling things up makes things better. 01:03:31.480 |
you sometimes-- it actually can have a very negative impact 01:03:43.360 |
if everyone can do exactly the same thing if they just 01:03:45.600 |
work hard, even though people like thinking of it that way. 01:03:48.200 |
It's just, OK, in our case, we have a good combination 01:03:53.560 |
that we happen to be good at, and it just works together. 01:04:08.000 |
I mean, yeah, it's kind of ironic saying that, speaking 01:04:11.600 |
at an event, but I really don't normally go to many events. 01:04:17.200 |
We don't take coffee dates with random people. 01:04:22.800 |
Yeah, we mostly, we really just like to write software. 01:04:26.720 |
And yeah, we've had some good ideas in the past. 01:04:59.440 |
done any experiments where we compare the binary decisions 01:05:11.880 |
focusing on the bias because that's, in some sense, 01:05:15.960 |
that's difficult because we're looking at the output. 01:05:24.160 |
versus binary annotation, but also mostly focused 01:05:28.480 |
on our own tooling because we think it's kind of useless. 01:05:33.480 |
where we said, oh, we did stuff in an Excel spreadsheet 01:05:35.760 |
and then we did stuff in Prodigy and it was much better. 01:05:38.320 |
So it's really mostly focused around our own tooling 01:05:47.920 |
I feel like giving these answers sounds unsatisfying 01:05:50.400 |
because I'm always saying, well, it depends on your data. 01:05:56.040 |
because we're doing this because your data is different 01:06:10.320 |
predicts something, ideally also something that's 01:06:16.120 |
Otherwise, the pattern approach does work very well 01:06:22.600 |
Like, we did one example of where we labeled drug names 01:06:26.480 |
on Reddit, like on our opiates, which was a pretty good-- 01:06:33.400 |
And also, it's a subreddit that's very on topic 01:06:35.960 |
because people who go on Reddit to discuss opiate use, 01:06:44.080 |
usually are very dedicated to talking about this one topic. 01:06:48.920 |
And so what we wanted to do is we labeled drug names, 01:06:54.200 |
drugs, and pharmaceuticals in order to, for example, 01:07:01.120 |
the content of this subreddit and see how it develops 01:07:08.200 |
worked very, very well because we have very specific terms. 01:07:14.600 |
Especially also, we can include spelling mistakes and stuff, 01:07:18.960 |
Like, we can really build up good word lists, 01:07:21.120 |
find them in the text, confirm them, and get to pretty decent 01:07:24.960 |
I would expect this work to work a little less well, 01:07:28.560 |
the cold start problem, on a much more ambiguous domain. 01:07:31.880 |
And there, you're probably better off to say, OK, 01:07:35.480 |
But even there, that's something I haven't really 01:07:48.640 |
perfect, like, highlight, and then, ah, shit, 01:08:00.160 |
And also, there, get more efficiency out of it. 01:08:35.120 |
So the question is, first, you gave an example 01:08:41.000 |
of annotating patient data, which is obviously very 01:08:45.240 |
problematic because doctors are not always very specific 01:08:49.040 |
And then, in the end, this was how did they enrich that with-- 01:08:52.600 |
So what they did is they got foundation of the [INAUDIBLE] 01:09:01.320 |
is whether we have some experience in the medical field 01:09:06.920 |
The answer is, well, we haven't personally done this. 01:09:09.040 |
But we do have quite a few companies in that domain, 01:09:13.880 |
also because the tool itself is quite appealing 01:09:17.120 |
because you can run it in your own compliant environment, 01:09:27.120 |
That's maybe also where, OK, having the professionals-- 01:09:29.680 |
getting the medical professionals more involved 01:09:31.520 |
might make sense, which normally is very difficult. 01:09:34.720 |
You don't want a doctor to do all the work themselves. 01:09:37.640 |
But if you can find some way to distill that and then ask 01:09:40.720 |
the doctor, OK, you wrote this here, does that mean-- 01:09:49.080 |
If you can try this out and extract some information, 01:09:53.400 |
well, that could be one idea to solve that, for example. 01:10:08.600 |
Like right now, we don't have a built-in logic for that, 01:10:18.080 |
inter-annotator agreement, if you can calculate that 01:10:28.840 |
Because the tool here, we really designed specifically 01:10:32.240 |
as a developer tool first and then scaling it up a second. 01:10:38.880 |
and if you have an idea, if you have an algorithm you want to use 01:10:44.400 |
do that fairly easily because you can download 01:10:48.640 |
You have a key that's answer, which is either 01:10:53.200 |
You can attach your own arbitrary data like a user ID. 01:10:57.160 |
And then it's fairly trivial to write your own function that 01:11:00.240 |
really takes all of this, reads it in, computes something, 01:11:07.360 |
But this is also something we're really interested in exploring 01:11:18.840 |
That's a big advantage of the binary interface 01:11:25.800 |
You filter out the ignored ones, and then you 01:11:47.280 |
one interface I showed, which was the sentiment 01:11:55.120 |
tell our users avoid this as much as possible, if you can. 01:11:59.680 |
And in some cases, you might still want that. 01:12:05.000 |
think of surveys when they think of annotating data. 01:12:09.960 |
but I think if you can leave that sort of mindset 01:12:12.600 |
and really open up a bit and think of other creative ways, 01:12:21.120 |
So for example, if I were doing this with those four options, 01:12:27.360 |
The annotator sees every text four times and says, 01:12:32.720 |
And because you can get to one second for annotation, 01:12:36.400 |
Like, even if you have thousands of examples, 01:12:40.680 |
And so that's how we would probably solve this. 01:12:42.880 |
And it also means you get every example four times. 01:12:54.040 |
Some people really want to build that survey. 01:13:01.920 |
So the question is, if you're doing the same example 01:13:15.200 |
multiple times, whether it slows down the annotation or not. 01:13:18.520 |
Well, actually, I mean, it's difficult to say 01:13:22.240 |
But I've actually found that even if you do the bare maths, 01:13:33.520 |
about five different concepts that are maybe not even fully 01:13:36.120 |
related, that just every tiny bit of friction 01:13:38.920 |
you put between a human and the interface or the decision 01:13:41.800 |
can very significantly slow down the process. 01:13:49.840 |
And just this thing that can easily add like 10 seconds 01:13:54.320 |
So if you do the whole thing three times at one second, 01:14:07.840 |
If you have to think too much, you're much more likely to fuck 01:14:12.160 |
And then that's also something you want to avoid. 01:14:15.920 |
But the active learning helps a lot here as well. 01:14:18.240 |
So if you have your labels, it's pretty confident 01:14:22.280 |
And so you just don't have to learn something. 01:14:24.440 |
Yeah, to repeat this, the active learning also 01:14:36.080 |
and don't have to really go through every single one that 01:14:39.920 |
is not as important as some of the other ones 01:14:44.880 |
Yeah, do you have any experience working with tasks like that 01:14:54.080 |
like the whole medical history or just a whole document. 01:14:57.760 |
So we have-- and whether we have experience with that. 01:15:00.800 |
So in general, we do say, if your task requires so much 01:15:05.640 |
context that you can't fit this into the prodigy interface, 01:15:08.520 |
then it doesn't mean that you can't train a model on that. 01:15:11.160 |
But for most of the tasks that users most commonly want to do, 01:15:14.440 |
this is often also an indicator that it's very, very difficult 01:15:19.080 |
doing named entity recognition or even text classification 01:15:23.080 |
and you need a lot of context and all the context 01:15:26.320 |
is equally as important, that's often an indicator 01:15:32.800 |
we say, OK, we start off by selecting one sentence 01:15:37.280 |
And then instead of you annotating the whole document, 01:15:41.360 |
you say, OK, this is the most important sentence. 01:15:46.040 |
So there are some tricks we use to get around this problem 01:15:56.280 |
important to get this across and frame it in that way 01:15:59.360 |
because, yeah, if you need two pages on your screen, 01:16:08.800 |
but your model won't learn that because your model needs 01:16:12.000 |
local context as well, at least, for the tasks that we are-- 01:16:15.880 |
I don't know if you have anything to add to that. 01:16:20.880 |
Often, it's important to take into account the models 01:16:25.920 |
Yeah, so the suggestion was, OK, having some tools, 01:16:33.560 |
some process that goes along with the software that 01:16:38.640 |
Yeah, we've actually been thinking about this a lot 01:16:43.120 |
and we're introducing a lot of new concepts at once, 01:16:47.200 |
ah, that's how you should do it, or you could try this. 01:16:57.800 |
so right now what we're doing is we have a support form 01:17:00.440 |
for Prodigy where we answer people's questions. 01:17:09.640 |
Other users come in and are like, oh, I actually 01:17:14.800 |
and here's what worked for me, and have this sort of exchange 01:17:24.820 |
a lot of the best practices are still evolving, 01:17:32.160 |
So it's definitely-- yeah, we're open for suggestion there 01:17:35.360 |
as well, but we're still in the process of really coming up 01:17:51.360 |
have any plans to sell models like medical models? 01:18:02.400 |
like an online store for very, very specific models. 01:18:05.120 |
So medical-- that's a very, very interesting domain. 01:18:09.320 |
And if so, we really want to have it specific, 01:18:17.700 |
Because we believe that, OK, pre-trained models 01:18:20.120 |
are very valuable, and even if you do medical texts, 01:18:25.200 |
then you can use a tool like Prodigy or something else 01:18:27.400 |
to really fine tune it on your very, very specific context, 01:18:31.120 |
have word vectors in it that already fit to your domain, 01:18:35.880 |
We think that this is a very future-proof way of working 01:18:47.160 |
So currently-- so a question is the text classification model 01:18:53.160 |
So what we're using is Spacey's text classification model. 01:18:58.520 |
But I think actually this question is pretty good, 01:19:00.520 |
because what's important to note is that Prodigy itself 01:19:04.080 |
comes with a few built-in recipes that are basically 01:19:07.560 |
ideas for, OK, how you could train a text classifier. 01:19:13.520 |
The idea-- the tool itself is really the scaffolding 01:19:16.200 |
So if you say, hey, I wrote my own model using PyTorch, 01:19:19.720 |
and I would like to train this, all you need to do 01:19:22.360 |
is you need to have one function that takes examples 01:19:26.160 |
And you need to have one function that takes raw texts 01:19:35.120 |
And then you can use the same active learning mechanism 01:19:45.200 |
are just a suggestion or an idea you can use to try it out. 01:19:48.840 |
But ultimately, we also hope that people in the future 01:19:51.800 |
will transition to just plugging in their own model 01:19:54.840 |
and just using the scaffolding around it to do that. 01:19:59.400 |
But we definitely don't want to lock anyone in and say, 01:20:01.680 |
oh, you have to use spaCy, especially for NER and stuff 01:20:08.080 |
But if you don't want to do that for other use cases, especially 01:20:11.040 |
text classification, we think that a lot of cases-- 01:20:13.800 |
well, you might want to use scikit-learn or vocal-webbit. 01:20:21.880 |
Yeah, or basically something completely custom. 01:20:32.600 |
whether this is built on the underlying model-- 01:20:42.960 |
So the question is, active learning versus no active 01:20:49.320 |
so what we're doing for most of these samples 01:20:55.600 |
But we also know there are lots of other ways 01:21:02.040 |
is we have a simple function that takes a stream 01:21:04.280 |
and outputs a sorted stream based on the assigned 01:21:10.440 |
So how you wire this up, again, is also up to you. 01:21:13.320 |
And yeah, to answer the part about what works best, 01:21:19.000 |
in general, in our kind of framework, where really, 01:21:23.480 |
And often, you start off with a model not knowing very much. 01:21:29.200 |
resorting the stream is actually very crucial. 01:21:31.360 |
Because otherwise, if you start from scratch, 01:21:44.080 |
There's very little-- you need some kind of guidance 01:21:46.560 |
that tells you, OK, what to work on next, especially 01:21:51.880 |
You need to pre-select them based on something. 01:21:59.440 |
But without that, yeah, it's very, very difficult. 01:22:04.480 |
And that's kind of what we're trying to solve with a tool. 01:22:18.400 |
I've got to say, anybody who's using fast AI, 01:22:23.760 |
any time you've used fast AI NLP or fastAI.txt, 01:22:34.560 |
is because I tried every damn tokenizer I could find. 01:22:38.680 |
And spaceys was so much better than everything else. 01:22:42.120 |
And then the kind of story of fast AI's development 01:22:44.560 |
is that over time, I get sick of all the shitty parts 01:22:50.560 |
And the fact that I haven't rewritten spacey or attempted 01:23:02.960 |
And it's got a good install story and so forth. 01:23:06.400 |
And I haven't used Prodigy, but just the fact 01:23:11.040 |
I recognize the importance of active learning 01:23:13.120 |
and the importance of combining human plus machine. 01:23:16.120 |
What's in that rare category of people, in my opinion, 01:23:18.720 |
are actually working on what's one of the most 01:23:27.280 |
And I look forward to seeing what you do next.