Back to Index

Increasing data science productivity; founders of spaCy & Prodigy


Chapters

0:0 Introduction
3:55 Syntax Positive
12:32 Syntax
14:35 Term Sensitive ECH
16:19 Using ECH in Spacey
17:21 Parsing Algorithm
17:44 Transition Based Parsing
22:24 Splitting Tokens
24:44 Learning to Merge
26:56 User Experience
29:16 SpaCy vs Stanford
30:54 Endtoend systems
32:48 Longrange dependencies
33:46 Language variation
35:9 Current resolution
36:49 How to use spaCy
37:42 Language generation
45:44 Binary decision
52:51 Recipes
55:26 Example

Transcript

OK, so yeah, this is our first time in the Bay Area, so it's nice to meet you all. And thanks for coming. Not so much to notice. So I'll start by just giving a quick introduction of us and some of the things that we're doing before I start with the main content of the talk, which is about this open source library that we developed spacey for natural language processing.

So the other things that we develop as well at Explosion AI is a machine learning library behind spacey think, which allows us to avoid depending on other libraries and keep control of everything and make sure that everything is easy to install. We also have an annotation tool that we develop alongside spacey prodigy, which is what Innes will be talking about.

And we're also preparing a data store of other pre-trained models for more specific languages and use cases and things that people will be able to use that basically will extend the capabilities of the software for more specific use cases. So to give you a quick introduction to Innes and I, which is basically all of Explosion AI, so I've been working on natural language processing for pretty much my whole career.

I started doing this after doing a PhD in computer science. I started off in linguistics and then kind of moved across to computational linguistics. And then around 2014, I saw that these technologies were getting increasingly viable. And I was also at the point in my career where I was supposed to start writing grant proposals, which didn't really agree with me.

So I decided to leave and I saw that there was a gap in the capabilities available for something that actually translated the research systems to something that was more practically focused. And then soon after I moved to Berlin to do this, I met Innes. And we've been working together since on these things.

And I think we kind of have a nice complementarity of things. She is the lead developer of our annotation tool Prodigy and has also been working on Spacey pretty much since the first release. So I included this slide, which we normally actually give this when we talk to companies specifically.

But I think that it's a good thing to include to give you a bit of this is what we tell people about what we do and how we make money and how the company works. And I think that this is a very valid question that people would have about an open source library.

It's like, well, why are you doing this and how does it fit into the rest of your projects and plans? So the Explain It Like I'm 5 version, which I guess is also the Explain It Like I'm Senior Management version, is we give an analogy. It's kind of like a boutique kitchen.

So the free recipes we publish online, you can see, is kind of like the open source software. So that's Spacey thing, et cetera. At the start of the company, especially, we were doing consulting, which I'm happy to say we've been able to wind down over the last six months and focus on our products.

And then we also focus on a line of kitchen gadgets, which is things like Prodigy. These are these downloadable tools to use alongside the open source software. And soon we'll have this sort of premium ingredients, which are the pre-trained models. So the thing that we don't do here is enterprise support, which I guess is probably the most common way that people fund open source software or imagine that they'll fund open source software with a business model.

And we really don't like this because we want our software to be as easy to use as possible and as transparent as possible and the documentation to be good. So I think it's kind of weird to have this thing where you have explicitly a plan that we're going to make our free stuff as good as possible.

And then we're going to have this service that we hope people pay us lots of money for, but we hope nobody uses. And that's kind of weird, right? It's kind of weird to have a company that you hope that your paid offering is really poor value to people. And so we don't think that that's a good way to do it.

And so instead, we have the downloadable tools, I think, is a good way to-- we have something which works alongside Spacey and I think is useful to people who use Spacey as well. OK, so onto the sort of main content of the talk and the bit that I'll be talking about.

So I'm going to talk to you about the syntactic parser within Spacey, the natural language processing library that we use. And so before I do it, so this is kind of what it looks like as sort of visualized as an output. So it's this sort of tree-based structure that gives you the syntactic relationships between words.

So the way that you should read this here is that the arrow pointing from this word to this word means that Apple is a child of looking in the tree. And it's a child with this relationship and such. In other words, Apple is the subject of looking. And is is an auxiliary verb attached to looking, and then at is a prepositional phrase attached to looking.

So these sorts of relationships tell you about the syntactic structure of the sentence and basically help you get at the who did what to whom sort of relationships in the sentence and also to extract phrases and things. So for instance, here, to make the thing more easy to read, we've merged UK startup, which is a sort of basic noun phrase into one unit.

And you can find these sorts of phrases more easily from given the syntactic structure. And just above here, we've got an example of what the code looks like to actually get the syntactic structure or navigate the tree. In spaCy, you just get this NLP object after loading the models.

And you just use that as a function that you feed text or pipetext through if you've got a sequence of texts. And given that, you get a document object, which you can just use as an iterable. And from the tokens, you get attributes that you can use to navigate the tree.

So for instance, here, the dependency relationship is just a dot depth. By default, that's an integer key, integer ID, because everything's kind of coded to an integer for easy and efficient processing. But then you can get the text value with an underscore as well. And then you can navigate up the tree with dot head.

And then you can look at the left and right children of the tree as well. So we try to have a rich API that makes it easy to use these dependency relationships. So that just getting dependency parses, obviously, just the first step, you want to actually use it in some way.

And that's why we have this API to make that easy. So the question that always comes up with this, and I think this is a very interesting thing for the field in general, is what's the point of parsing? What is this actually good for in terms of applications? So Yoav Goldberg is a very prominent parsing researcher.

And this is kind of the stuff that he's studied for most of his career. And he's one of the more well-known parsing people. And so it's interesting to see him and other people reflect on this and say that he finds it fascinating that even though we have so many best papers in NLP, so it's kind of a high prestige thing to study parsing.

But it seems like syntax is hardly used in practice in most of these applications. So the question is, why is this? Because parsing is based on trees and structured predictions, it's kind of fun to study. And there's all these deep algorithmic questions. Is it just kind of this catnip to researchers?

And does it have this kind of over prominence in the field? Or is it that there is something deeper about this and we should really continue studying this? Well, I can go either way on this. And so this slide shows you the case for parsing. And then I'll kind of have a counterpoint in a second.

So I think that the most important case for parsing is that there's a sort of deep truth to the fact that sentences are tree-structured. They just are, right? Language syntactic structure of sentences is recursive. And that means that you can have arbitrarily long gaps between two words which are related.

So for instance, if you have a relationship between, say, a subject and a verb like syntax is, whether the subject of that verb is plural or singular is going to change the form of the verb. And that dependency between them can be arbitrarily long because you can have this nested structure.

But it can't be arbitrarily long in tree space because the relationship between them will always be the subject and the verb, like sort of next to each other in the tree. So you can see how, for some of these things, it should be more efficient to think about it or model it as a tree.

And the tree should tell you things that you otherwise would have to infer from an enormous amount of data. It should be more efficient in this way. So we can say, OK, in theory, this should be important. And it should be something that we study based on this knowledge about how sentences are structured.

So then the counterpoint to this is, all right, so sentences are tree-structured. And that's a truth about sentences. But it's also true that they're written and read in order. So if you read a sentence, you do read it from left to right, or in English anyway, basically from start to finish, or you hear a sentence from start to finish.

And this really puts a sort of bounding on the linear complexity that you will empirically see. Because when somebody wrote this sentence, yes, they could have an arbitrarily long dependency. But they expect that that would mean that their audience listening to it will have to wait arbitrarily long between some word and the thing that it attaches to.

And that's kind of not very nice. So empirically, it's not very surprising to see that most dependencies are, in fact, short. And there's a lot of arguments that the options that are kind of provided to grammars are sort of arranged that you're able to keep your dependencies short. Like that's sort of some of the reasons you have options for how to move things around in sentences to make nice reading orders.

Because you want short dependencies. So this means that if most dependencies are short, then processing text as, say, chunks of words of one or two at a time kind of gives you a pretty similar view. Most of the time, you don't get something that's so dramatically different if you look at a tree instead of looking at chunks of three or four word sentences.

So this is kind of a counterpoint that says maybe even though the sentences are, in fact, restructured, maybe it's not that crucially useful. So I think that the part that makes this particularly rewarding to look at syntax or particularly useful to provide syntactic structures in a library like Spacey is that they're application independent.

So the syntactic structure of the sentence doesn't depend on what you hope to do with the sentence or how you hope to process it. And that's something that's quite different from other labels or other information that we can attach to the sentence. If you're doing something like a sentiment analysis, there's no truth about the sentiment of a sentence that's independent of what you're hoping to process.

That's not a thing that's in the text itself. It's a lens that you want to take on it based on how you want to process it. So whether you consider some review to be positive or negative depends on your application. It's not necessarily in the text itself. Because what counts as positive or negative?

What's the labeling scheme? What's the rating scheme? Or exactly what are they talking about? Well, the taxonomy that you have will depend on what you're hoping to process with. Those things aren't in the language. But details about the syntactic structure are in the language. They're things which are just part of the structure of the code.

And that means that we can provide these things, learn it once, and give it to many people. And I think that that's very valuable and useful and different from other types of annotations that we could calculate and attach. And that's why SPACY provides pre-trained models for syntax, but doesn't provide pre-trained models for something like sentiment.

Because we know how to give you a syntactic analysis that's as useful as it may be, or maybe not, depending on whether that actually solves your problems. But at least it's sort of true and generalizable, whereas we don't know what categorization scheme you want to classify your text in.

So we can't give you a pre-trained model that does that, because that's your own problem. So we try to basically give you these things, which are annotation layers, which do generalize in this way. And that means that there has to be a sort of linguistic truth to them. And that means that looking at things like the semantic roles, or sentence structure, or sentence divisions are things that we can do.

And that's why we are interested in this. So the other thing about syntactic structures and whether they're useful or not is that in English, not using syntax is pretty powerful, because English orthography happens to cut things up into pretty convenient units. They're not optimal units, but they're still pretty nice in a way that doesn't really hold true across a lot of other languages.

So in the bottom right here, we have Japanese, which usually isn't segmented into words. You can't just cut that up trivially with white space and get something that you can feed into a search engine, or get something that you can feed forward into a topic model. You have to do some extra work.

And the extra work that you do there really should consider syntactic structure. You can use a technology that only makes linear decisions, but the truth about what counts as a word or not is very entangled with the syntactic structure. And so there's real value in doing it jointly with syntactic parsing.

For other languages, you have kind of the opposite problem. So we have here a German word, and this is the German word for income tax return. Now, whether or not you want that to be sort of one unit will depend on what you're looking for. For many applications, actually, the English phrase is too short.

And the domain object, the thing that you want to be looking for and having a single node in your knowledge graph for, would actually be income tax return. That's pretty awesome. But in other applications, maybe you just want to look for tax. And so in those cases, the German word will be too large and your data will be too sparse.

So there's sort of different aspects to this. In the bottom left here, we have an example of Hebrew. And like Arabic and a couple of other languages like this, there's no vowels in the text. And the words tend to have all sorts of attachments that are difficult to segment off.

So there, again, you have difficult segmentation problems that are all tangled up with the syntactic processing. OK, so going forward to sort of an example of what we can do if we recognize non-whitespace-looking words and feed them into some of the other processing stuff that we have. So this is a demo that we prepared a couple of years ago for an approach that we call the term Sense2Vec.

So all this is is basically processing text using natural language processing tools, in this case specifically spaCy, in order to recognize these concepts that are longer than one word. So specifically here, we looked for base noun phrases and also named entities. And we just merged those into one token before feeding the text forward into a word2vec implementation, which gives you these semantic relationships.

And this lets you search for and find similarities between phrases which are much longer than one word. And as soon as you do this, you find, ah, the things which I'm searching for are much more specific in meaning. I'm not looking for one meaning of learning or one meaning of processing, which doesn't tend to be so useful or interesting.

Instead, you can find things related to natural language processing. And then you see, ah, machine learning, computer vision, et cetera. These are real results that came out of the thing as soon as you did this division. And so we can do this for other languages as well. So if we were hoping to use word2vec for a language like Chinese, you really want to be processing it into words before you do that.

Or if you're going to do this for a language like Finnish, you really want to cut off the morphological suffixes before you do this. OK. So incidentally, Innis has cleaned up the sensitivevec recently. So you can actually use this as a handy component within Spacey. So you can load up a stand model and then add a component that gives you these sensitivevec sensors.

So you can just say, all right, the token for three would be natural language processing, because it would do it emerging for you. And then you can also look up the similarity. So it's now much easier to actually use the pre-trained model and use that approach within Spacey. Incidentally, we have this concept of an extension attribute in Spacey so that you can kind of attach your own things to the tokens so that you can basically attach your own little markups or processing things.

So this underscore object is kind of a free space that you can attach attributes to, which ends up being quite convenient. It's a lot more convenient than trying to subclass something or something. OK. So for the rest of the talk, I'll give you a little bit of a pretty brief overview of the parsing algorithm and then explain how we're going to-- how we're modifying the parsing algorithm to work with languages other than English so that we can basically broaden out the support of Spacey to these other languages.

So what we see here is a completed parse. And I'm going to talk you through the steps or the decision points that the parser is going to make to derive this structure. And so the kind of key-- I think to keep in mind-- or the key aspect of the solution is that it's going to read the sentence from left to right and maintain some state.

And then it's going to have a fixed inventory of actions that it has to choose between to manipulate the current parse state to build up the arcs. And this type of approach, which is called transition-based parsing, I find deeply satisfying because it's linear in time because you only make so many decisions per word.

And I do think that it makes a lot of sense to take algorithms which process language incrementally. I think that that's deeply satisfying and correct in a way that a lot of other approaches to parsing aren't. And it's also a very flexible approach. So we can do joint modeling and have it output all sorts of other structures as well as the parse tree.

And that's actually what we're going to do. So already in Spacey, we have joint prediction of the sentence boundaries in the parse tree. And what we're going to do is extend this to this joint prediction of word boundaries as well. OK, so here's how the decision process of building the tree works.

So we start off with an initial state. And so for ease of notation or ease of readability, we're notating the first word of the buffer. So the first word that's being focused on as this beam of highlighting. And then the other element of the state is a stack. And so as the first action that we do, we have an action that can advance the buffer one and put the word that was previously at the start of the buffer onto the stack.

So here's what that shift move is going to look like. So here we have Google on the stack, which we write up here. And the first word of the buffer is reader. And so then another action that we can take is to form a dependency arc between the word that's on top of the stack and the first word of the buffer.

So in this case, we want to attach Google as a child of reader. So we have an action that does that. And because we're building a tree, when we make an arc to Google, we know that we can pop it from the stack because it's a tree. It only can have one head.

It can only have one attachment point. It's a different type of graph. And so that means that we can do that and keep moving forward. So here's what that looks like. We add an arc and pop Google from the stack. So now we make the next move. Clearly, we've got no words on the stack.

So we should put reader on the stack so that we can continue. Now we're at was. And now we want to decide whether we should make an arc directly between was and reader. In this case, no, we want to attach was to canceled. So we're going to move was onto the stack and move forward onto canceled.

So then here we do want this arc between canceled and was. So we do another left arc. And so we basically continue here. So we're sort of stepping back a bit and thinking about this. We've got a fixed inventory of actions. And so long as we can predict the right sequence of those actions, we can derive the correct pass.

So that's how the machine learning model is going to work here. The machine learning model is going to be a classifier that predicts, given some state, what to do next. And you can sort of imagine that we can have other actions instead if we wanted to predict other aspects of the structure.

So in the case of spaCy, we have an action that inserts a sentence boundary. So it just says, all right, given the words currently on the stack, you have to make actions that can clear the stack. But you're not allowed to push the next token until your stack is clear.

And that means that there's going to be a sentence boundary there. And we could have other actions as well. There's been work to jointly predict part of speech tags at the same time as you're parsing. Or you can do semantics at the same time as you do syntax. And so you can code up all sorts of structures into this.

And you're going to read the sentence left to right. And you're going to output some meaning structure attached to it. And as I said, I find this a satisfying way to do natural language understanding. Because it does involve reading the sentence and adding an interpretation incrementally. OK, so that's what this looks like as we proceed through.

So all right, so how are we going to do this splitting up or merging of other things? Well, it's actually not that complicated, given this transition-based framework. So already, you can kind of see that in order to merge tokens, all we really have to do is we've got those tokens.

And if we wanted Google Reader to be one token, we just have to have some special dependency label, which we are going to have in the tree. And so obviously, the label subtoken. And then all we have to do is say, all right, at the end of parsing, we're going to consider that as one token.

So the step from going through something like this and labeling a language like Chinese is actually super simple. We just have to prepare the training data so that the tokens are individual characters. And then we can say, all right, things which should be one word should have this sort of structure with this label.

And then if the parser decides that those things are attached together, then at the end of it, you just merge them up. The splitting tokens is more complicated, because you have to have some universal actions that manipulates the strings. So I'm still sort of working on the implementation of this in a way that's sort of clean and tidy.

But I actually think that this will be useful for a lot of English texts as well, because if you have English texts that's sort of misspelled, a lot of the time, things which should be two tokens get merged into one. So it's is a particularly common and frustrating one of this, because the verb is, it should be its own token.

But if you have it's as ITS, which is also a common word in English, you need to figure out that you have to have two parser actions, two parser states for that. And in general, you could have a statistical model that reads the sentence beforehand. But that statistical model that is going to read the sentence and process it is going to end up taking on work and doing jobs of figuring out the syntactic structure of the sentence in order to make those decisions.

And that's why I think doing these things jointly is kind of satisfying, because instead of learning that information in one level of representation and thrown it away, only to build up the same information in the next pass of the pipeline, you can do it all at once. And so I think that the joint incremental approaches, I think, are very satisfying and good.

OK. So where are we at the moment? So I've implemented the learning to merge side of things, which involves figuring out better alignments between the gold standard tokenization and the output of the tokenizer. And that's allowed me to complete the experiments for Chinese, Vietnamese, and Japanese of the Conference in Natural Language Learning 2017 benchmark, which was a sort of bake-off of these parsing models, which was conducted last year.

Now, in that benchmark, the team from Stanford did extremely well, compared to everybody else in the field. There were some two or three percentage points better. And so at the moment, we were ranking kind of at the top of what was the second place pack. So most of the languages were coming sort of underneath the Stanford system, but with significantly better efficiency and with sort of this end-to-end process.

And in particular, we're doing better than Stanford on these languages like Chinese, Vietnamese, and Japanese, because the Stanford system did have this disadvantage of using the sort of preprocessed text. They didn't do the whole task. They wanted to just use the provided preprocessed texts so that they could focus on the parsing algorithm.

And that meant that they did have this error propagation problem. If the inputs are incorrect because the preprocessed segmenter is incorrect, then they have a big disadvantage on these languages. So satisfyingly, the sort of doing all at once and entangling all of these representations, it does have this advantage.

And we're seeing that in the results that we have for those languages. And the other thing that's satisfying is this joint modeling approach of deciding the segmentation at the same time as deciding the parse structure is consistently better than the pipeline approach in our experiments. So basically, we're getting a sort of 1% to 3% improvement from this, which is about the same size as we're getting from using the neural network model instead of the linear model.

So I've found this also quite satisfying, that the sort of conceptually neat solution is also working well in practice. So where does this go, and what do we hope to deliver from this? Yes, that would probably be it. How am I for time? OK, well, this is the last slide, so wrapping up.

OK, so what we want to do is we want to deliver a sort of workflow or user experience where it's very easy to start with the pre-trained models for different languages and broad application areas. And we want to make sure that they have the same representation across languages. So you get the same parse scheme, which the folks have been working hard on and basically now have a pretty satisfying solution from the universal dependencies.

And so if you're processing text from different languages, it should be easy to find, say, subject-verb relationships or direct-object relationships. And that should work across basically any language so that you can use these parse trees and basically have a level of abstraction from which language the text is in.

And then given this, you should be able to do pretty powerful rule-based matching from the parse tree and other annotations to provide it. So it should be pretty easy to find information, even without knowing much about the language and reuse rules across language. And then if the syntaxic model and the identity models that we provide aren't accurate enough, the library should support easy updating of those, including learning new vocabulary items without you taking particular effort from this.

And overall, we sort of want to emphasize a workflow of rapid iteration and data annotation. So the concept of this is that we should be able to provide things which give a broad-based understanding of language, but that still ends up with a need for the knowledge specific to your domain and a training guide and evaluation data specific to your problems.

And we want to make sure that it's easy to connect the two up and go the extra. Start from a basic understanding of language and move forward to building the specific applications, which is-- now Innes will be talking about that aspect of the sort of intended package. Yeah. Right.

So to-- Yes, certainly. So the leap asked what the sort of overall difference or main-- most important difference between Spacey's parsing algorithm and Stanford's parsing algorithm. So amongst other things, the sort of most fundamental difference is that Stanford's system is a graph-based parser. So this is O(n) squared, or maybe O(n) cubed, in length of the sentence.

So you're unable to use this type of parsing algorithm for joint segmentation and parsing. You have to have a pre-segmented text, which is why it has this disadvantage on languages, or text which is more difficult to segment into sentences. So in Spacey, we want to make sure that we basically only use linear time algorithms.

And that's why we only take this transition-based approach. Other reasons sort of why they get such a good result. Other people have done graph-based models, and they're not nearly as accurate. So I hope to meet the Stanford team in the next couple of days and shake out the details of why this system is so accurate.

Because, actually, it is quite surprising. I've read their papers several times, and I can't get the sort of one key insight that means that their system performs so well. It's interesting. So I think that-- Right, yes, certainly. Yes. So the question, which is a very good one that many people have been thinking about, is to what extent can end-to-end systems, which maybe learn things about syntax, but learn them latently and don't have an explicit syntactic representation internally, replace the need for this type of syntactic processing?

So I would say that for any application where there's sufficient text, currently the best approach or the state-of-the-art approach doesn't use a parser. And actually, this includes translation and other things where you would kind of expect that having an explicit syntactic layer would help. If there's enough text, it seems that going straight to the end-to-end representation tends to be better.

However, that does involve having a lot of text. And for most applications, creating that much training data, especially initially when you're prototyping, tends not to be such a viable solution. So the way that I see it is that the parsing stuff is a great scaffolding. And it's a very practical thing to have in your toolbox, especially when you're trying to figure out how to model the problem.

So because otherwise, you end up in this chicken and egg situation of, well, we need lots of data to make our model work well. And otherwise, it just doesn't really get off the ground. But then how do we even know that we're collecting the right data for the right model until we have that data collected and we can see the accuracy?

So if you can take sort of smaller steps using these sort of rule-based scaffolding and bootstrapping approaches, I think you have a much more powerful and practical set of tools. And then finally, once you have a system that you know you want to e-cat every percent, maybe you end up collecting enough data that you don't need a parser in your solution explicitly.

So Dilip has pointed to a paper that recently showed that it, you know, BLSTM models don't necessarily learn long range dependencies. I think that that's probably true. But as somebody who's worked on parsing for a lot of my career, I try to remind myself not to cherry pick results.

And even if I do find a paper that shows that parsing works on something, well, the overall trend is that BLSTM models which don't use parsing work well. And the fact is that long range dependencies are kind of rare. So that's basically why it's important to be asking, well, what are these things good for, and not say, oh, everything should be using parsing.

Because it's true that not everything should. So the question is, if we look at other aspects of language variation instead of just, say, the segmentation and things, how does the incremental model perform? So specifically, how does it perform in free word order languages, perhaps ones with longer range or crossing dependencies?

So Stanford, actually, their paper had excellent analysis about a lot of these questions. And so they showed that their model, which is much less sensitive to whether the trees are projective, they do do relatively well in those languages. So for our preliminary results, we do fine on German and pretty well on Russian.

We still suck at Finnish. And I think there's a bug in Korean. It's at like 50%. So it's a mixed bag. But I would say that there's some problems to solve about the projectivity. The way that I'm doing this is a little bit crude at the moment. So in general, there is a disadvantage that we take from the incremental approach in this.

And there's a lot of clever solutions that I'm looking into for this. So yeah. So there's a pretty good extension package for coreference resolution that has taken some of the pressure off us to support it internally. We do think that coreference resolution is something that does belong in the library, because it's something that does have that property of being a language internal thing.

I think that there's a truth about whether that he or she belongs to that noun that doesn't depend on the application. It's just a true fact about that sentence. So we're very interested in being able to give you that piece of annotation. I wouldn't quite say the same thing about the sentiment.

I don't quite know-- I haven't been convinced by any schema of sentiment that is sufficiently independent of what you're trying to do that we could provide it. Instead, what we do provide you is a text categorization library. And the text categorization model that we have is only one of many that you might build.

And it's not best for every application. But it does do pretty well for short text. And I think that on many sentiment benchmarks, it performs quite well. It's a lot slower than some other ways that you could do sentiment. So it depends on what type of text you're trying to process and that sort of thing.

Well, oh, yes. So explicitly, the coreference resolution package that you should use is called MuralCoref. So MuralCoref. Yeah. Yeah, and it's built on PyTorch. It's overall pretty good. You can train it yourself. Yeah, well, PyTorch is the machine learning layer. But yes, it's built on space. So yes. So for German, I think it's pretty easy.

I've been using the word vectors trained by fast text. And you can basically just plug that in so there's one command to convert that into a spacey vocab object and load it up. We're trying to provide pre-trained models which don't depend on pre-trained word vectors so that you can bring your own.

Because otherwise, there's this conflict of the model's been trained to expect some word vectors. And then if you sub your own in, it's going to get different input representations. But yeah, training or bringing your own vectors is designed to be pretty easy. And if it's not, I apologize if there's bugs, and we'll try to fix them.

So. So the question is, after parsing and interpreting, do we have an interlingual representation that can then be used to generate another language? The answer is probably not. I mean, we don't have generation capabilities in spacey. People have worked on this sort of thing. But in general, having an explicit interlingual tends to perform less well than more brute force statistical approaches to syntax.

And I think the reason does sort of make sense that the languages are pretty different in the way that they phrase things and the way that they model the world in lots of ways. And so getting a translation that's remotely idiomatic out of that sort of interlingual representation is pretty tough.

And then there's another argument that you're solving a subproblem that's harder than the direct translation approach, which I'm not sure whether I buy that argument or not. But it's a common one that people use. OK, so should we move forward to the next talk? Thanks. So yeah, we started out by hearing a lot about the more theoretical side of things.

And I'm actually going to talk about how we collect and build training data for all these great models we can now build. And the nice thing about machine learning is that, well, we can now train a system by just showing examples of what we want. And that's great. But the problem is, of course, we need those examples.

And even if you're like, oh, I got this all figured out. Are you using this amazing unsupervised method where my system just infers the categories from the data and I never need to label any data? That's pretty nice, but you still need some way of evaluating your system. So we pretty much always need some form of annotations.

And now the question is, well, why do we even care about this? Why do we care about whether this is efficient, whether this works or not? The thing is, the big problem is that we actually, with many things in data science and machine learning, we need to try out things before we know whether they work.

Or we often don't know whether an idea is going to work before we try it. So we need to expect to do annotation lots of times and start off from scratch. Start all over again if we fucked up our label scheme. Try something else. We need to do this lots of times, so it needs to work.

And similarly, especially if you're working in a company in a team where you really want to use your model to find something out, ideally the person building the model should be involved in that process. And also, we always say good annotation teams are small. A lot of people don't understand this.

There's a lot of movement towards, oh, let's crowdsource this, get hundreds of volunteers, and we always have to remind, especially companies, that well, look at the big corpora that we use to train models. The good ones were produced by very few people, and there's a reason for that. More people doesn't always mean better results, actually quite the opposite.

So how great would it be if actually the developer of the model could be involved in labeling the data? And of course, we also have the problem of the specialist knowledge, especially in industries where this matters. You might want to have a medical professional give some feedback on the labels, or actually really label your data, or maybe a finance expert.

And yeah, those people usually have limited time. If you get an hour off their time, you want to use it more efficiently, and you don't want to bore them to death, or actually find the one person who has nothing else to do, because their knowledge is probably not as valuable as other experts' knowledge.

And yeah, another big problem, since you want humans, is that humans are actually-- humans kind of suck. We're not that efficient at a lot of things. So for example, we really have problems performing boring, unstructured tasks, especially things that require multiple steps and multiple things we need to get right.

We can't remember stuff. We're bad at consistency and getting stuff right. So fortunately, computers are really good at that stuff. And in fact, it's probably also the main reason we built computers. So there's really no need to waste the human's time by making them do stuff that they're going to do badly anyways.

And instead, we want our annotation tooling to be as automated as possible. Or in general, we want to automate as much as possible, and really have the human focus on the stuff that the human is good at, and we really need that input. And that's usually context, ambiguity, stuff like we can look at a sentence, and most of us will be able to understand a figure of speech immediately without thinking twice about it.

That's the stuff that's really, really hard for a computer. Also, put differently, humans are good at precision. Computers are good at recall. So the thing is, yeah, what I'm saying here, it sounds a bit like our floss and eat your veggies. Yeah, we probably will have had some experience with labeling data.

And normally, yeah, we also gave this talk to a crowd of more data science focused industry professionals. And actually, you'd be surprised how many companies we talked to, also very large companies, very actually technologically sophisticated companies, that mostly use Excel spreadsheets for everything. And it's not inherently bad, but they are very obvious problems with Excel spreadsheets.

And there's definitely a lot of room for improvement. So once people figure this out and realize that maybe they could do something better, or it's just terrible, like we don't want to do this, the next move is normally, let's move this all out to Mechanical Turk or some other crowd-sourced platform.

And yeah, Mechanical Turk, the Amazon cloud of human labor. And so, yeah, people do that. And then I was also surprised that their results are not very good. And the problem is, yeah, OK. So you have some guy do it for $5 an hour, get the data back, train your model doesn't work.

And actually, it's very difficult to then retroactively find out what the problem was. Maybe your label scheme was bad. Maybe your idea was bad. Maybe the data was bad. Maybe you didn't write your annotation manual properly. Maybe-- actually, yeah, another nice thing. Maybe you paid too much, because if you pay too much on Mechanical Turk, you attract all the bad actors.

So you kind of have to stick to the half minimum wage. So that could have been a problem. Maybe your model was bad. Your training code was bad. It's very, very difficult to find that out. And also, you realize that, well, it's not really just the cheap click work.

You need to do it more. So then, yeah, what most people conclude from this is, fuck this labeling in general. I don't want to do this anymore. Let's just find some unsupervised method and not bother with this. And that's actually-- yeah, also, the conversation I had recently where we talked to a larger media company, and they'd done exactly that.

And now they have a few hundred clusters. And it's really great. They have really great clusters. But now, their problem is that they have no idea what these clusters are. So they now need to label their clusters. And now, they're kind of back in the beginning. And I think what we see from this is that the label data itself, the fact that we need label data, that's an opportunity.

That is not the problem. The problem is how we do it. And yeah, so we've been thinking about this a lot. At least, from our point of view, there are a lot of things we could do better. So one of the things, really, to work against this problem that we have caused by us being human is that we need to break down these very complex things we're asking the humans into smaller, simpler questions.

And ideally, these should be binary decisions. So we can have a much better annotation speed because we can move through the things faster. And we can also measure the reliability much easier than if we ask people open questions. Because we can actually say, OK, do our annotators agree? Do they not agree?

Because that's, in the end, very important to find out whether we've collected data the right way. And the binary thing itself, it sounds a bit radical. But actually, if you think about it, most, or pretty much any task, can be broken down into a sequence of binary decisions, like, yes or no, decisions.

It might mean that we have to accept that, OK, and if we annotating a sentence or entities, we won't actually end up with a gold standard data for this sentence. We might actually end up with only partially annotated data. And I have to deal with that. But still, we're actually able to use our human's time more efficiently, which is often much more important.

So a lot of examples I'm going to show you now from using our annotation tool Prodigy, which, yeah, we started building as an internal tool. But we very, very quickly realized that, OK, this is really something pretty much every company we talk to, most users we talk to, this was always something that kept coming up.

So we thought, OK, what if we really combine all these ideas we already have, and how to train a model, actually use the technology we're working with within the tool, and also use the insights we have from user experience, and how to get humans to do stuff most efficiently, how to get humans excited, actually, even the whole idea of gamification, how to get humans to really stick to doing something, and put this all into one tool, and that's Prodigy.

And so here, we see some examples of those tasks, and how we can present things in a more binary way. So in the top left, we have an entity task. So here, this comes from Reddit, and we're labeling whether something is a product or not. And what we did here is we load in a spacing model, asked the model to label the products, and then we look at them and say yes or no.

Or we can also use a mode where we can then actually click on this, remove this, label something else. But still, you see, OK, we don't have to do this in an Excel spreadsheet. We actually get one question, we look at this, and pretty much immediately, we can say yes or no.

The same here, on the right, they were using-- I think this is actually a real example using the YOLO2 model with the default categories. And we have an image of a skateboard. We could say, is this a skateboard, yes or no? And yeah, immediately, have our annotations here. And even this one in the corner, even if we're not able to really break it down into a true binary task, we can still make it more efficient and easier for a human to answer.

Because here, with keyboard shortcuts, you can still do maybe two, three seconds per annotation and you have an answer. Or we say, hey, it's actually so fast, if we can get to one second, we might as well label our entire corpus twice, positive, negative, other labels we want to do, and just move through it quicker.

And yeah, to give you some background on why did we do this, what do we think Prodigy should achieve, we really think that, OK, we want to be able to make annotation so efficient that data scientists can do it themselves. Or here, what we call data scientists can also be researchers and people working with the data, people training the model.

Yeah, reading it like that, it still doesn't sound like fun. But the idea is, we could really make a process that's efficient that you actually really want to do this because you don't have to depend on anyone else. You can just get the job done and see whether your idea works or not.

And the same-- yeah, and this also means you can iterate faster. We're very used to, OK, you iterate on your code, but you can actually iterate on your code and your data. You try something out, doesn't work, try something else. Maybe see, OK, is it going to work if I collect more annotations?

You can all try this out. And we also want to waste as little time as possible and use what the model already knows and have the human correct its predictions instead of just having a human do everything from scratch. And as a library itself, we really want Prodigy to fit into the Python ecosystem.

We want it to be customizable, extensible in Python. You can write scripts for it. And we also-- it was a very conscious decision not to make it a SaaS tool, because we think data privacy is important. You shouldn't have to send your text to our servers for no reason.

And we also think you shouldn't be looked in. Like, you should get a JSON format out that you can use to train your models however you like, and not our random format that you can then download from our servers. So that's where we're going with Prodigy. And here's a very simple illustration of how the app looks.

The center are recipes, which are very simple. Python scripts that orchestrate the whole thing. You have a REST API that communicates with the web app naturally so you can see things on the screen. You have your data that's coming in, which is text images. And you can have an optional model state that's updated in a loop, if you want that.

And then the model then communicates with a recipe. You can, as the user annotates, it's updated in a loop and can suggest more annotations that are more compatible with the annotator's recent decisions. And yeah, there's a database and a command line interface so you can actually use it efficiently and don't have to worry about these aspects.

So here, can you see? Yeah, in the corner we have a simple example of a recipe function, which really is just a Python function. You load your data in and then you return this dictionary of components, for example, an ID of the data set, how to store your data, a stream of examples.

You can pass in callbacks to update your model, things to execute before the thing starts. So the idea is really, OK, if you need to load something in, if you can write that in Python, you can do it in Prodigy. And you can also-- we provide a bunch of pre-built-in recipes for different tasks with some ideas of how we think it could work, like named entity recognition.

For example, you can use the model, correct its predictions. You can use the model, say yes or no, to things. You can use it for dependency parsing and look at an arc and annotate that. We have recipes that use word vectors to build terminology lists, text classification. So there's also a lot that you can mix and match creatively.

For example, you have the multiple choice example that's not really tied to any machine learning task, but it fits pretty much into any of these workflows that you might be doing. And of course, the evaluation is also something we think is very, very important and is often neglected, especially in more industry use cases.

But we think there's actually-- ABL evaluation is actually a very powerful way of testing whether your output is really what you want it to be. And so here we see an example of how you can chain different workflows together, all using models, word vectors, things you already have in order to get where you want to get to faster.

So here, a simple example, we want to label fruit. It's kind of a stupid example because it's that-- I can't think of many use cases where you actually want to do that, but it makes a great illustration here. So yeah, we start off, we say, OK, we want fruit.

What are fruit? We have some examples, apple, pear, banana. That's what we can think of. And we also have word vectors that we can use that will easily give us more terms that are similar to these three fruit terms that we came up with. And then we can use this terminology list that we collected by just saying yes or no to what we've gotten out of the word vectors, look at those in our data, and then say whether apples in this context is a fruit or not.

Because we're not just labeling all fruit terms as a fruit entity, because it could be apple, the company. But we get to look at it, and it's much more efficient than if you ask the human to sit through and highlight every instance of fruit nouns in your text. And so this also leads to one of our main aspects of the tool, workflows that we're especially proud of and that we think really can make a difference, which is we can actually start by telling the computer more abstract rules of what we're looking for and then annotating the exceptions instead of really starting from scratch.

Or we can even use the technology we're working with to build these semi-automatically using word vectors, using other cool things that we can now do. And then, of course, also specifically look at those examples that the statistical model we want to train is most uncertain about. So we try to avoid the predictions where we can be pretty sure that they're correct and actually really ask the human first about the stuff that's 50/50 and where really the human feedback makes most of the difference.

And so here's a quick example. Let's say, OK, we want to label locations. We start off with one city, San Francisco. And then we look at what else is similar to that term. So these are actually real suggestions from that Sense2Vec model that Matt showed earlier. And as you can see, the nice thing is we're using word vectors.

We're not using a dictionary. So we're not going to annotate California and maybe University of San Francisco. But we're not going to annotate California roles because we're in a vector space and we know that what we're actually looking for is at least similar to the real meaning of the word.

And a lot of these are super trivial to answer. So we can accept them, we can reject them, or we can ignore them because this is a bit too ambiguous and we don't actually want that in our list because it can mean too many things. And then from here, we can actually create a pattern that uses spaces, attributes, or in this case, the lower case form of the token and GPE that stands for geopolitical entities or anything with the government.

And that's what we're trying to label. So we can easily build up these roles very quickly, very automated, and then we have a bunch of locations that we can then match in our text. So here, it found a mention of Virginia, which we can then accept. So that's a very, very simple example of this.

But of course, this also works for slightly more complex constructs where we can really take advantage of the syntactic structure. So here, this was a finance example. So what we're trying to do is we want to extract information about executive compensation. So yeah, some executive receives some amount of money in stock, for example, like this one.

And this is a pretty difficult task. But also, the idea here is we have this theory that maybe if we could train a model, a text classification model, to predict whether a sentence is about executive compensation or not, we can then very, very easily use what we already know about the text to extract, let's say, the first person entity.

We extract the amount of money, put that in our database. And we've actually-- yeah, we found a good solution for an otherwise very, very complex task. So for this, this is just an idea. We haven't tried this in detail, but one possible pattern using token attributes we have available would be let's try and look for an entity type person, followed by a token with a lemma receive.

So received, receives, receiving, and followed by a token with the entity type money. And let's just look at what this pulls up. That's an idea. I mean, there are plenty of other possible patterns you can come up with. And the nice thing is we're actually going to be looking at them again in context.

So they don't have to be perfect. And even actually, in fact, even if it pulls up random stuff that you realize is totally not what you want, this is also very important. Because you won't only be collecting annotations for the things you know are definitely right. You're also collecting annotations for the things that are very, very similar or look very, very similar to what you're looking for but are actually not what you're looking for.

And that's probably just as important as the positive examples. So yeah, the moral of the story is what we're saying is we're very used to iterating on our code as programmers. But you should really be doing both. The data is just as important. So as we see here, OK, that's the normal type of programming.

You have a runtime program. You work on the source code. You compile it, get your runtime program. You don't like something about your program. You go back, change the source code, compile it, and so on. That's a pretty standard workflow. And in machine learning, we don't have a runtime program in that sense.

We have a runtime model. So the part we should really be thinking about and working on is the training data. Instead, most focus is currently on the training algorithm. And if you use that analogy, that's very similar to going and tweaking your compiler if you're not happy with your runtime program.

You can do that, but of course, you probably go back and edit your source code. I think this is actually a pretty good example. It's pretty accurate. There are only so many training algorithms, but what really makes a difference is your data. So if you have a good way and a fast way of iterating on that data, and you're able to really master this part of the problem, you'll also get to try more things quickly.

As we know, most ideas don't actually work. It's always one of these things that's kind of misrepresented. A lot of people have this idea, ooh, you're doing all these amazing AI things, and everything just works. It's like, kind of doesn't. Nothing works. And sometimes things work. And you really want to find the things that actually work.

And for that, you need to try them. And so it also means if you can actually figure out what works before you try it and invest in it, you can actually be more successful overall because you're not going to waste your time on the things that might fail and more scale things up that actually weren't even supposed to work in the first place.

And one thing that's also very important to us is you can really build custom solutions. You can build solutions that fit exactly to your use case, and you'll keep these on it. If you collect your own data, you'll keep that forever, and nobody can lock you in. You're not just consuming some API, and if that API shuts down, you can start again from scratch.

You have your data, no matter what other cool things we can do at some point in the future, you can always go back to your labeled data and really build your own systems. And we believe that this is really something that's very important in the future of the technology.

That's also a reason why we think AI development in general in companies should be done in-house. And, yeah, we're hoping that we can keep providing useful tools that will make this easier. Yeah. So the question is, yeah, Jeremy thinks we write very good software, even though we're only two people, and how are we doing that?

Yeah, that's a very good question. I mean, we do get this a lot. I mean, I think it's-- I don't even know where this idea comes from that, like, yeah, you can scale things up. Like, I don't know, scaling things up makes things better. Because I do think, yeah, actually, the more people you get involved, you sometimes-- it actually can have a very negative impact on the quality of the software you produce.

In our case, it's just, OK, it just works. Like, I also don't like this idea of, oh, if everyone can do exactly the same thing if they just work hard, even though people like thinking of it that way. It's just, OK, in our case, we have a good combination of things that we like to do, things that we happen to be good at, and it just works together.

So I guess we are lucky in that way, but we also cut out a lot of bullshit, like the amount of meetings we don't take, the amount of events we don't go to. I mean, yeah, it's kind of ironic saying that, speaking at an event, but I really don't normally go to many events.

We don't take coffee dates with random people. We barely know. Yeah, we mostly, we really just like to write software. And yeah, we've had some good ideas in the past. Thanks for making this cool. I wish I had it two years ago. Have you done any experiments to see if there's actually biases and to show them your model examples versus just how many do you think you don't look at any trade-offs I mean, also, the question is, if we've done any experiments where we compare the binary decisions and whether it influences the annotators versus really doing everything from scratch.

So we haven't done experiments specifically focusing on the bias because that's, in some sense, that's difficult because we're looking at the output. We're looking at, does it improve accuracy? We've done experiments of manual annotation versus binary annotation, but also mostly focused on our own tooling because we think it's kind of useless.

Like, yeah, we can present you a study where we said, oh, we did stuff in an Excel spreadsheet and then we did stuff in Prodigy and it was much better. So it's really mostly focused around our own tooling and we did find that-- well, it depends on the task you're doing.

That's the other thing. I feel like giving these answers sounds unsatisfying because I'm always saying, well, it depends on your data. But that's also the whole point of it because we're doing this because your data is different and there's no one size fits all solution. But essentially, so we found what-- binary annotation works especially well if you already have a pre-trained model that predicts something, ideally also something that's not completely terrible.

Otherwise, the pattern approach does work very well on very specific domains. Like, we did one example of where we labeled drug names on Reddit, like on our opiates, which was a pretty good-- this was a pretty good data source because it's a very specific topic. And also, it's a subreddit that's very on topic because people who go on Reddit to discuss opiate use, usually are very dedicated to talking about this one topic.

So it was a good, interesting data source. And so what we wanted to do is we labeled drug names, drugs, and pharmaceuticals in order to, for example, have a better tool set to really analyze the content of this subreddit and see how it develops over time anyway. So there we found the pattern-based approach worked very, very well because we have very specific terms.

We can use word vectors to bootstrap these. Especially also, we can include spelling mistakes and stuff, which was very interesting. Like, we can really build up good word lists, find them in the text, confirm them, and get to pretty decent accuracy. I would expect this work to work a little less well, the cold start problem, on a much more ambiguous domain.

And there, you're probably better off to say, OK, we're labeling by hand. But even there, that's something I haven't really shown in detail here. But we also have a manual interface where you highlight. But what we do there is we use the tokenizer to pre-segment the text. So you don't have to sit there and pixel perfect, like, highlight, and then, ah, shit, now I've got the white space in.

Let's start again. So that's another thing we're doing. You can be much lazier in highlighting. And also, there, get more efficiency out of it. And still use a simpler interface. Yeah? So you mentioned about the Yeah. So the question is, first, you gave an example of annotating patient data, which is obviously very problematic because doctors are not always very specific in what they fill in.

And then, in the end, this was how did they enrich that with-- So what they did is they got foundation of the Yeah. Yeah, so basically, OK, the question is whether we have some experience in the medical field mixing this. The answer is, well, we haven't personally done this.

But we do have quite a few companies in that domain, also because the tool itself is quite appealing because you can run it in your own compliant environment, you know, that data privacy aspect. But it's interesting to explore. That's maybe also where, OK, having the professionals-- getting the medical professionals more involved might make sense, which normally is very difficult.

You don't want a doctor to do all the work themselves. But if you can find some way to distill that and then ask the doctor, OK, you wrote this here, does that mean-- you wrote x, does that mean y? And the doctor says, yep. Or the doctor says, nah.

If you can try this out and extract some information, well, that could be one idea to solve that, for example. Yeah, I can definitely see that. You can. Like right now, we don't have a built-in logic for that, although we are working on-- oh, sorry, I forgot to repeat the question-- inter-annotator agreement, if you can calculate that and incorporate that into your model.

So we're actually working on an extension for Prodigy, which is much more specifically for managing multiple annotators. Because the tool here, we really designed specifically as a developer tool first and then scaling it up a second. But since you have the binary feedback, and if you have an idea, if you have an algorithm you want to use and you know what you want, you can already do that fairly easily because you can download all the data as JSON.

You have a key that's answer, which is either accept, reject, or ignore. You can attach your own arbitrary data like a user ID. And then it's fairly trivial to write your own function that really takes all of this, reads it in, computes something, and then uses this later on.

So that's definitely possible. But this is also something we're really interested in exploring and working on. And the binary interface is great, but the trainer kind of is great, but yeah. Yeah, so we see binary. That's a big advantage of the binary interface is that there are only two options.

You filter out the ignored ones, and then you can really answer that question. Yeah. Yeah, well, you can design-- so the question was-- one interface I showed, which was the sentiment one with the multiple selections. This is not binary. That's true. And actually, it's also something we usually tell our users avoid this as much as possible, if you can.

And in some cases, you might still want that. Or we say, look, a lot of people still think of surveys when they think of annotating data. And I get where this is coming from, but I think if you can leave that sort of mindset and really open up a bit and think of other creative ways, you could get more out of this.

If you want to re-engineer a survey, maybe you want to use a survey tool. So for example, if I were doing this with those four options, I would say, OK, we have all texts. The annotator sees every text four times and says, is this happy or is this not happy?

And because you can get to one second for annotation, that's very fast. Like, even if you have thousands of examples, you can do this in a day yourself. And so that's how we would probably solve this. And it also means you get every example four times. And for each text, you know, is it sad?

Is it happy? Is it neutral? Is it something else? You have much more data. But not everyone wants this. Some people really want to build that survey. And we let them. But yeah. Yeah. So the question is, if you're doing the same example multiple times, whether it slows down the annotation or not.

Well, actually, I mean, it's difficult to say because it depends. But I've actually found that even if you do the bare maths, it can easily be much faster. Because if you say, OK, 1,000 examples. And normally, if you really have to think about five different concepts that are maybe not even fully related, that just every tiny bit of friction you put between a human and the interface or the decision can very significantly slow down the process.

So you think about, oh, is this happy? Or is this sad? Or is this about sports? Or is this about horses? And just this thing that can easily add like 10 seconds to each question. So if you do the whole thing three times at one second, you're still faster than you would have been if you'd added this friction.

And the other part is just a human error. If you have to think too much, you're much more likely to fuck it up and do it badly. And then that's also something you want to avoid. But the active learning helps a lot here as well. So if you have your labels, it's pretty confident that the economic labels don't apply.

And so you just don't have to learn something. Yeah, to repeat this, the active learning also makes a difference here. Because you could actually-- yeah, you could pre-select the ones that really make a difference to annotate and don't have to really go through every single one that is not as important as some of the other ones that you really care for.

Yeah, do you have any experience working with tasks like that or how you sort of Yeah, so the question is, yeah, what about tasks that need a lot of context, like the whole medical history or just a whole document. So we have-- and whether we have experience with that.

So in general, we do say, if your task requires so much context that you can't fit this into the prodigy interface, then it doesn't mean that you can't train a model on that. But for most of the tasks that users most commonly want to do, this is often also an indicator that it's very, very difficult to actually teach your model that if you're doing named entity recognition or even text classification and you need a lot of context and all the context is equally as important, that's often an indicator that that might not work so well.

So for example, text classification, we say, OK, we start off by selecting one sentence from the whole document. And then instead of you annotating the whole document, you say, OK, this is the most important sentence. Does this label apply or not? So there are some tricks we use to get around this problem because, yeah, we also think that, OK, it's important to get this across and frame it in that way because, yeah, if you need two pages on your screen, it's not efficient at all.

And also likely, you can do all that work, but your model won't learn that because your model needs local context as well, at least, for the tasks that we are-- I don't know if you have anything to add to that. Yeah, OK. Often, it's important to take into account the models really that are available.

Yeah, so the suggestion was, OK, having some tools, some process that goes along with the software that helps people break this down. Yeah, we've actually been thinking about this a lot because we do realize the tool is quite new, and we're introducing a lot of new concepts at once, and also some best practices where we think, ah, that's how you should do it, or you could try this.

And we are also realizing that there's no real satisfying one-size-fits-all answer. That's another problem. Everyone's use case is different, so right now what we're doing is we have a support form for Prodigy where we answer people's questions. And actually, a lot of users share what they're working on, asking for tips.

We kind of talk about it. Other users come in and are like, oh, I actually try to do this type of legal annotation, and here's what worked for me, and have this sort of exchange around it to figure out, OK, what works. Because, yeah, it's just like I think in machine learning, deep learning, a lot of the best practices are still evolving, and it's very, very specific.

So it's definitely-- yeah, we're open for suggestion there as well, but we're still in the process of really coming up with a good set of best practices and ideas. The question is whether-- yeah, we have any plans to sell models like medical models? Yes, as part of what Matt mentioned in the very introduction, we are definitely planning on having more of a models-- like an online store for very, very specific models.

So medical-- that's a very, very interesting domain. And if so, we really want to have it specific, like medical texts in French or Chinese, and really go in that direction. Because we believe that, OK, pre-trained models are very valuable, and even if you do medical texts, you can start off with a pre-trained model, then you can use a tool like Prodigy or something else to really fine tune it on your very, very specific context, have word vectors in it that already fit to your domain, and maybe up those as well.

We think that this is a very future-proof way of working with these technologies. Yeah? Yeah? So currently-- so a question is the text classification model we're using in Prodigy. More info-- more details on that. So what we're using is Spacey's text classification model. That's what's built in. But I think actually this question is pretty good, because what's important to note is that Prodigy itself comes with a few built-in recipes that are basically ideas for, OK, how you could train a text classifier.

You could use Spacey. But it's definitely not tied to those. The idea-- the tool itself is really the scaffolding around it. So if you say, hey, I wrote my own model using PyTorch, and I would like to train this, all you need to do is you need to have one function that takes examples and updates your model.

And you need to have one function that takes raw texts and outputs the score for each text. And then you provide that to Prodigy. And then you can use the same active learning mechanism as you would use with a built-in model. So the idea is really the models we ship are just a suggestion or an idea you can use to try it out.

But ultimately, we also hope that people in the future will transition to just plugging in their own model and just using the scaffolding around it to do that. But we definitely don't want to lock anyone in and say, oh, you have to use spaCy, especially for NER and stuff and other things.

We think spaCy is pretty good. But if you don't want to do that for other use cases, especially text classification, we think that a lot of cases-- well, you might want to use scikit-learn or vocal-webbit. Yeah, or what a great name. Yeah, or basically something completely custom. Yeah. So the question is, active learning part, whether this is built on the underlying model-- Oh, yeah.

So the question is, active learning versus no active learning, how well this works. First also, maybe as a general introduction, so what we're doing for most of these samples is we use a basic uncertainty sampling. That's what we found works best. But we also know there are lots of other ways you could be solving that.

So in the end, how we implement this is we have a simple function that takes a stream and outputs a sorted stream based on the assigned scores and the model in the loop. So how you wire this up, again, is also up to you. And yeah, to answer the part about what works best, in general, in our kind of framework, where really, you see one sentence at a time.

And often, you start off with a model not knowing very much. The active learning component, basically, resorting the stream is actually very crucial. Because otherwise, if you start from scratch, have very few examples, you'll be annotating for a very, very long time. And all kinds of random predictions, you annotate your stream in order.

There's very little-- you need some kind of guidance that tells you, OK, what to work on next, especially if you feed in millions of texts. You need to sort them. You need to pre-select them based on something. And this could be the model's predictions. This could be something else.

This could be the keywords or the patterns. But without that, yeah, it's very, very difficult. And that's kind of what we're trying to solve with a tool. Thank you so much, Innes and Matthew. I've got to say, anybody who's using fast AI, any time you've used fast AI NLP or fastAI.txt, you've called the spacey tokenize function.

You're using spacey behind the scenes. And the reason you're using spacey is because I tried every damn tokenizer I could find. And spaceys was so much better than everything else. And then the kind of story of fast AI's development is that over time, I get sick of all the shitty parts of every third-party library I find.

And I gradually rewrite them myself. And the fact that I haven't rewritten spacey or attempted to is because I actually think it's one of those rare pieces of software that doesn't suck at all. It's actually really good. And it's got good documentation. And it's got a good install story and so forth.

And I haven't used Prodigy, but just the fact that these guys are working on. I recognize the importance of active learning and the importance of combining human plus machine. What's in that rare category of people, in my opinion, are actually working on what's one of the most important problems today.

So thank you both so much for coming and for this fantastic talk. And I look forward to seeing what you do next. Thank you. (audience applauds)