back to indexStanford XCS224U: NLU I Information Retrieval, Part 4: Neural IR I Spring 2023
Chapters
0:0 Intro
0:21 Cross-encoders
4:53 Shared loss function The negative log-likelihood of the positive passage
8:38 Soft alignment with ColBERT
10:1 ColBERT as a reranker
12:18 Beyond reranking for CoIBERT
13:21 Centroid-based ranking
14:29 ColBERT latency analysis
16:4 Additional ColBERT optimizations
16:59 SPLADE
19:51 Additional recent developments
20:29 Multidimensional benchmarking
00:00:06.160 |
This is part 4 in our series on information retrieval. 00:00:09.240 |
We come to the heart of it, neural information retrieval. 00:00:11.840 |
This is the class of models that has done so much to bring NLP and IR 00:00:16.200 |
back together again and open new doors for both of those fields. 00:00:23.040 |
I think you should imagine that the name of the game is to take 00:00:25.800 |
a pre-trained BERT model and fine-tune it for information retrieval. 00:00:34.240 |
conceptually the simplest approach that you could take. 00:00:37.400 |
For cross-encoders, we're going to concatenate 00:00:40.720 |
the query text and the document text together into one single text, 00:00:47.020 |
and then use representations in that BERT model as the basis for IR fine-tuning. 00:00:52.720 |
In a bit more detail, we'd process the query in 00:00:55.040 |
the document and then probably take the final output state above the class token, 00:01:03.120 |
and fine-tune the model against our information retrieval objective. 00:01:07.440 |
That will be incredibly semantically expressive because we have all of 00:01:11.080 |
these interesting interactions between query and document in this mode. 00:01:15.680 |
In a bit more detail, in the background here, 00:01:17.760 |
I'm imagining that we have a dataset of triples where we have a query, 00:01:21.800 |
one positive document for that query, and some number, 00:01:25.280 |
one or more negative documents for that query. 00:01:28.920 |
The basis for scoring is as I described it before, 00:01:34.480 |
concatenate the query in the document and process that text, 00:01:37.680 |
and then retrieve the final output state above the class token that's given here, 00:01:42.400 |
and that's fed through a dense layer that is used for scoring. 00:01:46.760 |
Then the loss function for the model is typically 00:01:49.820 |
the negative log likelihood of the positive passage. 00:01:55.400 |
the positive passage according to our scoring function, 00:01:58.360 |
and the denominator is that positive passage score again, 00:02:01.440 |
sum together with the total for all the negative passages. 00:02:05.760 |
Let's step back. This will be incredibly semantically rich, 00:02:13.360 |
the BERT model to jointly encode the query and the documents. 00:02:16.760 |
We have all these rich token level interactions, 00:02:21.760 |
This won't scale because we need to encode every document at query time. 00:02:26.860 |
In principle, this means that if we have a billion documents, 00:02:29.780 |
we need to do a billion forward passes with the BERT model, 00:02:34.100 |
one for every document with respect to our query, 00:02:42.940 |
Although there's something conceptually right about this approach, 00:02:57.080 |
In this mode, we're going to separately encode queries and documents. 00:03:01.340 |
On the left here, I've got our query encoded with a BERT-like model, 00:03:07.160 |
except the final output state above the class token. 00:03:14.500 |
we just need that final output state above the class token. 00:03:18.260 |
Then we're going to do scoring as a dot product of those two vectors. 00:03:24.220 |
we have a dataset consisting of those triples for our query, 00:03:27.620 |
one positive document, and one or more negative documents. 00:03:31.500 |
Now, our core comparison function is what I've called sim here. 00:03:35.820 |
The basis for sim is that we encode our query using 00:03:38.700 |
our query encoder and get the final output state, 00:03:50.980 |
this is the negative log likelihood of the positive passage. 00:03:58.420 |
with the sum for all of the negative passages. 00:04:04.220 |
but it's very limited in terms of its query document interactions. 00:04:09.460 |
The core of the scalability is that we can now 00:04:12.580 |
encode all of our documents offline ahead of time, 00:04:18.460 |
one single vector associated with each one of those documents. 00:04:25.020 |
get that one representation above the class token, 00:04:28.180 |
and do a fast dot product with all of our documents. 00:04:34.860 |
we have lost all of those token level interactions 00:04:45.340 |
those single vector representations and we might 00:04:48.460 |
worry that that results in a loss of expressivity for the model. 00:04:53.100 |
Before moving on to some additional models in this space, 00:05:01.300 |
The loss function for both of the models that I presented 00:05:04.540 |
is the negative log likelihood of the positive passage. 00:05:07.780 |
Here's how I presented it for the cross encoder, 00:05:17.300 |
You can now see that there's a general form of this, 00:05:30.500 |
the way that might play out is that you've simply 00:05:35.260 |
and everything else about how you're setting up 00:05:51.100 |
who's my longtime collaborator and co-advises Omar with me. 00:05:55.340 |
Omar would want me to point out for you that Colbert 00:05:58.700 |
stands for contextualized late interaction with BERT. 00:06:02.300 |
That's an homage to the late night talk show host, 00:06:06.860 |
late night contextual interactions with his guests. 00:06:10.420 |
But you are also free to pronounce this Colbert 00:06:29.140 |
I've grayed out all the states except the final ones 00:06:37.020 |
Similarly, we encode the document again with BERT, 00:06:40.660 |
and here again, the only states we need are the output states. 00:06:44.860 |
Then the basis for Colbert scoring is a matrix of 00:06:48.380 |
similarity scores between query tokens and document tokens, 00:06:52.580 |
again, as represented by these final output layers. 00:07:05.820 |
We get the value of the maximum similarity for document tokens, 00:07:12.260 |
the maxim value that is the basis for the model. 00:07:17.260 |
we have a dataset consisting of those triples. 00:07:20.220 |
The loss is the negative log likelihood of the positive passage, 00:07:25.620 |
and here is the maxim scoring function in full detail. 00:07:34.500 |
some document token and sum all those maxim values together. 00:07:44.380 |
between tokens in the query and tokens in the document. 00:07:47.460 |
Let me unpack that. It's highly scalable because as with DPR, 00:07:51.420 |
we can store all of our documents ahead of time. 00:07:57.580 |
of output vectors here to represent documents. 00:08:00.660 |
At query time, we encode the query and get the output states, 00:08:04.100 |
and then perform a bunch of very fast maxim comparisons for scoring. 00:08:15.780 |
token level interactions between query and document. 00:08:18.780 |
It's now, it's just that they happen only on the output states, 00:08:26.820 |
That was too expensive and this looks like a nice compromise. 00:08:32.900 |
an extremely powerful and effective IR mechanism. 00:08:37.700 |
One thing I really like about Colbert is that it 00:08:45.860 |
some level of term matching between queries and documents. 00:08:51.620 |
we get to do that in a semantically very rich space. 00:08:58.100 |
when did the Transformers cartoon series come out? 00:09:02.500 |
the animated Transformers was released in August 1986. 00:09:16.660 |
But we also have a very strong maxim match between 00:09:19.860 |
cartoon in the query and animated in the document. 00:09:25.460 |
only neural models like Colbert can make without extra effort. 00:09:28.900 |
Similarly, for come out in the context of the query, 00:09:32.860 |
we have a strong match to released in the document. 00:09:38.420 |
the two parts of the date expression, August 1986. 00:09:41.620 |
Here I've shown the top two maxim values to show that we're 00:09:51.620 |
interpretable and also reveals to us why this is 00:09:57.340 |
because it can make all of these deep associations. 00:10:05.580 |
I thought I would pause here and just talk a little bit with 00:10:11.420 |
these neural models and then turn them into something that 00:10:14.220 |
could be effective as a deployed search technology. 00:10:18.540 |
Because in the background here is that we have 00:10:21.060 |
semantic expressiveness but it comes at a price, 00:10:24.300 |
we need to do forward inference in BERT models, 00:10:29.700 |
prohibitively so if we have very tight latency restrictions. 00:10:46.860 |
Here this is how this would play out for Colbert. 00:10:52.060 |
essentially consists of token level representations. 00:11:17.060 |
We use BM25 for the expensive first phase where we need to do 00:11:21.420 |
brute force search over our entire index of documents, 00:11:33.700 |
that second phase can be incredibly powerful and add 00:11:36.780 |
a lot of value as a result of the fact that Colbert and 00:11:39.860 |
models like it are so good at doing retrieval in this context, 00:11:46.140 |
One nice thing about this though is that we can control 00:11:51.700 |
we'll do very little processing with Colbert. 00:11:55.420 |
we'll use Colbert more often and we can calibrate 00:11:58.220 |
that against other constraints that we're operating under. 00:12:04.940 |
The one concern you might have maybe as a purist, 00:12:08.060 |
is that you now have two retrieval mechanisms in play, 00:12:13.220 |
and Colbert which performs the re-ranking function. 00:12:16.220 |
We might hope for a more integrated solution. 00:12:27.060 |
Now, the primary thing will be that we'll have 00:12:40.420 |
and then for each vector in that query representation, 00:12:44.020 |
we retrieve the P most similar token vectors, 00:12:47.420 |
and then travel through them to their associated documents. 00:12:59.340 |
all we're doing is a bunch of similarity calculations 00:13:04.860 |
Again, we have a lot of control over how much we 00:13:08.100 |
actually use the full Colbert model at step 2 here, 00:13:13.220 |
other constraints that we're operating under. 00:13:20.860 |
The way we can do even better is with centroid-based ranking. 00:13:47.060 |
Now, given a query that we encode again as a sequence of vectors, 00:14:12.980 |
instead of having to search over this entire index, 00:14:15.580 |
we search over a potentially very small number of 00:14:32.260 |
I thought I would just mention a little bit of 00:14:38.660 |
This comes from the paper that we called Plaid. 00:14:43.580 |
despite all the hard work that I just described for you, 00:14:52.780 |
Whereas you might hope you could get this down to 00:15:04.100 |
One surprising thing for Colbert is that only a small part of 00:15:18.660 |
used with the centroids that I described before. 00:15:21.180 |
The bulk of the work is being done when we have to 00:15:29.140 |
That's a point that I haven't mentioned before, 00:15:33.140 |
the Colbert index can get very large because we 00:15:39.180 |
But we find that we can make them relatively low resolution 00:15:46.060 |
all they need to do is represent individual tokens. 00:16:01.020 |
What the team did is do a lot of work to reduce 00:16:06.580 |
decompression that the Colbert model was doing. 00:16:09.260 |
They trade that a little bit off against using 00:16:16.100 |
But they did successfully remove almost all the overhead that was 00:16:21.940 |
the corresponding decompression that we had to do, 00:16:24.900 |
and they got the latency all the way down to 58 milliseconds. 00:16:28.860 |
I regard this as absolutely an amazing achievement. 00:16:39.420 |
but rather thinking about issues like latency and how they 00:16:42.380 |
impact the deployability of systems like this. 00:16:46.180 |
There's lots more room for innovation in this space. 00:16:49.540 |
I would exhort you-all to think about how you could 00:16:51.740 |
contribute to making systems not only more accurate, 00:16:54.700 |
but also more efficient along this and other dimensions. 00:16:59.020 |
There's one more model that I wanted to mention because I think 00:17:02.740 |
this is incredibly powerful and competitive and also 00:17:08.900 |
how to use neural representations in this space. 00:17:19.940 |
and I'm trying to be agnostic about whether this is 00:17:24.540 |
because we do both of those with the same kind of calculations. 00:17:32.100 |
The core shift in perspective here is that now we're going to do 00:17:38.860 |
but rather with respect to our entire vocabulary. 00:17:42.460 |
Here I have a small vocabulary of just seven items, 00:17:45.780 |
but of course you could have tens of thousands of items. 00:17:48.660 |
That's important for SPLADE because we're going to have 00:18:02.220 |
but now the scoring is with respect to tokens in 00:18:05.060 |
the sequence that we're processing and all of our vocabulary items. 00:18:12.980 |
You should think of it as a bunch of neural layers 00:18:16.060 |
that help you represent all of these comparisons. 00:18:23.260 |
the sparsification of the scores that we get out of that. 00:18:28.420 |
The essential insight is that with this SPLADE function, 00:18:32.060 |
we're going to get a score for every vocabulary item 00:18:35.100 |
with respect to the sequence that we have processed. 00:18:40.260 |
You should think of this orange thing as a vector with 00:18:43.980 |
the same dimensionality as our vocabulary giving what are 00:18:50.220 |
our sequence with respect to everything in that vocabulary. 00:18:54.020 |
Again, we do that for queries and for documents, 00:18:56.940 |
and then the similarity function that's at the heart of all of 00:19:02.420 |
which is a dot product between the SPLADE representation for 00:19:05.900 |
the query and the SPLADE representation for the document. 00:19:09.820 |
These are big long sparse vectors and we take 00:19:14.900 |
The loss is our usual negative log likelihood plus importantly, 00:19:19.780 |
a regularization term that leads to sparse balance scores, 00:19:23.860 |
which I think is an important modification given how 00:19:33.900 |
and I love this perspective where we're now even further back 00:19:40.540 |
the vocabulary and term matching is so important. 00:19:43.220 |
But again, it's happening in this very rich neural space, 00:19:50.420 |
I'm not going to go through this slide in detail, 00:20:01.580 |
But I think the list does point to a general set of directions 00:20:10.580 |
That can happen with things like distillation and 00:20:15.580 |
the models and setting up new objectives for them, 00:20:20.820 |
not just accuracy, but also efficiency for these systems. 00:20:25.100 |
Tremendously exciting and active area of research for the field. 00:20:33.460 |
the thing that I emphasized so much when we talked about IR metrics, 00:20:37.420 |
which is that there is more at stake here than just accuracy. 00:20:43.860 |
controlled experiments that we did in this paper, 00:20:46.540 |
trying to get a sense for the system requirements, 00:20:49.740 |
latency, costs, and accuracy for a variety of systems. 00:20:54.500 |
There's no simple way to navigate this table, 00:21:02.660 |
even get to run with this tiny little compute budget. 00:21:06.140 |
If you are absolutely compute constrained or cost constrained, 00:21:15.700 |
But assuming you can have more heavy-duty hardware, 00:21:18.820 |
you might think about trade-offs within the space of 00:21:21.180 |
possible Colbert setups and this is illuminating because, 00:21:25.140 |
for example, these two models are pretty close in accuracy, 00:21:28.740 |
but very far apart in terms of cost and latency. 00:21:32.340 |
You might think I can sacrifice this amount of accuracy here 00:21:38.420 |
to do this much in terms of reduced latency and cost. 00:21:50.580 |
and costs a fraction of what the Colbert model costs. 00:21:54.020 |
Now, the Colbert model is much more accurate, 00:21:59.500 |
given the other considerations that are in play. 00:22:02.660 |
Here's another comparison between two BT-Splayed large models. 00:22:06.580 |
For a modest reduction in latency that comes from running on a GPU, 00:22:11.820 |
I have to pay a whole lot more money for the same accuracy. 00:22:16.860 |
For example, you start to see that it's very unlikely that you'd be able to 00:22:20.580 |
justify using a GPU with BT-Splayed large when it's only a modest latency reduction, 00:22:27.100 |
but a huge ballooning in the overall cost that you pay. 00:22:30.980 |
I think there are lots of other comparisons like this that we can make, 00:22:34.700 |
and we're going to talk later in the course about how we might 00:22:41.140 |
a leaderboard that takes account of all of these different pressures. 00:22:44.940 |
IR is a wonderful playground for thinking about such trade-offs.