back to index

Stanford XCS224U: NLU I Information Retrieval, Part 4: Neural IR I Spring 2023


Chapters

0:0 Intro
0:21 Cross-encoders
4:53 Shared loss function The negative log-likelihood of the positive passage
8:38 Soft alignment with ColBERT
10:1 ColBERT as a reranker
12:18 Beyond reranking for CoIBERT
13:21 Centroid-based ranking
14:29 ColBERT latency analysis
16:4 Additional ColBERT optimizations
16:59 SPLADE
19:51 Additional recent developments
20:29 Multidimensional benchmarking

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.160 | This is part 4 in our series on information retrieval.
00:00:09.240 | We come to the heart of it, neural information retrieval.
00:00:11.840 | This is the class of models that has done so much to bring NLP and IR
00:00:16.200 | back together again and open new doors for both of those fields.
00:00:20.480 | In the background throughout the screencast,
00:00:23.040 | I think you should imagine that the name of the game is to take
00:00:25.800 | a pre-trained BERT model and fine-tune it for information retrieval.
00:00:31.320 | In that context, cross-encoders are
00:00:34.240 | conceptually the simplest approach that you could take.
00:00:37.400 | For cross-encoders, we're going to concatenate
00:00:40.720 | the query text and the document text together into one single text,
00:00:44.920 | process that text with BERT,
00:00:47.020 | and then use representations in that BERT model as the basis for IR fine-tuning.
00:00:52.720 | In a bit more detail, we'd process the query in
00:00:55.040 | the document and then probably take the final output state above the class token,
00:01:00.080 | add some task specific parameters on top,
00:01:03.120 | and fine-tune the model against our information retrieval objective.
00:01:07.440 | That will be incredibly semantically expressive because we have all of
00:01:11.080 | these interesting interactions between query and document in this mode.
00:01:15.680 | In a bit more detail, in the background here,
00:01:17.760 | I'm imagining that we have a dataset of triples where we have a query,
00:01:21.800 | one positive document for that query, and some number,
00:01:25.280 | one or more negative documents for that query.
00:01:28.920 | The basis for scoring is as I described it before,
00:01:32.360 | we're going to take our BERT encoder,
00:01:34.480 | concatenate the query in the document and process that text,
00:01:37.680 | and then retrieve the final output state above the class token that's given here,
00:01:42.400 | and that's fed through a dense layer that is used for scoring.
00:01:46.760 | Then the loss function for the model is typically
00:01:49.820 | the negative log likelihood of the positive passage.
00:01:52.840 | In the numerator here, we have our score for
00:01:55.400 | the positive passage according to our scoring function,
00:01:58.360 | and the denominator is that positive passage score again,
00:02:01.440 | sum together with the total for all the negative passages.
00:02:05.760 | Let's step back. This will be incredibly semantically rich,
00:02:09.560 | but it simply won't scale.
00:02:11.400 | The richness comes from us using
00:02:13.360 | the BERT model to jointly encode the query and the documents.
00:02:16.760 | We have all these rich token level interactions,
00:02:19.700 | but that is the model's downfall.
00:02:21.760 | This won't scale because we need to encode every document at query time.
00:02:26.860 | In principle, this means that if we have a billion documents,
00:02:29.780 | we need to do a billion forward passes with the BERT model,
00:02:34.100 | one for every document with respect to our query,
00:02:37.100 | get all those scores,
00:02:38.300 | and then make decisions on that basis,
00:02:40.220 | and that will be simply infeasible.
00:02:42.940 | Although there's something conceptually right about this approach,
00:02:46.180 | it's simply intractable for modern search.
00:02:50.380 | DPR can be seen as a model
00:02:52.820 | that's at the other end of the spectrum.
00:02:54.700 | This stands for dense passage retriever.
00:02:57.080 | In this mode, we're going to separately encode queries and documents.
00:03:01.340 | On the left here, I've got our query encoded with a BERT-like model,
00:03:04.940 | and I've grayed out all of the states
00:03:07.160 | except the final output state above the class token.
00:03:09.940 | That's the only one that we'll really need.
00:03:12.100 | I separately encode the document, and again,
00:03:14.500 | we just need that final output state above the class token.
00:03:18.260 | Then we're going to do scoring as a dot product of those two vectors.
00:03:22.780 | In a bit more detail, again,
00:03:24.220 | we have a dataset consisting of those triples for our query,
00:03:27.620 | one positive document, and one or more negative documents.
00:03:31.500 | Now, our core comparison function is what I've called sim here.
00:03:35.820 | The basis for sim is that we encode our query using
00:03:38.700 | our query encoder and get the final output state,
00:03:41.780 | and we get the dot product of that with
00:03:43.900 | the encoding for our document again focused
00:03:46.460 | on the output state above the class token.
00:03:49.220 | The loss is as before,
00:03:50.980 | this is the negative log likelihood of the positive passage.
00:03:54.100 | The positive score up here,
00:03:55.960 | and then again, use down here sum together
00:03:58.420 | with the sum for all of the negative passages.
00:04:01.540 | This will be highly scalable,
00:04:04.220 | but it's very limited in terms of its query document interactions.
00:04:08.100 | Let's unpack that a bit.
00:04:09.460 | The core of the scalability is that we can now
00:04:12.580 | encode all of our documents offline ahead of time,
00:04:15.460 | and indeed, we only need to store
00:04:18.460 | one single vector associated with each one of those documents.
00:04:22.180 | Then at query time,
00:04:23.460 | we just encode the query,
00:04:25.020 | get that one representation above the class token,
00:04:28.180 | and do a fast dot product with all of our documents.
00:04:31.100 | It's highly scalable in that sense too.
00:04:33.460 | But at the same time,
00:04:34.860 | we have lost all of those token level interactions
00:04:37.660 | we had with the cross encoder.
00:04:39.980 | Now, we have to hope that all of
00:04:41.580 | the information about the query and
00:04:43.180 | the document is summarized in
00:04:45.340 | those single vector representations and we might
00:04:48.460 | worry that that results in a loss of expressivity for the model.
00:04:53.100 | Before moving on to some additional models in this space,
00:04:56.340 | I thought I would just pause here and point
00:04:58.540 | out that we have a little bit of modularity.
00:05:01.300 | The loss function for both of the models that I presented
00:05:04.540 | is the negative log likelihood of the positive passage.
00:05:07.780 | Here's how I presented it for the cross encoder,
00:05:10.100 | and the core of that is this rep function.
00:05:12.500 | Here's how I presented it for DPR,
00:05:15.020 | where the core of it is the sim function.
00:05:17.300 | You can now see that there's a general form of this,
00:05:20.260 | where we just have some comparison function
00:05:22.460 | and everything else remains the same.
00:05:24.620 | This is freeing because if you developed
00:05:27.420 | variants of DPR or cross encoders,
00:05:30.500 | the way that might play out is that you've simply
00:05:32.740 | adjusted the comparison function here,
00:05:35.260 | and everything else about how you're setting up
00:05:37.340 | models and optimizing them could
00:05:38.900 | potentially stay the same.
00:05:41.460 | Let's move to Colbert.
00:05:44.420 | This model is near and dear to me.
00:05:46.340 | Colbert was developed by Omar Khattab,
00:05:48.300 | who is my student, along with Matei Zaharia,
00:05:51.100 | who's my longtime collaborator and co-advises Omar with me.
00:05:55.340 | Omar would want me to point out for you that Colbert
00:05:58.700 | stands for contextualized late interaction with BERT.
00:06:02.300 | That's an homage to the late night talk show host,
00:06:05.100 | Stephen Colbert, who has
00:06:06.860 | late night contextual interactions with his guests.
00:06:10.420 | But you are also free to pronounce this Colbert
00:06:13.420 | because obviously the BERT in
00:06:15.220 | that name is the famous BERT model.
00:06:18.300 | Here's how Colbert works.
00:06:20.220 | First, we encode queries using BERT.
00:06:22.540 | I've drawn this on its side for
00:06:24.140 | reasons that will become clear when I
00:06:25.540 | show you my full diagram,
00:06:26.780 | but it's just a BERT encoding of the query.
00:06:29.140 | I've grayed out all the states except the final ones
00:06:31.980 | because the only states we need
00:06:34.060 | are the output states from this model.
00:06:37.020 | Similarly, we encode the document again with BERT,
00:06:40.660 | and here again, the only states we need are the output states.
00:06:44.860 | Then the basis for Colbert scoring is a matrix of
00:06:48.380 | similarity scores between query tokens and document tokens,
00:06:52.580 | again, as represented by these final output layers.
00:06:55.540 | We get scores, and in fact,
00:06:57.140 | we get a full grid of these scores.
00:06:59.740 | Then the basis for scoring is
00:07:01.940 | a maxim comparison for every query token.
00:07:05.820 | We get the value of the maximum similarity for document tokens,
00:07:10.260 | and we sum those together to get
00:07:12.260 | the maxim value that is the basis for the model.
00:07:15.500 | In a bit more detail, again,
00:07:17.260 | we have a dataset consisting of those triples.
00:07:20.220 | The loss is the negative log likelihood of the positive passage,
00:07:23.860 | but now maxim is the basis,
00:07:25.620 | and here is the maxim scoring function in full detail.
00:07:29.460 | But again, the essence of this is that for
00:07:31.380 | each query token, we get the maxim for
00:07:34.500 | some document token and sum all those maxim values together.
00:07:39.180 | This will be highly scalable,
00:07:41.780 | but it has late contextual interactions
00:07:44.380 | between tokens in the query and tokens in the document.
00:07:47.460 | Let me unpack that. It's highly scalable because as with DPR,
00:07:51.420 | we can store all of our documents ahead of time.
00:07:54.660 | We just need to score this vector
00:07:57.580 | of output vectors here to represent documents.
00:08:00.660 | At query time, we encode the query and get the output states,
00:08:04.100 | and then perform a bunch of very fast maxim comparisons for scoring.
00:08:09.020 | But it's also semantically rich.
00:08:11.180 | We have retained some of the advantages of
00:08:13.340 | the cross encoder because we do have
00:08:15.780 | token level interactions between query and document.
00:08:18.780 | It's now, it's just that they happen only on the output states,
00:08:22.340 | whereas the cross encoder allowed them to
00:08:24.020 | happen at every layer in the BERT model.
00:08:26.820 | That was too expensive and this looks like a nice compromise.
00:08:30.700 | Colbert has indeed proven to be
00:08:32.900 | an extremely powerful and effective IR mechanism.
00:08:37.700 | One thing I really like about Colbert is that it
00:08:41.420 | brings in an older insight from IR,
00:08:44.180 | which is that essentially we want to do
00:08:45.860 | some level of term matching between queries and documents.
00:08:49.300 | Except now, since this is a neural model,
00:08:51.620 | we get to do that in a semantically very rich space.
00:08:55.060 | Let me show you that by way of an example.
00:08:56.940 | Here I have the query,
00:08:58.100 | when did the Transformers cartoon series come out?
00:09:01.260 | We have the document,
00:09:02.500 | the animated Transformers was released in August 1986.
00:09:06.740 | I'm going to show you some maxim values.
00:09:09.660 | The largest score is between Transformers in
00:09:12.380 | the query and Transformers in the document.
00:09:14.940 | That makes good sense.
00:09:16.660 | But we also have a very strong maxim match between
00:09:19.860 | cartoon in the query and animated in the document.
00:09:23.300 | That's a very semantic connection that
00:09:25.460 | only neural models like Colbert can make without extra effort.
00:09:28.900 | Similarly, for come out in the context of the query,
00:09:32.860 | we have a strong match to released in the document.
00:09:36.020 | Then for when in the query that matches to
00:09:38.420 | the two parts of the date expression, August 1986.
00:09:41.620 | Here I've shown the top two maxim values to show that we're
00:09:44.780 | really getting a semantic connection
00:09:46.420 | to that full unit in the document.
00:09:48.820 | This thing makes the model highly
00:09:51.620 | interpretable and also reveals to us why this is
00:09:54.300 | such an effective retrieval mechanism
00:09:57.340 | because it can make all of these deep associations.
00:10:01.100 | Before moving on to SPLADE,
00:10:03.660 | the final model that I wanted to talk about,
00:10:05.580 | I thought I would pause here and just talk a little bit with
00:10:08.420 | you about how you take Colbert or any of
00:10:11.420 | these neural models and then turn them into something that
00:10:14.220 | could be effective as a deployed search technology.
00:10:18.540 | Because in the background here is that we have
00:10:21.060 | semantic expressiveness but it comes at a price,
00:10:24.300 | we need to do forward inference in BERT models,
00:10:27.300 | and that can be very expensive,
00:10:29.700 | prohibitively so if we have very tight latency restrictions.
00:10:33.860 | The question for us is,
00:10:35.420 | can we overcome those limitations
00:10:37.780 | and make this a practical solution?
00:10:40.580 | One easy thing to do to make this practical
00:10:44.180 | is to employ these models as re-rankers.
00:10:46.860 | Here this is how this would play out for Colbert.
00:10:49.780 | For Colbert, remember we have an index that
00:10:52.060 | essentially consists of token level representations.
00:10:55.620 | Those are each associated with documents.
00:10:58.260 | Given an index structure like this,
00:11:00.380 | a simple thing to do would be to take
00:11:01.940 | our query and code it as a bunch of tokens,
00:11:04.780 | get the top K documents for that query using
00:11:08.020 | a fast term-based model like BM25,
00:11:11.180 | and then use Colbert only at stage 2
00:11:13.900 | to re-rank the top K documents there.
00:11:17.060 | We use BM25 for the expensive first phase where we need to do
00:11:21.420 | brute force search over our entire index of documents,
00:11:25.100 | and the model like Colbert comes in only at
00:11:27.500 | phase 2 to do re-ranking.
00:11:30.140 | It sounds like a small thing,
00:11:31.740 | but in fact the re-ranking that happens in
00:11:33.700 | that second phase can be incredibly powerful and add
00:11:36.780 | a lot of value as a result of the fact that Colbert and
00:11:39.860 | models like it are so good at doing retrieval in this context,
00:11:44.700 | but they're expensive.
00:11:46.140 | One nice thing about this though is that we can control
00:11:48.940 | our costs because if we set K very low,
00:11:51.700 | we'll do very little processing with Colbert.
00:11:54.180 | If we set K high,
00:11:55.420 | we'll use Colbert more often and we can calibrate
00:11:58.220 | that against other constraints that we're operating under.
00:12:02.100 | This is a perfectly reasonable solution.
00:12:04.940 | The one concern you might have maybe as a purist,
00:12:08.060 | is that you now have two retrieval mechanisms in play,
00:12:10.780 | BM25 which does a lot of the work,
00:12:13.220 | and Colbert which performs the re-ranking function.
00:12:16.220 | We might hope for a more integrated solution.
00:12:19.140 | Could we get beyond re-ranking for Colbert?
00:12:22.100 | I think the answer is yes.
00:12:23.820 | We're going to make a slight adjustment
00:12:25.740 | to how we set up the index.
00:12:27.060 | Now, the primary thing will be that we'll have
00:12:29.220 | these token level vectors which of course,
00:12:32.180 | as before, associate with documents.
00:12:35.220 | Now, when a query comes in,
00:12:37.300 | we encode that into a sequence of vectors,
00:12:40.420 | and then for each vector in that query representation,
00:12:44.020 | we retrieve the P most similar token vectors,
00:12:47.420 | and then travel through them to their associated documents.
00:12:50.860 | Then the only Colbert work that we do is
00:12:53.940 | scoring this potentially small set
00:12:55.780 | of documents that we end up in phase 2.
00:12:57.860 | Because in phase 1,
00:12:59.340 | all we're doing is a bunch of similarity calculations
00:13:02.420 | between vector representations.
00:13:04.860 | Again, we have a lot of control over how much we
00:13:08.100 | actually use the full Colbert model at step 2 here,
00:13:11.220 | and therefore we can calibrate against
00:13:13.220 | other constraints that we're operating under.
00:13:16.180 | This is certainly workable,
00:13:18.220 | but we can probably do even better.
00:13:20.860 | The way we can do even better is with centroid-based ranking.
00:13:24.780 | This begins from the insight that
00:13:26.580 | this index that we've constructed
00:13:28.340 | here will have a lot of semantic structure,
00:13:31.260 | and we can capture that by clustering
00:13:33.620 | the token level vectors that represent
00:13:35.660 | our documents into clusters,
00:13:38.460 | and then taking their centroids to be
00:13:40.540 | representative summaries of those clusters.
00:13:43.460 | We can use those as the basis for search.
00:13:47.060 | Now, given a query that we encode again as a sequence of vectors,
00:13:51.220 | for each one of those vectors,
00:13:52.780 | we retrieve the closest centroids,
00:13:55.460 | and then travel from them to
00:13:57.540 | similar document tokens and
00:13:59.860 | then from them to similar documents.
00:14:02.100 | Then again, we use Colbert,
00:14:03.620 | the full model only at step 3 here.
00:14:06.340 | All these other comparisons are
00:14:07.900 | just fast similarity comparisons.
00:14:10.940 | This gives us huge gains because
00:14:12.980 | instead of having to search over this entire index,
00:14:15.580 | we search over a potentially very small number of
00:14:18.940 | centroid representations and use those as
00:14:21.780 | the basis for getting down to
00:14:23.740 | a small set of documents that we're
00:14:25.340 | going to score completely with Colbert.
00:14:28.420 | That's a bunch of the work that we've done.
00:14:32.260 | I thought I would just mention a little bit of
00:14:34.540 | the work that we've done specifically to
00:14:36.380 | address latency concerns for Colbert.
00:14:38.660 | This comes from the paper that we called Plaid.
00:14:41.460 | It begins from the observation that
00:14:43.580 | despite all the hard work that I just described for you,
00:14:46.260 | the latency for the Colbert model was still
00:14:49.020 | prohibitively high at 287 milliseconds.
00:14:52.780 | Whereas you might hope you could get this down to
00:14:54.700 | around 50 milliseconds for
00:14:56.300 | a feasible deployable solution at a minimum.
00:15:00.380 | This chart here is showing you
00:15:02.420 | where the work actually happens.
00:15:04.100 | One surprising thing for Colbert is that only a small part of
00:15:08.300 | the overall time there is actually spent on
00:15:10.860 | the core modeling steps of
00:15:12.380 | representing examples and doing scoring.
00:15:15.860 | In fact, only a small part is even
00:15:18.660 | used with the centroids that I described before.
00:15:21.180 | The bulk of the work is being done when we have to
00:15:24.180 | look things up in this giant index,
00:15:26.820 | and also when we do decompression.
00:15:29.140 | That's a point that I haven't mentioned before,
00:15:31.060 | but the essence of this is that
00:15:33.140 | the Colbert index can get very large because we
00:15:36.220 | need to store token level representations.
00:15:39.180 | But we find that we can make them relatively low resolution
00:15:43.140 | for or even two-bit representations because
00:15:46.060 | all they need to do is represent individual tokens.
00:15:49.580 | But that does mean that we would like to
00:15:51.460 | decompress them at some point to
00:15:53.300 | get back to their full semantic richness.
00:15:55.700 | We found that that step of unpacking them,
00:15:58.900 | was also expensive.
00:16:01.020 | What the team did is do a lot of work to reduce
00:16:04.380 | the amount of heavy-duty lookup and
00:16:06.580 | decompression that the Colbert model was doing.
00:16:09.260 | They trade that a little bit off against using
00:16:11.740 | more centroids as part of
00:16:13.260 | that initial search phase that I described.
00:16:16.100 | But they did successfully remove almost all the overhead that was
00:16:19.740 | coming from these large data structures and
00:16:21.940 | the corresponding decompression that we had to do,
00:16:24.900 | and they got the latency all the way down to 58 milliseconds.
00:16:28.860 | I regard this as absolutely an amazing achievement.
00:16:32.740 | I think it shows you how much
00:16:34.540 | innovative work can happen in this space,
00:16:36.300 | not focused on hill climbing on accuracy,
00:16:39.420 | but rather thinking about issues like latency and how they
00:16:42.380 | impact the deployability of systems like this.
00:16:46.180 | There's lots more room for innovation in this space.
00:16:49.540 | I would exhort you-all to think about how you could
00:16:51.740 | contribute to making systems not only more accurate,
00:16:54.700 | but also more efficient along this and other dimensions.
00:16:59.020 | There's one more model that I wanted to mention because I think
00:17:02.740 | this is incredibly powerful and competitive and also
00:17:05.980 | offers yet again another perspective on
00:17:08.900 | how to use neural representations in this space.
00:17:11.700 | This model is SPLADE.
00:17:14.100 | Here's how SPLADE works.
00:17:15.820 | I've got at the bottom here
00:17:17.180 | our encoding mechanism for sequences,
00:17:19.940 | and I'm trying to be agnostic about whether this is
00:17:22.220 | a query sequence or a document sequence
00:17:24.540 | because we do both of those with the same kind of calculations.
00:17:29.500 | Just imagine we're processing some text.
00:17:32.100 | The core shift in perspective here is that now we're going to do
00:17:35.820 | scoring with respect not to some other text,
00:17:38.860 | but rather with respect to our entire vocabulary.
00:17:42.460 | Here I have a small vocabulary of just seven items,
00:17:45.780 | but of course you could have tens of thousands of items.
00:17:48.660 | That's important for SPLADE because we're going to have
00:17:51.060 | very sparse representations by comparison
00:17:53.820 | with cross-encoders, DPR and Colbert.
00:17:57.300 | Here's how this works.
00:17:58.980 | We're going to form like with Colbert,
00:18:00.940 | a matrix of scores,
00:18:02.220 | but now the scoring is with respect to tokens in
00:18:05.060 | the sequence that we're processing and all of our vocabulary items.
00:18:09.300 | The scoring function for that is detailed.
00:18:11.860 | I've depicted it here.
00:18:12.980 | You should think of it as a bunch of neural layers
00:18:16.060 | that help you represent all of these comparisons.
00:18:19.420 | You do all of that work,
00:18:21.020 | and then the SPLADE scoring function is
00:18:23.260 | the sparsification of the scores that we get out of that.
00:18:27.060 | That's depicted here.
00:18:28.420 | The essential insight is that with this SPLADE function,
00:18:32.060 | we're going to get a score for every vocabulary item
00:18:35.100 | with respect to the sequence that we have processed.
00:18:37.940 | That's what's depicted in orange here.
00:18:40.260 | You should think of this orange thing as a vector with
00:18:43.980 | the same dimensionality as our vocabulary giving what are
00:18:47.540 | probably very sparse scores for
00:18:50.220 | our sequence with respect to everything in that vocabulary.
00:18:54.020 | Again, we do that for queries and for documents,
00:18:56.940 | and then the similarity function that's at the heart of all of
00:18:59.660 | these models is now SIMSPLADE,
00:19:02.420 | which is a dot product between the SPLADE representation for
00:19:05.900 | the query and the SPLADE representation for the document.
00:19:09.820 | These are big long sparse vectors and we take
00:19:12.340 | the dot product of them for scoring.
00:19:14.900 | The loss is our usual negative log likelihood plus importantly,
00:19:19.780 | a regularization term that leads to sparse balance scores,
00:19:23.860 | which I think is an important modification given how
00:19:26.580 | different the SPLADE representations are
00:19:28.820 | compared to the others we've discussed.
00:19:31.660 | But this is an incredibly powerful model,
00:19:33.900 | and I love this perspective where we're now even further back
00:19:37.980 | to original IR insights about how
00:19:40.540 | the vocabulary and term matching is so important.
00:19:43.220 | But again, it's happening in this very rich neural space,
00:19:47.140 | it's defined by this grid of scores.
00:19:50.420 | I'm not going to go through this slide in detail,
00:19:53.740 | but I couldn't resist mentioning
00:19:55.260 | a bunch of other recent developments.
00:19:57.580 | They are biased toward Colbert
00:19:59.660 | because I'm biased toward Colbert.
00:20:01.580 | But I think the list does point to a general set of directions
00:20:06.100 | around making systems more efficient
00:20:08.260 | and also making them more multilingual.
00:20:10.580 | That can happen with things like distillation and
00:20:13.740 | also innovative ways of training
00:20:15.580 | the models and setting up new objectives for them,
00:20:18.460 | while balancing lots of considerations,
00:20:20.820 | not just accuracy, but also efficiency for these systems.
00:20:25.100 | Tremendously exciting and active area of research for the field.
00:20:29.260 | To round out that point,
00:20:31.340 | I thought I would return to
00:20:33.460 | the thing that I emphasized so much when we talked about IR metrics,
00:20:37.420 | which is that there is more at stake here than just accuracy.
00:20:41.980 | This is from a series of
00:20:43.860 | controlled experiments that we did in this paper,
00:20:46.540 | trying to get a sense for the system requirements,
00:20:49.740 | latency, costs, and accuracy for a variety of systems.
00:20:54.500 | There's no simple way to navigate this table,
00:20:56.780 | so let me just highlight a few things.
00:20:58.860 | First, BM25 is the only model that we could
00:21:02.660 | even get to run with this tiny little compute budget.
00:21:06.140 | If you are absolutely compute constrained or cost constrained,
00:21:10.380 | you might be forced to choose BM25.
00:21:13.140 | It's a reasonably effective model.
00:21:15.700 | But assuming you can have more heavy-duty hardware,
00:21:18.820 | you might think about trade-offs within the space of
00:21:21.180 | possible Colbert setups and this is illuminating because,
00:21:25.140 | for example, these two models are pretty close in accuracy,
00:21:28.740 | but very far apart in terms of cost and latency.
00:21:32.340 | You might think I can sacrifice this amount of accuracy here
00:21:38.420 | to do this much in terms of reduced latency and cost.
00:21:42.140 | Here's another such comparisons.
00:21:43.980 | Colbert small has latency of 206,
00:21:47.460 | BT-Splayed large has latency of 46,
00:21:50.580 | and costs a fraction of what the Colbert model costs.
00:21:54.020 | Now, the Colbert model is much more accurate,
00:21:56.740 | but maybe this is an affordable drop here
00:21:59.500 | given the other considerations that are in play.
00:22:02.660 | Here's another comparison between two BT-Splayed large models.
00:22:06.580 | For a modest reduction in latency that comes from running on a GPU,
00:22:11.820 | I have to pay a whole lot more money for the same accuracy.
00:22:16.860 | For example, you start to see that it's very unlikely that you'd be able to
00:22:20.580 | justify using a GPU with BT-Splayed large when it's only a modest latency reduction,
00:22:27.100 | but a huge ballooning in the overall cost that you pay.
00:22:30.980 | I think there are lots of other comparisons like this that we can make,
00:22:34.700 | and we're going to talk later in the course about how we might
00:22:37.860 | systemize some of these observations into
00:22:41.140 | a leaderboard that takes account of all of these different pressures.
00:22:44.940 | IR is a wonderful playground for thinking about such trade-offs.
00:22:50.060 | [BLANK_AUDIO]