-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathtranscription.json
274 lines (274 loc) · 103 KB
/
transcription.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
[
{
"speaker": "Speaker A",
"content": "Hey, everyone."
},
{
"speaker": "Speaker B",
"content": "Thank you so much for watching another episode of the weaviate podcast. Today we're diving into the Arctic Embed text embedding model series from Snowflake, with a particular focus on the recently released Arctic embed 2.0 Multilingual text Embedding model. We'll also touch on the recently released weaviate embedding service is one way to access these models and of course, easily integrate them with weaviate. So to kick it off from the Snowflake team, I'm super excited to welcome Luke Merrick and Martin Yu. Thank you both so much for joining the podcast."
},
{
"speaker": "Speaker A",
"content": "Thank you. It's great to be here, Connor. We both have listened to the podcast as fans, and it's kind of a special treat to get to be here with you. Meeting the man himself, Connor Shorten."
},
{
"speaker": "Speaker C",
"content": "Yeah, thanks for having us."
},
{
"speaker": "Speaker B",
"content": "Thanks for that Luke. Awesome. And from the Weaviate team, I'm super excited to welcome Charles Pierce. Charles is the second Weaviate podcast guest helping get this podcast off the ground. Now he's back at weaviate as the head of the weaviate Labs team. Charles, thank you so much for joining."
},
{
"speaker": "Speaker D",
"content": "It's good to be back. Conor, a pleasure as always, and I'm really, really excited for this conversation. So, yeah, I think it will. Think it's going to be a great one."
},
{
"speaker": "Speaker A",
"content": "Awesome."
},
{
"speaker": "Speaker B",
"content": "So, Luke and Martin, could you kick us off? I'd really love to know. This kind of the intro of the Arctic embedding models, just sort of like how the project began and all these things."
},
{
"speaker": "Speaker A",
"content": "Yeah, I mean, so I can kick that one off. I was there maybe a little bit earlier in the. In the process than Martin. So Snowflake acquired Neva, which was a search company, and that brought like a kind of a lot of search folks and search expertise and search ambitions into Snowflake about a year and a half ago, as part of that or around that time, this, this new product, which has now become Cortex Search, was created for Snowflake. And this is kind of a managed search solution. Um, it's. It's kind of a different target audience than I think a lot of the weaviate product really resonates with because it's kind of. Not a lot of. You're not meant to understand how it works."
},
{
"speaker": "Speaker B",
"content": "It's."
},
{
"speaker": "Speaker A",
"content": "It's not the open source part, but along the way we, we did a lot of search benchmarking and we're creating this product and we're thinking about, you know, how can I provide a better, higher Quality search product. How, how do you even measure that quality? And as we were doing these early benchmarks, we're finding out which part of the stack is really driving quality. And it turned out, you know, we could, we could tweak some stuff around like vocabulary expansion or synonyms or all these things on the keyword search. They have an impact. You can fiddle with the way you fuse scores when you're using like a re ranking neural network model or switching which neural network model you're using for re ranking. But actually, you know, in our initial experiments, at least from our current, our original stack, the embedding models were by far the strongest signal we had on impacting quality. And there were new ones coming out which we were benchmarking and seeing, oh, these ones are better. And so we were kind of planning, oh, can we switch to this model? You know, is the licensing right? Is the, is the branding right? You know, Snowflake unfortunately has to guard both its brand and meet all these license requirements. And so that we sort of, this idea came up of like, well, it's an open source, you know, publication, big paper trail, maybe we can just implement one of these. And so that was the first step was just could we reproduce some of the existing prior work. The benefit that we had was as kind of an engineering team trying to actually power products. We were able to focus a lot more on the details of the implementation and less about the novelty or other kind of science publication focused context. And that actually ended up making it so that when we went forward, both this data assets we had from niva, a lot of these careful evaluations, we kind of went out with our 1.0 models, kind of went out strong. So that's, that's, you know, then that kicked off kind of a trajectory of like, okay, well now we're onto something. So that's been, I guess the history from, from very early on we kind of realized like where Snowflake is really trying to make its money is more this managed service. And so, you know, if we're doing this great work, we're benefiting so much from open source, open source pre trained models. Start the model. You know, as our initial point, we're doing all sorts of, you know, literature review behavior and you know, keeping up with everyone else's great work out there in the open. Why don't we also participate in this? And so that's kind of where the, you know, and that became arctic embed the, you know, open source models, which are the same models we serve inside of Snowflake and that actually has a nice benefit too because now the community really trusts us and feels less locked in and has all these other benefits too if they're using our managed product. And also it has to benefit that everyone who is not the right product for can also benefit from these models for retrieval. And we're seeing some of that. We're very excited with what we've is doing with these models, for example."
},
{
"speaker": "Speaker B",
"content": "Awesome. I think, yeah, I think we have a similar focus. I love that whole emphasis on kind of reproducing the training recipe and then how that leads into all these research insights and as later in the podcast we'll dive into all these cool things you've discovered. But I think maybe I want to pass it to Charles to kind of keep discussing this. Like, yeah, like I think weaviate. We're also kind of coming into a similar strategy of trying to move up the. Move up and around the set stack sort of to help more with these managed search solutions. Charles, I mean, to put you on top of it."
},
{
"speaker": "Speaker D",
"content": "Yeah, no, no thanks Hunter. I think, yeah, like in our own way it's something similar but also kind of different in the sense that, you know, Web8 had this existing modules integration kind of system for four years now and it's really popular. And one thing we see is that our users love those integrations. But Also maybe some third party APIs have limitations and usually those limitations are based off of things like rate limits and kind of just functional things like, oh, I have to bring a new API key into my setup. So one of the questions we asked is how can we make this process easier? So we kind of had this idea to just spin her up, spin up our own embedding service that we can kind of ensure is able to like scale with with our users and what. And what they need. So that's where that came from. And then, you know, our initial set of models that we landed on was Arctic Embed, just because I felt like those models really gave the best bang for buck, you know, if that's if you want to how you want to kind of see it in terms of parameter count to performance as well as all the nice little features on top like mrl, like Matryushka representation learning and quantization and really, really good kind of recall kind of maintenance even as you're reducing the dimensions down. So it just, it felt very kind of how you described it, Luke, in terms of your use case. The model feels very production oriented and you can kind of. It kind of shows that the Team uses it internally because it has all of these nice features and like you actually did, you, you said something there which was like the community trust and that, that like links. That really kind of struck me because I've been looking at MTAB a lot recently and, and the benchmark and I feel like I've had some conversations where I feel like there, there might be some doubts about certain models there in terms of how good is that score and reflective are those average scores. Usually if I see a model with 7 billion parameters, I have a little bit of skepticism of how good that might be in production. And I think when I hear teams like your own saying that they're using these models in production, it creates that level of trust. I like to say know your provider as well as just the model and its score because if you know the provider is using that model that you know they're not going to use a model that for one reason or the other might be artificially doing better on like an average score, but actually in a production scenario, then it might actually turn out there might have been some like eval data leakage or something like that."
},
{
"speaker": "Speaker A",
"content": "Oh man, I have, I have so many hot takes regarding all of these topics. I'm happy to share a couple, a couple thoughts that I think might be interesting to folks. Listening is, you know, the MTAB benchmark is an incredibly useful tool. You know, and the beer benchmark, it kind of evolved out of. We focus very heavily on retrieval. So I know it's kind of unfortunate a lot of people kind of average in all these other tasks, even if their use case is retrieval. It kind of makes it harder for us to have that conversation. But it is really useful to see these other dimensions. And you know, actually we like when we go, I mean we don't like when our ranking drops a ton by averaging. All these other tasks we're not focused on not training for, but we like being able to see, you know, knowing that this model for some sort of, you know, we trained a linear classifier on top is maybe not as fit for purpose without further fine tuning than, you know, another option off the shelf. It's, you know, it's useful to get that knowledge and from like a scientific perspective especially, but it's also, you know, if your goal is to get to the top of the leaderboard, I mean the, the, the most terrible thing you could do is just train on the evaluation set. Nobody is stopping you. It's public data. But also the original purpose of the beer benchmark was to have a bunch of different domains and prove that a lot of these papers that neural networks were showing really promising results in. And Martin, I think you're more of a research than expert than I, so you can probably correct me or give some more color here. It was not meant to be something oh, I trained on each one of these data sets and now I can solve these data sets. It was how much can I generalize across different things? And now what you see is unfortunately all these data sets like are very, have good training sets. You know, you have Ms. Marco and Natural Questions and Hotpot qa and I think even these other ones that have very small training sets, people are still using them. Maybe not because it makes the model better in generality, but, but it definitely is better on what they're measuring. And it's, you know, it's hard to know is including the, you know, Phi QA small training set improving my model outside of Phi QA evaluation, but it certainly gives you a higher score. And so then when for those of us who are kind of scrolling the results, it's like it's difficult to know exactly. Certainly the models that are 10 points higher than others are probably onto something. But within the ones that are one or two points away from one another also it's kind of sometimes apples to oranges because somebody has written a data set dependent prompt that's been very well tuned. They haven't fine tuned on that data set necessarily. Maybe they haven't trained on its case, but they've done some amount of fitting that model to that data set that requires more work in practice than just simply wiring up the model as an API or something. You're going to have to create a dataset dependence, you know, use. And so those two paradigms being compared on the same benchmark is kind of a, you know, interesting situation. And for folks who are just trying to get one number just quickly scroll it's unfortunately, you know, a little bit more nuanced than that. But we love, I mean we've, we've honestly we've made a lot of benchmarks of our own. We use a lot of benchmarks of our own but we keep coming back to you know, all these beer data sets, semtep retrieval because it is really a great span of different domains and it is very useful tool. So you know, nothing, nothing here is to besmirch that as a great and useful tool. But you know, any tool is only as good as you use it, you know, properly and for its purpose. And that's kind of an interesting case There, Martin, did you have anything to add there? I know you have. Martin's got like, you know, a bajillion years of experience in the, in the academic domain as well. And so he's seen the, you know, the creation of these things."
},
{
"speaker": "Speaker C",
"content": "I mean, I mean I think we all get to the like the clay versus miracle sort of l tracing topic on the like the multilingual part. So get to that."
},
{
"speaker": "Speaker B",
"content": "Oh yeah. Awesome. Yeah. So, so many cool things here. Yeah. Are we really measuring generalization with these benchmarks? I think that topic is always around, especially with the language models and many models as well. And yeah, I think the beer data set has been awesome. I've seen the FIKA data set versus like Psydox and all these different things and miracle data set that you use in the 2.0 paper. I think that's another amazing paper. There's also latte. But anyway, so yeah, so I'd love to just kind of tour through all these things. Luke, in your opening you mentioned like this recipe, all the things that go into training and embedding model. And so I think it would be really awesome if in this podcast we could just kind of like take that apart and I think make it more accessible to people would be something that could be really cool. So I think maybe the first thing, the thing that most people think is super unaccessible would be pre training models broadly. So could we kick it off with your experience with pre training embedding models?"
},
{
"speaker": "Speaker A",
"content": "Yeah, I mean we came into it, we were reading a lot of prior works And I think E5 was the work out of Microsoft Research Asia which kind of put this new paradigm where it's not just pre training token level tasks to get kind of this token level knowledge, world knowledge. We actually want to do a contrastive objective for very large scale, large number of examples leveraging web scale data. And so that paper was I believe, one of the first to really clearly show and market and leverage this fact that if you use large scale data in putting on your information retrieval contrast of learning hat it can be very powerful. Because before that a lot of. And even after it, I mean you have examples, especially with these 7 billion parameter models, folks are actually not doing as much large scale training. I believe that the E5 misrel paper was out of the same group as the E5 original paper and they're showing a very different way works too. And I think that's, it's very cool. It's very cool to see all these researchers playing in this space and how much Work is now going into the field of embedding models and in the open too where we can all collaborate. But what we have found is that especially you know, as Charles, you were mentioning with these production minded workloads where it really, you know, 20 times smaller model is actually sometimes 20 times better. From your latency and throughput and cost perspective, maybe we don't want to, you know, take the approach which might be a little bit less data for the training side. We want to take the approach that's going to be better for our customers and users. So we're looking at know how much can we pack in scaling up the data so that we don't have to scale up the model parameters to get the same level of quality. And this leads us kind of to this approach of looking at the prior art. A lot of web scale associations. Basically these models learn to associate some question with some answer. And so you want data sets that reflect that. Oftentimes people are using the title of an article compared with the body of the article. But one of the things our insight has been is not just scaling up to very large scale, which is I believe kind of the idea that if I really put forward, we go to the common crawl, we get a bunch of this data. But also remembering that the data you're using really should reflect the behavior you want the model to be gaining. And you see this happening in all sorts of fields. You now see the DCLM data comp competition and methodologies. You have red pajama, you have fineweb. In fact we have fineweb. I was just talking to the fineweb folks at Neurips here at their poster and they're talking about, well actually embedding models turn out to be a better base model for training filters. It kind of all everything's feeding itself. Embedding models are actually becoming used in data filtering exercises as well. We use them for filtering our embedding data. But getting this quality of when this query goes with this result that actually is really showing the type of association you want the model to have to to perform well at actual search. And also when you're saying okay, all these other queries don't go, or all these other results don't go with your query, which is what we do in pre training. We do this by just taking very large batch size. We do a lot of effort engineering wise to use a lot of GPUs, not necessarily to make training go even faster and faster, but because it enables us to actually fit more negative examples in the same batch and do this contrast where we're contrasting the association between these two go together. And then this question doesn't go with all of these other documents. You kind of get more signal in your. Basically your label is better the larger your batch size is. That's another thing that we didn't discover that but we've definitely replicated Results published by E5 and BGE and showing larger batch sizes at least up to a point scaling from 4,000 to 8,000 to 16,000. We've seen this kind of creeping up in terms of quality. So those are the things we kind of think a lot about quality and train the same type of objectives people were training before. But rather than using, you know, handpicked negative examples, we just kind of trust on scale. We also don't always just trust on skill. We, we've put out some work again of our, our dick embed paper was, I mean maybe the first one to quickly show or clearly show with a. And call out with a. Of ablation study how much, you know, sampling each batch of contrasted training from its own data set could improve. And so we find that, you know, there are things that can be done and it's an interesting research frontier to how much better can you make the negatives? We've, we've played around with it. We've tried clustering, we've tried other tricks and we have another preprint of kind of exploring the early stages from this clustering approach. But anyhow, getting at the high level. And you mentioned accessible Connor, I think the tried and true ideas are can you use web scale data but filter it down or process it or you know, synthetically generate better questions. These are, you know, whatever technique you use is not so important as the principle is. We want both large in scale and we also want, you know, the actual behavior that we're desiring to do demonstrated. So we want to contrast not just between completely irrelevant negatives and a very relevant positive. We want to have hard negatives showing up in our batch, make the batch big enough that there's an interesting contrast being shown and make sure that the positive actually is a positive and get rid of things like web crawl failures. Nobody wants to teach the model to associate every question with the phrase lorem Ipsum blah blah blah. But that happens all the time in this web scale data. You have all of these associations between some title and some placeholder text because not, not everyone's done on the Internet, you know, when you scraped it or when the Common Crawl foundation scraped it or whoever you're using for it and NEVA and all these other data sets that we have access to as well. So yeah, that's kind of been our philosophy. And the actual mechanism is very similar to, you know, for anyone familiar with training, a contrast of the trained, you know, model is very similar in terms of the mathematics and all that. It's just we kind of scale up, we'll shrink, we shrink the sequence length if we have to, to scale up and, and that's, you know, how we, how we perform here."
},
{
"speaker": "Speaker B",
"content": "Yeah, amazing. Well, I learned, I picked up so many details from reading your papers and in preparation for this podcast and it was so fun just hearing all that your tour you just went through touched all these awesome topics. So I have a few questions, maybe ask 1 and then pass it back to Charles. I wanted to kind of touching on this theme of accessibility really quick before maybe coming back into this source stratification, synthetic queries, hard negative, all that cool stuff. I wanted to just kind of zone in on, on the cost of this. So I understand that you scale up to batch size of 32,000, you know, and then the multiple to the 768 and then you do this data parallel and you have say 32h100. And so I was doing a little bit of like napkin math with Modal's pricing and so my math kind of came to. This would cost about $2,000 to do this kind of pre training at that particular size. Is that like ballpark what that kind of stage costs?"
},
{
"speaker": "Speaker A",
"content": "Yeah, that's a great question in terms of cost because you know, the, the way that a lot of players like Snowflake work is, you know, we actually signed three year agreements on H1 hundreds when, when they were, you know, restricted and other stuff like that. I actually don't know how much I am supposed to publicly talk about. But, but the way a lot of companies like Snowflake work is, is they, they, you know, work very closely with, you know, providers on a, on a much more like large scale level. So where a lot of folks who are thinking about doing these types of research are, you know, in the space of. Okay, and let me rent this by the hour. I have a, you know, a sense of like what the per hour pricing is. If you're in an organization that has, you know, compute resources and is doing all sorts of other things with GPUs, it's a little bit different where you can kind of. Or if you're with an academic institution that uses Slurm or something like that, it's a different mental model. Fortunately, a lot of people are doing a lot of really large scale work with very expensive GPUs. And if you're, you know, around them and, and you know, you can queue a job up that runs between their jobs, it, it doesn't really have the same cost to the overall organization. And so fortunately that's kind of a paradigm. A lot of us, we've, we've had periods in time where we're very much, you know, thoughtful about the type of ablation studies or experiments we're doing. We've had times where it sort of doesn't even make sense to think about optimizing our code or something because certain resources free up for a couple days. You know, the kind of the human cadence and the machine cadence line up, you know, differently with other projects. But I think, I think you're right. I mean H1 hundreds are often about, you know, a dollar to an hour per one. So you're looking at, you know, 30, $50 an hour cost for kind of the full set of machines that we would use for one of these longer jobs. And so, and if you're using these for, you know, multiple days, it does get into that type of price range. Conor, I believe your math is pretty much on, on point, but yeah, it is, it is something that we think about this when we're working with the open source community. Right. Because folks really want to read the report and train the model. They don't want to just oh, I guess I can fine tune it. It is more accessible. And I would also say like you can get so close, you know, we scale up because machines are sitting here, all that. But I think a lot of our work has shown you don't need 7 billion parameters, you don't need, you know, you don't need a billion question answer pairs. If you look at the Arctic 2 technical report, you know, a lot of Martin's really good recent work, which maybe you should speak to that Martin, you know, a question to you is like, how much do you think you can get away with with a shorter pre training run? Where you know, is there, is there? Look, if you were to make Arctic to, you know, cheap edition, you know, how good do you feel like? I think, I think it would be pretty good. I don't know. What do you think?"
},
{
"speaker": "Speaker C",
"content": "Yeah, I think on languages that we target, I think longer training actually brings a little bit improvement. But for languages that we didn't train for, actually prolonged training actually hurts the language that we didn't target like Chinese, Japanese, so there's a trade off there. Yeah. But I think we can probably cut pre training in half and still get pretty decent results. That's my feeling."
},
{
"speaker": "Speaker A",
"content": "Yeah, there's a great plot and it was all Martin's work in this in the Arctic 2.0 Technical Report, which is up on Arxiv, kind of showing over training what the downstream. If you fine tune that checkpoint and you evaluate it, you evaluate on each language of Miracle, it's very interesting to see that what Martin was talking about is some of these other languages actually lose Miracle score. And so if you just care about the getting the high score in Miracle, the best checkpoint to fine tune and ship is actually as Arctic EMIT 2.0 is probably not the final one, which we used in terms of just miracle score because that includes a bunch of these other languages where there might have been a slight bit of negative transfer between training on a lot of these heavily European languages. Again, this is a kind of product focus. Snowflake has a lot of multilingual customers in Europe and so we wanted to make sure we really treated a handful of languages very well. Those are also where we had good data. And then it turns out that that may have a negative impact on say, Indonesian search result quality on the Miracle benchmark. Now another good question is, okay, Miracle comes with its own training set. Is that general? You know, it's, it's always a, you should measure on your own problem type situation. But if for those out there who want to do research in the space, we, our team has already given up on chasing the number one spot and trying to unseat NV retriever research project that are doing very interesting work trying to push the frontier of what's the most high score you can get on MTAB retrieval. Because we're more interested in, you know, something that's useful and something that's maybe even simpler. Don't need, you know, dataset, dependent prompt or all this stuff. And, and so, you know, if you're willing to also take that same mindset as us, you can also extend that mindset further and cut, cut your training budget and you know, probably for $200 of pre training, you know, on many fewer, you know, eight. We do a lot of experiments just, you know, with one machine because it's easier and you can fit eight H1 hundreds on that. You can rent that for, you know, $10 an hour from a lot of providers now or getting towards that. And it's only going to get cheaper, it's only going to be easier and, and and then you know, we're also seeing data quality is more important than scale. So there's probably stuff we should be doing. Our DICT3 might be trained less. Who knows."
},
{
"speaker": "Speaker B",
"content": "Yeah, also. Well, yeah, so I asked that question not to kind of like you know, probe particularly, like particularly how much money did you spend on these models just to kind of highlight this trend and like, you know, the cost of pre training is definitely seems to be going down and then especially with the betting models and yeah, I think there's all sorts of interesting things. So Charles, I want to pass it to you if you, if you want to ask a question on the pre training before I do have, I have a couple more but I don't want to hog the."
},
{
"speaker": "Speaker D",
"content": "Yeah, thank you Connor. Yeah, I have tons of questions but I think there's kind of like two things that kind of came to mind. The first is around pre training and kind of like what are specifically what are like the similarities and then dis similarities between let's say pre training, pre training embedding model versus an LLM. Because in, in LLMs for example you have this kind of pre training stage that is very typical but typically training is like multi stage. And then you follow that up with kind of a reduced, like a reduced training stage where it might be like or HLF or something like that where it's very task specific. And when we were talking about pre training with contrasted learning for embedding models, you kind of trust the negatives. You trust that the negatives kind of like naturally emerge and that the kind of density of the data will kind of prove that out. But do embedding pipelines, even for just generalist based models benefit from multi stage training where you, you do introduce hard negatives later on in the training pipeline?"
},
{
"speaker": "Speaker A",
"content": "Yeah. Wow, that's a great question. Covering a bunch of stages of the training. I can say from my perspective it's a little different of a perspective on the thing because we're looking at it from not so much comparing to large language model training as comparing to other ways that folks have trained embedding models. And you know, before E5 had really popularized this multi stage process, there was a big focus actually on getting the right, a lot of papers doing distillation or teacher or you know, using models that aren't embedding models or systems that aren't embedding to teach embedding with very like small sets of data that have been, you know, labeled by a re ranking model and this large batch size, large data scale wasn't thing there and it's incredibly helpful and it's maybe the most helpful part of the training and we still definitely have a fine tuning stage and we've also done some experiments with the language model pre training. You can even take a randomly initialized BERT model, train it just for text embedding with no mass language modeling objective and you can get good vectors out. You don't get great vectors out, you're not going to be scoring up on the MTAB leaderboard. But it's interesting that actually it feels like the closer you get to the final stage of really shaping that embedding space to make fine grained relevance judgments that get good. You know, if you're measuring an NDCG at 10, that's you know, a positionally aware, you know, kind of ranking aware metric as well as just recall and retrieval from, you know, a giant haystack, find your needle, push it towards the top. These later stages especially become more important and this is fine tuning and data quality. Negative quality is something that, you know, is maybe the most important thing we drive towards. So I don't know if that was maybe. Charles, I think you asked a couple very interesting questions in there, but I don't know, hopefully that touched on a little bit of it from."
},
{
"speaker": "Speaker D",
"content": "Yeah, no, I think that that definitely kind of gets down to. Yeah. What the core question was and the thinking I had around it. And just another, and correct me if I'm wrong on this from how I understood the kind of the report in the paper, but with, with the Arctic embed series in particular, was MRL part of that initial pre training stage which is maybe counter to some models out there, the MRL is introduced just kind of in a, in a fine tuning stage. So because I know, let's say with Arctic 2.0 large you have 98% recall at 256 dimensions versus 768 and that's substantially, that's a substantial kind of maintenance of kind of performance with such a kind of, you know, a 3x reduction in dimensionality. So was that, was that that, was that a feature of the training?"
},
{
"speaker": "Speaker A",
"content": "I think so. We've done. I mean Martin, you can, if you have deep insight or deep conviction, please shout out. I'm not as confident that I can tell you. I mean, I think the interesting part of MRL is it working in the first place is sort of, it's not always the most intuitive or it feels kind of wonderful and interesting. But things that are wonderful and interesting, often you don't necessarily speak with as much conviction about. Exactly. I know how that works because otherwise you wouldn't have found it wonderful that it works in the first place. So that disclaimer aside, I think we do feel that fine tuning some behaviors into models doesn't always work as well as kind of making that from day one. The behavior we've tried throughout the experiments of months of training embedding models we've done things like try different loss functions but try just on the fine tuning. It's much easier to make a change than just fine tuning and tweak it and run with it. And adding MRL to the fine tuning definitely worked well from a perspective of it's a lot better than not adding MRL and fine tuning and just chunking your embedding vector down. But what we found with some experiments is we weren't able to necessarily see a worse full dimensionality score from adding more MRL earlier and we were able to see a better cut down score. And this lines up with. We've talked you all I think on the weba podcast have had Aditya, the lead author on MRL paper on and talked about it. I think he is pointed out at times that you know in his opinion or experience as well like including this objective in your pre training step can lead to less like quality degradation on, on that or, or just be a better, a better result. And I think it sort of intuitively makes sense where you have this. You're suddenly introducing a negative transfer where the, the model is being taught to kind of spread the information out across the whole vector for this very large training thing. When it's seeing all these different examples and then only on a smaller subset of its training you're trying to, to push it to do a different behavior and it may not be that easy to push it at that point. It would be I think a very interesting paper or research topic to go in and really like revisit. You know, I mean I would just love to get more of a theory of why MRL is working or what MRL is really doing. I think, I think it's, it's kind of nice that everyone's using it and it's very useful tool but it'd be cool to see a little more principle about that. Um, but yeah we, we definitely found empirically that it was not harmful to add it to our earlier stages of training. Um and, and you know luckily because we're not you know we're, we're prioritizing easy to maintain code bases and engineering and kind of product minded like it Wasn't a big lift for us to do that. And I do think that that is a, is a benefit. And that is my guess from reading things like, you know, the Google Gecko paper as to why some of these other models. A lot of what we do is trying to reproduce why other people have gotten worse results. We did a lot of this with 2.0 Martin. You can talk. Martin tried so many things and I worked with him and we scratched our heads for a long time. Why are there so many two model families with multilingual models? Because as we started adding more multilingual data to our training, our English was just fine and our English recalls retrieval was not suffering. And so we asked these questions of why does Google gecko only retain 94%? And that's actually one of the better numbers compared to, you know, some of these other things on MTAP retrieval. Like is it because they have a slightly larger full dimension? Is it because they train differently? And I don't know, they didn't publish a, they didn't do a version that got 98%, a version that got 94 and say we chose 94 because X. So we kind of have to scratch our heads on that. But yeah, we definitely. I think one of the things is like we care a lot about 256 dimension embedding vectors because we find that again like 300 million parameter model instead of a 7 billion parameter model. That 20x reduction is by far the one that most people want to make in practice. And similarly a 4x smaller vector that we're getting like with our Arctic 2.0 large from 1024 to 256 with 98% the quality. That's a trade off that a lot of people will want to make in practice. In fact, people are doing much more quality degrading things like fully binarizing vectors or, or using product quantization, which tends to have, especially on NDCG at 10 metrics, a little bit more degradation in some of our experiments. You know, if folks are already doing this to reach the scale they want, you know, then that's a signal that maybe we should be focused on, on prioritizing the quality at efficiency rather than just quality at no at no matter what the cost is. And so that's kind of. Maybe it's because we just didn't downweight it. There could be, I mean some of the things we did try experimenting with but have kind of stopped doing as many ablation studies on is like how much better would your full dimensionality performance be if you didn't train with mrl, the, the kind of stated gospel is that it has no negative effect to introduce MRO loss. And in a lot of training runs it has been seen that way. But especially earlier on, like our initial model we hadn't added compression, wasn't on our roadmap in the very first 1.0 model to be released. Going to to 1.5, a couple experiments in there. I did notice it's like is it noise in the eval or is it actually that, you know, the final model is not as good as when I turned this setting off and reran the training and it could be, you know, nuance of how we train it, et cetera. But that's something that I mean maybe other folks have done more looking at and maybe they're actually making a different trade off where they're not accepting any degradation in their full quality. Maybe they're chasing the leaderboard score a little harder. You know, it's, it's, it depends on what you're trying to do."
},
{
"speaker": "Speaker C",
"content": "Yeah. I might add that during development of Arctic impact 2.0 there was one stage where we only had about 30% of the pre training we ended up doing and then MRL doing fine tuning and then we actually observed the MRL quality is not that good and then everybody went into panic mode for a week and then actually we end up training pre training more and then the MRL got better. So if you're looking for a reason to do more pre training steps, I guess that's another reason."
},
{
"speaker": "Speaker A",
"content": "Yeah, that's actually a really interesting experiment that we should probably do because we have, we have the two training approach is to look at how much quality degradation you experience when you do almost no pre training or a lot of pre training or more and more pre training. Because I think the point, the kind of nuance part in highlighting is we do a very long pre training step with lots of examples and that's different than doing, you know, adding MRL for all of that is different from adding MRL for a pre, a full pre training step that is shorter. Right. The longer you, you've, you know, trained the model with mrl. And you know another thing that would be interesting is, is the mass language modeling objective that gets trained. We, we use a lot of off the shelf models because folks are doing lots of great work to make them great. And, and you know, we frankly don't feel like we have as much to add there at the moment. But those models are trained with MRL and it would be Interesting. If somebody did the mass language modeling objective, you could do the same thing, right? Use the first 256 dimension of that final vector when you're doing your predicting is this masked or not operation? And I believe that might have another benefit for the compressibility of these models because nobody using BERT is thinking, oh man, I want the final contextualized token embeddings to be really small. But, but if they did, you know, because you know that down this, down the line you're going to train a pooling of those to, to have this effect. You know, that might be something that could shake out interestingly and it kind of aligns with you do enough pre training, you don't even need the MLM model to do decent at retrieval and get, you know, close to some of the early five numbers with our, you know, much improved final later pipeline. So yeah, interesting free, free paper ideas like crazy are coming out here in this discussion. If you're in this space and you want to, you know, rent some GPUs and spend a few hundred dollars trying to run some pre training. Highly recommend checking out what it looks like if you do MRL during pre training, but different lengths of pre training. Very interesting topic."
},
{
"speaker": "Speaker B",
"content": "Yeah, I'd say it kind of maybe like in a less targeted way kind of like dropout or stochastic path dropout kind of have like an analog or these mixture of experts models that kind of have a similar analog. And Charles, something you said at the weaviate embedding service webinar yesterday that I really liked was about thinking about it like resolutions, I think about like embedding resolutions and I don't know about you guys, maybe this will resonate as well. But when I'm training a model, maybe it's an auto encoder and embedding model. I just never can decide what like what am I going to make the latent dimension is going to be 128 to 56 is kind of like just, you know, a toss up. And so this way of balancing it all at the same time with this MRL objective is amazing. Charles, maybe I'd love to ask you more about, you know, asking you about that embedding resolution analogy and how alleviate like, you know, because we have these different granularities for your vectors and we know how many vectors you have, we can quantize it like we can like compute the optimal size for you to say fit it in memory, any of these kind of things."
},
{
"speaker": "Speaker D",
"content": "Yeah, I, I think it kind of. Well first like the, the Resolution, resolution analogy just chimes better with my brain and that like I like to think of like the full embedding dimension as like 4K and you know, like with like the models that go all the way down, that's like 144p. But the magic is still there to mean that like it's especially with pre training that effectively the dimensions kind of learn to organize themselves where they like all contribute to the end prediction but at different resolutions. But yeah, that's just how my brain thinks about that. But really what to answer your question, Connor, what I think is just very important about MRL is that with these embedding models, they're not being used in a vacuum. People aren't just producing embeddings, being like, wow, look at the results. You know, they're, they're being used in downstream systems, they're being used in information retrieval databases. So when you've got something like with weaviate that has a HNSW vector cache where all your vectors need to be stored in memory and you've got a billion objects to search over, well then a 3x reduction in that cache size is very, very, very, very significant from a business, like a business and operational point of view. So yeah, just to see people focusing on MRL to me is really important and like, you know, I'm only sure like techniques are going to improve over the next couple of years and kind of maintaining as much quality like you said, Luke, you know, the papers that might emerge in this as to why this is happening might even be more, more interesting. Like, you know, why are we even training at full, full dimensions and all these types of things?"
},
{
"speaker": "Speaker A",
"content": "So yeah, yeah, it feels like somebody who just a PhD in information theory should be able to come in here and go, of course that works, you know, and, and, and why, you know, and, and I just, I want that person to kind of come into my, into my field and, and, and lay it all out. Because if, I mean we do some of these experiments where we try binarization loss or we try MRL and we try truncation and then suddenly you get to too few bytes and no matter which approach you're doing with quantization or binarization or all this, you start to get degradation and it's like, well of course, like from. If you had zero bits of information, there's no way you could sort through all these vectors. We know that the zero point is very obvious. And empirically, as you get up as many as you want, 4,000 dimensions, float 32 precision we're getting good retrieval, some point between the two. It makes sense that it would degrade, but how it degrades and what the actual amount of information you need. I think the resolution, imagining TV with four pixels, you can't get a lot out of it. But 144p is enough to kind of get a sense of what some of the major objects are. You can't get the fine details. But if you're trying to classify what's playing, you know, you don't need that many that that high resolution to kind of get the gist of it. And we see this, our recall numbers don't degrade as much as our like NDCG at 10 numbers. And that's showing us that like the right answers are still shuffling up towards the top. It just seems more of like a noisy. If you're, if you're imagining that your embedding vector is giving a total ranking over, you know, all of it, because that's what it is doing. Especially if you do a not indexed, but just a flat operation. More conceptually simple, sometimes you're getting a total ordering of which documents are most relevant to this query. The compression can almost be thought of as adding noise the same way that like losing resolution TV sort of looks similar to adding noise to the tv. It's very special type of noise. But in that kind of intuitive visual sense, it makes sense empirically that we see, okay, you know, the right answer is still getting into the top thousand. It's just not always getting into the top ten. You know, it's like there's some kind of shuffling. But the scale of the noise, the scale of it is at the point where you can still tell what's on the TV when you're watching 144p and you, you can still tell that, you know, this topic is in the top 0.01 percentile of relevant items, even if you're not sure if it's very top or middle of the top. And so the challenge, I think that you were mentioning, Charles, is that you still need to keep things in memory if you want good retrieval performance. Because unless you know which one you want to do the full precision at, you still need it there warm, ready to pull it in order to serve your query. And so there's this catch 22 of even if you're using the most sophisticated HNSW implementations and you're tuning the heck out of your retrieval system at some point, you still need that, that full vector and you need it that you don't Know which full vector you want or which you know, thing to pull. And so you, you still want to keep a lot of stuff in memory. And that's where it becomes so valuable to do this compression. Even compression levels where you, you know, maybe you do have, for retrieval performance, you have multiple cascading levels of highly compressed. You, you get to the, you know, 0.1% of your data really fast and then you exhaustively scan that. Some, some types of indexing are like this too. But you know, when you're doing this type of work, you can, you can play all you want, but at some point you're still paying your server, your server provider for the RAM up to a point. And so if you can get quality good enough, that's been kind of, our philosophy is like you can take, if you were just willing to accept 1 or 2% degradation, you already accepted that going from 7 billion parameters to 100 million, you're going to accept it again going from, you know, 3,000 dimensions down to 200. Maybe, you know, maybe we can just save us so much money and headache and all of that. And, and then when you're in that playground, then you can just focus on improving qual and it's actually very fast to iterate. You know, your evals go four times faster sometimes because 1 4th dimensionality, you know, your training goes 20 times faster, you know, or for the same data, you know, maybe not for the same quality scaling laws and all that, but it's kind of an interesting, interesting philosophy where once you embrace that, it can actually help some of the science as well."
},
{
"speaker": "Speaker B",
"content": "Yeah. Wow, so many cool nuggets in there. I think with web we had a lot of luck or like success by rescoring from disk with quantized vectors and yes, doing exactly as you're saying with these. Matrioshka. Yeah, so, so interesting. So we are pretty deep in the podcast, but I did, I did want to ask just one more pre training question. So sorry to be on this one for a bit. I think because we've touched on MRL so much, maybe we can shave a bit off of the architecture discussion. But I, I really wanted to dive further into source stratification. And Luke, you've also published a paper that for our listeners will be linked to the description about clustering your data to do this source stratification. I think this is just such a powerful idea because yeah, maybe if you just want to take the mic on. Yeah, like what does this mean to have, you know, make sure you're in batch of negatives all come from the same source. And then maybe how we can do this in an unsupervised way going forward."
},
{
"speaker": "Speaker A",
"content": "Yeah, I mean, it's interesting. I, I started into this kind of on an intuitive direction of like what, what would be better? I was writing a data loader. I'm trying to decide do I, do I load data across sources? Because currently they're in, you know, parquet files or wherever I'm storing them kind of by source. My data library is organized now. I'm trying to create the data format for loading it. And I was thinking, I was like, what would be a better batch? Well, maybe I should try both or maybe I should do the one that's intuitively better. Both intuition and empirical result aligned that if you have one data set that's from Wikipedia, your questions are all fact based questions about Wikipedia and you have some data set on a script from Reddit or something. I think as an example, E5 use a lot, you know, Reddit relationship advice topics are not as useful as negatives to questions about, you know, European history found on, on Reddit as on Wikipedia, sorry as other Wikipedia articles that might have more European history topics more prevalently. And so you kind of get this for free. You know, this is, you know, you are aware of this fine tuning step when you're doing this, or I am. And it intuitively makes sense that this is sort of bringing the intuition of that, that you know, maybe a lot of these negatives especially what we see with training too is when you turn the stratification on, the very first steps of training don't improve. So it does seem that like the model does need to learn to, you know, disambiguate relationship advice from European history at some point. But it might take two or three batches and then it's got it, it's kind of got that part. And so from that point on, those negative examples are just kind of worthless. And you can see it in the math. If the score of you know that you're negative is so much lower than the score of your positive and the loss function, it will contribute nothing to the loss, nothing to the gradient. Your, your training will literally from a kind of a kind of theoretical point of view. I believe the ance paper out of Microsoft Research has, has shown this kind of phenomenon elegantly in the math. It doesn't work. And so this, both the math and the intuition and the topicality makes sense. Jimmy Lin's group has a great paper, TAS B that my coworker showed to me After I was doing all of this clustering work on the pre training and they do clustering by query on a more fine tuning style training before this E5 paradigm of multi stage with large pre training happened, get very good results on efficiency and kind of show that your in batch negatives benefit a lot when you have topical similarity. Now of course it's trading off of, you know, you're pushing that entire gradient step really around one topic and you know, at some point you have to think about when you're doing these batches of like 32,000 examples, do you really want it all to be one very narrow topic or is that going to maybe make your model kind of walk around topic spaced in very different ways or you know, overfit. Step by step the intuition starts to raise other questions and my intuition may not be your intuition. So the questions you raise or the concerns you have may be different. Empirically I tried to take things further than what's published in the archive and it's not as easy to go down to the like perfectly optimized pre training batches. It starts to feel like, you know, clustering might be sort of on the end of the sweet spot. Batch stratification very helpful, very easy to implement clustering harder to implement, a little bit more benefit, not seeing nearly as much. And then at some point I've had experiments that absolutely sucked, which when I look at some of the data it's either oftentimes you can start to hit these false negative problems because if you have 10 million examples and you're only taking 10,000 and maybe there are a couple false negatives in the 10 million that for this query you haven't labeled them as positive, but they really are also relevant. If you batch the most relevant examples in with that positive, you take the 10,000, you're going to get all the false negatives in that batch and then you're going to train the model with very noisy label where you're saying okay, disambiguate this positive from this positive, this positive, this should be good. This positive you should actually give a low score to. The model's going to fit some weird thing that disambiguates those two positives from one another. Not the fact that relevant, it's not going to be learning relevance anymore. So you have to fight this. We fight this a lot in our fine tuning data preparation and cleaning and but, but we know that it works in pre training from the stratification trick from this clustering and there's sort of this like in My head sometimes going to sleep at night, I close my eyes and I just imagine what it would be like to have pre training that has really good negatives, just like our fine tuning and how it wouldn't cost $2,000 to train this thing, it would cost 20 cents. You know, it would just, it would just fit and everyone would laugh about all the 7 billion parameter model training jobs we've been running. Forgetting that the data sucks. But you know, that's my, that's my personal like hot take and you know, wish and I have to make it a reality to prove it to the world. But that's kind of where my thinking is at. I think I'm more of an evangelist for, you know, negative quality than most in this space. But yeah."
},
{
"speaker": "Speaker B",
"content": "Yeah. Well, I'd love to continue on that with this. Yeah. Hard negative mining. It seems like that's the secret sauce of fine tuning. I think you already kind of touched."
},
{
"speaker": "Speaker A",
"content": "On it a lot."
},
{
"speaker": "Speaker B",
"content": "But I kind of, from reading your papers from Arctic Embed one to Arctic Embed two, it sounds like you originally had this strong conviction of curriculum learning and doing some kind of curriculum of the hard negatives. And then it sounds like maybe you, you know, abandoned it and are less sold on that idea. Yeah. So maybe if we could just, you know, open the topic about hard negative mining. Is that the current secret sauce to fine tuning after. So, so we've kind of graduated from this pre training stage. We're mostly just using in bash negatives I think in A. And now we have like more careful curation of the negatives. Is that a correct understanding and yeah, maybe if you don't mind just."
},
{
"speaker": "Speaker A",
"content": "Yeah. Martin, you want to explain fine tuning? I think you probably have an elegant way of putting it."
},
{
"speaker": "Speaker C",
"content": "Yeah, sure. So I think the main highlight in RP 2.0 fine tuning is that we want to use a more prominent teacher. We found that like using like Stella 1.5 B compared to English GTE seems to provide a good value and then also throwing out very high scoring negatives which are probably true. Sorry, false negative turned out to be pretty good. But this curriculum learning idea, we find out that perhaps it's due to we are not able to find a consistent measurement of hardness. That's why this curriculum thing is not working as good as just random modeling. But yeah, I think curriculum learning is one of the negative results we're sort of reporting in this iteration."
},
{
"speaker": "Speaker A",
"content": "Yeah, it's also I think both of these, the incredibly positive result we saw in 1.0 and the fact that that curriculum learning really helped 1.5 empirically validated that data set. There was one very good. And I think this happens too, in the space. There's a particular version of Ms. Marco where those negatives that came, you know, pre mined were like the best version of fine tuning for like years of research. And people tried to mine their own negatives, do other stuff, and somehow they just got it right. And it was, it was not even understood. I mean, talking to Daniel Campus, who was on the Ms. Marco project at Microsoft, will say, like, we don't even know. Like we, we tried to replicate with their own same pipelines, just this magic data set. You know, we just. It's almost like, you know where it came from. If you're internal, but you don't know how to make it again, that can kind of happen. And I think what we had is we had this great curriculum experiment that from our 1.0 fine tuning with the worst teacher was worse in other ways than our 2.0 improvements. The curriculum really had a positive impact. Got us another half point of MTAB score on retrieval for all the models we switched it to. But then I went back and I was talking to my coworker Gaurav, who did all this great work, and it improved our fine tuning even further in our 2.0. And I said, hey, Garv, like, wouldn't it be nice to have a plot showing just how much better the curriculum is for 2.0 that we've been using? And he was like, luke, I have bad news for you. Random is working just as well. And I'm like, he's like, what should we do? And I'm like, publish it. Isn't that interesting? Like, we should put it out there so, you know, it's one of those things. The way we mine negatives is or mined negatives is a little different. We're using some of these tricks that are more proportional rather than absolute distance in dot product score land. A lot of these ideas came out of the Envy Retriever paper, which again cited our paper for focusing on fine tuning. There's a lot of great interaction between these different groups. But it's kind of this idea, I have this idea that you can't just take these heuristics or intuitions as gospel. If you don't quite understand exactly why switching from a, an absolute margin between positive and negatives, your threshold for what's a false positive to proportional to the top score, then maybe you want to be a little bit more careful and run these experiments again to see if that curriculum trick you were doing really translate to this new parameterization. It could be the curriculum learning does work, but just in the reparameterized negative mining you need a little bit different measure. And gosh, we have three different measures. We tried and didn't see an improvement in the curriculum on those experiments. But you know, some of these training runs, it wasn't like it was worse. It was just that the, the benefit we had seen before, maybe we got the same benefit from switching or parameterization and they overlapped. Maybe we can't parameterize the same way the curriculum using the kind of proportional gap that we're using in V2, who knows? But it's always, it's always kind of an interesting and humbling reminder that you can think you're very smart and you can think you've been doing this for months full time, and yet like your intuitions don't always meet when they hit the road. I mean we see this in deep learning all the time. People, we have all these like ceremonies we do with our different functions or tricks. And it turns out that you can turn it off now that it was some interaction between a weird architecture choice that's gone out of fashion for five years that you needed this extra treatment. And you know, you see this with neural network initializations or you know, different norms affecting different ways that people architect thing and things. And you know, parallel transformers, pre norm or post norm, it's different than not parallel transformers. And you know, if you don't empirically measure it or get really deep into theory about it, you know, you'll be surprised. Yeah. Was that, did we. Was there another topic, Charles? I don't remember. Were you asking something else about the fine tuning step?"
},
{
"speaker": "Speaker D",
"content": "Yeah, I think that again kind of covered it, which is like the importance of that negative mining and how much I think you said you changed your teacher model for 2.0, was it?"
},
{
"speaker": "Speaker A",
"content": "Yeah. There's a delightful little comparison we made. The comparison in our paper is not even as big of a gap. So NB Retriever published a technical report for, for the team at Nvidia that did their kind of research retrieval model. And that model did a lot of work at looking at negative mining and their fine tuning and teaching with an ensemble of different labeling functions, different embedding models used to select what might be the hardest, not actually positive examples to contrast the positive example against in that fine tuning data. And so we looked at some of their ideas and did replication size on some of it. I don't think we ever felt like going to the, you know, we have this like pragmatism, right? Like operational simplicity of just one teacher model is very nice. And you know, their ensembling score improvement, not so great. Even in their paper trying to showcase this, it's there, it's impressive, it's scientifically interesting. Pragmatically, the gap between a bad model and a better model seemed more interesting to us. So we, we looked at moving, you know, to spending a little bit more compute and storage space on embedding all of our fine tuning data. Fine tuning is not that big anyhow, so you can, you can spare a little bit less expense on the pre processing steps. And we found that it was dramatic improvement in the negatives you get. And that kind of goes to show too like the data isn't done being squeezed for quality. And these improvements are the same improvements you get from like a 10 order of magnitude or 1 order of magnitude 10x scaling law benefit. And so we would much rather play in the space of scaling up our teacher model on the fine tuning data and get a better quality data set. I'll get off my anti scaling soapbox at some point on this conversation, but there's just so much to do to make the data better, the training process better. And yeah, we found that some of these top models that come out, there's almost this bootstrapping effect where if you use a model that is now better at retrieval to score all of the documents in, in a, in a fine tuning data set as relevance against one query and then select okay, the document that is labeled as relevant, we definitely want to keep that label, but we want to like think about okay, what should be the other ones we contrast against in that batch? You can't contrast against the whole data set, that's too big. So we're going to select, you know, maybe 10 examples. We want 10 which are very hard. So we want 10 that confuse even some of the best models into giving a high relevance score. But we don't want, you know, we also trust the best models to maybe be right. So if the best model says that actually a different one is more relevant than the labeled one, we want to think, okay, we may not want to, we may still want to use that example. It may still be relevant, but it may actually be less relevant than the other one or may not be more relevant than the one that got scored higher. So for consistency's sake, let's just not even use that other one as a negative. And this kind of flies in the face of some of the earlier work on like ance which is trying to is kind of assuming your data is perfectly labeled and then taking the hardest negatives, the hardest ones to the model that's being trained, we're actually finding that well, maybe we trust some of these pre trained models that the best model we have might be able to tell us things about even the data labeling quality that are actually more helpful to training. And so that's kind of this idea that has been explored multiple times different teams, Nvidia's research team, our team and as we refine the recipe we get much better results. The versions of fine tuning data sets, we have higher quality and it would be cool actually at some point to follow on this work and maybe there are plenty of open fine tuning data sets like MsMarco we could replicate some of these things on and maybe even publish new versions of for other folks to just. Okay, you don't even have to do the do the filtering. You can just get better performance with these better fine tuning data sets. That would be a cool contribution that we thought about. Now that we've finished 2.0, maybe we can do a little bit more past the tech report in these spaces. No promises, but it is on our minds as well sharing this more super cool."
},
{
"speaker": "Speaker B",
"content": "So yeah, it sounds like really interesting opportunities for the fine tuning dataset. I had a question about this teacher model. So does the teacher model embed the entire fine tuning dataset and then you store that in like a vector index and then maybe that's opportunity for weaviate to contribute to the kind of story?"
},
{
"speaker": "Speaker A",
"content": "Yes, I mean I think it's exactly that. One of the things when you mentioned weaviate contributing, I mean one of the benefits that we have is when we're doing these training or data processing runs, we've got these GPU machines around and sometimes it makes sense to GPU accelerate these like batch lookups and retrieval because it's more throughput. Like we're trying to consistency filter the entirety of, you know like some of these data sets, I don't know, NQ I think is in the tens or hundreds of thousand somewhere in there. The training examples is an example that folks have used. I mean Ms. Marco is another one that's popular and all these data sets, like we have ones that are even bigger that we look at for fine tuning. And so you don't, you know, you don't want to go one at a time and do this like fast retrieval. Even though the retrieval can be very fast. It's actually more of a batch and throughput oriented thing, which for some of these like graphical indexes actually may be kind of an adversarial case of like, well, actually I want to take these 10 completely different queries and look up the most things that are relevant to all 10 of them at once. That's great. If you're doing a big linear scan, it's just a matrix multiply and that's happening on GP really fast. For some of the clever, clever, you know, approximate retrieval tricks, you're, you're taking different paths, you're like walking the graph in different directions. And if you're trying to do that on the same cpu, it's like cache locality and other things. It might actually be not a great tool. And practically a GPU tends to go fast enough. So it is an opportunity for web to contribute. And it is something that we've thought about. We have a pretty. Sometimes we just use raw Pytorch primitive operations to do matrix multiply and top K operations. And so it's not the most cost performant way to do some of this stuff, but it's also fairly cheap for a lot of the scale of these data sets. We don't go that crazy large for fine tuning compared to what others use with open data sets like Ms. Marker and nq. And that's something we look at. But yeah, you're exactly right when you were saying. Yeah. Are you just embedding the entire data set and then doing retrieval over all of the queries? Yes, it's this interesting thing. It's actually very simple. You just, you, you are going to retrieve from your data set all of the relevant items and then you're going to use only items that are at least somewhat relevant for training because you assume that your model is now good enough by the time it reaches the fine tuning stage to mark as so much less relevant all the clearly not relevant examples that, that would not contribute to the loss. You can kind of think of it and in fact some papers have tried to think of this as are we just approximating doing the full batch contrastive objective of if we could do a batch size of a million and our data set's a million examples and fine tuning, you'd have a million by a million matrix of queries as rows and documents as columns or something. And if you could compute all of these relevant scores and do your loss function on them, how different would that be from simply taking the most relevant ones as your docs and doing that piece by piece? And once your model gets good enough, like theoretically, if a model would assign zero relevance between say the 90th percentile and below relevant documents. Then you would have no mathematical difference dropping all of those from the batch when training that particular example. And so that's kind of what the fine tuning tries to do. We actually, rather than in pre training, you have a one to one ratio of your embedding queries. You embed 10 queries, you embed 10 documents. In fine tuning, it's often an order of magnitude off. We're sampling at least 5 or 10 or sometimes, sometimes people do with one, sometimes people do with like a hundred. But often it makes sense to sample a bunch of good negatives to kind of make sure the contrast is showing more. You know, it's more then, then you feel more confident it's a general purpose label that you're saying, okay, this one is relevant. This one that contains the same words but talks about them in different ways is not relevant to the question. That's the kind of nuance you want to teach the model. And when you get that then your ranking scores, your NDCG tend to go up substantially. Sometimes the recall, especially like 100, we don't see nearly as much improvement either between fine tuning or from pre training to fine tuning or fine tuning with one data set to a fine tuning with a better data set. Those, those things we, but we do see that, you know, these embedding models can actually be pretty darn good rankers now as if you're, you know, recycling their knowledge into better labels and retraining. And this bootstrapping process is very, very exciting to us because we get to use the same datasets, the same models, the same training and just get better every iteration."
},
{
"speaker": "Speaker D",
"content": "This is kind of maybe a bit of a segue off of this, but something I think Connor and myself have just been thinking about lately is with, with fine tuning and it's probably more in the domain adaptation kind of corner of, of fine tuning where you know, what is the value of like synthetic query generation given, given a data set? And like, particularly why I think about it is I look at, I look at a lot of these training data sets and they mix, you know, kind of QA question answering data sets and then keyword, keyword to keyword to document kind of questions. And to me those are very different distributions and you know, it's a different formulation. So yeah, I was just wondering if the team had had any kind of like experiments with, with synthetic query generation and how like the benefits of it. But it's, it's a, it's a more. It's a costlier process, I guess, and it takes, it takes longer and yeah, that's kind of just another little corner of fine tuning that I think is pretty interesting."
},
{
"speaker": "Speaker A",
"content": "Yeah, well, I mean we have the benefit at Snowflake. We have other teams building. Can you do LLMs? They would love us to test this out some of the, you know, oh, we're just deploying a new model. Can you go run a bigquery on it? So actually there's sort of a weird incentive to sometimes actually do some of the synthetic data generation because it's a great way to test out the other products that people are building and stuff. So we have looked at this. I am not the world's best prompt engineer. I think Martin has had more success than I have with coaxing the LLMs to give him useful tools. Maybe not. It's helpful for, especially for eval. If you have a data set and you don't have any queries for it. But you know, these are the kind of documents someone wants to search. It gives you at least some vibe. Check to embed the documents and then embed something that sort of seems like questions for it. But if you actually look at what you get when you as a bad prompt Engineer, prompt an LLM to generate questions. For example, if you go on GPT4 or something, you chatgpt and you just paste like, hey, here's a document. Ask me a question about this document. Oftentimes it'll be like, okay, in the second sentence, what color was the thing it was talking about? And it's like, okay, that's not a query, that's not retrieval, that's actually document question answering. And so you have to be careful that you're actually constructing queries that are retrieval oriented or other, other ways. You get like these other weird examples which are like, you know, it takes the sentence, the exact sentence out of the document and turns it into a question. And it's like, well, if you. That, that, you know, sometimes people have that, sometimes they're like, okay, I know there's a document called like Q3 Sprockets Report and I want Q3 Sprockets Report. It's navigational. That's. I want that document to show up in the top. I know it exists in the corpus. And I just, you know, people use Google for this all the time. They like literally go into Bing and they type in google.com as a search query because they don't understand, you know, this happens. But we, we want to make sure when we're training that like, oftentimes those aren't the hard queries. And so that's not the most interesting example. And you know, those, if you're evaluating on that, you might be overconfident that your model's doing well in that domain. And if you're training on that, you might not be getting your model that much better at the things that are actually tough. So that's kind of been one of those things where we found like filtering or tweaking existing data, you know, synthetic data that's heavily grounded in existing data sets or here, ask a, here's a question, ask another question might be a better prompt than just here's my data set, go for it. And so, yeah, we have examples, I think, of how we've done some of this contextualized generation of queries and even negative documents and stuff. And there are other, like the Google Gecko paper has some interesting methodology, biologists who've looked at and tried to reproduce, but we've actually found mixed results in our different experiments trying to use synthetic data just because it really is like tricky to make it right. It can be trivially easy or not an actual search query. And if you crack the code with your prompting or your strategy, please do share that with the open source community because it's so useful to be able to do domain adaptation and generate evaluation data sets and all this. It's miraculous. I mean, the amount of work that information retrieval folks have spent human, human labor constructing these things for both training and evaluation. We would love to automate more of that for sure and be able to expand to new domains."
},
{
"speaker": "Speaker D",
"content": "Yeah, it's a cool point though, really, in what you're saying, because again, what I really like about how you kind of, you and the team have described the process is everything's about kind of not just quality, but operational efficiency as well. Like doing things like the smart way so to add, add like an element, I guess, to the pipeline that could potentially be kind of stochastic or a little bit random at times or like it doesn't chime with that, like the dependable, dependable pipeline that you have. So, but that being said, yeah, you know, if there were, if there are ways to contextualize it the right way and to kind of zero in on a good strategy for getting high quality outputs, it would be valuable for sure."
},
{
"speaker": "Speaker A",
"content": "And we're excited. I mean there are papers coming out in this space. I think folks in academia and beyond are pushing the boundaries on good techniques for it. So it's an Area we definitely try to reproduce and do experiments with. I'll say that for, you know, for this Arctic 2.0 work. Like, we, you know, it's not like we're throwing out any of the great work that went in Arctic 1.0 and was published in that release. But I think most of, you know, what we talk about is what we mostly focused on, and that's kind of this quality via filtering and distilling or better using teacher signals as well as expanding to other languages and validating all those steps is a big part of the work as well."
},
{
"speaker": "Speaker B",
"content": "Yeah. Well, if I could get one more question on synthetic. I love synthetic data generation. Giles knows I love this synthetic query stuff. And so even, you know, when we're publishing this podcast, the day before publishing our podcast with Morningstar and Morningstar, they do this thing where they write a synthetic query for each of their document chunks and then embed that because it helps with that distribution alignment of going question to question. And then very naturally it's like, you know, you have these synthetic questions now you can kind of benchmark your system. A lot of WE users, they start with their corpus. They don't have the questions. But Luke and Martin, as I've been studying your papers and throughout this conversation, I'm actually graduating from, you know, I've been interested in synthetic queries forever, but now I'm kind of thinking about synthetic hard negatives because you have all this problem about filtering the negatives. It's like, what if you use an LLM to kind of transform the negatives to make it like this perfect scale of negatives? And that's. My interests shift throughout our conversation, which I always love it when that happens. But what do you think about that idea?"
},
{
"speaker": "Speaker A",
"content": "I'll say I haven't as carefully done the reimplementation reproducibility work around the Google Gecko method. But my recollection from that paper that is now several months old and maybe not top of my reading stack in terms of rereading, is that they do a certain amount of this and they do have a lot of clever tricks for it. They show that that helped their training substantially in their. In their paper, or at least make the claim they feel, the authors feel the conviction. We have experimented with similar things. We have found that, like, maybe it's partially because we have access to a lot of really high quality data already, that it's maybe a higher bar for us to add improvement on that. We do some amount of experimentation with both synthetic queries and documents. And I think Both are very fruitful, but it's interesting to think about. I guess I like to approach this from the perspective of what is the objective that you're trying to meet. And I think it's almost. To me, the more interesting thing is the problem that sometimes, sometimes you don't have good negatives for a particular query. You have, you have a really interesting here's a question and here's what answers it. But that one is so easy because nothing else in my corpus is close that I can't actually train a model with this. Maybe if I could find some other interesting examples that I could make my training curriculum better and make the training example, the example being not just a positive pair, but a positive pair plus contrasting negatives better. And so that's something that is a very fascinating problem. Synthetically generating documents is one part of it. You could also synthetically modify the query to be closer to other positives. There's all sorts of ways you could play with it. You could try to ingest other corpuses, take some of your pre training data and just mine that for hard negatives, for fine tuning. There are all sorts of different ways you could play with it. But I think you're really hitting on this very powerful problem which is, you know, not every, not the distribution of relevance, but, you know, true relevance, whatever that is. What, what you would think Platonic ideal of relevance may not be consistent between examples and training and, and making it stronger, making a better contrast might, might make your training much better. And so that problem for sure is interesting. The examples of, of ways to approach it with synthetic document generation have been compelling to us. We've, we've played around with stuff. I really liked the gecko paper. I thought they did a really good job explaining all the cool things they tried. I really appreciated that open science from, from that team and, and yet I don't know if I have the conviction that like practically like, if I were to go code it up right now, I could just like push Arctic Embed to higher heights. Like it's one of those things that it's tricky to get right."
},
{
"speaker": "Speaker B",
"content": "Yeah, it sounds hard. Like if I ask, maybe if I'm asking like, how does reciprocal rank fusion work? And I'm searching within weaviate's blogs and it's like, you know, these other distracting hard negatives, they just have nothing to do with reciprocal rank fusion. And how do I. Right, yeah, such an interesting topic. So now I'd like to kind of transition topics a bit into the multilingual aspect. This is Something we see a lot from embeddings where they focus on not only just, you know, best embedding model, but also supporting multilingual. And when I gave Cloud your paper, the one of Cloud's favorite takeaways was Chinese text actually boosted English performance. And so I think, you know, the language models like that nugget from your paper. So what's been your experience with this, you know, multilingual aspect of it?"
},
{
"speaker": "Speaker A",
"content": "This is all Martin and we brought him onto the team. Not, I mean, because he's a great talent in the field, but also we really appreciate that he has a strong background working with different languages in natural language processing, meeting the intersection with information retrieval. And so he's by far the expert. He's first paper for a very good reason on this 2.0 report. And you know, I think Martin should definitely be the one to, to shape this conversation. And it's also, I mean, yeah, if you. It's also just important to think about, like it's not a mistake. Like the reason, the reason this went so well is also we have the expert here. So with all that, you know, that's my introduction to Martin. Why don't you take us away, explain the work."
},
{
"speaker": "Speaker C",
"content": "Yeah, thanks. So I basically joined the team in the end of July and Arctic 2.0 is sort of my big project. And in the beginning, this is not Even called Arctic 2.0, it's just called multilingual Arctic Embed. I think the motivation of this work is driven by real workload because customers actually ask for multilingual retrieval capability on codec search and we really want to focus on them. And if I were to say one thing on this project, one important thing is this conviction of not overfitting to a certain benchmark. So in the multilingual space, there are this benchmark called Miracle. It's multilingual information retrieval across a continuum of languages. And the problem with this benchmark is that it's purely based on multilingual Wikipedia. And as we know, Wikipedia is very easy to overfit on. You cannot find a language node that has not been trained on Wikipedia. All three stages. Math language modeling, pre training and then fine tuning. It's very easy to overfit. So we are very focused on try to not overfit on basically miracle and not do leaderboard chasing. So one important step we do in this project is to bring on datasets that we believe no model are trained on, that is the CLAY data set. And actually we find out that all of these open source embedding models that open source multilingual embedding models actually seem to overfit pretty hard on Miracle, whereas the actually fail pretty hard on this new data. New clay dataset we find. But that's why we are really confident that we are more generalizable to new data to real customer workload."
},
{
"speaker": "Speaker B",
"content": "Yeah, really close to that cleft dataset. Oh, sorry, sorry. Look at mineca."
},
{
"speaker": "Speaker A",
"content": "Yeah. And this is yet another example of. Martin kind of has worked with us before. And so that was kind of. Martin was like, oh yeah, let's license this one. It's been useful in the past and focus on a lot of languages also that we were most focused on these kind of European languages, Western European languages in particular, where a lot of customers are asking when we get their customer request, hey, I'd love to try Cortex search, but is it going to work on French and you're okay, maybe I should focus on French when we're training this. That kind of came out of some of that expertise. And so it was fun for me to learn about this from Martin's past work and current work. Now it's interesting to see, you know, I would also, I would say like what we. I think we did a pretty good job in our report of not saying like all these open source language models definitely overfit to Miracle, but rather that we observe they. They're very competitive in the ranking of overall models including closed source models on Miracle. If you take the numbers published by the Google Gecko paper from Miracle, which seem kind of surprisingly low, you know, but that you actually see a very large gap where all these models are performing much better on NDC10 at miracle average across 18 languages. These open source ones that are 100 million parameters are just crushing Google Gecko but falling, you know, five points behind on the MTAB English retrieval benchmark in some cases or two or three points. And so it's kind of an. It was an interesting. We were starting to see this and Martin was like let's test Clap Nuts or Clay as a. I think the more French way to pronounce it. It's kind of an interesting experience and like it was almost a human experience for me. It's like bump into the right people and try this. But I think it also calls for like we really need more especially it'd be nice if they're open and not, you know, requiring a license benchmarks for. For non English retrieval that are good across, you know, languages. And it's hard because, you know, a lot of people just like Miracles, like We want to spend all the languages, get a good pulse check of like 18 different languages. But for a lot of people it's like their entire use case sits in one language. It might not be English, but it's not like they're not doing 18 different languages of retrieval at the same time. And so, you know, from the academics, it's kind of like there's this like oh, English or like everything. And in practice and you see this, I think like Gina AI I think has like a German model, they have German customers and it makes sense. You know, we, we kind of need to think multilingual doesn't mean like omnilingual necessarily. It means like I'm going to want to support more than one use case and maybe not completely sacrifice and be like a master of none. You don't want the model that's good at everything, but there's a better model for what you care about that's actually not the most useful model out there. So anyhow, I think that was kind of the interesting background of this client. I said, and we do show the performance of some of these other closed models that probably had a lot more training resources data wise than just Miracle or some of these other ones because the fine tuning data especially there's not as much high quality data out there in the public. Those other ones are substantially outperforming on clay compared to multilingual E5 or other ones. And if you read these open papers that have so kindly published their data mixes, it sorts of paint this picture that maybe we might be slanted a little too much towards Wikipedia. And then also if you're tuning parameters or tuning your data mix, if all you're measuring on is Miracle, you're going to get yourself to a bad place. And so when we were tuning, it was very nice to start using other evaluations as well. And we actually focused a lot on keeping our English scores high, which I think also helped us fight overfitting. Because if you overfit to Spanish Wikipedia, your English language retrieval also goes down. And there's tons of great English language retrieval benchmarks in Beer, mtab, et cetera."
},
{
"speaker": "Speaker D",
"content": "That to me actually was one of the really interesting things about this new model. And I even with how Martin, how you described the project started off as multilingual Arctic embed, but actually as a result, because you kept English in the mix, it's actually fine. But it's kind of counterintuitive to what the general strategy has been up to this point, which is you pick one model for English and you pick another one for everything else. I think that's a real standout with 2.0 for me is that you can actually, it looks like you can kind of have both. You can have both."
},
{
"speaker": "Speaker A",
"content": "I mean Martin, I don't know if you felt confident. I think we were not sure we could have both. Especially you look at all these things, you go, okay, I guess OpenAI only has one model and I think they might do okay on a multilingual. We don't really know. But everyone else has got two models and all the open source ones and they publish benchmarks and they're all bad at English compared to their English model they did before. Beforehand."
},
{
"speaker": "Speaker C",
"content": "Yeah."
},
{
"speaker": "Speaker A",
"content": "Are we going to make it? You know we have these internal docs and planning and all that and you know, we were like, it would be a cool feature to have just one model. It's certainly better for users. But I was, I was fully expecting to see a maybe where we could shrink the gap. Like what I was hoping for is we'd have, you know, MTAB English might be 55 for our baseline with training, you know, just English data and mixing it multilingual, we'd be able to get 53 or 52 instead of where some others have like an even bigger gap. And Martin, what were your expectations like can you remember in July about this?"
},
{
"speaker": "Speaker C",
"content": "Yeah, I mean, yeah, my previous experience has not been involved in training like three stages of multilingual model. It's actually pre training. So actually my focus on this project is actually pre training. Yeah. So I don't have any expectation. But my expectation wasn't that we are going to keep the English score. Yeah. And then actually during the model development we are gradually finding out that our model after just pre training are substantially outperforming other models after just pre training, before fine tuning. So that's where we start to think, okay, maybe now this is pretty good. For some reason that's unclear to us as well. We have some assumptions."
},
{
"speaker": "Speaker A",
"content": "Yeah, it's the big mystery is like why did it work? So why did it just work? It's kind of the theme of this."
},
{
"speaker": "Speaker D",
"content": "It's kind of the theme. It's kind of nice that the theme with deep learning from years ago is still kind of sticking around which is just kind of like, like at times, you know, you, you turn some dial somewhere along the system where you change an ingredient and it. Something things are still just kind of a little bit mysterious."
},
{
"speaker": "Speaker A",
"content": "For sure."
},
{
"speaker": "Speaker B",
"content": "Yeah. Awesome. So such a, such an interesting topic around this multilingual. The different aspects of It. So the one kind of anchoring question I really wanted to come to, I think we already kind of started touching on it with the Matrioshka representation learning was. And, and Luke, earlier you mentioned about, you know, cortex and search and looking at all the different knobs you can turn. So I just really love to understand further how you think about the decision to train these say single vector embedding models versus say maybe the multi vector Colbert models or maybe say the splayed sparse models or say, you know, cross encoders, rewrikers and I guess just like how you feel about like the different kinds of models you can train for search."
},
{
"speaker": "Speaker A",
"content": "Yeah, that's a great question. I mean I think it's also, it's interesting to take a bit of a historical lens and see that in the beginning people were just taking Bert and pooling the outputs, maybe doing a little whitening like they did with bca. But there was no training retrieval model even. It was just like can we do dense vectors instead of these sparse ones that we get from keyword stuff? And finding results are like not better than BM25 or other lexical retrieval techniques, but interesting and how far we've come and how many different approaches. But then also there's this, there does seem to be everyone's kind of like you have vector databases. Like weaviate has put a lot of code towards supporting a single vector retrieval use case. And I think it's interesting to remember that that's not the only way to do it. And in five, 10 years, who knows what the best techniques will be. I think the kind of geometric elegance of putting all of your documents on a hypersphere and just kind of going around and finding which section is what you're looking for. Sometimes I imagine a globe and the query is literally spinning the globe around and then I find all my documents there and it's kind of this beautiful. I mean obviously a globe is three space watch dial and a globe are not truly what's happening here in embedding dimensions, but it's kind of, you know, it also we see these degenerate cases where a query wants to ask essentially two questions. Or you know, maybe somebody is feeding this retrieval system into a generative system and they're like, okay, compare Apple and Google's earnings in Q2 2023. Well, okay, that's really two questions from your data bank, right? There's going to be like a report on Google's earnings and a report on Apple's earnings. Maybe somebody compared the two Webster's benefits from this. Oftentimes you ask like a comparison question and some Reddit thread comes up where somebody's comparing the two things. It's like okay, that's, that is truly the most relevant thing. But if there is not, you want to gracefully split to, you know, and maybe you're comparing two very distant topics like Apple and Google and finance are not a great example because they're similar companies and similar comparison technique. But like you know, draw like Compare you know, 7th century industrialization of agriculture to 17th century, you know, musical advances. And if that's a query that you somehow want to do like those search topics are going to be, you know, not close to each other and hard to reason about. Like what system, what would the ideal system look like putting those topics next to each other in vector space so you could retrieve them. No. Right. Like it's completely dependent on the query that those things should be close to one another. And so what you really want is more than one probe of your, of your organization of knowledge library your vector space. And that's one thing that having something that's not just simply a vector similarity operation can be very helpful for. But I think there's also a lot of, I mean it goes back super long in the retrieval literature to like you know that the papers that you read are typed up and like scanned in and you can't find, you know and it's like before the advent of personal computing people were hand labeling retrieval systems and evaluating and they had ideas about topics. Like topics is a very fundamental idea that like we all know humans there are topics in language and you can sort of sort things out and that's that really lines well with the single vector case in a way that like splayed for vocabulary expansion somewhat gets at right. It's topically relevant words can help you kind of fix your, your lexical index. We looked at splayed splits, definitely helps. But when you add, when you get a neural network doing the dense vector, suddenly the marginal benefit of adding splayed doesn't always show up. That was kind of before we started Arctic. We did looked at a bunch of things. We looked at okay, let's you know, spend more money on re ranking. Let's earn you know time and spend more budget there. Let's you know, let's try splating things. Let's all this. I don't think we look too much at Colbert because it's, it's a big lift to implement you know, these multi vector solutions. And frankly I think that one Vector per token is. Is seems a little bit too much, right? Like I think in topics, when I'm trying to retrieve something, I don't think most queries I'm doing need word or token level interaction there at the retrieval. I think I would like to allow for more interesting interaction between my query representation, be it a vector or collection of vectors in my document one than a dot product. I would like for the ability to have multiple different directionalities of it. But you know, I feel that like, personally, the Colbert model is more interesting to me as a. This shows how much better things can be when you have these multiple topics and multiple things. Generalization can be better when you split to very narrow topics that are almost word level. Suddenly relevance between these words are general across domains in a way that sometimes the collection of words as a single idea isn't. And so it makes sense to me that Colbert generalizes really well and performs really well, especially when you have these weird nuanced retrieval things. But from a practical system, when you want to do scalable retrieval, it feels like maybe the wrong tool and trying to cram multiple very large vectors. We did so much work to go from a kilobyte to 100 bytes, right, per vector. And then we see these things of like, okay, now I have 100 vectors, gosh, maybe you get them down below 100 bytes each. But like this, we just tried to squeeze half an order of magnitude out. Now you want to introduce two more. It just doesn't feel like pragmatically the right tool for some of these retrieval things. But I think the world would totally jump on board if you showed an even bigger improvement, the retrieval quality. But unfortunately it's one of those things if you want to, you know, be 10 times more expensive to run or 10 times less scalable, you have to also be 10 times better to convince people in a lot of cases or you're just. You're fit to a different use case. Cortex search is not the product that can focus so narrowly on this, you know, will never go over a million documents use cases and, and you know, and there is some really cool work with Cold Bear V2 to, to scale it up. But yeah, I think like splate is such a cool technique because it's even more efficient, you know, in some, in some ways, but not necessarily a really cool technique because it's using neural networks to the max and I think text embeddings. We feel like we can really unleash the training and really try to do all of our deep learning on that with some limitation of the single vector representation. But it's a very nice trade off in this middle ground going to multi vector or re ranking models. Hypothetically for some retrieval jobs or small enough data set you could just take modicize re ranker style model cross attending tokens from your queries and documents just stream through all of your data. GPUs are getting fast. You know, some of these small models do pretty well. There's, there's so much to explore all over the cost spectrum and all that. But I think where we found like we continue to find places to drive market improvement either with features around compressibility and multilinguality that, you know, solve more use cases or around just actual quality like we've done with this fine tuning work. The models are just better now. You know, our multilingual model is better than our previous English model. And you know, as long as we continue to find that, it's sort of hard to justify us. You know, Martin and I jumping to other topics to actually try it out, love reading about it, really root for the people doing that work. And I really, you know, I think that's one of the great things about being able to publish these models openly, being able to publish papers openly and communicate. Like I think a lot of people see MTAB as like I want to get number one. And like anytime someone else jumps up that's a bad thing because I'm farther from being number one. But I think it's a much more happy place to be is like I want the world to be, you know, getting better search. I think it's so cool when you feed, you know, our paper into Claude and you get really cool results. And just because I didn't build Claude doesn't mean that's not cool. And I would love to learn more about how Claude works or what training mix they did. I wish Anthropic would publish more. I understand why they don't, you know, but, but that's I think the mentality there. And so I'm really rooting for all these, especially if you guys have paper, send them over. We love learning about all sorts of different ways people are doing search and all that."
},
{
"speaker": "Speaker B",
"content": "Amazing. Well, Luke, Martin, Charles, thank you so much for joining the podcast. I mean this is such a marathon of search topics. I've learned so much about all these things and I'm really excited to kind of watch it back and continue our work with these embedding models. Thank you so much."
},
{
"speaker": "Speaker A",
"content": "Thank you Connor and Martin, thank you."
},
{
"speaker": "Speaker D",
"content": "Real fascinating."
},
{
"speaker": "Speaker A",
"content": "Thanks."
}
]