back to indexInsights from Snorkel AI running Azure AI Infrastructure: Humza Iqbal and Lachlan Ainley

00:00:00.000 |
Hey everyone, thanks so much for coming. I'm Hamza. I'm an applied research scientist at 00:00:17.520 |
Snorkel on the computer vision team working on fine-tuning, you know, foundation models 00:00:23.100 |
for enterprise use cases. Thanks Hamza. I like the ears. Thank you. Why don't you start 00:00:33.340 |
out by telling the folks a little bit about Snorkel and what you do there. It's an interesting 00:00:39.520 |
name, any relation to like AI for scuba diving or anything like that? Well, hot tubs actually 00:00:44.920 |
because snorkel.com, once I joined the company, I learned it's a hot tub company, but no, actually 00:00:50.880 |
we don't do anything for scuba diving or hot tubbing. And to give a little bit of context, 00:00:56.660 |
the main problem that we're trying to solve is like, you know, data development for the 00:01:00.560 |
enterprise. So, one key thing that I kind of want to take note of is the fact that, you 00:01:06.840 |
know, out of the box, LLMs rarely meet enterprise quality, latency, and cost requirements. You 00:01:12.880 |
know, to give some context, our customers are like Fortune 500 companies like banks, insurance 00:01:17.620 |
companies, places like that. And for them to deploy their models, they really need them 00:01:21.940 |
to be very reliable and very accurate. And off the shelf models like Claude or GPT4 or Gemini 00:01:29.240 |
may get you part of the way there, but they don't have that, you know, final mile that really, 00:01:34.080 |
you know, says, yes, we can like deploy these completely. So, what we focus on is developing 00:01:40.040 |
data to fine-tune these models to get them there. 00:01:43.500 |
So, Hamza, this makes sense, but why is it hard? What's the challenge in addressing this issue? 00:01:51.660 |
Yeah. So, data development is fundamentally challenging, and there's a few key reasons 00:01:58.040 |
for that. One is that, you know, RAG is just a starting point. You know, you guys may have 00:02:02.520 |
heard of like, you know, using like RAG to hook up like, you know, enterprise knowledge databases 00:02:07.440 |
to these models and get them to help give what, you know, these models we're not pre-trained 00:02:12.540 |
on. And I don't want to RAG on it. It's great and all, but it's just a starting point. You 00:02:17.540 |
know, it won't get you all the way there. And, you know, quality in your data is absolutely 00:02:22.660 |
key. And finding and maintaining the right data is critical because, you know, a lot of 00:02:29.280 |
times, you know, the common instruction tuning datasets may be very big, but they don't contain 00:02:34.460 |
exactly the information that you need. Like, if you're training specifically on, I don't 00:02:38.680 |
know, a specific type of bank policy or a specific type of policy for certain industries, you need 00:02:44.600 |
that very key slice of information to be able to really improve these models. 00:02:50.840 |
So I think folks have a good understanding of what you guys do and why you do it. Why don't 00:02:58.400 |
you talk a little bit about what Snorkel is and what you're famous for besides the interesting 00:03:03.600 |
Yeah. So Snorkel pioneered data development for LLMs and we're trusted by, you know, many different 00:03:11.280 |
companies. We've worked with lots of companies in, like, you know, the Fortune 500 and all 00:03:15.300 |
that. We were spun up out of the Stanford AI lab quite a while ago and have a lot of, like, 00:03:20.520 |
you know, decorative experience in, like, data development because it's key to, like, you know, 00:03:24.580 |
many aspects of ML and we've published many papers and, you know, a lot of hot fields like 00:03:31.520 |
Okay, nice. I think we're going to switch context here for a bit and talk a little bit about the 00:03:40.560 |
specific research projects you guys are focused on. You guys take a research first culture. 00:03:47.920 |
Yeah. Why don't you explain a little bit about that and the projects? 00:03:51.940 |
Yeah, thanks, Lachie. So first I kind of want to talk a little bit about our research focus 00:03:58.140 |
overall. So really what we, the core question we try to answer is how can enterprises best 00:04:03.580 |
develop their data for custom AI models? You know, so we have, there are a few different directions 00:04:09.020 |
we want to pursue overall. One is keeping SMEs in the loop while maximizing the value of their 00:04:14.140 |
time. You know, because again, like, for a lot of these industries we need the subject 00:04:18.580 |
matter experts that know the key details in order to be able to, you know, provide feedback 00:04:24.820 |
to models and help them be able to improve. The second is, you know, make data development 00:04:30.420 |
programmatic, scalable, and auditable. Because, you know, while we do need SMEs, at the same time 00:04:36.900 |
we also need to make things scalable in a way that solely manual intervention isn't. So it's 00:04:42.520 |
really being able to combine those two things together that make, make this important. And 00:04:48.280 |
the third is continuous evaluation with domain-specific dynamic benchmarks. You know, I'm sure you all 00:04:54.460 |
have seen things like LMSys or whatnot, and it's pretty good to see, you know, a general understanding 00:05:00.020 |
of where a lot of these LLMs fall in terms of their ability to do things. But for specific 00:05:05.240 |
industries, you need specific benchmarks to say, how good is it at this? Like, you know, a bank 00:05:11.320 |
isn't going to care about how well these LLMs do at, say, grade school math, right? 00:05:17.080 |
So there's that. And I want to go into a little bit more detail and talk about 00:05:22.840 |
some active research projects we're working on right now. One is fine-grained evaluation and looking at, 00:05:29.080 |
you know, where evaluation for these models is broken. One particular area is, you know, in long context 00:05:34.840 |
models. You know, you guys may have seen things like the needle in the haystack test, where you take a bunch of, like, 00:05:40.840 |
Paul Graham essays and insert some sentence and see how well it can find that. But, you know, 00:05:46.600 |
one thing we found is that, again, that doesn't necessarily give a proper sense of how these models 00:05:52.360 |
handle long context in other domains. And, you know, really, again, breaking everything down domain by 00:05:57.480 |
domain is super critical. So, you know, figuring out how can we improve long context overall. 00:06:04.600 |
Another key area is enterprise alignment. You know, making sure that these LLMs comply with, you know, 00:06:10.200 |
company goals, regulations, and all that. You know, we don't want our LLMs to be committing any 00:06:15.160 |
career-limiting moves while outputting text. And another area, which is particularly near and dear to 00:06:21.800 |
me because I work on it actively, is multimodal alignment. We find that, you know, these models trained 00:06:27.080 |
on public data, you know, underperform in, like, you know, specific domains. And one area we're working 00:06:32.440 |
on is using these large vision language models or LVLMs to be able to generate synthetic data without 00:06:38.280 |
manual annotation to be able to train, you know, downstream models. So, kind of being able to 00:06:43.320 |
really have this flywheel of going from, like, specific data generation in the loop to model training 00:06:49.320 |
I think it's something that we're very excited about. 00:06:53.400 |
That's great, Hamza. We're excited that a lot of these projects are happening on 00:06:59.640 |
Azure's AI infrastructure, obviously, as well. I think if you forward the slides a little bit, 00:07:08.040 |
I think, you know, you went through an experience with Azure getting on board and running these projects. 00:07:15.240 |
I think people are really interested to maybe understand what are the best practices you had 00:07:20.200 |
working with our infrastructure, some of the pitfalls and the benefits as well. 00:07:26.920 |
You know, this one's my slide. I think the way we think about infrastructure that's supporting this wave 00:07:35.880 |
of AI, it's really about optimising it in every sense possible for the different AI applications and use. 00:07:46.680 |
You know, we look at everything from our Azure data centres. We have over 300 worldwide. 00:07:51.720 |
The CPU or the host, so combining our virtual machines with the right CPUs and offering the right throughput. 00:08:01.160 |
The accelerator, we use a diversity of accelerators from AMD, NVIDIA and our own first-party silicon, 00:08:09.960 |
as well with the Maya chip. We have topologies that, you know, optimise that I/O between the different 00:08:17.080 |
layers and obviously the networking throughput as well. So it's really about making sure that we can take 00:08:25.160 |
the best of breed at what we do at a super computing scale and deliver that back to the customers so we have 00:08:31.240 |
a real cycle around, you know, learning from working with organisations like OpenAI, Mistral and others that 00:08:39.800 |
have trained their models on Azure's infrastructure and then being able to democratise that and deliver it back to 00:08:44.920 |
customers as well. And so Hamza, with that, why don't you share a little bit more about what exactly you guys did on Azure? 00:08:55.720 |
Yeah, so first we'll actually talk a little bit about how we do distributed training in general with Azure. 00:09:02.120 |
So we have a stack, you know, and so on the ML framework side, you know, we use PyTorch, you know, 00:09:08.520 |
pretty standard framework. We also use a library called Horovod, which handles multi-node communication. 00:09:14.920 |
So if we have like multiple Azure VMs, how do they communicate with each other? It allows for faster 00:09:20.840 |
communication across nodes. And on the underlying, you know, hardware layer, you know, we use a bunch of Azure 00:09:27.160 |
VMs that could be like A100s or H100s and they're all, you know, we connect to them. We use Horovod to 00:09:34.040 |
connect to them and send gradients through for distributed training and they all read and write 00:09:38.200 |
to a single network file system or NFS. You can basically think of an NFS as being a shared file 00:09:43.880 |
system that every machine has access to as if it were a local file system, which makes it very seamless to 00:09:50.200 |
read data or write checkpoints for models. So that's kind of our overall 00:09:56.520 |
infrastack. And what about running on on Azure? You guys had some specific 00:10:05.160 |
workloads that you were covering. Yeah. So we've run a number of projects 00:10:10.360 |
on Azure and we've run them on different sizes from like, you know, one node to dozens of nodes. So, 00:10:16.440 |
you know, we've run like DPO align models with a bunch of instruction response preference data sets. 00:10:22.520 |
We've run like, you know, preference optimization techniques and using and the things we did that 00:10:28.840 |
have used the most compute have been large scale distributed training jobs for multimodal training 00:10:33.800 |
and inference, you know, with like dozens of GPUs. So, yeah, these are the kinds of workloads that we've run. 00:10:40.760 |
And I think you had some lessons learned throughout as well. It wasn't, you know, I think it's not 00:10:54.360 |
always smooth sailing with these types of jobs. So any, any best practices, traps for young players 00:11:00.760 |
out there in terms of your experience? Yeah, absolutely. So there's a number of key architectural 00:11:09.160 |
considerations to keep in mind. One is having enough nodes to support, you know, your ideal batch size. 00:11:15.560 |
So on the CV side, you know, when I was training like, you know, let's say like clip based models, 00:11:20.120 |
one thing I learned is that, you know, you want you, we wanted enough nodes to have a certain batch size, 00:11:25.560 |
but we also didn't want too many such that we would either fall into the trap of having 00:11:30.200 |
too large a batch size or under utilizing whichever nodes we were using. So getting that balance right was 00:11:36.040 |
pretty important. Another thing is networking bottlenecks. We want to make sure all of our 00:11:40.760 |
data and nodes are close together. Like, you know, imagine if for example, you had your, um, a bunch of 00:11:45.880 |
your, your data in like US West, and maybe you had your nodes in like, you know, um, Asia or something 00:11:52.600 |
like that, right? You know, that's like a simple example, but bottom line is networking communication is 00:11:57.640 |
pretty important. And you want to make sure that, you know, when you're sending this data across, 00:12:01.800 |
that's not going to be a bottleneck when you, because if you're training, one thing that will 00:12:05.560 |
happen is when you send gradients to different copies of the model, that could be a bottleneck 00:12:10.680 |
there. Another bottleneck could be data reading, um, because recall that, you know, while your model 00:12:15.960 |
is training and you're doing your forward and backward propagations, um, asynchronously, you're 00:12:20.200 |
loading in data to be fed to the model. Um, and so this is where the NFS read speed is absolutely 00:12:27.640 |
critical. Um, if your NFS is not reading in your data fast enough, then you could be bottlenecked 00:12:32.680 |
waiting for data to be processed and your model isn't actually crunching. Um, and you know, one key, key 00:12:39.160 |
takeaway for both of these is make sure your GPU utilization is good. NVIDIA SMI is your best friend here. 00:12:44.920 |
If you see your GPU utilization being low, you know, don't be afraid to look into why and, you know, 00:12:50.200 |
you can, you know, do different things to debug. Like if it's multi node, for example, then, you know, 00:12:55.320 |
you can test networking and stuff like that. If it's on a single node, that likely means it's a data 00:12:59.640 |
loading issue. So there's lots of different ways and tools that you can use to step into these things. 00:13:05.160 |
And, you know, don't underestimate the basics of like reliability, flexibility and manageability. 00:13:10.360 |
So, you know, one thing that we were, we really cared about as a team is we wanted to make sure 00:13:15.080 |
that our data distribute that, you know, when we're training experiments, right, we were like, you know, 00:13:19.560 |
going through, sometimes we needed all the nodes, sometimes we needed very few. And, you know, being 00:13:24.760 |
able to work with instances that gave us that flexibility over a long period of time is very 00:13:28.920 |
important. You know, when we were shopping around, some cloud providers only let us use compute for a fixed 00:13:34.760 |
amount of time, like maybe, say, a month or two months. And, you know, as a trade-off, we'd have 00:13:39.400 |
a bunch of compute, but that didn't really work for us because we weren't in it for training a model for 00:13:44.520 |
some fixed amount of time. We wanted something where we could go on and off for a longer period of time. 00:13:49.720 |
That's great insight, Hamza. I think also we talked about some of the advantages of using Azure, which 00:14:01.480 |
would be great if you could shamelessly plug that for Azure as well. 00:14:04.760 |
Yeah, happy to. So, you know, one was availability. You know, the Azure VMs were dedicated and allowed 00:14:11.240 |
us to adjust our capacity on demand. The reliability was pretty good. It was consistently dependable with, 00:14:17.080 |
like, no real issues. You know, NFS throughput was also quite good. You know, again, right, like, you know, 00:14:23.960 |
if your NFS is bad, then that means that, you know, you're not reading in data fast enough. Or, 00:14:28.440 |
for example, being able to dynamically change the size of your NFS. If, let's say, for example, 00:14:33.240 |
you need more or less capacity. If you need more because you suddenly have more data than you realize 00:14:38.200 |
you had before, then you need to be able to tell it that. At the same time, if you realize that, you know, 00:14:42.760 |
your NFS is over-provisioned, you don't want to be, you know, overpaying in Azure bills. Though I'm sure 00:14:48.360 |
Lockie wouldn't mind that. But, and, you know, the ease of use is very important, you know, clear documentation 00:14:55.160 |
and straightforward process. You know, like, as the guy that set up Azure for my team, I really didn't like it if 00:15:00.840 |
other people needed to bug me. And thankfully, once I got things working, it just worked and I did not need to be paid. So, 00:15:07.160 |
So, that is very important. That's really good to hear, Hamza. 00:15:11.960 |
Yeah, I think, like, you know, this is fantastic. And I think you had some specific data points. You guys 00:15:21.400 |
recently went through a process to go from the A100s through to the H100s or the VMs leveraging those. 00:15:29.480 |
Can you share a bit of insight around what you observed with that experience? 00:15:33.560 |
Yeah. So, the key takeaway is that H100s are really good. One thing we wanted to do when we were 00:15:41.160 |
doing this was do a cost analysis and see, okay, for a given number of H100s and a given number of 00:15:46.520 |
A100s that cost the same amount, what kind of training and inference are we getting? And so, 00:15:50.280 |
here you can see we're comparing two H100s to four A100s because that's what works out about the same 00:15:56.120 |
cost-wise. And we're doing better on both training and inference, which means that we're doing better 00:16:01.400 |
per dollar just by switching here. And, you know, there are a couple of key points I really want to 00:16:07.720 |
emphasize here. One is that, you know, it's really nice when you just have a very simple plug-and-play 00:16:14.520 |
change that works. You know, there's a lot of ongoing work to optimize, you know, you know, 00:16:18.920 |
especially like things like inference, for example, with your KV caches, your partial KV caches, your 00:16:23.400 |
speculative decodings. And it's really nice to be able to say, hey, let's just do something simple 00:16:28.200 |
and have it work. And the second is that, with this faster inference in particular, we can go through 00:16:34.200 |
more synthetic data, higher-end model accuracy, and it just enables us a flywheel of faster iteration, 00:16:39.000 |
which is super critical for being able to, like, do more development. 00:16:42.680 |
Yeah, I mean, you can see from the numbers, the performance is there. And I'm assuming that, 00:16:50.440 |
internally, just that ability to do more with fewer GPUs has been a really great benefit for you guys 00:16:57.960 |
So your question is, why the inference is taking longer than the training time? 00:17:18.440 |
I think it was because we were doing the larger batch size, and it just happened to 00:17:24.840 |
work out that way, that for whatever batch size we were doing, it just wound up taking longer, 00:17:29.880 |
because, yeah, with the training, you typically need a smaller batch size, because you have to 00:17:33.320 |
to put more things in memory for backprop, and somehow it just wound up working that way? 00:17:51.720 |
No, because I know that when we were comparing training inference across the hardware, 00:18:01.000 |
those were kept fixed, like whatever inference batch size we were using for DH100 was the same as DA100, 00:18:14.040 |
yeah, so speaking about what's next, I think this is a good segue, so with the next slide, 00:18:25.560 |
just from our point of view with Azure, we are sort of adopting and we spoke about optimizing at every layer of the stack, 00:18:36.360 |
and I think the addition of our AI accelerator, that's used for our own internal workloads across Microsoft 365. 00:18:44.520 |
But we've just announced the AMD MI300X with the high bandwidth memory, we've got the NVIDIA A100s, the H100s, and we'll be adopting Blackwell. 00:18:56.440 |
What's really exciting is the pace of innovation in silicon has never been like this before. 00:19:04.040 |
We're talking with NVIDIA almost doing two releases a year. 00:19:07.800 |
Previously, you know, I think it might have been one every two years or something like that. 00:19:14.040 |
It's really amazing to see this growth in the silicon, how far we're getting, and what Hamza just shared, 00:19:21.880 |
the ability to do more and more with less on the infrastructure as well. 00:19:26.440 |
And so I think, you know, certainly as far as Azure and our AI infrastructure strategy, 00:19:33.320 |
it's to continue to adopt these new evolutions and really make sure the right GPUs are used in the right places for the right workloads. 00:19:43.000 |
Hamza, what about for Snorkel AI, what's next? 00:19:45.640 |
Yeah, so to give you all a sneak peek into what we're working on actively at Snorkel, 00:19:50.680 |
what do we have a number of directions, you know, one is to, you know, say, like, you know, 00:19:54.840 |
better data leads to better gen AI and get better prototypes to production. 00:19:58.520 |
So we want to explore new ways to programmatically utilize preference signals for data synthesis and curation. 00:20:03.880 |
We also want to develop scalable SME entry points for data development using rationales and custom taxonomies. 00:20:11.640 |
And finally, we want, you know, better multimodal retrieval algorithms. 00:20:14.840 |
And we want to evaluate those on domain specific data sets to scale up the retrieval models we have. 00:20:20.920 |
So, and we're, of course, excited to do all these things on Azure AI infra. 00:20:28.680 |
Well, thank you so much for presenting and speaking on behalf of Microsoft.