Insights from Snorkel AI running Azure AI Infrastructure: Humza Iqbal and Lachlan Ainley

Hey everyone, thanks so much for coming. I'm Hamza. I'm an applied research scientist at Snorkel on the computer vision team working on fine-tuning, you know, foundation models for enterprise use cases. Thanks Hamza. I like the ears. Thank you. Why don't you start out by telling the folks a little bit about Snorkel and what you do there.

It's an interesting name, any relation to like AI for scuba diving or anything like that? Well, hot tubs actually because snorkel.com, once I joined the company, I learned it's a hot tub company, but no, actually we don't do anything for scuba diving or hot tubbing. And to give a little bit of context, the main problem that we're trying to solve is like, you know, data development for the enterprise.

So, one key thing that I kind of want to take note of is the fact that, you know, out of the box, LLMs rarely meet enterprise quality, latency, and cost requirements. You know, to give some context, our customers are like Fortune 500 companies like banks, insurance companies, places like that.

And for them to deploy their models, they really need them to be very reliable and very accurate. And off the shelf models like Claude or GPT4 or Gemini may get you part of the way there, but they don't have that, you know, final mile that really, you know, says, yes, we can like deploy these completely.

So, what we focus on is developing data to fine-tune these models to get them there. So, Hamza, this makes sense, but why is it hard? What's the challenge in addressing this issue? Yeah. So, data development is fundamentally challenging, and there's a few key reasons for that. One is that, you know, RAG is just a starting point.

You know, you guys may have heard of like, you know, using like RAG to hook up like, you know, enterprise knowledge databases to these models and get them to help give what, you know, these models we're not pre-trained on. And I don't want to RAG on it. It's great and all, but it's just a starting point.

You know, it won't get you all the way there. And, you know, quality in your data is absolutely key. And finding and maintaining the right data is critical because, you know, a lot of times, you know, the common instruction tuning datasets may be very big, but they don't contain exactly the information that you need.

Like, if you're training specifically on, I don't know, a specific type of bank policy or a specific type of policy for certain industries, you need that very key slice of information to be able to really improve these models. So I think folks have a good understanding of what you guys do and why you do it.

Why don't you talk a little bit about what Snorkel is and what you're famous for besides the interesting name? Yeah. So Snorkel pioneered data development for LLMs and we're trusted by, you know, many different companies. We've worked with lots of companies in, like, you know, the Fortune 500 and all that.

We were spun up out of the Stanford AI lab quite a while ago and have a lot of, like, you know, decorative experience in, like, data development because it's key to, like, you know, many aspects of ML and we've published many papers and, you know, a lot of hot fields like prompting, RAG, architectures and so on.

Okay, nice. I think we're going to switch context here for a bit and talk a little bit about the specific research projects you guys are focused on. You guys take a research first culture. Yeah. Why don't you explain a little bit about that and the projects? Yeah, thanks, Lachie.

So first I kind of want to talk a little bit about our research focus overall. So really what we, the core question we try to answer is how can enterprises best develop their data for custom AI models? You know, so we have, there are a few different directions we want to pursue overall.

One is keeping SMEs in the loop while maximizing the value of their time. You know, because again, like, for a lot of these industries we need the subject matter experts that know the key details in order to be able to, you know, provide feedback to models and help them be able to improve.

The second is, you know, make data development programmatic, scalable, and auditable. Because, you know, while we do need SMEs, at the same time we also need to make things scalable in a way that solely manual intervention isn't. So it's really being able to combine those two things together that make, make this important.

And the third is continuous evaluation with domain-specific dynamic benchmarks. You know, I'm sure you all have seen things like LMSys or whatnot, and it's pretty good to see, you know, a general understanding of where a lot of these LLMs fall in terms of their ability to do things. But for specific industries, you need specific benchmarks to say, how good is it at this?

Like, you know, a bank isn't going to care about how well these LLMs do at, say, grade school math, right? So there's that. And I want to go into a little bit more detail and talk about some active research projects we're working on right now. One is fine-grained evaluation and looking at, you know, where evaluation for these models is broken.

One particular area is, you know, in long context models. You know, you guys may have seen things like the needle in the haystack test, where you take a bunch of, like, Paul Graham essays and insert some sentence and see how well it can find that. But, you know, one thing we found is that, again, that doesn't necessarily give a proper sense of how these models handle long context in other domains.

And, you know, really, again, breaking everything down domain by domain is super critical. So, you know, figuring out how can we improve long context overall. Another key area is enterprise alignment. You know, making sure that these LLMs comply with, you know, company goals, regulations, and all that. You know, we don't want our LLMs to be committing any career-limiting moves while outputting text.

And another area, which is particularly near and dear to me because I work on it actively, is multimodal alignment. We find that, you know, these models trained on public data, you know, underperform in, like, you know, specific domains. And one area we're working on is using these large vision language models or LVLMs to be able to generate synthetic data without manual annotation to be able to train, you know, downstream models.

So, kind of being able to really have this flywheel of going from, like, specific data generation in the loop to model training I think it's something that we're very excited about. That's great, Hamza. We're excited that a lot of these projects are happening on Azure's AI infrastructure, obviously, as well.

I think if you forward the slides a little bit, I think, you know, you went through an experience with Azure getting on board and running these projects. I think people are really interested to maybe understand what are the best practices you had working with our infrastructure, some of the pitfalls and the benefits as well.

You know, this one's my slide. I think the way we think about infrastructure that's supporting this wave of AI, it's really about optimising it in every sense possible for the different AI applications and use. You know, we look at everything from our Azure data centres. We have over 300 worldwide.

The CPU or the host, so combining our virtual machines with the right CPUs and offering the right throughput. The accelerator, we use a diversity of accelerators from AMD, NVIDIA and our own first-party silicon, as well with the Maya chip. We have topologies that, you know, optimise that I/O between the different layers and obviously the networking throughput as well.

So it's really about making sure that we can take the best of breed at what we do at a super computing scale and deliver that back to the customers so we have a real cycle around, you know, learning from working with organisations like OpenAI, Mistral and others that have trained their models on Azure's infrastructure and then being able to democratise that and deliver it back to customers as well.

And so Hamza, with that, why don't you share a little bit more about what exactly you guys did on Azure? Yeah, so first we'll actually talk a little bit about how we do distributed training in general with Azure. So we have a stack, you know, and so on the ML framework side, you know, we use PyTorch, you know, pretty standard framework.

We also use a library called Horovod, which handles multi-node communication. So if we have like multiple Azure VMs, how do they communicate with each other? It allows for faster communication across nodes. And on the underlying, you know, hardware layer, you know, we use a bunch of Azure VMs that could be like A100s or H100s and they're all, you know, we connect to them.

We use Horovod to connect to them and send gradients through for distributed training and they all read and write to a single network file system or NFS. You can basically think of an NFS as being a shared file system that every machine has access to as if it were a local file system, which makes it very seamless to read data or write checkpoints for models.

So that's kind of our overall infrastack. And what about running on on Azure? You guys had some specific workloads that you were covering. Yeah. So we've run a number of projects on Azure and we've run them on different sizes from like, you know, one node to dozens of nodes.

So, you know, we've run like DPO align models with a bunch of instruction response preference data sets. We've run like, you know, preference optimization techniques and using and the things we did that have used the most compute have been large scale distributed training jobs for multimodal training and inference, you know, with like dozens of GPUs.

So, yeah, these are the kinds of workloads that we've run. And I think you had some lessons learned throughout as well. It wasn't, you know, I think it's not always smooth sailing with these types of jobs. So any, any best practices, traps for young players out there in terms of your experience?

Yeah, absolutely. So there's a number of key architectural considerations to keep in mind. One is having enough nodes to support, you know, your ideal batch size. So on the CV side, you know, when I was training like, you know, let's say like clip based models, one thing I learned is that, you know, you want you, we wanted enough nodes to have a certain batch size, but we also didn't want too many such that we would either fall into the trap of having too large a batch size or under utilizing whichever nodes we were using.

So getting that balance right was pretty important. Another thing is networking bottlenecks. We want to make sure all of our data and nodes are close together. Like, you know, imagine if for example, you had your, um, a bunch of your, your data in like US West, and maybe you had your nodes in like, you know, um, Asia or something like that, right?

You know, that's like a simple example, but bottom line is networking communication is pretty important. And you want to make sure that, you know, when you're sending this data across, that's not going to be a bottleneck when you, because if you're training, one thing that will happen is when you send gradients to different copies of the model, that could be a bottleneck there.

Another bottleneck could be data reading, um, because recall that, you know, while your model is training and you're doing your forward and backward propagations, um, asynchronously, you're loading in data to be fed to the model. Um, and so this is where the NFS read speed is absolutely critical. Um, if your NFS is not reading in your data fast enough, then you could be bottlenecked waiting for data to be processed and your model isn't actually crunching.

Um, and you know, one key, key takeaway for both of these is make sure your GPU utilization is good. NVIDIA SMI is your best friend here. If you see your GPU utilization being low, you know, don't be afraid to look into why and, you know, you can, you know, do different things to debug.

Like if it's multi node, for example, then, you know, you can test networking and stuff like that. If it's on a single node, that likely means it's a data loading issue. So there's lots of different ways and tools that you can use to step into these things. And, you know, don't underestimate the basics of like reliability, flexibility and manageability.

So, you know, one thing that we were, we really cared about as a team is we wanted to make sure that our data distribute that, you know, when we're training experiments, right, we were like, you know, going through, sometimes we needed all the nodes, sometimes we needed very few.

And, you know, being able to work with instances that gave us that flexibility over a long period of time is very important. You know, when we were shopping around, some cloud providers only let us use compute for a fixed amount of time, like maybe, say, a month or two months.

And, you know, as a trade-off, we'd have a bunch of compute, but that didn't really work for us because we weren't in it for training a model for some fixed amount of time. We wanted something where we could go on and off for a longer period of time. That's great insight, Hamza.

I think also we talked about some of the advantages of using Azure, which would be great if you could shamelessly plug that for Azure as well. Yeah, happy to. So, you know, one was availability. You know, the Azure VMs were dedicated and allowed us to adjust our capacity on demand.

The reliability was pretty good. It was consistently dependable with, like, no real issues. You know, NFS throughput was also quite good. You know, again, right, like, you know, if your NFS is bad, then that means that, you know, you're not reading in data fast enough. Or, for example, being able to dynamically change the size of your NFS.

If, let's say, for example, you need more or less capacity. If you need more because you suddenly have more data than you realize you had before, then you need to be able to tell it that. At the same time, if you realize that, you know, your NFS is over-provisioned, you don't want to be, you know, overpaying in Azure bills.

Though I'm sure Lockie wouldn't mind that. But, and, you know, the ease of use is very important, you know, clear documentation and straightforward process. You know, like, as the guy that set up Azure for my team, I really didn't like it if other people needed to bug me. And thankfully, once I got things working, it just worked and I did not need to be paid.

So, So, that is very important. That's really good to hear, Hamza. Yeah, I think, like, you know, this is fantastic. And I think you had some specific data points. You guys recently went through a process to go from the A100s through to the H100s or the VMs leveraging those.

Can you share a bit of insight around what you observed with that experience? Yeah. So, the key takeaway is that H100s are really good. One thing we wanted to do when we were doing this was do a cost analysis and see, okay, for a given number of H100s and a given number of A100s that cost the same amount, what kind of training and inference are we getting?

And so, here you can see we're comparing two H100s to four A100s because that's what works out about the same cost-wise. And we're doing better on both training and inference, which means that we're doing better per dollar just by switching here. And, you know, there are a couple of key points I really want to emphasize here.

One is that, you know, it's really nice when you just have a very simple plug-and-play change that works. You know, there's a lot of ongoing work to optimize, you know, you know, especially like things like inference, for example, with your KV caches, your partial KV caches, your speculative decodings.

And it's really nice to be able to say, hey, let's just do something simple and have it work. And the second is that, with this faster inference in particular, we can go through more synthetic data, higher-end model accuracy, and it just enables us a flywheel of faster iteration, which is super critical for being able to, like, do more development.

Yeah, I mean, you can see from the numbers, the performance is there. And I'm assuming that, internally, just that ability to do more with fewer GPUs has been a really great benefit for you guys So your question is, why the inference is taking longer than the training time? I think it was because we were doing the larger batch size, and it just happened to work out that way, that for whatever batch size we were doing, it just wound up taking longer, because, yeah, with the training, you typically need a smaller batch size, because you have to to put more things in memory for backprop, and somehow it just wound up working that way?

Do you want the mic? No, because I know that when we were comparing training inference across the hardware, those were kept fixed, like whatever inference batch size we were using for DH100 was the same as DA100, so that wasn't a factor. yeah, so speaking about what's next, I think this is a good segue, so with the next slide, just from our point of view with Azure, we are sort of adopting and we spoke about optimizing at every layer of the stack, and I think the addition of our AI accelerator, that's used for our own internal workloads across Microsoft 365.

But we've just announced the AMD MI300X with the high bandwidth memory, we've got the NVIDIA A100s, the H100s, and we'll be adopting Blackwell. What's really exciting is the pace of innovation in silicon has never been like this before. We're talking with NVIDIA almost doing two releases a year. Previously, you know, I think it might have been one every two years or something like that.

It's really amazing to see this growth in the silicon, how far we're getting, and what Hamza just shared, the ability to do more and more with less on the infrastructure as well. And so I think, you know, certainly as far as Azure and our AI infrastructure strategy, it's to continue to adopt these new evolutions and really make sure the right GPUs are used in the right places for the right workloads.

Hamza, what about for Snorkel AI, what's next? Yeah, so to give you all a sneak peek into what we're working on actively at Snorkel, what do we have a number of directions, you know, one is to, you know, say, like, you know, better data leads to better gen AI and get better prototypes to production.

So we want to explore new ways to programmatically utilize preference signals for data synthesis and curation. We also want to develop scalable SME entry points for data development using rationales and custom taxonomies. And finally, we want, you know, better multimodal retrieval algorithms. And we want to evaluate those on domain specific data sets to scale up the retrieval models we have.

So, and we're, of course, excited to do all these things on Azure AI infra. Fantastic, Hamza. Well, thank you so much for presenting and speaking on behalf of Microsoft. We really appreciate it. Thank you. Thank you guys so much. Thank you.

Insights from Snorkel AI running Azure AI Infrastructure: Humza Iqbal and Lachlan Ainley

Transcript