back to index

Insights from Snorkel AI running Azure AI Infrastructure: Humza Iqbal and Lachlan Ainley


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hey everyone, thanks so much for coming. I'm Hamza. I'm an applied research scientist at
00:00:17.520 | Snorkel on the computer vision team working on fine-tuning, you know, foundation models
00:00:23.100 | for enterprise use cases. Thanks Hamza. I like the ears. Thank you. Why don't you start
00:00:33.340 | out by telling the folks a little bit about Snorkel and what you do there. It's an interesting
00:00:39.520 | name, any relation to like AI for scuba diving or anything like that? Well, hot tubs actually
00:00:44.920 | because snorkel.com, once I joined the company, I learned it's a hot tub company, but no, actually
00:00:50.880 | we don't do anything for scuba diving or hot tubbing. And to give a little bit of context,
00:00:56.660 | the main problem that we're trying to solve is like, you know, data development for the
00:01:00.560 | enterprise. So, one key thing that I kind of want to take note of is the fact that, you
00:01:06.840 | know, out of the box, LLMs rarely meet enterprise quality, latency, and cost requirements. You
00:01:12.880 | know, to give some context, our customers are like Fortune 500 companies like banks, insurance
00:01:17.620 | companies, places like that. And for them to deploy their models, they really need them
00:01:21.940 | to be very reliable and very accurate. And off the shelf models like Claude or GPT4 or Gemini
00:01:29.240 | may get you part of the way there, but they don't have that, you know, final mile that really,
00:01:34.080 | you know, says, yes, we can like deploy these completely. So, what we focus on is developing
00:01:40.040 | data to fine-tune these models to get them there.
00:01:43.500 | So, Hamza, this makes sense, but why is it hard? What's the challenge in addressing this issue?
00:01:51.660 | Yeah. So, data development is fundamentally challenging, and there's a few key reasons
00:01:58.040 | for that. One is that, you know, RAG is just a starting point. You know, you guys may have
00:02:02.520 | heard of like, you know, using like RAG to hook up like, you know, enterprise knowledge databases
00:02:07.440 | to these models and get them to help give what, you know, these models we're not pre-trained
00:02:12.540 | on. And I don't want to RAG on it. It's great and all, but it's just a starting point. You
00:02:17.540 | know, it won't get you all the way there. And, you know, quality in your data is absolutely
00:02:22.660 | key. And finding and maintaining the right data is critical because, you know, a lot of
00:02:29.280 | times, you know, the common instruction tuning datasets may be very big, but they don't contain
00:02:34.460 | exactly the information that you need. Like, if you're training specifically on, I don't
00:02:38.680 | know, a specific type of bank policy or a specific type of policy for certain industries, you need
00:02:44.600 | that very key slice of information to be able to really improve these models.
00:02:50.840 | So I think folks have a good understanding of what you guys do and why you do it. Why don't
00:02:58.400 | you talk a little bit about what Snorkel is and what you're famous for besides the interesting
00:03:03.240 | name?
00:03:03.600 | Yeah. So Snorkel pioneered data development for LLMs and we're trusted by, you know, many different
00:03:11.280 | companies. We've worked with lots of companies in, like, you know, the Fortune 500 and all
00:03:15.300 | that. We were spun up out of the Stanford AI lab quite a while ago and have a lot of, like,
00:03:20.520 | you know, decorative experience in, like, data development because it's key to, like, you know,
00:03:24.580 | many aspects of ML and we've published many papers and, you know, a lot of hot fields like
00:03:29.580 | prompting, RAG, architectures and so on.
00:03:31.520 | Okay, nice. I think we're going to switch context here for a bit and talk a little bit about the
00:03:40.560 | specific research projects you guys are focused on. You guys take a research first culture.
00:03:47.920 | Yeah. Why don't you explain a little bit about that and the projects?
00:03:51.940 | Yeah, thanks, Lachie. So first I kind of want to talk a little bit about our research focus
00:03:58.140 | overall. So really what we, the core question we try to answer is how can enterprises best
00:04:03.580 | develop their data for custom AI models? You know, so we have, there are a few different directions
00:04:09.020 | we want to pursue overall. One is keeping SMEs in the loop while maximizing the value of their
00:04:14.140 | time. You know, because again, like, for a lot of these industries we need the subject
00:04:18.580 | matter experts that know the key details in order to be able to, you know, provide feedback
00:04:24.820 | to models and help them be able to improve. The second is, you know, make data development
00:04:30.420 | programmatic, scalable, and auditable. Because, you know, while we do need SMEs, at the same time
00:04:36.900 | we also need to make things scalable in a way that solely manual intervention isn't. So it's
00:04:42.520 | really being able to combine those two things together that make, make this important. And
00:04:48.280 | the third is continuous evaluation with domain-specific dynamic benchmarks. You know, I'm sure you all
00:04:54.460 | have seen things like LMSys or whatnot, and it's pretty good to see, you know, a general understanding
00:05:00.020 | of where a lot of these LLMs fall in terms of their ability to do things. But for specific
00:05:05.240 | industries, you need specific benchmarks to say, how good is it at this? Like, you know, a bank
00:05:11.320 | isn't going to care about how well these LLMs do at, say, grade school math, right?
00:05:17.080 | So there's that. And I want to go into a little bit more detail and talk about
00:05:22.840 | some active research projects we're working on right now. One is fine-grained evaluation and looking at,
00:05:29.080 | you know, where evaluation for these models is broken. One particular area is, you know, in long context
00:05:34.840 | models. You know, you guys may have seen things like the needle in the haystack test, where you take a bunch of, like,
00:05:40.840 | Paul Graham essays and insert some sentence and see how well it can find that. But, you know,
00:05:46.600 | one thing we found is that, again, that doesn't necessarily give a proper sense of how these models
00:05:52.360 | handle long context in other domains. And, you know, really, again, breaking everything down domain by
00:05:57.480 | domain is super critical. So, you know, figuring out how can we improve long context overall.
00:06:04.600 | Another key area is enterprise alignment. You know, making sure that these LLMs comply with, you know,
00:06:10.200 | company goals, regulations, and all that. You know, we don't want our LLMs to be committing any
00:06:15.160 | career-limiting moves while outputting text. And another area, which is particularly near and dear to
00:06:21.800 | me because I work on it actively, is multimodal alignment. We find that, you know, these models trained
00:06:27.080 | on public data, you know, underperform in, like, you know, specific domains. And one area we're working
00:06:32.440 | on is using these large vision language models or LVLMs to be able to generate synthetic data without
00:06:38.280 | manual annotation to be able to train, you know, downstream models. So, kind of being able to
00:06:43.320 | really have this flywheel of going from, like, specific data generation in the loop to model training
00:06:49.320 | I think it's something that we're very excited about.
00:06:53.400 | That's great, Hamza. We're excited that a lot of these projects are happening on
00:06:59.640 | Azure's AI infrastructure, obviously, as well. I think if you forward the slides a little bit,
00:07:08.040 | I think, you know, you went through an experience with Azure getting on board and running these projects.
00:07:15.240 | I think people are really interested to maybe understand what are the best practices you had
00:07:20.200 | working with our infrastructure, some of the pitfalls and the benefits as well.
00:07:26.920 | You know, this one's my slide. I think the way we think about infrastructure that's supporting this wave
00:07:35.880 | of AI, it's really about optimising it in every sense possible for the different AI applications and use.
00:07:46.680 | You know, we look at everything from our Azure data centres. We have over 300 worldwide.
00:07:51.720 | The CPU or the host, so combining our virtual machines with the right CPUs and offering the right throughput.
00:08:01.160 | The accelerator, we use a diversity of accelerators from AMD, NVIDIA and our own first-party silicon,
00:08:09.960 | as well with the Maya chip. We have topologies that, you know, optimise that I/O between the different
00:08:17.080 | layers and obviously the networking throughput as well. So it's really about making sure that we can take
00:08:25.160 | the best of breed at what we do at a super computing scale and deliver that back to the customers so we have
00:08:31.240 | a real cycle around, you know, learning from working with organisations like OpenAI, Mistral and others that
00:08:39.800 | have trained their models on Azure's infrastructure and then being able to democratise that and deliver it back to
00:08:44.920 | customers as well. And so Hamza, with that, why don't you share a little bit more about what exactly you guys did on Azure?
00:08:55.720 | Yeah, so first we'll actually talk a little bit about how we do distributed training in general with Azure.
00:09:02.120 | So we have a stack, you know, and so on the ML framework side, you know, we use PyTorch, you know,
00:09:08.520 | pretty standard framework. We also use a library called Horovod, which handles multi-node communication.
00:09:14.920 | So if we have like multiple Azure VMs, how do they communicate with each other? It allows for faster
00:09:20.840 | communication across nodes. And on the underlying, you know, hardware layer, you know, we use a bunch of Azure
00:09:27.160 | VMs that could be like A100s or H100s and they're all, you know, we connect to them. We use Horovod to
00:09:34.040 | connect to them and send gradients through for distributed training and they all read and write
00:09:38.200 | to a single network file system or NFS. You can basically think of an NFS as being a shared file
00:09:43.880 | system that every machine has access to as if it were a local file system, which makes it very seamless to
00:09:50.200 | read data or write checkpoints for models. So that's kind of our overall
00:09:56.520 | infrastack. And what about running on on Azure? You guys had some specific
00:10:05.160 | workloads that you were covering. Yeah. So we've run a number of projects
00:10:10.360 | on Azure and we've run them on different sizes from like, you know, one node to dozens of nodes. So,
00:10:16.440 | you know, we've run like DPO align models with a bunch of instruction response preference data sets.
00:10:22.520 | We've run like, you know, preference optimization techniques and using and the things we did that
00:10:28.840 | have used the most compute have been large scale distributed training jobs for multimodal training
00:10:33.800 | and inference, you know, with like dozens of GPUs. So, yeah, these are the kinds of workloads that we've run.
00:10:40.760 | And I think you had some lessons learned throughout as well. It wasn't, you know, I think it's not
00:10:54.360 | always smooth sailing with these types of jobs. So any, any best practices, traps for young players
00:11:00.760 | out there in terms of your experience? Yeah, absolutely. So there's a number of key architectural
00:11:09.160 | considerations to keep in mind. One is having enough nodes to support, you know, your ideal batch size.
00:11:15.560 | So on the CV side, you know, when I was training like, you know, let's say like clip based models,
00:11:20.120 | one thing I learned is that, you know, you want you, we wanted enough nodes to have a certain batch size,
00:11:25.560 | but we also didn't want too many such that we would either fall into the trap of having
00:11:30.200 | too large a batch size or under utilizing whichever nodes we were using. So getting that balance right was
00:11:36.040 | pretty important. Another thing is networking bottlenecks. We want to make sure all of our
00:11:40.760 | data and nodes are close together. Like, you know, imagine if for example, you had your, um, a bunch of
00:11:45.880 | your, your data in like US West, and maybe you had your nodes in like, you know, um, Asia or something
00:11:52.600 | like that, right? You know, that's like a simple example, but bottom line is networking communication is
00:11:57.640 | pretty important. And you want to make sure that, you know, when you're sending this data across,
00:12:01.800 | that's not going to be a bottleneck when you, because if you're training, one thing that will
00:12:05.560 | happen is when you send gradients to different copies of the model, that could be a bottleneck
00:12:10.680 | there. Another bottleneck could be data reading, um, because recall that, you know, while your model
00:12:15.960 | is training and you're doing your forward and backward propagations, um, asynchronously, you're
00:12:20.200 | loading in data to be fed to the model. Um, and so this is where the NFS read speed is absolutely
00:12:27.640 | critical. Um, if your NFS is not reading in your data fast enough, then you could be bottlenecked
00:12:32.680 | waiting for data to be processed and your model isn't actually crunching. Um, and you know, one key, key
00:12:39.160 | takeaway for both of these is make sure your GPU utilization is good. NVIDIA SMI is your best friend here.
00:12:44.920 | If you see your GPU utilization being low, you know, don't be afraid to look into why and, you know,
00:12:50.200 | you can, you know, do different things to debug. Like if it's multi node, for example, then, you know,
00:12:55.320 | you can test networking and stuff like that. If it's on a single node, that likely means it's a data
00:12:59.640 | loading issue. So there's lots of different ways and tools that you can use to step into these things.
00:13:05.160 | And, you know, don't underestimate the basics of like reliability, flexibility and manageability.
00:13:10.360 | So, you know, one thing that we were, we really cared about as a team is we wanted to make sure
00:13:15.080 | that our data distribute that, you know, when we're training experiments, right, we were like, you know,
00:13:19.560 | going through, sometimes we needed all the nodes, sometimes we needed very few. And, you know, being
00:13:24.760 | able to work with instances that gave us that flexibility over a long period of time is very
00:13:28.920 | important. You know, when we were shopping around, some cloud providers only let us use compute for a fixed
00:13:34.760 | amount of time, like maybe, say, a month or two months. And, you know, as a trade-off, we'd have
00:13:39.400 | a bunch of compute, but that didn't really work for us because we weren't in it for training a model for
00:13:44.520 | some fixed amount of time. We wanted something where we could go on and off for a longer period of time.
00:13:49.720 | That's great insight, Hamza. I think also we talked about some of the advantages of using Azure, which
00:14:01.480 | would be great if you could shamelessly plug that for Azure as well.
00:14:04.760 | Yeah, happy to. So, you know, one was availability. You know, the Azure VMs were dedicated and allowed
00:14:11.240 | us to adjust our capacity on demand. The reliability was pretty good. It was consistently dependable with,
00:14:17.080 | like, no real issues. You know, NFS throughput was also quite good. You know, again, right, like, you know,
00:14:23.960 | if your NFS is bad, then that means that, you know, you're not reading in data fast enough. Or,
00:14:28.440 | for example, being able to dynamically change the size of your NFS. If, let's say, for example,
00:14:33.240 | you need more or less capacity. If you need more because you suddenly have more data than you realize
00:14:38.200 | you had before, then you need to be able to tell it that. At the same time, if you realize that, you know,
00:14:42.760 | your NFS is over-provisioned, you don't want to be, you know, overpaying in Azure bills. Though I'm sure
00:14:48.360 | Lockie wouldn't mind that. But, and, you know, the ease of use is very important, you know, clear documentation
00:14:55.160 | and straightforward process. You know, like, as the guy that set up Azure for my team, I really didn't like it if
00:15:00.840 | other people needed to bug me. And thankfully, once I got things working, it just worked and I did not need to be paid. So,
00:15:07.160 | So, that is very important. That's really good to hear, Hamza.
00:15:11.960 | Yeah, I think, like, you know, this is fantastic. And I think you had some specific data points. You guys
00:15:21.400 | recently went through a process to go from the A100s through to the H100s or the VMs leveraging those.
00:15:29.480 | Can you share a bit of insight around what you observed with that experience?
00:15:33.560 | Yeah. So, the key takeaway is that H100s are really good. One thing we wanted to do when we were
00:15:41.160 | doing this was do a cost analysis and see, okay, for a given number of H100s and a given number of
00:15:46.520 | A100s that cost the same amount, what kind of training and inference are we getting? And so,
00:15:50.280 | here you can see we're comparing two H100s to four A100s because that's what works out about the same
00:15:56.120 | cost-wise. And we're doing better on both training and inference, which means that we're doing better
00:16:01.400 | per dollar just by switching here. And, you know, there are a couple of key points I really want to
00:16:07.720 | emphasize here. One is that, you know, it's really nice when you just have a very simple plug-and-play
00:16:14.520 | change that works. You know, there's a lot of ongoing work to optimize, you know, you know,
00:16:18.920 | especially like things like inference, for example, with your KV caches, your partial KV caches, your
00:16:23.400 | speculative decodings. And it's really nice to be able to say, hey, let's just do something simple
00:16:28.200 | and have it work. And the second is that, with this faster inference in particular, we can go through
00:16:34.200 | more synthetic data, higher-end model accuracy, and it just enables us a flywheel of faster iteration,
00:16:39.000 | which is super critical for being able to, like, do more development.
00:16:42.680 | Yeah, I mean, you can see from the numbers, the performance is there. And I'm assuming that,
00:16:50.440 | internally, just that ability to do more with fewer GPUs has been a really great benefit for you guys
00:16:57.960 | So your question is, why the inference is taking longer than the training time?
00:17:18.440 | I think it was because we were doing the larger batch size, and it just happened to
00:17:24.840 | work out that way, that for whatever batch size we were doing, it just wound up taking longer,
00:17:29.880 | because, yeah, with the training, you typically need a smaller batch size, because you have to
00:17:33.320 | to put more things in memory for backprop, and somehow it just wound up working that way?
00:17:38.440 | Do you want the mic?
00:17:51.720 | No, because I know that when we were comparing training inference across the hardware,
00:18:01.000 | those were kept fixed, like whatever inference batch size we were using for DH100 was the same as DA100,
00:18:07.560 | so that wasn't a factor.
00:18:14.040 | yeah, so speaking about what's next, I think this is a good segue, so with the next slide,
00:18:25.560 | just from our point of view with Azure, we are sort of adopting and we spoke about optimizing at every layer of the stack,
00:18:36.360 | and I think the addition of our AI accelerator, that's used for our own internal workloads across Microsoft 365.
00:18:44.520 | But we've just announced the AMD MI300X with the high bandwidth memory, we've got the NVIDIA A100s, the H100s, and we'll be adopting Blackwell.
00:18:56.440 | What's really exciting is the pace of innovation in silicon has never been like this before.
00:19:04.040 | We're talking with NVIDIA almost doing two releases a year.
00:19:07.800 | Previously, you know, I think it might have been one every two years or something like that.
00:19:14.040 | It's really amazing to see this growth in the silicon, how far we're getting, and what Hamza just shared,
00:19:21.880 | the ability to do more and more with less on the infrastructure as well.
00:19:26.440 | And so I think, you know, certainly as far as Azure and our AI infrastructure strategy,
00:19:33.320 | it's to continue to adopt these new evolutions and really make sure the right GPUs are used in the right places for the right workloads.
00:19:43.000 | Hamza, what about for Snorkel AI, what's next?
00:19:45.640 | Yeah, so to give you all a sneak peek into what we're working on actively at Snorkel,
00:19:50.680 | what do we have a number of directions, you know, one is to, you know, say, like, you know,
00:19:54.840 | better data leads to better gen AI and get better prototypes to production.
00:19:58.520 | So we want to explore new ways to programmatically utilize preference signals for data synthesis and curation.
00:20:03.880 | We also want to develop scalable SME entry points for data development using rationales and custom taxonomies.
00:20:11.640 | And finally, we want, you know, better multimodal retrieval algorithms.
00:20:14.840 | And we want to evaluate those on domain specific data sets to scale up the retrieval models we have.
00:20:20.920 | So, and we're, of course, excited to do all these things on Azure AI infra.
00:20:25.160 | Fantastic, Hamza.
00:20:28.680 | Well, thank you so much for presenting and speaking on behalf of Microsoft.
00:20:34.120 | We really appreciate it.
00:20:35.800 | Thank you.
00:20:36.440 | Thank you guys so much.
00:20:37.560 | Thank you.