[Full Workshop] Llama 3 at 1,000 tok/s on the SambaNova AI Platform

All right, well hi everyone! Thanks so much for joining us today. So today I get the great opportunity of introducing SambaNova and some of our capabilities around reaching over a thousand tokens per second using Llama 3. Today I'm gonna spend a little bit of time getting you oriented around SambaNova some of the capabilities that we provide as an AI platform and also some of the underlying technologies that are providing the means of achieving some of the accomplishments like a thousand tokens per second.

Before I start jumping into the content I want to take the opportunity to introduce some of my colleagues really quick. So joining me today I have Petro Milan who is a principal AI engineer he's gonna also be leading our workshop component and be here as we're getting hands-on with the technology.

I also have Varun Krishna who is a senior senior principal AI solutions engineer joining me as well and I'm Michelle Maturne I'm our director of solutions engineering I serve our global customer base at SambaNova. So before jumping into our content I just want to cover what are we going to be talking about today.

So we're going to start off with a little bit of housekeeping, talk a little bit about the prerequisites we are going to be getting hands-on today and also introduce our discord channel which is where we're going to be communicating with one another and also sharing some content and information that you're going to need such as files and other like API and keys and things like that.

I'm going to talk about SambaNova just get you oriented around who are we and how are we achieving 1000 tokens per second. Then I'm going to pass it off to Petro he's going to talk you through our workshop today how we're going to get hands-on get you oriented around how to get started and we're actually going to go through a live build and we'll support you along the way.

We'll also spend some time around questions just in case that you have any questions about SambaNova our technology or anything that we're doing in the hands-on component. So before we get started I want to talk a little bit about prerequisites so first of all we are going to be using our laptops so hopefully you have them today and we're also going to need internet access to get through our workshop.

We're going to be working in a Python environment so hopefully we have Python set up and ready and also we're going to be needing to install some packages through PIP. We are going to be working in Discord so if you don't mind I'm going to give everybody just a second to hopefully get on Discord and and join our channel.

So I'll just get everyone a quick moment to get set up. Once you're set up just maybe give me a thumbs up so I know. Question? Time to build capital T capital T capital B. The general one not the one that says speaker. Okay, so just to repeat it's AI engineer the password is time to build with a capital T and a capital B.

Okay all it's all capitalized each word but concatenated. Good? Were you able to join? Okay, awesome, cool. Just want to make sure there's no problems. So we'll give everyone a couple minutes. We've just got Wi-Fi access. All right. In case you haven't got a chance to set up maybe just take a picture of this really quick.

We'll also go back to it before we kick off the hands-on component. But want to spend a little bit of time just getting you oriented around us as Somba Nova. So Somba Nova, we are a full stack AI platform and we've existed since 2017. We were founded out of Stanford University.

So two of our co-founders are actually Stanford professors, Kunle and also Chris. Both of them and including Rodrigo each bring a unique perspective to our founding team including previous startups that were acquired and had various exits. Also building other AI startups that are pretty well known in the industry such as Snorkel, Together AI and also a really depth of experience around building out hardware and chips.

We are building the full stack from the ground up so that means we build our own chip I'll go into that a little bit later later but also all the way through the system level and the software layer. We're on our fourth generation chip and we build our own chip.

I'll go into that a little bit later but also all the way through the system level and the software layer. We're on our fourth generation chip and we have built an entire stack that allows you to both fine tune models, pre-train models and also deploy those models with really high performance inference.

We have achieved over a billion dollars in funding from various well known names like BlackRock or Google Ventures, Intel, GIC. And so we are really well established to solve the challenge around building and deploying AI hardware. So what exactly are we targeting and what exactly are we trying to solve for?

Our customer base comes from a wide variety of enterprises and also to government organizations. So we're really aiming to deliver capabilities that can help service the enterprise grade AI capabilities that companies and governments require to deliver unique and differentiated capabilities and also things like sovereign AI. Our underlying platform is delivering the means to actually achieve the scale of a trillion parameters plus.

And we're doing that through delivering full stack capabilities. When I say full stack, many folks have probably utilized many of these technologies on this slide. We're not necessarily trying to compete with every single layer involved here. But what we are trying to do is ease the process of getting started and ease the journey along the way.

And so instead of having to make a decision at every single one of these layers, we are actually integrating things into a very seamless experience from deciding on what chip is going to work with what compute. What compute and chip is going to work with what operation systems. What operation system works with what models.

You don't have to actually make each of these decisions and know that they have to integrate with one another. We actually create this very seamless experience along the way where everything kind of orchestrates and works very nicely. And what we're doing at the end of this is actually delivering the capability to not only fine tune, but deliver really, really fast inference capabilities.

So we're going to demo a little bit of this later, but recently we released a demo that you can actually go try live and we'll do so later called Samba 1 Turbo. This is exceeding world records around speed of inference, especially when it comes to Llama 3. And you can see that through some of the metrics that were recently published through artificial analysis.

Artificial analysis did a benchmarking exercise to understand the different capabilities, the speed at which they are able to deliver inference throughput for a thousand tokens per second across various hardware providers. And what you can see is that we are far exceeding as a platform the throughput capabilities compared to some of the other providers out there.

And so I want to kind of talk a little bit about how are we enabling such speed. So when it comes to the underlying technology, many of us have experienced some of these trends in the industry. Many of us got started with what you see on the right hand side of the slide, which is the large monolithic model.

This is the likes of like an open AI, for example, or a Gemini or a Cloud. And many of us started our LLM journey or our generative AI journey using some of these technologies. But along the way, many other capabilities in the open source community started to pop up, specifically these smaller models.

And these smaller models allowed us to do things like fine tuning and actually adapting some of these models to our enterprise data and our enterprise requirements. And so that really started to take off in the industry. And each of these started to see different pros and cons associated with them.

When it came to large monolithic models, when we actually started to put these into practice when it came to enterprise applications, one of the reasons many of us leaned into this is because of the broad capabilities that the likes of open AI brought. And also the ease of integration in terms of open AI into the actual platform itself.

It's super easy to manage, and it also is trained on the internet's data, and so it can handle a lot of different things. But when it came to actual enterprise applications, enterprises have unique capabilities required to deliver on some of the use cases and challenges they're trying to solve for.

For example, enterprises have unique data that they, you know, oftentimes segregate from the internet, or most of the time segregate from the internet that's proprietary to them. And oftentimes they have spent the last 10 years trying to actually aggregate that data into the likes of data lakes and other kind of centralized capabilities.

And so now how do you actually transform that into AI capabilities that you can leverage? That was very difficult when it came to large monolithic models. The other challenges that we saw is that many started to become very concerned about security when it came to open AI. They wanted to preserve data privacy.

They wanted to also own and use the model as a differentiation for themselves. And then also as you start to see more and more adoption, that cost just started to skyrocket. Open AI and a lot of these closed-source models charge on a per-token rate. And so as you start to utilize more and more LLMs and utilize LLMs more heavily, the cost just starts to go up and up and up and up and up.

And it's really, really hard to control. On the other hand, when it came to adopting the smaller expert models or the smaller open-source models like the likes of Llama 3.8b that we'll talk about later, We were able to address some of the enterprise accuracy concerns by actually pre-training, fine-tuning these models to adapt them to the enterprise requirements.

While we were doing this at a smaller scale to address different capabilities and tasks that we needed to solve for in the enterprise, They weren't trying to solve a broad set of tasks and also adopt a broad set of general knowledge. Thus, manageability became a little bit challenging because now we have all of these micro models that we have to orchestrate and have them work together.

But we were able to solve for some things like security, model ownership, data privacy, and data ownership. But again, because we had so many of these and we had to fine-tune each one of these, the cost also became very challenging. So, what are we trying to solve for through our capability of SambaOne?

We're trying to bring the best of both of these paradigms together to deliver the capabilities of each in a very simplistic way. And the way that we actually deliver this is through four core capabilities. First of all, we take all of those expert models behind the scenes. So, let's just say we fine-tuned a model for our legal purposes.

We fine-tuned a model for our HR purposes. We fine-tuned a model for coding capabilities. Each of those have different groups and tasks and use cases that are going to consume those. But we want to really ease the experience of having to integrate those into the application. So, we put all of those behind a secure single endpoint.

And so, you only have to interface with one endpoint to gain access to all these various models. Now, we need to determine how are we going to actually use those or consume those various models behind the scenes. So, now we need capabilities around orchestration. So, one of the other capabilities we're delivering as a part of this is around routing.

So, we're delivering the ability to determine based off of an incoming prompt. What is the best suited expert behind the scenes to solve that prompt? And we're doing so through a router. We're also bringing the means of dynamically fine-tuning. Every expert is going to have a different cadence in which fine-tuning is going to make sense.

So, maybe your finance model gets adjusted annually when your policies are updated. But maybe your coding model, because you're pushing code so regularly, needs to get updated on a quarterly basis. And so, you want to actually be able to schedule your fine-tuning jobs and adjust and swap these models at the rate at which it makes sense to actually retrain these models.

And lastly, you have a bunch of models under the scenes. Not every application or group is going to or should be able to access each of those models. So, now you need to figure out a way to manage the access controls for these. So, what we're also delivering as a part of this capability is model level RBAC.

So, you can actually determine this application or this person or this group of people should be able to access this set of models. And it allows you a ton of efficiency from a computation and management operation standpoint with the security and fine-grained control that you need to actually manage access to these different models and data underlying these models.

So, within Samba 1, we have two ways of delivering this. We have something called a flexible COE that allows you to kind of determine exactly what models lie under the hood. And we also have a pre-composed version of Samba 1, composition of experts. So, our pre-trained or pre-configured, I should say, or pre-composed version of this model has 92 underlying experts.

And when I say experts, I'm really referring to a specific model that can bring different capabilities associated with it. And within those 92 experts, we have a broad range of languages that are covered within those models, a broad range of domains, and a diverse set of tasks that are very, very relevant to the enterprise.

All of these models are supported by seven different foundation model architectures, including Llama 2, Llama 3, Mistral Falcon, Bloom, and even some multimodal capabilities such as, like, Lava, Clip, D-Plot. That is kind of starting to support some of the multimodal trends that are coming. And of all these 92 models, we are actually delivering and partnering with organizations to create and contribute back to the open source community.

So, out of the 92 experts, 12 of them are actually ones that we've helped either develop ourselves or co-develop with organizations out there, including some of the language capabilities that we've delivered such as models that can support things like Thai or Japanese, Hungarian. And through that, we've developed a lot of experience on how to actually adapt models to different languages.

We've also created a model for text-to-SQL capabilities and delivered really, really good results through that model for text-to-SQL. And lastly, we have contributed back to BloomChat. It's the second largest open source model, and it's also bringing a lot of the multilingual capabilities. So, organizations essentially, as they're constructing these different composition of experts, can add as many expert models as they need.

As I mentioned before, while we have that pre-configured composition, this is really intended for you to be able to construct exactly what you need in terms of models under the hood. So, you can add as many as you need. So, our end goal with this is to really be able to bring the capabilities that enterprises need to handle the diverse set of use cases and capabilities they need to solve their problems.

And so, one of the things that we've created along the way to measure ourselves against this is an enterprise-grade AI benchmarking set. And this is really tailored to understand our capabilities against the best-in-industry models across various enterprise-specific tasks and domains that are needed. Things like information extraction, that's where many, many enterprises are starting, but also broad set of capabilities like text SQL, coding, function calling, and we're measuring ourselves against GPT 3.5 Turbo and GPT 4.

And what we can see is, along the way, we're meeting or exceeding the capabilities that OpenAI is bringing. But alongside the capabilities, you also need to orchestrate these models. I talked a little bit about this before, but one of the deliverables as far as Samba 1 from a product standpoint is we're bringing routing capabilities so that you can actually take a prompt, determine what is the best-suited expert, and then route to that.

There's also scenarios where you may need to do something outside of routing. You may actually just want to directly call a specific model. Or, in the case that is popping up really, really regularly now with agentic AI, you may need to do model chaining. So that's another capability that we're bringing as a part of the product suite for Samba 1.

And so how are we actually uniquely set up to deliver the composition of experts capability and also the really fast inference speed that we're going to see briefly? So I mentioned earlier, but we are a full-stack AI platform, but we also build our own chip. Our chip, we call that an RDU, or a reconfigurable data flow unit.

So instead of a GPU, we refer to our chip as an RDU. And our current version of that chip is called SN40L. And what is unique about SN40L is actually the memory structure of this chip. So our chip supports a three-tiered memory architecture. So we have our on-chip memory, and then we have our high bandwidth memory, and then we have a huge memory capacity in DDR.

And this really allows us to store a ton of models. And have those models be swapped in and out of various tiers within the memory to achieve really strong performance and also really, really efficient compute utilization. So what does this look like? As I mentioned before, we can store up to five trillion parameters on DDR.

So that's like if we were to store like three to four open EIs on a single chip. And then, as we need to actually execute those models, they can move up the memory stack. And this allows us to, one, take into consideration what models need to be used at what time.

And then, two, do so in a really performant way because the network connectivity between these three layers is really tight. So just to kind of go into some of the specs, our on-chip SRAM has four gigs, our high bandwidth memory has 512 gigs, and our DDR has up to six terabytes.

So lots of memory to work with. And when we think about how does this compare to what you experience in the GPU world, when you think about the number, if you want to host a really large amount of models, when you have to do so on a GPU, you're basically needing to work with the memory that the GPU has on chip.

But with us, because we have the DDR component, we're able to store a huge amount of models in a single system and keep it coupled with the other memory tiers. And so with GPUs, you're often, if you wanted to host 500 different models, you're going to have to actually align those models to various systems.

And you're going to have to basically call various systems to actually access each of those models. With us, it's going to be one underlying system because we're able to store, again, up to five trillion parameters. So I'm going to hand it over to Pedro. What we're going to do next is actually get the chance to get hands-on.

Thanks, Rochelle, for the presentation. So what we will be doing next is a demo of our Lama 3 and Samba 1 Turbo. And after this, we move to the hands-on portion of the workshop. So if you want to try out our Lama 3 endpoint, you can go to our website, sambanova.ai.

And then click on Samba 1 Turbo. So this is where you can access our chat infra. And here you have options to select various models, you know, Lama 3, 8B, 7TB, our COE, MISTRO, and even some of our in-house models that we train ourselves, like somebody or others. So yeah, we'll do a demo of Lama 3, 8B, and I'll ask it the following question.

So, you know, create a 3-day workout schedule for intermediate fitness level. Let me actually redo it again. You can see, you know, it gets instant response. And then for the performance metrics, so you can see the insights here. So a few things to note. So, of course, we have a pretty high throughput, which is 1,000 tokens per second.

But that's not only the end of the story. We also have a pretty small time-to-first token, which is basically the input inference time of 0.09 seconds. And we also have a pretty small end-to-end total inference time of 0.65 seconds, right? So with our full-stack platform, we're going to achieve high throughput and very small inference time.

And if you just want to do a comparison, let's say, with ChatGPT, for instance, if you ask the same question, just copy it. Yeah, you can clearly see the difference in speed. And the other cool thing which we built, so we have this real-time option where, as you write your prompt, you can instantly see the model's response.

And as you change the prompt, you can also see the change in the response. So let's say, I don't know, hi, I want to write an email about blah, blah, blah, right? So the point here is that, you know, with this real-time option, you know, you can do real-time chatting.

And this can be helpful, for instance, if you're drafting emails or even if you want to do some, like, real-time prompt engineering, let's say. So yeah, that's it for the SambaOne Turbo. I can give maybe a few minutes for folks to try it out before we move on to the hands-on.

So, again, you go to our website, SambaNova.ai, and then you click on SambaOne Turbo. So you can do it from your laptop or from your cell phone. Yes? Another question. Can we tweak the generation parameters? Yes, yes, yes, yeah. Let me see how to do it on this UI.

Okay, I think from this UI, it seems it is fixed. But in the hands-on, once you will be calling our endpoint from the API key, we will be changing some of the configs as well. Yeah, any other question? Yeah, were you able to try it out? Okay, great. Okay, so I think, yeah, we can move on to the hands-on portion.

So what I'll do first is just introduce what I'll be talking about in this part, and then we'll dive in to the hands-on. So, yeah, we prepared two exercises for you today. So the first one is a basic example to get you started. So here, I'll be showing how you can load environment variables, set up the SambaNova API key, initialize the LLM, and do a simple inference call in Python.

And the second example is a more practical one. So we will be building and deploying a Q&A system with Rack for enterprise search using our platform. And we will also be using other libraries and packages like, you know, Lanchain, various data loaders, E5-Large V2 embedding, ChromaDB vector store, and, of course, the Lama3 endpoint, which runs at the speed of 1,000 tokens per second.

So, yeah, let's start with the basic example. And if you want to follow along, so you can go to Google, write AI StaticKit SambaNova, and then click on this link, or you can just write this URL. So this is our StaticKit repo. We have a collection of open source examples on Gen AI apps.

And, yeah, once you get there, yeah, you can go to workshops, AI engineer 2024, basic examples. So I'll go over the readme and then do a live demo of the work of this exercise, and then I'll give you some time to try it out. So, yeah, first, you'll need to clone this repo here.

And then you'll need to create a .end file in the repo root directory. So this is where we will be specifying the Samba Studio API key. Yeah, so let me show you how this is done. Yeah, I guess with the two mic, I need to find a way. Yeah, so I've already cloned the repo.

Yeah, the repo. And then, at this level, this is where you will need to create the .end file. So, in the case, yeah, you'll have to do touch .end. And then, yeah, you can add these information here. So for the first hands-on, that's the only thing that you'll need.

And you can access or copy those from our Discord channel. So, yeah, if you go to Discord events, yeah, so you can copy either of these keys. So we have two dedicated endpoints for this workshop. All right, and once we finish setting up the .env, the third step is basically, you know, installing the packages.

So for this one, you can either do it with, you know, Conda or Python environment. So first, yeah, we will go to the basic examples repo. So just CD and then the repo path. Then you can create a Conda environment. So I would recommend using Python 3.10. And then you activate your Conda environment.

And here we name it basic_x. And then, yeah, you just install the requirements with the pip install -r requirements. And then you can use this line to just link the kernel to your notebook. Okay. So if I go to my terminal, yeah, so we'll go to workshop, AI engineer, basic examples.

Okay, so this is where you can create the Conda. So I've already done it beforehand, so I'm just going to activate the environment. Yeah, and this is the requirements file. So we only have, like, a few packages that is needed. Yeah, and that's it for the installation. So once this is done, we're going to go.

So let's go. Yeah, and this is the requirements file. So we only have, like, a few packages that is needed. Yeah, and that's it for the installation. So once this is done, yeah, you should be able to do a few packages. Yeah, and that's it for the installation. So once this is done, yeah, you should be able to do a few packages that are needed.

Yeah, and that's it for the installation. So once this is done, yeah, you should be able to open the notebook. And again, you can do it, you know, from the terminal, which will write you to a browser or through VS code. So if you want to do it through the terminal, you just write Jupyter notebook and then the name of the notebook.

So it's going to be example with Samba Studio.ipy1b. I'm actually going to do the demo via VS code just because I can show you the time step. So I already have this set up. Okay, so you know we are in the basic examples repo and then example with Samba Studio.

And again, if you're doing it with VS code, just make sure that you have the kernel set up. So this is a pretty basic script. So we will first, let me just restart it. Yeah. Yeah. So, yeah, we will be loading the libraries. So we actually have a wrapper with length chain.

So this is where we will be initializing and calling our endpoints. The second step is to load the environment variables. So these are actually the information which we added in the .env file. And then we will initialize the LLM. So, yeah, we will be using the Samba Studio wrapper.

And we set the Samba Studio API key. And then we specify the model configs. So I had a question earlier about the model configs. So this is where we can set this up and play with this. Again, I think most of you are familiar with these configs. So, you know, do a sample.

If you set it to false, this is basically a deterministic output. If you set it to true, then it becomes probabilistic. You can change the temperature and also the max tokens to generate. And also, this is basically our CoE endpoint, right? So we have one endpoint through which you can actually call different models.

And in this case, we set the export to MetaLama3HB instruct. Okay. So I'll be running the step. And yeah, we have now our model loaded. And now we're good to go. So I'll first show you how you can do an inference call using a simple invoke method in NankChain.

So just write LLM.invoke and then add your prompt. And the prompt here is what is the capital of France. And what you'll notice is it gives the right answer, but also give you other stuff which you didn't ask for. And this is a common thing with open source models because when you ask a prompt, you need to include the special tags, right?

So in the case of LLM3, you can get it from a meta model card. So you'll have to actually, in the prompt, add these special tags or tokens. In particular, you have to let the LLM know that this is the beginning of text. And this is where you will insert the user query, right?

So, and then let the LLM know where it needs to answer. And once you have the special tags inserted, now you should be able to get the right response. So this is the first way to do the inference call. The other way is to do it via a LCL in NankChain.

So basically, you can use the LCL to connect a prompt template with LLM and an output parser. And NankChain has very templates that you can use. So, we are asking the same prompt. It's just that the main difference here, we are adding the country as a placeholder. And then when you prompt the model, you can actually specify the value of this country, right?

Yeah, so that's it for the basic example. And as I said, this is just to get you started. So, yeah, we can spend 10 minutes for you guys to try it out. Me, Varun, and Oshel will be here to help. And, yeah, then you can move on to the second exercise.

Yeah, I'll move around if also people have questions. Do you have the right endpoint set, basically? Can you open your .nv? Well, actually, it's working, right? I don't know. Can you try the E4, the other one? Oh, yeah. Actually, the one here is outdated, basically. So, if you're using the one, the GitHub is outdated.

So, yeah, you need to use the one from Discord. Discord. Yeah, because this is public, so everyone can actually -- sorry about that. Yeah, so you can use anyone. You have it working, right? Yeah. Yeah, yeah. Yeah, yeah. And, yeah, basically, for this workshop, the COE endpoint, we've only activated the LAMA 3.

But eventually, if you were to activate the whole COE against just the same endpoint, then you can just switch between models, basically. And also -- I don't think the other benchmarks for the LAMA 3 need to be in terms of tokens per second. You mean performance? Yeah, yeah. Yeah.

So, I mean, there's the GPU numbers, right, which I think Oshel presented in the slide deck. So, we have, I think, 8 to 14x speed up with GPUs. And there are also, like, you know, other startups who have also, like, high inference speeds, like, basically, Grok. But, you know, one difference we have with Grok is they run with 576 chips.

We only do it with 16 chips, basically. And you actually conserve the full precision. They do it with reduced precision, you know? So -- It's weird to get a response that quickly. Yeah, yeah, yeah. Damn. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Were you able to get it right or you're not trying good?

Yeah. That's fine. Let me know if you have questions. Okay. Nice. Yeah. Maybe you can have our stuff there. Yeah. Yeah. But this is simple if you want to try it out. Yeah. Okay. So no need for the migrate. Yeah. Which one are you using? Let's try both. Try the two that are in here, the P2 and E4.

Yeah. Then maybe your internet connection then. I don't know. Yeah. Yeah. Yeah. No, it's not. Can you restart the notebook and then... Yeah. You did that. Okay. Let's try the next part. I wonder if there's... Yeah. If it's authenticating... Yeah. Like, this should be faster. But anyway, let's... No, yeah.

I mean, this one should take, yeah, three seconds. Yeah. That's it. Yeah. But because this is without the special tags, right? So because it's a long... Yeah. This one, yeah, it should be just, yeah. Oh, I see. Yeah, yeah, yeah. Yeah, yeah, yeah. Yeah, yeah, yeah. No, because... It's one thing?

No, it's basically waiting for the whole thing to complete and then you're outputting the answer, you know? So, yeah. Yeah. So the same question... Oh, it's putting... You have questions? Were you able to try it out or just... Yeah, yeah, yeah. Yeah. Okay. No, let me know if you have questions.

Sure. Thanks. Yeah. Yeah. Yeah. So the first one... Yeah, it's the prompt gen. So the same question has been wrapped into... So what is the prompt gen. So what is the prompt gen. So prompt gen. So prompt gen. Yeah, so the first one, yeah, it's the prompt template. So the same question has been wrapped into what is the prompt template?

So prompt template is basically template and using those tabs. So the language is the specific template. So you need to drive the instruction to the body in that template. So that's how the body would write kind of work. If you don't write any of them, the body just kind of gives you-- And I think this is just one that I don't know.

I don't know, this is just a small amount of time to try this with other . So that's what it is. In the next, we should move to the second one, or maybe just give it a little more. Yeah. Yeah, so I mean, you can get it from the-- So if you go to Google, and then write meta 3, llama card, you get it there.

Yeah. Yeah, so I mean, you can get it from the-- The llama card. Yeah, so if you go to Google, and then write meta 3, llama card, you get it there. OK. So that would come up the entire file. Yeah. Yeah. I think we can move to the second one.

Yeah, since everyone tried it out. Yeah. Yeah. Like, I-- Yeah, yesterday. OK, cool. I think many of you were able to try out this simple exercise. So yeah, we'll move now to the second one. Everyone's going to-- Yeah. Yeah, so as I said earlier, this is going to be a Q&A system with RAG.

And if you want to follow along, again, from the same repo, go to workshop, AI engineer, 2024, and then EQR RAG. And like the previous exercise, I'll go through the readme, do a live demo of the installation and the run setup, and then give you some time to try it out.

And the app here, we have two versions of it, one with Jupyter Notebook and the other one with Streamlit, which is a UI based. And before I jump into the hands-on, just wanted to give a brief overview of what RAG is, although I'm sure many of you already know this concept, but just for completeness.

So yeah, so RAG is a technique that we can use to supplement LLM with additional information from various sources to improve the model's response. And RAG is very helpful if you want to use an off-the-shelf LLM to ask a question beyond its training data, or if you want to have the LLM access to up-to-date information without retraining it.

Also, you know, RAG can help reduce hallucinations in some context. And a typical RAG workflow consists of the following steps. So we first have document loading and parsing. So this is where we can use a data loader to actually load the data into a digital text that we can edit and format.

And various data loaders are available depending on the extension of the file you're using. So in a PDF, text, PowerPoint. After this, we have a splitting step. So this is where we will be splitting the document into smaller chunks. And you know, the chunk size and the overlap, all of these are hyperparams.

And the next step is vectorization. So this is where we will be using an embedding model like E5 large V2 to map each chunk to a numerical vector. numerical vector and we can store, you know, the vectors along with the content and the metadata in a vector store like face and chroma DB.

And today we will be using chroma DB, which is open source. And again, the whole goal of this embedding is that it allows us to do like semantic similarity and semantic search. And in the retrieval step, this is where we ask the question, which is going to also be embedded into the vector space.

And then we have a retriever, which is going to retrieve the closest chunk vectors to the query vector according to some similarity metric. And we can also add a re-ranker, which can re-rank the retrieved chunks per relevance and also remove some of the unnecessary chunks. And the last step is basically Q&A generation.

So this is where you provide the LLM with the query and the final retrieved chunk to get the grounded response. And yeah, I maybe also would like to precise that in this exercise, so the, we will be using third party tools for document loading, splitting and storage. For the embedding model, you can either run it on CPU or on our hardware.

And we'll be doing both to show the differences. And for the LLM part, this is going to be done on our hardware. So yeah, that's it for the overview. And I think, yeah, we can move on to the readme. Yeah, so we first clone the repo. I think if you've done the other exercise, then you don't have to do this step.

So the second thing, after this, we will set up the environment variable. So for this one, we will be using Samba Studio API for the LLM. And also, we will be using an embedding API as well. So both are available on Discord. Yeah, let me show you in the terminal.

Yeah, and actually, yeah, I would also recommend to deactivate your previous environment. Okay, so we go to the EKRRAG repo. Yeah, actually, for the .n file, this one you'll have to put it at this level, the AI Sutter Kit. Okay, so for this one, we will need the embed endpoint and API key and also the Samba Studio endpoint and API key.

All right, then we go back to the EKRRAG folder and then we are ready to go with the installation. So here, we will be needing more packages. So first, Tesseract, this is our OCR data extractor. So let's say you're using a Mac, just run brew install Tesseract. And this one, you can do it outside your local environment.

It should take, you know, a few minutes to install. And you also need Poplar if you don't have it already, you can just do brew install Poplar. And then, yeah, we will need to set our virtual environment. So since this is a more complicated exercise, I would just recommend to use the, you know, default option.

So the Python environment with a Python 3.10. So if you don't have a Python 3.10, let's say on a Mac, you can install it using this command here, and then you can add the path to your shell, like BashRC or ZSHRC using this command here. And then you can just source your shell file.

Okay, and then, yeah, we will go to the repo if you haven't done this already. Create your Python environment. So, again, if you added this step here, then your laptop should recognize the Python 3.10, then -m, venv, and then the name of the environment. You activate that environment, and then you run the install script.

So this should take, I would say, you know, five minutes if you have good Internet, and once this is done, you also need to install IPy kernel, and also link your kernel to your notebook. So when we tested this on different laptops, you know, some folks were having also sometimes NLTK and SSL Certificate, so you might also need to run this script here.

Yes, it's in my bag, basically. Yeah, so let's activate the Python environment. Yeah, and this is the requirements file, and as you can see, right, we have more packages here. And then, yeah, this is the file which you may need to also run as well. Yeah, and once this is set up, that's all you need to run.

That's fine. I can do it. Okay. Yeah, the notebook. And as I said earlier, right, so we have the app in a notebook and in a streamlet. So for the notebook, again, right, you can open it from the terminal or from VS Code. Yeah, so you'll go to the eqr.rag, repo, notebooks, and then raglcl.ipy1b, and this is going to be our main script, which is using actually, you know, files and models from other files.

So, in particular, we will be using the document retrieval.py, and also some files from the VectorDB, which I'll explain in more details. All right, so let's go maybe first over the structure of the notebook. So we first import the libraries and set the required path. So you don't have to do anything at this point, but yeah, just know that the kit directory, this is the absolute path for your eqr.rag.

And then the repo directory, this is the absolute path for the AIS of the kit. So let's run this repo. And then for the document loading and splitting, yeah, so we added, you know, various data loaders, you know, like PyPDF and unstructured. And which data loader you want, you can set this up in the config file, which is here.

So yeah, I'm just going to do a test with PyPDF2 for now, but we can switch to other data loaders afterwards. And yeah, for the experiment, I will be using the SN40L paper. So this is an archive paper, which we recently submitted. So this contains like information about the stack, the hardware and our COE.

I can show you the paper as well. We also have it in GitHub, but you can also upload your own PDF. And yeah, you'll have it to put it under data, temp, and then, yeah, this is the paper that I'll be using for the demo. All right, and let's go back to our VS code.

So this is where, you know, we will be using PyPDF to actually load the content into a list. And then we are using the recursive splitter from LengChain. So what I'll do first is going to the notebook and then we can go into the functions and more details if you are interested.

Yeah, so let's run this step. Yeah, and in the config, so I set the chunk size to 1200 and then the chunk overlap to 240. But again, you can change those configs. So for this 15-page PDF, we end up getting 89 chunks. The next step is the vectorization and storage.

So this is where we will be using the embedding model to map each chunk to an embedding vector. And as I said earlier, right, so we can actually run the embedding model either on CPU or on RDU. So RDU is basically our AI chip. So if you want to do it on RDU, then you'll have to go to the config.yaml file and then set the type to Samba Studio.

And then BAT size, so this one you can have it either 1 or 32. So 32 means that we are actually processing 32 chunks at the same time. And this is a standalone endpoint, so yeah, the COE is set to false. And I'll show later, if you want to run the endpoint for the embedding on your laptop, how you can change the configs for that.

All right, and then, yeah, let's run the vectorization. And after this, we're actually storing or indexing the embedding vectors into the ChromaDB vector store. I think this should take around 10 seconds. Let me see what's happening. Yeah, always good to restart the notebooks. I'll just go over the steps again.

Okay. Yeah, so it took, you know, four seconds to embed the whole thing. And yeah, this is where we will initialize our QA chain. So again, we have different wrappers and classes, which I can go over it in details afterwards. But for now, let's just execute the cell. Yeah, now you're ready to go.

Ask a question. And then what happens is right through this QA chain, the question gets embedded to the vector space. We retrieve the top K chunks. So in this experiment, yeah, we have this set to three since I won't be using a Reranker. But I can also show you how to use the Reranker.

And then, yeah, the question and the context are provided as basically context for the LLM to get the answer. So, yeah, what is a monolithic model? And you can see the response is instantaneous. Yeah, so that's it for the experiment. Let's maybe try ask it a bit more complicated question.

So if I open the PDF again. I think, yeah, there was a table in the PDF. Yeah, like this is a table, you know, showing operation intensity versus fusion level. So, yeah, let's see if PyPDF is able to retrieve some of the information from this table here. So I already have the questions prepared.

Let's try to access those. Yeah, so, yeah, we got the response, right? So 4.10.4 basically. And again, if you end up having more complicated tables in PDF, in this case, I would recommend to switch to the unstructured data loader. And for this, yeah, all you have to do is just go to the config file and then change PyPDF to unstructured.

So, yeah, that's it for the second exercise. I can go into more details about each function if you're interested, or have you try it out first, and then you can maybe come back and then go over the functions. Yeah, so do you want to maybe try it out first, I guess?

Okay, great. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. you you you

[Full Workshop] Llama 3 at 1,000 tok/s on the SambaNova AI Platform

Transcript