Building Jarvis: A Deep Dive into Local LLM Optimization & Multi-Agent AI [AI in Action

Building Jarvis: A Deep Dive into Local LLM Optimization & Multi-Agent AI [AI in Action - Oct 17]

Chapters

0:0 Introduction to "AI in Action" and RLIF Discussion
1:32 The Jarvis Dream: Dynamic In-Game Banter
2:50 Community Contributions and Personalized AI Assistants
4:5 Starting the Presentation: Banter Packs and Initial Gaming AI
5:53 Latency and Hallucination Problems with Local LLMs
6:35 Transition to Banter Hearts and MLOps Platform
7:24 Muse Protocol and Public Deployment (Banter Blocks)
8:5 Quantization Benchmarks (Llama 3.1B vs. Gemma 3)
8:31 Hackathon Experience and the Need for Local Agents
9:43 The Genesis of the Jarvis Idea and Agent Workflows
10:0 Multi-Turn Task Challenges and GPU Adjustments
10:34 Benchmarking Multi-Agent Systems
11:10 Concurrency with Multi-Agent Systems and Homogeneous LLMs
12:17 Achieving 2x Speedup with Concurrent Agents
12:40 Memory Contention in Heterogeneous Systems
13:16 GPU Layer Thresholds for Stable Agent Execution
14:29 Workflow Optimization and the RLIF Pipeline
15:13 Quantization Testing and Llama vs. Gemma Findings
15:40 RLIF: Frontier Models Debating Local Model Output
16:47 Challenges with GPU Power for Live Demos (RTX 4080)
17:52 Machine Specs and Throughput Optimizations
19:0 Reproducing the Research and Private Repositories
20:30 Kernel-Level Optimizations (Triton and Torch)
21:53 Release Plan for CUDA Optimizations
22:57 Why Gemma
24:55 Eye-Catching Newer Models (Samsung 7B Model)
25:52 Gemma Architecture Optimization for Small Models and Edge Devices
26:36 Special-Purpose Models and Task-Specific Memory
27:40 Data Privacy and Local Agent Sandboxing
28:22 Weighted Debate and Budgeted API Calls in RLIF
29:35 RLIF Pipeline: Weighted Frontier Model Consensus for Local Model Training
30:50 Cost of Local AI (Electricity)
31:24 Fully Modular System and Model Stability
31:45 Validation Path: Consensus Algorithm and LLM as Judge
33:0 Naming Conventions: Banter Packs, Hearts, Blogs (No Significance)
33:40 Personalization and Adapting to Vocal Cues
34:22 Daily Use and Iterative RLIF Exploration
35:36 Weighted Voting and Ranked Choice in RLIF
36:51 Kernel Optimization Integration with RLIF Data
37:50 Orchestrator for GPU Allocation based on Task
38:54 Hardware Agnostic Optimization Philosophy
39:49 RLIF Data Set as Ground Truth for Future Models
41:7 GPU Scheduling with Model Architecture Optimizations
42:0 Observability with ELK, Prometheus, Grafana, and Jaeger
43:16 Redis for Initial Logging, Postgres for Golden Logging
43:50 TTS Pipeline: Whisper and Piper
44:7 KitOS: A 25MB TTS Pipeline
45:21 Multimodality and Future Plans (Comics, Sora 2)
46:8 Live Animation of Constitution Debate
47:8 Optimizing for User Feedback over GPU Config
48:45 VRAM Allocation for Sequence Level Token Load
49:16 Analogy: GPU Scheduler as Muscle Allocation, RLIF as Thinking
50:18 First Attempt at Public Exposure for Research
50:50 Dream Job: Founder, Muse Capacity
51:36 YouTube Visibility and Demo Video Plans
52:30 Resource Shortages: Constraint Breeding Creativity
53:49 Fully Integrated House with AR/VR
54:20 Platform for Plug-in and Optimize Any Model
54:33 Comparison to Other Coding Model Optimization Projects
55:0 Online Constitution Building and Scalability
55:20 Enterprise Deployment and Checkpointing
55:56 Illum: Company Offering Free Compute
56:28 Full Circle Moment: Jarvis Inspiration
56:47 Social Media Presence (LinkedIn, Automated CI/CD for Blogs)
57:38 Demo Video Suggestion for Local Llama Community
58:10 Call to Action: Collaborate on the Project
58:30 Latent Space AI in Action Community
59:36 Illum Compute Offer

Transcript

So I'm curious. You said you're a fresh grad. What does that mean? I graduated from NYU in May. Ah, okay, okay. Awesome. Congratulations, first of all. Thank you. Yeah, so for AI in action, it's normally just like anything that's interesting. So if I were you, I wouldn't feel too much pressure to perform.

As long as you're talking about interesting stuff that's in the AI space, I'm sure we'll have a good time. Do you find dynamic constitution through RLIF interesting? Yes. So the transition from like RLHF to I guess AIF to now like they're calling it RLVR, which are all separate things.

But like that's like the buzzword of the day or the week has been interesting to me. And verifiable rewards is now kind of like taking on this thing where people are contributing to like a community effort for verifiable rewards. That's super, that whole space is super interesting to me.

So I think this should be a good talk or maybe like a good presentation is like kind of like my thoughts off the bat. But I have not dug deep enough into, or I've done the research on reinforcement learning and like how you can use verifiable rewards to make models better for like low cost.

But I have not done it myself. So that's something I haven't really, really explored the way I need to, but it is, I do find it very, very interesting, that whole space. Yeah. Long story short, this is my attempt to create my own Jarvis from the Iron Man movies.

and I started simple. I just wanted like, okay. Do we have to wait or like, should I just begin? I get, I mean, being that it's recording, we probably could begin. We normally have a few more people come in. Yeah. We have some people doing it now. Give it to like 3:05, like two more minutes and then we can get started.

Yeah. But that's my take on it. I just want, like, I was applying to jobs and the process got a bit too difficult. So I decided that I'd much rather just use my skills. Why do I need to grind lead code anymore? So I just started simple. I wanted to have dynamic gaming banter, more or less like in-game characters talk to me instead of just to the game or I interact with the game, more or less.

This was this, there was this idea in the latent space discord where people say that basically more or less, we get stuff. We don't have dynamic NPCs, right? We have a pre like NPCs, which have a set amount of responses. So I just wanted to start simple and that's where I began.

Yeah, that's pretty cool. That's a cool idea. I like the idea already. Zach says, I'm doing Claude as coach while we're on a couch to 5K program and the training of the minor body. Yeah, that's pretty cool. Yeah, I haven't really gone the like Jarvis route, but like, it's not, not that like, I, I've got my own version of this that I'm exploring.

Like, this is an interesting space because like the models have gotten cheap enough and powerful enough that like wrapping something around it to help, like, you know, whether it's, I need to figure some stuff out and I want to like bounce some ideas against these different personas and have this kind of conversational space or it's like, I just need someone to be like, yeah, it's fine that you're blocked.

Go walk for five minutes that like part of my brain that I've like, I've had in the past, but it's like easy to lose track of and then you just like stall out on stuff when you could easily just go touch grass, come back and then get back to it.

And I need that little external reminder system and like, I'm just using Claude for it because like 20 bucks a month in a web chat with the right project context. Like it works, but I've been wanting to like evolve into something a little bit more like structured. So I'm, I'm curious what you've got to share with the group.

Yeah. It's 3:05 now. So we probably can go ahead and, and it's recording. So we could probably go ahead and get started. I'm curious about the voice, the voice part of the Jarvis thing, but, but yeah, I think we're clear to get started now. Yes, just a minute. So basically, like, as I said, my idea began with something that I wanted to email Gabe Newell, the owner of steam, and I just wanted to start with gaming banter and it started with banter packs, basically integrate an LLM and get to a system where you have.

A prompt engineered character talking back to you through the game. I had everything mapped in. I started figuring out like the in game mechanics as to what DLLs needed to be tracked so that the local LLM could respond. Initially, it was just text. Once I had the MVP ready for text, I tried to overlay it over the most recent alien game.

It was launched. I took it and I pushed it over the alien games trailer and it worked beautifully. And then I injected it into the game. I could figure out what DLLs I needed to track to just see what events were happening. And then, like once the characters started moving or their lips started moving, my LLM started responding through text.

Now I used whisper plus piper for the entire conversation to be tracked. So whenever I responded, instead of I muted the in game characters and I had the system respond to me. This is the banter packs aspect of it. I had a real-time streaming overlay through OBS. OBS tracked my entire in and out and offered it to the local LLM through OLAMA.

That was where it initially started. Now, here is where the problems began. The first layer while being banter packs is that there's a lot of latency. Whenever there's a local LLM responding, your TTFT goes to about 2.5 seconds and that's not good. Like if you're playing games, you needed something that could instantly respond.

So efficiency was a big problem. And sometimes the LLM's either hallucinated or lost track of the personality they were supposed to emulate. So I had generic responses instead of specific responses. So if I was playing Red Dead Redemption 2, instead of Arthur, I'd get some version of Marvel Rivals character responding to me.

That was not good enough for me. So I started digging deeper into the aspect. And the first thing that I found was I moved on to Banter Arts. I took out the LLM engine, I created an entire MLOps platform. In that MLOps platform, I started at the basic of layers.

This is where the research got interesting. So initial findings told me that if I could use partial offload, I could get better TTFT for tinier prompts. And using full offload, I could get better TTFT and context loading only if I keep it locked. The quantization was basically depending on game.

So it would understand what I was talking and just load data based on that. And optimizing the throughput is what gave me 10x improvements. All of these reports are in the link that I have put into the chat. And hold on, let me go on further. So all of this news protocol is the part how I publicly publish what is happening in the system.

Because enter packs and banter hearts both are private repositories. So news protocol takes data from banter packs and banter hearts and promises into banter blocks, which is the public deployment of a public site, which tracks what progress I have made so far. Everything from the benchmarks to the development process.

And in the research phase, this is what I found. So if I use the quantization for LAMA 3.1b, I had around 76.59 token per second. But with JAMA 3, I could get it up to 102. Optimal configuration came out to be a full offload with context limit of 4096 KB and a temperature of 0.4.

So that was a general victory there. Next, I start then. This is where it derailed completely because I went off to a competition. There was a hackathon on 4th October. And there were a lot of like OpenAI startup head of division, a lot of founders were there. And they wanted us to create some product out of an agentic workflow.

We were assigned some amount of free API calls. And I sat in that room working with around 150 other people, and I saw like hundreds of dollars just flying out for MVPs, which may or may not be further used. All of this, all of these unique ideas were just flying out, and they were being used in any possible way.

Now, this is where I also found out that hundreds of GBs of data was also being read. So anyone, if possible, they were already incurring hundreds of dollars in terms of API calls, costs in terms of API calls. They were also losing data. Their data was being read wherever it was going.

Nobody had any idea where it was passing off to. I was talking to people and yeah, that happened. And I came back from that hackathon. I had an idea that I wanted to implement my local agents. And that's where the Jarvis idea started that I could use my already, I had a working voice system.

So I could just deploy my agents through having a conversation with my local orchestrator. And I just started working on the agent workflows. Initial findings was that basically whatever I found in TR 108 was irrelevant in TR 109. because any sort of improvement in TTFT that I had failed in multi-turn tasks.

So to get that, I had to adjust any for the GPU. I had to take the full offload and reduce it to a partial offload. And the context size had to be reduced and the temperature had to be adjusted for that. So upon doing that, then my agents got more faster and had a better, slightly better throughput improvement.

But the task got done with 100% accuracy. So that worked out for me. The task was basically for like for the benchmarking purposes. The task was whatever Triton, Triton, CUDA, NSYNC benchmarks that I had run to ingest all of that data and draw inferences from them and publish a report.

The template of the report is on my GitHub. It's in the chat. So you'll find the report TR 108 there, TR 109 there. And the total process ran about three hours or so. And then I started building further down the line into concurrency. Single agents deployed locally 100%. So now how much could I push the system?

So I went into like, I needed multi-turn agents working concurrently and together onto a task. So the task remains same, but they had to, one agent was the ingester and another was the analyzer. And they were coordinating. And I found out that if I had baseline Olama models deployed over LLM, I could get mixed accuracy.

The speedup was tangible, but efficiency was mixed. Heterogeneous meaning like if I had a specific tuning at the CUDA core level, heterogeneous adjustments due to any sort of specific task. So if the ingester has a specific heat map, sorry, heat or a kernel adjustment, it wouldn't work if they weren't homogeneous.

The LLM models offered, sorry, the GPU did not offer full offload. So that's why we needed homogeneous. And I found out that I could have a 2X speedup with two more agents deployed simultaneously with a 99.3% efficiency boost. This gigabyte of RAM is crazy. You've mentioned, you've mentioned a couple of times that you posted a link in the chat.

You're not talking about the, the, the zoom chat, right? It is in the zoom chat. Yes. Is it Gaben.ai? Is that what Zach posted? No. Hold on. I'll have to check, but I'll, but I'll just finish this. Yeah, yeah. So once I figured this out, then this is where like the findings were pushed out.

Like there was a, if I had heterogeneous systems, there was a lot of memory contention. So if I wanted a full offload, the ID, I had to avoid any sort of memory contentions. So that's where I had to select what sort of partial offload was needed. I'm sorry for this, but I was playing around with the cloud and this is what it generated.

I apologize. I apologize, but the contention was a GPU equal to 80 layers is what was the minimum threshold for a two stable two agent execution. You could execute more, but it depends on what your VRAM is and what sort of quantization you use. And of course, normally there's a, depending on the context size, there's no linear graph.

The graph is like, the total findings are very interesting. when I ran both concurrently, single agents could have a specific TTFT and a specific token output. But there is a, like you realize that you can have a full offload only if you add just context size to be a four KB minimum.

But for a multi-concurrent multi-agent system, you had to have that. And then only then you end up with proper linear, proper efficiency there. And this is not it. But beyond two agents, it depends on your, as I said, it depends on your VRAM. And if you have like a proper system with around 50 GB or whatever VRAM, you'll be able to run at least 10 agents concurrently without any contention.

And fully local, no need for internet as long as you have all the data on the system. So what we found out is basically what the total token per second was 102 for geometry. Workflow optimization was pretty much domain specific. So if your task changes, your optimizations will change.

For that, I have been working on optimization agent, which detects the task. And that's where the entire RLA pipeline that I was talking about comes in. And multi-agent concurrency could achieve a theoretical 2x boost. Thank you. That's about like about Chimera. I could discuss further. Hold on. On the RLA pipeline.

But before we go to that, I'll go through the reports once. So for first things that I, I was using leaders geometry and I used multiple quantizations for it. Everything I think from 270 million to 1 billion quantization to 3 billion quantization. And general findings were basically, if there was one, you could have 1.165 millisecond mint TTFT with memory usage at 5.3 GB.

And these are the parameters. That's basically Llama versus Gemma. I was going to test more, but I found RLAF more interesting. Like the next phase of this is basically what I've done is I'm using API calls from Claude. I'm using API calls from Gemini. I'm using API calls from OpenAI to optimize the performance of my local Gemma.

So if I have a task, Gemma produces an output and the 3 big names, they debate over the quality of the output and whether it's good enough. If it's not, then they offer, they retweak the output and then this retweaking process is fed into an RLAF database. So that when I am further training the local Eljama model and I have to use RLAF, this is the database that will help design the PPO or the DPO.

Though I haven't quite decided how to do that. This report is about like all of these reports are about 100 pages. So if you want to deep dive, it's thoroughly available there. And I'm open to questions. Yeah. Do you have enough GPU power to share your screen during zoom and show us this in action is my first question.

Can like you show us? No, I have, I have hit contention issues. Like I, I, if I am training, if I am running it out, what ends up happening is it straight up crashes. Like if I have a zoom working, hold on, let me check. Okay. Just a sec.

You fit it in a Colab maybe? It could work. I don't know. I got here. Colab is on cloud. And I have, I have run out of three credits basically. Okay. Okay. Okay. Okay. I got to look up the VRAM on that. Yeah. This is, this is really cool.

I think just from a research perspective, it's really cool. We had some questions early in the chat. If you said you wanted to field questions, but yeah, really, really interesting research that you've done here. I like it because we were curious what machine specs you need. You were talking about running Ollama and doing gaming at the same time.

And I guess that you need at least a, or in your case, you, you've used a 4080 to do this, to pull this off. Um, and, uh, Cam L says, which params got you 10 X throughput? I think this is on an earlier slide. You were talking about 10.

Yes. Uh, so for that, hold on. Uh, I'll go sequentially. No, not here. Hold on. So reports. Ollama. I apologize. I'm not used to this. So for the deep dive, um, these are the parameters where I found, uh, my initial, uh, initial, uh, teleport was 5.5, one second TTFT.

Uh, sorry. Initial teleport was around 7.9 tokens aggregate aggregate. And for quantization photo, I, I had 78 increase and that was 10 X. Um, these are the tunings that I used. It's all in the reports. Okay. And then one of the other things, uh, you mentioned a couple of times posting links.

I'm assuming this GitHub, uh, repository is, is what you were talking about when you were talking about posting links. Uh, the link was for this report, like, um, this entire doc folder. This is what is being posted in the chat. So anything that you want to test, play it out.

It's right there. Like any questions you have related to it. So, so in theory, I could go to this GitHub repository, follow your steps and reproduce this. Like, let's say I have a, a, a 4080 at home and I can reproduce these steps and also have like a Jarvis on my computer is what you're saying.

Yes. Uh, you'll need my, uh, Chimera optimization. You'll need the Banterhats repository, but you can end to end replicate all of this. Got it. So like you'll need this repo. It's not public. It's private. Oh, Banterhats. Okay. Okay. I see. This and this repo both are private. Uh, the public is Banterblogs and Chimera multi-agent.

And, and that's primarily research, not necessarily, uh, the agent itself. Uh, Banterhats, uh, has the agents itself. Yeah. So hold on. So these are the agents that I ran, the entire reports end to end performance. Um, then there's multi-agent, same comprehensive test results, phases. I have been selective in uploading all the research, like the data up because that was a lot of stuff.

Like this entire repo is about 200 K lines, maybe more. I'll have to check. Do you get a release plan for the, some of the, I, I got here a little late, so I don't have full context, but it looked like you did some like, uh, CUDA optimizations and, uh, kernel rewrites and stuff.

Do you, you got a release plan. Yeah. Or you did that. Um, I did, uh, I did writing based optimization for, uh, which is in kernel deep dives. Just a sec. Going back to that. Going back to that. Okay. Okay. Hold on. I'll have it on here. So this is what we found out.

Uh, I could not do Q80 and that's why I could not be like, this is testing around. This is just playing around. But for optimization, torch optimization had a basic, this attention for attention kernels. Uh, for Matmul, uh, this is what I found out. And after kernel fusion, my TTA actual, uh, latency went down 10, uh, 50, 50 X around from 6.9 to 0.07 for my orchestrator.

So this is all for my orchestrator. Kernel optimization. Like I, I cannot have resource contention, right? The orchestration, uh, the orchestrator for the agents needs to respond instantly so that my multi-agents could just go off and do their tasks. Yeah. I think he was, uh, Yikes was also asking, do you have a room?

This is a part of a larger question than I have, but like, do you have a real, a release plan for the CUDA optimizations that you made? The, the kernel changes. I do, but, uh, this is like all of this I have done in over like the past 30 days.

Uh, this optimization thing started, I think it started. Yeah. See, it was just over 15 days ago. So I do have a plan on eventually realizing, uh, like releasing a one click optimization straight up end to end, uh, from downloader to everything. But, uh, currently I'm just exploring ideas as to how I can have the entire RLIF, AIF pipelines working out.

Like, so that you can have everything optimized for your own personal game. Yeah, I know that sounds, um, very, very interesting. Uh, yeah, the, the possibility of being able to optimize tasks, tasks, task specific, uh, CUDA kernels, um, uh, I think is underexplored. So this looks like a fun, fun kind of spot to be playing around.

Yeah. Yeah. I'm really curious. Um, like there's, of course the frontier models are like all the, you know, big giant GPU stuff, but I, I'm personally excited about like smaller models that can run on prosumer hardware. Uh, I'm curious why you picked Gemma three and not other smaller models.

I know Gemma three came out a little while ago, but like in AI terms, it's, it's ancient. Um, cause it was a while ago, but I'm, I'm curious, like, it looked like you did compare it to a couple other models or was it just like Gemma three was. Gemma three Lama and then for like, the thing is, uh, Gemma three had a lot of documentation already.

Actually. So people already had performed things and did things with it. So if I do everything with the latest and the greatest, I do not have a comparative analysis. I do not have points of reference as to why things are failing. So all of this could be, if I don't have models, which I have a historical data, it could just be, uh, people actually having optimized originally and my system doing nothing.

If it makes sense to you. So yeah. Yeah. I see on hugging face that there's like 218 fine tunes for Gemma three. Yeah. So like that makes sense why you would be like, all right, this is where a lot of other people have gone. Let me not like blaze a new trail.

That's kind of like follow what other people have done. I'll have more to compare against. Hmm. It offers me benchmarks to compare against whether I'm doing MLOs properly. So once I understand that this is what the ground reality is, this is what people are doing, and this is what their understanding is, then I can debug on a level where I have more understanding.

No, I can optimize in a nutshell on any sort of frontier model or like the latest models. But again, that would mean that I, uh, it cannot be simply me doing it. It could be the person who launched it, who has already done it. So what am I doing better?

So this is what I'm doing better. Are there any of the newer models that have caught your eye that you're, you're interested in, especially from the Chinese labs that have the smaller models? I had, uh, recently I read this research paper by Samsung. I think it was the 7 billion model that could perform as well as a 40 billion parameter model.

I wanted to, I wanted to test it, but, um, like my recent, like, this is all the bottoms up. The other optimization technique that I have been, like, I have written a document on is a TD. Okay. This could get interesting. Hold on. So, uh, in this, like I am going now for a top zone approach.

Yeah, the other, uh, interesting thing potentially to look into or to, to follow up on flows point, um, or, or, um, Zach's point, uh, Gemma actually is actually a pretty solid pick for this, in my opinion, largely because I think. The Gemma architecture has, has Google has, has, um, been a lot more oriented around optimizing that for small models, mobile, especially because they're trying to integrate it directly to the pixels.

Um, so 3, and I think would be really interesting to do some, to see if any of these translate over to the 3 end, because if you're gonna, if you can get more perform performance out of a 3 end, I think you could do like a lot on the edge, which I think this particular architecture would be really, really useful at.

So, yeah, that's what my idea was basically all of the models are generalized. So this is where I started the RLIF. Like if I have, uh, if they're generalized, then I can make them special purpose models. So the, uh, there was this content, as I said, uh, the RLIF thing, that's what it was for.

So yeah, models are tiny, but they have a lot of hallucinations and that accuracy goes down. Oh, being, uh, ultimately LLMs are pattern matches. So they will only pattern match as long as much as they have in memory. So I would much rather replace it with tasks specific memory.

Like, uh, if you are coding, like you're coding with Claude, you're not going to code in 50 languages. You're going to code in five specific languages and you need a model. that's only special for those five specific languages. Mm-hmm . And, uh, in those five, uh, five special languages.

Codex is very strong at something. Claude is very strong at something. Gemini is very strong at something. So your end to end stack has five languages. It has specializations and it has specific needs. And that's where this entire RLIF pipeline comes in. Your local model is being trained by these big names to be optimized for your task.

The constitution for this local model is stored locally. So nothing, uh, none of your personalized, uh, proprietary code is leaked out because you are running agents through your entire code bases. You are running LLMs through your entire code bases. You have no idea what is being stored where. You have no, you have no idea what sort of learning is happening and what sort of data in the prompt injection based leakage is happening.

So if you sandbox it onto a local system, you can test end to end and you have no issues. So based on consensus and I've also optimized like the proper, like my current, uh, this idea, we can have RLIF for running for end time. Instead of that, what I ended up doing is I have a weight adjusted models.

So, uh, Claude gets, uh, a higher weight for coding tasks. Gemini gets a higher weight for life tasks and general purpose tasks. Open AI has a higher weight. So based off that, that's like, uh, how the debate works. And, uh, what I've done is I have budgeted those debates.

So every, uh, everything has a limit. So if those API calls have a specific cost, uh, I have sandbox those API calls to $5. So in $5, they must reach and reach a consensus and they must generate an output, which is very like good for me. This is good for this task.

And that's it. That's the output that I need. I don't need to go back. I need, I don't need to go back and I don't need to check on codex. I don't need to check on, um, Claude or I don't need to check on Gemini. So if Gemini, I think today we had that Google launch.

I'm not sure, but Google, Google is launching Gemini three, right? So I can just plug that in right here and I can use that for my own optimization. Yeah, no, that's really interesting in that. So we have, we have frontier models with different specializations. And then if I'm reading you correctly, then we have a RLI IAF pipeline that is monitoring like per task.

And then for each task, we're going to have the various frontier models at different weights, depending on the task, be doing more, be more heavily weighted during the RL AIF process. Is that kind of the idea? Yeah. Yeah. Sweet. Nice. That's really cool. So this is like the top down thing.

Uh, so while I, while I am optimizing bottoms up, I found out that I have two X boosts. That's enough. Like that's good. It's good for me for now. So what I needed to do is if I can optimize from top down. If my, uh, local model is performing at least as good as Claude currently, not even as good as Claude.

Uh, I can say three points on it, 3.5. I've achieved the task that I need. I, I have two, um, agents running concurrently fully local without me having to pay a single cent, except obviously the hardware costs doing my task for me. And they are, they are obviously going to explore private data.

They are going to be sandbox into an environment where they cannot like send out data. Uh, and everything remains fully local. I guess also, uh, cost of electricity. And, and as the, you know, we're getting colder going into the winter months, your room is going to be nice and toasty from that GPU.

Yeah. True, true. Pretty much. Right. Yes. Uh, this is not public. Like, uh, I, I can post a GitHub Gist somewhere, but like this is a proposal for now and I've not fully explored this. I started this two days ago. I have a current debate working out, but I ran a, ran out of API calls.

So I'm waiting. But yeah, that's the idea that I have. Everything is fully modular. So you can plug in any model you want. So sometimes what ends up happening is 3.5 is better at some tasks than 4.1 Opus. So why not use 3.5? They're more stable. 100%. Uh, I think someone posted that, um, codex is giving a streaming error.

Why not? Then we can just switch to a GPT five. What's your, uh, validation path or like your consensus algorithm, um, kind of set up here. If I'm, if I'm, so if I'm tracking correctly, we have, uh, RLI AIF that's happening through some kind of debate format, which is weighted depending on the task.

And then there's some kind of termination point. That's either $5 or like, more or less, like, I like it. This works for me kind of thing. Are you driving the model during the RL AIF process? So you can get to the validation state and then lock it in, or, uh, what's kind of the process?

So right. Let's see. Uh, apply test comparison. Oh, pairwise comparison. Okay. Okay. Proposal that beats all. Oh, there's this corner. I'm not familiar with converse say, or border. I'm presuming that those are just like a scoring scoring. Yes. Yeah. And it's like LLMS judge kind of like preference, kind of preference based stuff.

Yes. Nice. Maybe. Yeah. That part's really cool. Yeah. Yeah. Yeah. Really cool. But I was going to ask, can you, can you get into, I have like some, some, I think the questions that'll actually help you in whatever journey you're on now, but I'm curious, like banter packs versus banter hearts, banter blogs.

Why did you name them that way? Like what's the significance of the naming? There's no significance. I was being, I was being dumb. Honestly. Yeah. Like there's no significance. I was being dumb. Honestly. Yeah. Like there's no significance. I just, I think this is what I'm hearing vibes. Yeah.

Yeah. I like the entire, entire ecosystem is called Chimera. But I started off just because I wanted to, I wanted to, I wanted to, I wanted to, I wanted The model optimization is whichever one has the best banters, bro. Yeah. Yeah. Yeah. I mean, normally if you see, we had a GPT-3, which we had a GPT-3, which is a GPT-3, which I had like the entire entire ecosystem is called Chimera but I started off just because I wanted to and it's gotten pretty far the model optimization is whichever one has the best banters bro yeah yeah I mean normally if you see uh we had a GPT-3 being robotic and people didn't enjoy it so they personalized it uh GPT uh adapts to your vocal cues or sorry the things that you say and it speaks in your tone that's what I'm using so this this is pretty much like personalization on a whole new level because not everybody is going to be the same type of depth not everybody is going to have the same type of life tasks not everybody is going to have like specific requirements some people are going to be hyper fixated on having deterministic outputs and some people are good with just arguing through ideas and just exploring them and the constitution remembers that so this is yeah is this the daily driver for you now like are you using this tuned model for all sorts of daily stuff or are you still sort of like tuning it and finding new angles or is it like is it useful in certain contexts but not others because this size can be kind of hit or miss so I'm curious how like in use day-to-day it feels day-to-day I enjoy it a lot honestly like I while I'm chilling uh while I'm just thinking through ideas I go through a back and back and forth through like hitting this idea RLA finance waiting weightage I had no as someone else said the entire algorithm I had no idea if that could be used so I explored so I explored through all of these three uh frontier models together and I found out that this worked for me like this was the best way to handle weights 100% so why not just debate like uh normally we have ideas and we need we need uh we have to google multiple things so instead of googling I I'd much rather have the lm google it for me and find like specific answers so we're using we're using weighted voting and ranked choice is that so are these are the larger models doing are they deciding like what what what decision are they making here like are they actually doing weight updates uh not weight updates they are creating an RLA of data set so normally what ends up happening is uh question is generated now what is quantum physics local model generates an answer open AI generates an answer uh Gemini generates an answer cloud generates an answer we enter a debate there are four responses uh and which is the best why is it the best uh once the debate phase ends uh uh they hit a consensus and that answer is stored into uh I look over the answer if I say that the answer is good it's stored away into the constitution or rather the RLA of data set which will eventually be part of my dynamic constitution but yes it's totally and it keeps on updating until uh like we hit the repeat part so if there's enough data I can train my local model to generate answers similar to cloud for coding similar to Gemini for daily tasks stuff like that okay and then where does the um the like kernel level optimization stuff come in fit into that picture I see data set and then I see I see test train I think yeah so normally this is where it starts uh getting fun I have I created a data set where my what is quantum uh quantum physics so that's the size size of the prompt and the the data set is going to be generated so it's going to go on to the GPU so some tasks do not require a full GPU offload some tasks do require a full GPU offload now it decides the orchestrator decides based of the three reports that I have tr10 108 through tr110 and plus whatever further research that I find it decides that this is how much GPU allocation needs to happen in uh 70 nanoseconds sorry 70 microseconds and it offloads on to the GPU so your elements respond as fast as they can on your current current hardware so top down buttons up let's okay so there's an orchestrator deciding how much GPU to allocate and that's for each task and the thing that's being allocated is GPU memory for the small model and the small model is being is being let's see small is small model being is small model actually getting weight updates in here I know I'm losing it I think it's not getting weight update it's not getting anything it's just generating answers for now okay it will be further optimized later so currently it's just working out like it's just responding to my questions okay but there's so the idea being like there is a GPU scheduler it has some level of access to this um this constitution based kind of um uh uh golden data set and then since it just has that it's like infer this and do the scheduling kind of thing correct nice I like that actually the the um you just let go and like like get to the point where you like you you understand the interesting steps up until here and then you're like figure it out I think that's yeah I mean uh hardware is going to change I started with 40 80 50 90 is already out 60 90 is going to be out by the next year and I might end up with a 500 I might end up with a h 100 if I optimize for each GPU I uh I'm going to end up a bit of confused I can't do that so why would I let that happen I would much rather have I would much rather benchmark whatever GPU I have on hand and then optimize my LLM base for that GPU so any personal GPU that you have you can optimize for it one shot optimization one shot LLM deployment one shot responses so you don't that's the reason the RL AIF uh data set is not integrated into the banter hearts or chimera because this will be your ground truth for your future models you see like if you switch devices if you switch devices you if you go from a macbook m3 and if you go into a macbook m5 pro max whatever so the optimizations are going to be benchmarked differently but ultimately what preferences that you have for responses need to be the golden standard for the local model that you deploy so if you deploy like I am deploying gemma 3 you deploy GPT 120b OSS both of them are going to have different answers but your preferences are yours your answers are yours so that GPT OSS 5 120b needs to know that your life preferences are this your coding preferences are this your tasks are this this is what you eat this is what you enjoy and that's like the RL AIF end of it the top down version of it and the bottom of version is basically optimization optimization of GPT OSS 120b on your local hardware so further quantization will obviously make the model tinier based on your specific tasks but that's eventually yeah and I'm wondering you could I feel like there's a way to be able to do like gpu scheduling alongside uh model architecture optimizations right because like if you're if you have enough g if you have enough vram to apply this to a gemma right now then you want to explore the effectiveness of different quants or different model architectures like uh like uh you know kimmy with uh with kind of their they're heavy experts few um a few heads kind of strategy and then you could see sort of uh uh you could end up rather than getting that one to approve you could get five to approve and then the five have all have sort of different architectures you've got a llama you got a gemma you got etc etc it could be very um correct a lot of possibility there so yes that's the reason why banter packs or uh like this infrastructure architecture has a full uh elk observability with prometheus uh grafana and uh hager eager for tracing like what's going wrong if the agents are deployed so anything that agent does is tracked any sort of api calls happen they attract and that's why the mlx platform also has a prometheus plus grafana uh not prometheus grafana n-sync uh monitoring plus uh prometheus grafana monitoring i didn't go for the lk there so for the more like pro coded people it's already there yeah i would are you i would presume that the gpu scheduler that level of orchestration if that's getting logged alongside the responses then you should be able to build a a a good in addition to your to your full output data set you could um you know you stick some probes into the into the cuda graph or whatever um and then the kind of see what other what other um metrics you could you could whatever levels you could you could start making optimizations at it gets a little fuzzy there but yeah that's the reason like that's why i have ready redis for uh initial uh logging and postgres for uh golden logging like the truth ground truth logging so postgres handles the actual entire data logging and redis is like immediately are responding to the cache coming up on the 45th minute uh if anyone else in the audience has some questions too like um you know just just a call for questions i was going to also ask you about the tts uh pipeline like have you played around with the voices and things of that nature like as jarvis is responding to you yeah so basically uh my initial take was based uh orchestrating over the api but i just went the whisper pipeline uh pipeline it works for me for now but because this whisper piper pipeline just is pretty simple i don't need to wire up a lot of things and it does not require a lot of resources i was going to try a kitten us have you heard of kitten us kitten us is of 25 mb uh like tts pipeline no i haven't it was launched uh last month uh it's a very popular uh not very popular but somewhat popular startup on y combinator um let me i'll i want to post that link so i'll i'll take some time yeah actually if you have any more questions i'll take some time and find that in person let's see um kitchen yeah uh i would be interested so you have you have whisper for your tts what's your stt currently piper oh piper okay yeah whisper piper pipeline yeah i did not uh heavily dive into the entire like i had initial uh success at atms uh latency so then i didn't go fully into this level of optimization like that uh front end aspect would come later once i have the entire back end mapped out and the back end is going straight down to the silicon so it just gets complicated yeah yeah i can imagine um have you have you uh uh uh have you looked into or you might have already covered it i'm not sure i got uh it was a little late but uh uh multi-modality have you tried like throwing a moon dream in there and seeing what happens kind of thing i have a plan like the this current orchestration has four uh entire things sorry four how do i put this four repers the fifth rep is going to be uh generating comics more or less so this discussions that happen just make them funny and make them into comics and then once uh sora 2 i have the availability for sora 2 you'll be able to hear gemini plus cloud plus gpt for discuss it out loud and yelling at my local model to be better that sounds like fun yeah i know it's it's a good thing yeah eventually hardware will be good enough so i could just uh have at least i'd presume it would be like a five ms lag but even with five ms lag i could just observe it live animating like a live animation of my constitution debate happening anime episode about what your gpt your gpu kernel is currently doing while you're working i mean yes more or less it's entertaining like i think i like the idea very personally like just uh having like an interactive environment where i could just chime in and because obviously i have the uh voice pipeline ready so if i say something and this sd dts pipeline is being read by uh all these models so i'm actually interacting with anime versions of these guys ironically arguing with this kernel because the model's hallucinating again yes i love it that's very that's gonna be that's gonna be a party well and i think there's something to that too in terms of in terms of real optimization right because like the kinds of especially if you get to hear them argue like you know i can do a hyper parameter sweep or whatever but if if we're if we're showing the scheduler this entire like anime episode debate process and being like okay now allocate the vram um it would be really hard to articulate that in terms of a gpu config um whatever that thing is sort of like inferring from this sort of blob of data that it's got to work with um i think could be real good could you could actually all it has to infer from is what question i have asked and what feedback i'm giving so the model itself is like the model allocation is generating like the system is generating some words so the ttft is being allocated the gpa allocation is just for the model generating words and the prompt engine like whatever prompt i generate whatever question i asked that's what it has to observe nothing else this discussion is happening over like on an entirely different thing but uh as long as it tracks that my question has 15 tokens and 15 tokens requires this amount of model uh resource allocation partial or full overload what else needs to be done because the the scheduler doesn't need to do anything else or only be dumb enough to simply like this is what you need to do this is where it needs to be done eventually when systems get smart enough i'll have a smart a smart agent decide but we are not there yet hardware wise we are not there yet yeah the um okay so the the vram allocation is going to be for sequence level or like batch level kind of um uh token token load okay okay and then i'll put only sequence level pre-allocate it to put it simply uh all of us work out right so the best understanding of this would be uh you know how much pressure that you need to put into your muscles to lift a 10 kg dumbbell or a 10 kg bar but you have to think about your form when you're whether you're going to squat with it whether you're going to bench press with it but the pressure doesn't change the muscle allocation doesn't change all that changes is how you move your muscles so this rleif is the thinking part and the muscle allocation is the gpu scheduler part both are independent but related uh b did you have a question i said you turned your camera on i don't know if you had a question specifically for sahil um but i have a i have a couple of questions uh if not no it was by mistake sorry yeah oh okay okay no worries no worries no worries i just wanted to you know make sure we include um is this your sahil is this your first like uh i guess shot or not shot but like attempt at getting exposure to these ideas in in your research uh like this specifically the jarvis thing yes like it was very random as if you see the first comment that i have is extremely quite it's it's extremely stupid honestly are you do you do you have have any like the labs i don't know what your dream job is because that's kind of how we started the conversation but have any labs reached out for you to do this type of research or ai research ml research in general uh based on your your degree or or anything that that you're interested in doing i have applied but the recent uh h1b debacle uh i my interviews got put on hold so i just decided why why wait yeah much rather work no for sure for sure i so okay ideally what is your what is your dream job like do you have any uh uh in mind i mean eventually i might end up as a founder given the scope of my capacity at muse but this this was my first comment got it okay yeah i and and i'm asking like these type of questions because i think platforms like this i don't know if you know but all the recordings go on youtube uh and like there's obviously like the latent space or let me not say obviously because i'm not sure if everyone here is familiar but swigs host the latent space podcast uh which is like a huge platform uh with a lot of subscribers and this uh recording actually goes on like latent space tv which is kind of like a an associated youtube channel but but gets a couple hundred views per video um and i'm asking these questions because i'm also curious like do you have a video uh of you a demo of you like using jarvis and action uh i do plan on creating one but this rlaf pipeline got really interesting so i'm currently working on that yeah okay okay i understand i think i had a uh tauri ui test run yeah it is giving me issues but i'll figure that out yeah for sure yeah i just think this is really interesting research and obviously like uh could open you up to lots of job opportunities in my opinion uh especially in this space so i just think posting videos like this or demos like this and work like this research like this on on these type of communities um could get you access to you know all kinds of um opportunities so i just wanted to say that like like while we have a couple minutes left and there's still room for other people to ask questions i just want to ask those uh type of questions uh i guess number one for the recording uh that's gonna go on youtube but but number two just kind of for my own curiosity um it seems like we have some chats related as the the web three guy in the in the chat i can certainly think of like five or six ways to to monetize this like immediately if if one one wanted to but i i think a better question perhaps that should be included in the recording is for the experiments that you're doing and the kind of like research that you're looking into um are there any resource shortages that you're are the so like you got a 4080 you're kind of compute constrained is that is the is the constraint here breeding creativity or is are there things where it's like man i really wish i had x or y to test this on kind of thing or both it's actually both because the thing is uh it's an interesting space if you're deploying this locally and you can have a live system like the eventual pathway that i have is to have a fully integrated house so if you've seen any of the iron man movies for sure you have a ar vr system that you can interact with and do whatever you want and you if you have a gpu powerful enough to handle that the system already enables it and yeah the world models are are just getting better um yeah yeah like more people are working on model optimization i'm creating a platform that can plug in any model and optimize it for your local system yeah i had i i have a lot of i have a very like kind of similar vein project except i went a lot uh uh uh i went for the big guys i was like okay let's let's assume that we don't have any compute constraints and then we have a coding task what are what are what's the best selection of coding models best selection of coding lauras and then best amount of like kernel level optimizations that we can make to each pipeline kind of thing um but the the the like the um sort of uh uh online kind of uh uh constitution building of the golden data set is a is a very like interesting approach that's that's kind of cool that it works um sort of at any level of or in or yeah have you tried scaling it up at all i guess i have tried but my system decided to bsod then so i just decided to stop like you see these are the kind of optimization that run like if you are looking at the screen it's everything end to end so that is a pipeline that is ready for enterprise basically just deploy it over any sort of the local multi gpu system optimize it optimize their lm to run over that local gpu system and see how it eventually ends up everything from here pretty much checkpointing yeah yeah i think there's a there's a company that i know that's like actively looking for people to give compute too i'll see if i can grab the link because i think it would be cool to try this on like 8x h100 and just see see how far it rips yes yeah it does like you start to see right the idea is really interesting and the more i dive into it the more interesting it gets that's why everything has to be it's well organized but yeah more or less this is kind of like a full circle moment for me because like one of the reasons why i got into the ai space uh a couple years ago is i saw a guy on youtube trying to build jarvis but it was nowhere near this advanced um so it's it's funny to see something like this kind of like come to fruition i'm also curious are you on are you on i saw you had github uh i should probably go look at your profile but do you are you on twitter as well no i'm not on twitter i i did open an account i sure do like get active on it but i'm working on this so i like basically i spend at least 10 to 12 hours a day just running through this and seeing where it goes like yesterday yeah pretty much and like these tests also run a lot so i don't spend too much time on social media more or less i had an automated pipeline so what what ends up happening is uh like the cicd uh for this banter packs is this here so this uh publishes into banter blogs which publishes onto my linkedin but that's about it and to maintain like them and to maintain the code base i have like a full cicd pipeline yeah i got 16 gigabytes of vram at home i would totally do a demo video for this and like uh because i just think i can see it going viral in the spaces that are interested in things like that and and we have like two minutes left uh so we probably have to ask them wrapping up questions um but yes there's like a subreddit called local llama and then an ex twitter community called local llama that would eat stuff like this of like they absolutely love running on gpus at home and that's that's the primary focus of those communities so i think they would love uh stuff like this uh speaking of communities if you guys don't have any other parting questions or sahil if you have a correct request or any any questions for us uh or any calls to the to the public that may see this video on youtube if you have any like closing comments honestly anyone who wants to tweak around with this and have fun discussing this please contact me let's have fun perfect perfect um and if anyone else left on the call uh yaks if you don't have anything to close basically uh what what cable normally does at this time is say like you know on a weekly basis we meet on fridays at this time and kind of like try and uh discuss interesting ideas in the ai space so if you have any interesting ideas and would love to sign up uh for any of the following weeks uh please let us know we typically coordinate in discord so the uh ai and action builders channel is the discord uh where we kind of get together and you can talk to our ai and action bot which will help you schedule for an upcoming week um and we're looking for people to speak i think the next we have somebody maybe two weeks from now but other than that yeah we we have we we have open slot next week so if you're watching this and it's uh and it's not next friday which is the what 24th i think yeah come to latent dot space slash p slash community and then come to the discord and tell us that you want to present something uh the other addendum that i'll make is for sahil and then anybody else who is experimenting on a level where they need gpus i just dropped a link into the ai in action channel um from a company called illoon and they are actively trying to give away compute right now so there's a form to fill out um if you click the link that is found in that discord and perhaps in the show notes we'll see cool it's uh a minute over so we'll wrap it up thank you for presenting this is really cool like very interesting and when you said rl i