Building Jarvis: A Deep Dive into Local LLM Optimization & Multi-Agent AI [AI in Action

Building Jarvis: A Deep Dive into Local LLM Optimization & Multi-Agent AI [AI in Action - Oct 17]

Chapters

0:0 Introduction to "AI in Action" and RLIF Discussion
1:32 The Jarvis Dream: Dynamic In-Game Banter
2:50 Community Contributions and Personalized AI Assistants
4:5 Starting the Presentation: Banter Packs and Initial Gaming AI
5:53 Latency and Hallucination Problems with Local LLMs
6:35 Transition to Banter Hearts and MLOps Platform
7:24 Muse Protocol and Public Deployment (Banter Blocks)
8:5 Quantization Benchmarks (Llama 3.1B vs. Gemma 3)
8:31 Hackathon Experience and the Need for Local Agents
9:43 The Genesis of the Jarvis Idea and Agent Workflows
10:0 Multi-Turn Task Challenges and GPU Adjustments
10:34 Benchmarking Multi-Agent Systems
11:10 Concurrency with Multi-Agent Systems and Homogeneous LLMs
12:17 Achieving 2x Speedup with Concurrent Agents
12:40 Memory Contention in Heterogeneous Systems
13:16 GPU Layer Thresholds for Stable Agent Execution
14:29 Workflow Optimization and the RLIF Pipeline
15:13 Quantization Testing and Llama vs. Gemma Findings
15:40 RLIF: Frontier Models Debating Local Model Output
16:47 Challenges with GPU Power for Live Demos (RTX 4080)
17:52 Machine Specs and Throughput Optimizations
19:0 Reproducing the Research and Private Repositories
20:30 Kernel-Level Optimizations (Triton and Torch)
21:53 Release Plan for CUDA Optimizations
22:57 Why Gemma
24:55 Eye-Catching Newer Models (Samsung 7B Model)
25:52 Gemma Architecture Optimization for Small Models and Edge Devices
26:36 Special-Purpose Models and Task-Specific Memory
27:40 Data Privacy and Local Agent Sandboxing
28:22 Weighted Debate and Budgeted API Calls in RLIF
29:35 RLIF Pipeline: Weighted Frontier Model Consensus for Local Model Training
30:50 Cost of Local AI (Electricity)
31:24 Fully Modular System and Model Stability
31:45 Validation Path: Consensus Algorithm and LLM as Judge
33:0 Naming Conventions: Banter Packs, Hearts, Blogs (No Significance)
33:40 Personalization and Adapting to Vocal Cues
34:22 Daily Use and Iterative RLIF Exploration
35:36 Weighted Voting and Ranked Choice in RLIF
36:51 Kernel Optimization Integration with RLIF Data
37:50 Orchestrator for GPU Allocation based on Task
38:54 Hardware Agnostic Optimization Philosophy
39:49 RLIF Data Set as Ground Truth for Future Models
41:7 GPU Scheduling with Model Architecture Optimizations
42:0 Observability with ELK, Prometheus, Grafana, and Jaeger
43:16 Redis for Initial Logging, Postgres for Golden Logging
43:50 TTS Pipeline: Whisper and Piper
44:7 KitOS: A 25MB TTS Pipeline
45:21 Multimodality and Future Plans (Comics, Sora 2)
46:8 Live Animation of Constitution Debate
47:8 Optimizing for User Feedback over GPU Config
48:45 VRAM Allocation for Sequence Level Token Load
49:16 Analogy: GPU Scheduler as Muscle Allocation, RLIF as Thinking
50:18 First Attempt at Public Exposure for Research
50:50 Dream Job: Founder, Muse Capacity
51:36 YouTube Visibility and Demo Video Plans
52:30 Resource Shortages: Constraint Breeding Creativity
53:49 Fully Integrated House with AR/VR
54:20 Platform for Plug-in and Optimize Any Model
54:33 Comparison to Other Coding Model Optimization Projects
55:0 Online Constitution Building and Scalability
55:20 Enterprise Deployment and Checkpointing
55:56 Illum: Company Offering Free Compute
56:28 Full Circle Moment: Jarvis Inspiration
56:47 Social Media Presence (LinkedIn, Automated CI/CD for Blogs)
57:38 Demo Video Suggestion for Local Llama Community
58:10 Call to Action: Collaborate on the Project
58:30 Latent Space AI in Action Community
59:36 Illum Compute Offer

Whisper Transcript | Transcript Only Page

00:00:00.000 | So I'm curious. You said you're a fresh grad. What does that mean?

00:00:04.620 | I graduated from NYU in May.

00:00:08.320 | Ah, okay, okay. Awesome. Congratulations, first of all.

00:00:11.420 | Thank you.

00:00:11.740 | Yeah, so for AI in action, it's normally just like anything that's interesting.

00:00:17.580 | So if I were you, I wouldn't feel too much pressure to perform.

00:00:21.080 | As long as you're talking about interesting stuff that's in the AI space, I'm sure we'll have a good time.

00:00:26.700 | Do you find dynamic constitution through RLIF interesting?

00:00:30.300 | Yes. So the transition from like RLHF to I guess AIF to now like they're calling it RLVR, which are all separate things.

00:00:43.960 | But like that's like the buzzword of the day or the week has been interesting to me.

00:00:48.320 | And verifiable rewards is now kind of like taking on this thing where people are contributing to like

00:00:55.460 | a community effort for verifiable rewards.

00:00:59.560 | That's super, that whole space is super interesting to me.

00:01:01.960 | So I think this should be a good talk or maybe like a good presentation is like kind of like my thoughts off the bat.

00:01:08.560 | But I have not dug deep enough into, or I've done the research on reinforcement learning and like how you can use verifiable rewards to make models better for like low cost.

00:01:21.080 | But I have not done it myself.

00:01:22.580 | So that's something I haven't really, really explored the way I need to, but it is, I do find it very, very interesting, that whole space.

00:01:29.880 | Yeah. Long story short, this is my attempt to create my own Jarvis from the Iron Man movies.

00:01:37.560 | and I started simple. I just wanted like, okay. Do we have to wait or like, should I just begin?

00:01:45.060 | I get, I mean, being that it's recording, we probably could begin. We normally have a few more people come in. Yeah. We have some people doing it now.

00:01:52.860 | Give it to like 3:05, like two more minutes and then we can get started.

00:01:57.160 | Yeah. But that's my take on it. I just want, like, I was applying to jobs and the process got a bit too difficult.

00:02:06.160 | So I decided that I'd much rather just use my skills. Why do I need to grind lead code anymore?

00:02:12.160 | So I just started simple. I wanted to have dynamic gaming banter, more or less like in-game characters talk to me instead of just to the game or I interact with the game, more or less.

00:02:26.560 | This was this, there was this idea in the latent space discord where people say that basically more or less, we get stuff. We don't have dynamic NPCs, right? We have a pre like NPCs, which have a set amount of responses.

00:02:44.460 | So I just wanted to start simple and that's where I began.

00:02:48.460 | Yeah, that's pretty cool. That's a cool idea. I like the idea already.

00:02:53.160 | Zach says, I'm doing Claude as coach while we're on a couch to 5K program and the training of the minor body.

00:02:59.260 | Yeah, that's pretty cool.

00:03:00.360 | Yeah, I haven't really gone the like Jarvis route, but like, it's not, not that like, I, I've got my own version of this that I'm exploring.

00:03:09.460 | Like, this is an interesting space because like the models have gotten cheap enough and powerful enough that like wrapping something around it to help, like, you know, whether it's, I need to figure some stuff out and I want to like bounce some ideas against

00:03:22.560 | these different personas and have this kind of conversational space or it's like, I just need someone to be like, yeah, it's fine that you're blocked.

00:03:30.760 | Go walk for five minutes that like part of my brain that I've like, I've had in the past, but it's like easy to lose track of and then you just like stall out on stuff when you could easily just go touch grass, come back and then get back to it.

00:03:43.460 | And I need that little external reminder system and like, I'm just using Claude for it because like 20 bucks a month in a web chat with the right project context.

00:03:51.660 | Like it works, but I've been wanting to like evolve into something a little bit more like structured.

00:03:57.860 | So I'm, I'm curious what you've got to share with the group.

00:03:59.860 | Yeah.

00:03:59.860 | It's 3:05 now.

00:03:59.860 | So we probably can go ahead and, and it's recording.

00:04:00.860 | So we could probably go ahead and get started.

00:04:02.860 | I'm curious about the voice, the voice part of the Jarvis thing, but, but yeah, I think we're clear to get started now.

00:04:13.860 | Yes, just a minute.

00:04:22.060 | So basically, like, as I said, my idea began with something that I wanted to email Gabe Newell, the owner of steam, and I just wanted to start with gaming banter and it started with banter packs, basically integrate an LLM and get to a system where you have.

00:04:42.060 | A prompt engineered character talking back to you through the game.

00:04:47.260 | I had everything mapped in.

00:04:49.260 | I started figuring out like the in game mechanics as to what DLLs needed to be tracked so that the local LLM could respond.

00:04:58.260 | Initially, it was just text.

00:05:00.260 | Once I had the MVP ready for text, I tried to overlay it over the most recent alien game.

00:05:06.260 | It was launched.

00:05:07.260 | I took it and I pushed it over the alien games trailer and it worked beautifully.

00:05:13.460 | And then I injected it into the game.

00:05:15.460 | I could figure out what DLLs I needed to track to just see what events were happening.

00:05:18.460 | And then, like once the characters started moving or their lips started moving, my LLM started responding through text.

00:05:27.660 | Now I used whisper plus piper for the entire conversation to be tracked.

00:05:31.660 | So whenever I responded, instead of I muted the in game characters and I had the system respond to me.

00:05:38.660 | This is the banter packs aspect of it.

00:05:40.860 | I had a real-time streaming overlay through OBS.

00:05:42.860 | OBS tracked my entire in and out and offered it to the local LLM through OLAMA.

00:05:48.860 | That was where it initially started.

00:05:52.860 | Now, here is where the problems began.

00:05:55.060 | The first layer while being banter packs is that there's a lot of latency.

00:06:01.060 | Whenever there's a local LLM responding, your TTFT goes to about 2.5 seconds and that's not good.

00:06:08.060 | Like if you're playing games, you needed something that could instantly respond.

00:06:12.060 | So efficiency was a big problem.

00:06:14.060 | And sometimes the LLM's either hallucinated or lost track of the personality they were supposed to emulate.

00:06:20.260 | So I had generic responses instead of specific responses.

00:06:23.260 | So if I was playing Red Dead Redemption 2, instead of Arthur, I'd get some version of Marvel Rivals character responding to me.

00:06:34.260 | That was not good enough for me.

00:06:36.260 | So I started digging deeper into the aspect.

00:06:39.260 | And the first thing that I found was I moved on to Banter Arts.

00:06:43.260 | I took out the LLM engine, I created an entire MLOps platform.

00:06:47.260 | In that MLOps platform, I started at the basic of layers.

00:06:52.460 | This is where the research got interesting.

00:06:54.460 | So initial findings told me that if I could use partial offload, I could get better TTFT for tinier prompts.

00:07:04.460 | And using full offload, I could get better TTFT and context loading only if I keep it locked.

00:07:12.460 | The quantization was basically depending on game.

00:07:16.460 | So it would understand what I was talking and just load data based on that.

00:07:21.660 | And optimizing the throughput is what gave me 10x improvements.

00:07:25.660 | All of these reports are in the link that I have put into the chat.

00:07:31.660 | And hold on, let me go on further.

00:07:34.860 | So all of this news protocol is the part how I publicly publish what is happening in the system.

00:07:46.060 | Because enter packs and banter hearts both are private repositories.

00:07:50.060 | So news protocol takes data from banter packs and banter hearts and promises into banter blocks, which is the public deployment of a public site, which tracks what progress I have made so far.

00:08:00.860 | Everything from the benchmarks to the development process.

00:08:04.860 | And in the research phase, this is what I found.

00:08:06.860 | So if I use the quantization for LAMA 3.1b, I had around 76.59 token per second.

00:08:14.060 | But with JAMA 3, I could get it up to 102.

00:08:19.260 | Optimal configuration came out to be a full offload with context limit of 4096 KB and a temperature of 0.4.

00:08:27.260 | So that was a general victory there.

00:08:29.260 | Next, I start then.

00:08:31.260 | This is where it derailed completely because I went off to a competition.

00:08:36.460 | There was a hackathon on 4th October.

00:08:38.460 | And there were a lot of like OpenAI startup head of division, a lot of founders were there.

00:08:49.260 | And they wanted us to create some product out of an agentic workflow.

00:08:53.260 | We were assigned some amount of free API calls.

00:09:00.460 | And I sat in that room working with around 150 other people, and I saw like hundreds of dollars just flying out for MVPs, which may or may not be further used.

00:09:11.260 | All of this, all of these unique ideas were just flying out, and they were being used in any possible way.

00:09:19.260 | Now, this is where I also found out that hundreds of GBs of data was also being read.

00:09:25.260 | So anyone, if possible, they were already incurring hundreds of dollars in terms of API calls, costs in terms of API calls.

00:09:33.260 | They were also losing data.

00:09:35.260 | Their data was being read wherever it was going.

00:09:37.260 | Nobody had any idea where it was passing off to.

00:09:39.260 | I was talking to people and yeah, that happened.

00:09:42.260 | And I came back from that hackathon.

00:09:44.260 | I had an idea that I wanted to implement my local agents.

00:09:48.260 | And that's where the Jarvis idea started that I could use my already, I had a working voice system.

00:09:53.260 | So I could just deploy my agents through having a conversation with my local orchestrator.

00:09:58.260 | And I just started working on the agent workflows.

00:10:02.260 | Initial findings was that basically whatever I found in TR 108 was irrelevant in TR 109.

00:10:08.260 | because any sort of improvement in TTFT that I had failed in multi-turn tasks.

00:10:15.260 | So to get that, I had to adjust any for the GPU.

00:10:20.260 | I had to take the full offload and reduce it to a partial offload.

00:10:24.260 | And the context size had to be reduced and the temperature had to be adjusted for that.

00:10:28.260 | So upon doing that, then my agents got more faster and had a better, slightly better throughput improvement.

00:10:35.260 | But the task got done with 100% accuracy.

00:10:37.260 | So that worked out for me.

00:10:38.260 | The task was basically for like for the benchmarking purposes.

00:10:41.260 | The task was whatever Triton, Triton, CUDA, NSYNC benchmarks that I had run to ingest all of that data and draw inferences from them and publish a report.

00:10:55.260 | The template of the report is on my GitHub.

00:10:58.260 | It's in the chat.

00:11:00.260 | So you'll find the report TR 108 there, TR 109 there.

00:11:04.260 | And the total process ran about three hours or so.

00:11:09.260 | And then I started building further down the line into concurrency.

00:11:14.260 | Single agents deployed locally 100%.

00:11:17.260 | So now how much could I push the system?

00:11:20.260 | So I went into like, I needed multi-turn agents working concurrently and together onto a task.

00:11:26.260 | So the task remains same, but they had to, one agent was the ingester and another was the analyzer.

00:11:32.260 | And they were coordinating.

00:11:34.260 | And I found out that if I had baseline Olama models deployed over LLM, I could get mixed accuracy.

00:11:41.260 | The speedup was tangible, but efficiency was mixed.

00:11:44.260 | Heterogeneous meaning like if I had a specific tuning at the CUDA core level, heterogeneous adjustments due to any sort of specific task.

00:11:55.260 | So if the ingester has a specific heat map, sorry, heat or a kernel adjustment, it wouldn't work if they weren't homogeneous.

00:12:04.260 | The LLM models offered, sorry, the GPU did not offer full offload.

00:12:09.260 | So that's why we needed homogeneous.

00:12:11.260 | And I found out that I could have a 2X speedup with two more agents deployed simultaneously with a 99.3% efficiency boost.

00:12:19.260 | This gigabyte of RAM is crazy.

00:12:22.260 | You've mentioned, you've mentioned a couple of times that you posted a link in the chat.

00:12:26.260 | You're not talking about the, the, the zoom chat, right?

00:12:29.260 | It is in the zoom chat.

00:12:30.260 | Yes.

00:12:31.260 | Is it Gaben.ai?

00:12:32.260 | Is that what Zach posted?

00:12:34.260 | No.

00:12:35.260 | Hold on.

00:12:36.260 | I'll have to check, but I'll, but I'll just finish this.

00:12:40.260 | Yeah, yeah.

00:12:41.260 | So once I figured this out, then this is where like the findings were pushed out.

00:12:47.260 | Like there was a, if I had heterogeneous systems, there was a lot of memory contention.

00:12:52.260 | So if I wanted a full offload, the ID, I had to avoid any sort of memory contentions.

00:12:58.260 | So that's where I had to select what sort of partial offload was needed.

00:13:02.260 | I'm sorry for this, but I was playing around with the cloud and this is what it generated.

00:13:07.260 | I apologize.

00:13:09.260 | I apologize, but the contention was a GPU equal to 80 layers is what was the minimum threshold

00:13:13.260 | for a two stable two agent execution.

00:13:15.260 | You could execute more, but it depends on what your VRAM is and what sort of quantization you use.

00:13:22.260 | And of course, normally there's a, depending on the context size, there's no linear graph.

00:13:28.260 | The graph is like, the total findings are very interesting.

00:13:34.260 | when I ran both concurrently, single agents could have a specific TTFT and a specific token output.

00:13:41.260 | But there is a, like you realize that you can have a full offload only if you add just context

00:13:47.260 | size to be a four KB minimum.

00:13:49.260 | But for a multi-concurrent multi-agent system, you had to have that.

00:13:53.260 | And then only then you end up with proper linear, proper efficiency there.

00:13:59.260 | And this is not it.

00:14:05.260 | But beyond two agents, it depends on your, as I said, it depends on your VRAM.

00:14:09.260 | And if you have like a proper system with around 50 GB or whatever VRAM,

00:14:14.260 | you'll be able to run at least 10 agents concurrently without any contention.

00:14:19.260 | And fully local, no need for internet as long as you have all the data on the system.

00:14:28.260 | So what we found out is basically what the total token per second was 102 for geometry.

00:14:35.260 | Workflow optimization was pretty much domain specific.

00:14:39.260 | So if your task changes, your optimizations will change.

00:14:42.260 | For that, I have been working on optimization agent, which detects the task.

00:14:46.260 | And that's where the entire RLA pipeline that I was talking about comes in.

00:14:49.260 | And multi-agent concurrency could achieve a theoretical 2x boost.

00:14:57.260 | Thank you.

00:14:58.260 | That's about like about Chimera.

00:15:00.260 | I could discuss further.

00:15:02.260 | Hold on.

00:15:03.260 | On the RLA pipeline.

00:15:05.260 | But before we go to that, I'll go through the reports once.

00:15:08.260 | So for first things that I, I was using leaders geometry and I used multiple quantizations for it.

00:15:18.260 | Everything I think from 270 million to 1 billion quantization to 3 billion quantization.

00:15:23.260 | And general findings were basically, if there was one, you could have 1.165 millisecond mint TTFT with memory usage at 5.3 GB.

00:15:36.260 | And these are the parameters.

00:15:39.260 | That's basically Llama versus Gemma.

00:15:42.260 | I was going to test more, but I found RLAF more interesting.

00:15:45.260 | Like the next phase of this is basically what I've done is I'm using API calls from Claude.

00:15:51.260 | I'm using API calls from Gemini.

00:15:54.260 | I'm using API calls from OpenAI to optimize the performance of my local Gemma.

00:15:59.260 | So if I have a task, Gemma produces an output and the 3 big names, they debate over the quality of the output and whether it's good enough.

00:16:10.260 | If it's not, then they offer, they retweak the output and then this retweaking process is fed into an RLAF database.

00:16:19.260 | So that when I am further training the local Eljama model and I have to use RLAF, this is the database that will help design the PPO or the DPO.

00:16:29.260 | Though I haven't quite decided how to do that.

00:16:32.260 | This report is about like all of these reports are about 100 pages.

00:16:43.260 | So if you want to deep dive, it's thoroughly available there.

00:16:46.260 | And I'm open to questions.

00:16:49.260 | Yeah.

00:16:50.260 | Do you have enough GPU power to share your screen during zoom and show us this in action is my first question.

00:16:57.260 | Can like you show us?

00:16:58.260 | No, I have, I have hit contention issues.

00:17:02.260 | Like I, I, if I am training, if I am running it out, what ends up happening is it straight up crashes.

00:17:08.260 | Like if I have a zoom working, hold on, let me check.

00:17:11.260 | Okay.

00:17:12.260 | Just a sec.

00:17:13.260 | You fit it in a Colab maybe?

00:17:17.260 | It could work.

00:17:18.260 | I don't know.

00:17:19.260 | I got here.

00:17:20.260 | Colab is on cloud.

00:17:21.260 | And I have, I have run out of three credits basically.

00:17:26.260 | Okay.

00:17:27.260 | Okay.

00:17:28.260 | Okay.

00:17:29.260 | Okay.

00:17:29.260 | I got to look up the VRAM on that.

00:17:30.260 | Yeah.

00:17:31.260 | This is, this is really cool.

00:17:32.260 | I think just from a research perspective, it's really cool.

00:17:34.260 | We had some questions early in the chat.

00:17:35.260 | If you said you wanted to field questions, but yeah, really, really interesting research that you've done here.

00:17:43.260 | I like it because we were curious what machine specs you need.

00:17:56.260 | You were talking about running Ollama and doing gaming at the same time.

00:17:59.260 | And I guess that you need at least a, or in your case, you, you've used a 4080 to do this, to pull this off.

00:18:05.260 | Um, and, uh, Cam L says, which params got you 10 X throughput?

00:18:12.260 | I think this is on an earlier slide.

00:18:13.260 | You were talking about 10.

00:18:14.260 | Yes.

00:18:15.260 | Uh, so for that, hold on.

00:18:16.260 | Uh, I'll go sequentially.

00:18:18.260 | No, not here.

00:18:19.260 | Hold on.

00:18:20.260 | So reports.

00:18:22.260 | Ollama.

00:18:23.260 | I apologize.

00:18:27.260 | I'm not used to this.

00:18:28.260 | So for the deep dive, um, these are the parameters where I found, uh, my initial, uh,

00:18:34.260 | initial, uh, teleport was 5.5, one second TTFT.

00:18:37.260 | Uh, sorry.

00:18:38.260 | Initial teleport was around 7.9 tokens aggregate aggregate.

00:18:43.260 | And for quantization photo, I, I had 78 increase and that was 10 X.

00:18:51.260 | Um, these are the tunings that I used.

00:18:56.260 | It's all in the reports.

00:19:00.260 | Okay.

00:19:01.260 | And then one of the other things, uh, you mentioned a couple of times posting links.

00:19:03.260 | I'm assuming this GitHub, uh, repository is, is what you were talking about when you were

00:19:08.260 | talking about posting links.

00:19:09.260 | Uh, the link was for this report, like, um, this entire doc folder.

00:19:14.260 | This is what is being posted in the chat.

00:19:18.260 | So anything that you want to test, play it out.

00:19:21.260 | It's right there.

00:19:22.260 | Like any questions you have related to it.

00:19:24.260 | So, so in theory, I could go to this GitHub repository, follow your steps and reproduce this.

00:19:31.260 | Like, let's say I have a, a, a 4080 at home and I can reproduce these steps and also have

00:19:35.260 | like a Jarvis on my computer is what you're saying.

00:19:37.260 | Yes.

00:19:38.260 | Uh, you'll need my, uh, Chimera optimization.

00:19:40.260 | You'll need the Banterhats repository, but you can end to end replicate all of this.

00:19:45.260 | Got it.

00:19:47.260 | So like you'll need this repo.

00:19:49.260 | It's not public.

00:19:50.260 | It's private.

00:19:51.260 | Oh, Banterhats.

00:19:52.260 | Okay.

00:19:53.260 | Okay.

00:19:54.260 | I see.

00:19:55.260 | This and this repo both are private.

00:19:57.260 | Uh, the public is Banterblogs and Chimera multi-agent.

00:20:00.260 | And, and that's primarily research, not necessarily, uh, the agent itself.

00:20:05.260 | Uh, Banterhats, uh, has the agents itself.

00:20:08.260 | Yeah.

00:20:09.260 | So hold on.

00:20:10.260 | So these are the agents that I ran, the entire reports end to end performance.

00:20:18.260 | Um, then there's multi-agent, same comprehensive test results, phases.

00:20:24.260 | I have been selective in uploading all the research, like the data up because that was a lot of stuff.

00:20:34.260 | Like this entire repo is about 200 K lines, maybe more.

00:20:40.260 | I'll have to check.

00:20:41.260 | Do you get a release plan for the, some of the, I, I got here a little late, so I don't have full context, but it looked like you did some like, uh, CUDA optimizations and, uh, kernel rewrites and stuff.

00:20:52.260 | Do you, you got a release plan.

00:20:53.260 | Yeah.

00:20:54.260 | Or you did that.

00:20:55.260 | Um, I did, uh, I did writing based optimization for, uh, which is in kernel deep dives.

00:21:01.260 | Just a sec.

00:21:02.260 | Going back to that.

00:21:03.260 | Going back to that.

00:21:04.260 | Okay.

00:21:05.260 | Okay.

00:21:06.260 | Hold on.

00:21:07.260 | I'll have it on here.

00:21:12.260 | So this is what we found out.

00:21:15.260 | Uh, I could not do Q80 and that's why I could not be like, this is testing around.

00:21:19.260 | This is just playing around.

00:21:20.260 | But for optimization, torch optimization had a basic, this attention for attention kernels.

00:21:25.260 | Uh, for Matmul, uh, this is what I found out.

00:21:27.260 | And after kernel fusion, my TTA actual, uh, latency went down 10, uh, 50, 50 X around from 6.9 to 0.07 for my orchestrator.

00:21:37.260 | So this is all for my orchestrator.

00:21:39.260 | Kernel optimization.

00:21:40.260 | Like I, I cannot have resource contention, right?

00:21:43.260 | The orchestration, uh, the orchestrator for the agents needs to respond instantly so that my multi-agents could just go off and do their tasks.

00:21:50.260 | Yeah.

00:21:51.260 | I think he was, uh, Yikes was also asking, do you have a room?

00:21:55.260 | This is a part of a larger question than I have, but like, do you have a real, a release plan for the CUDA optimizations that you made?

00:22:02.260 | The, the kernel changes.

00:22:03.260 | I do, but, uh, this is like all of this I have done in over like the past 30 days.

00:22:10.260 | Uh, this optimization thing started, I think it started.

00:22:13.260 | Yeah.

00:22:14.260 | See, it was just over 15 days ago.

00:22:17.260 | So I do have a plan on eventually realizing, uh, like releasing a one click optimization straight up end to end, uh, from downloader to everything.

00:22:27.260 | But, uh, currently I'm just exploring ideas as to how I can have the entire RLIF, AIF pipelines working out.

00:22:35.260 | Like, so that you can have everything optimized for your own personal game.

00:22:39.260 | Yeah, I know that sounds, um, very, very interesting.

00:22:43.260 | Uh, yeah, the, the possibility of being able to optimize tasks, tasks, task specific, uh, CUDA kernels, um, uh, I think is underexplored.

00:22:53.260 | So this looks like a fun, fun kind of spot to be playing around.

00:22:58.260 | Yeah.

00:22:59.260 | Yeah.

00:23:00.260 | I'm really curious.

00:23:01.260 | Um, like there's, of course the frontier models are like all the, you know, big giant GPU stuff, but I, I'm personally excited about like smaller models that can run on prosumer hardware.

00:23:12.260 | Uh, I'm curious why you picked Gemma three and not other smaller models.

00:23:17.260 | I know Gemma three came out a little while ago, but like in AI terms, it's, it's ancient.

00:23:21.260 | Um, cause it was a while ago, but I'm, I'm curious, like, it looked like you did compare it to a couple other models or was it just like Gemma three was.

00:23:29.260 | Gemma three Lama and then for like, the thing is, uh, Gemma three had a lot of documentation already.

00:23:34.260 | Actually.

00:23:35.260 | So people already had performed things and did things with it.

00:23:38.260 | So if I do everything with the latest and the greatest, I do not have a comparative analysis.

00:23:43.260 | I do not have points of reference as to why things are failing.

00:23:46.260 | So all of this could be, if I don't have models, which I have a historical data, it could just be, uh, people actually having optimized originally and my system doing nothing.

00:24:00.260 | If it makes sense to you.

00:24:03.260 | So yeah.

00:24:04.260 | Yeah.

00:24:05.260 | I see on hugging face that there's like 218 fine tunes for Gemma three.

00:24:08.260 | Yeah.

00:24:09.260 | So like that makes sense why you would be like, all right, this is where a lot of other people have gone.

00:24:14.260 | Let me not like blaze a new trail.

00:24:15.260 | That's kind of like follow what other people have done.

00:24:16.260 | I'll have more to compare against.

00:24:17.260 | Hmm.

00:24:18.260 | It offers me benchmarks to compare against whether I'm doing MLOs properly.

00:24:23.260 | So once I understand that this is what the ground reality is, this is what people are doing, and this is what their understanding is, then I can debug on a level where I have more understanding.

00:24:37.260 | No, I can optimize in a nutshell on any sort of frontier model or like the latest models.

00:24:44.260 | But again, that would mean that I, uh, it cannot be simply me doing it.

00:24:49.260 | It could be the person who launched it, who has already done it.

00:24:51.260 | So what am I doing better?

00:24:53.260 | So this is what I'm doing better.

00:24:54.260 | Are there any of the newer models that have caught your eye that you're, you're interested in, especially from the Chinese labs that have the smaller models?

00:25:06.260 | I had, uh, recently I read this research paper by Samsung.

00:25:10.260 | I think it was the 7 billion model that could perform as well as a 40 billion parameter model.

00:25:15.260 | I wanted to, I wanted to test it, but, um, like my recent, like, this is all the bottoms up.

00:25:24.260 | The other optimization technique that I have been, like, I have written a document on is a TD.

00:25:29.260 | Okay.

00:25:30.260 | This could get interesting.

00:25:31.260 | Hold on.

00:25:32.260 | So, uh, in this, like I am going now for a top zone approach.

00:25:44.260 | Yeah, the other, uh, interesting thing potentially to look into or to, to follow up on flows point, um, or, or, um, Zach's point, uh, Gemma actually is actually a pretty solid pick for this, in my opinion, largely because I think.

00:26:06.260 | The Gemma architecture has, has Google has, has, um, been a lot more oriented around optimizing that for small models, mobile, especially because they're trying to integrate it directly to the pixels.

00:26:18.260 | Um, so 3, and I think would be really interesting to do some, to see if any of these translate over to the 3 end, because if you're gonna, if you can get more perform performance out of a 3 end, I think you could do like a lot on the edge, which I think this particular architecture would be really, really useful at.

00:26:34.260 | So, yeah, that's what my idea was basically all of the models are generalized.

00:26:40.260 | So this is where I started the RLIF.

00:26:42.260 | Like if I have, uh, if they're generalized, then I can make them special purpose models.

00:26:47.260 | So the, uh, there was this content, as I said, uh, the RLIF thing, that's what it was for.

00:26:53.260 | So yeah, models are tiny, but they have a lot of hallucinations and that accuracy goes down.

00:26:57.260 | Oh, being, uh, ultimately LLMs are pattern matches.

00:27:01.260 | So they will only pattern match as long as much as they have in memory.

00:27:04.260 | So I would much rather replace it with tasks specific memory.

00:27:09.260 | Like, uh, if you are coding, like you're coding with Claude, you're not going to code in 50 languages.

00:27:16.260 | You're going to code in five specific languages and you need a model.

00:27:19.260 | that's only special for those five specific languages.

00:27:22.260 | Mm-hmm .

00:27:23.260 | And, uh, in those five, uh, five special languages.

00:27:27.260 | Codex is very strong at something.

00:27:29.260 | Claude is very strong at something.

00:27:31.260 | Gemini is very strong at something.

00:27:33.260 | So your end to end stack has five languages.

00:27:37.260 | It has specializations and it has specific needs.

00:27:40.260 | And that's where this entire RLIF pipeline comes in.

00:27:43.260 | Your local model is being trained by these big names to be optimized for your task.

00:27:48.260 | The constitution for this local model is stored locally.

00:27:52.260 | So nothing, uh, none of your personalized, uh, proprietary code is leaked out because you are running agents through your entire code bases.

00:28:03.260 | You are running LLMs through your entire code bases.

00:28:06.260 | You have no idea what is being stored where.

00:28:09.260 | You have no, you have no idea what sort of learning is happening and what sort of data in the prompt injection based leakage is happening.

00:28:17.260 | So if you sandbox it onto a local system, you can test end to end and you have no issues.

00:28:23.260 | So based on consensus and I've also optimized like the proper, like my current, uh, this idea, we can have RLIF for running for end time.

00:28:35.260 | Instead of that, what I ended up doing is I have a weight adjusted models.

00:28:39.260 | So, uh, Claude gets, uh, a higher weight for coding tasks.

00:28:42.260 | Gemini gets a higher weight for life tasks and general purpose tasks.

00:28:46.260 | Open AI has a higher weight.

00:28:48.260 | So based off that, that's like, uh, how the debate works.

00:28:52.260 | And, uh, what I've done is I have budgeted those debates.

00:28:55.260 | So every, uh, everything has a limit.

00:28:57.260 | So if those API calls have a specific cost, uh, I have sandbox those API calls to $5.

00:29:02.260 | So in $5, they must reach and reach a consensus and they must generate an output, which is very like good for me.

00:29:10.260 | This is good for this task.

00:29:11.260 | And that's it.

00:29:12.260 | That's the output that I need.

00:29:13.260 | I don't need to go back.

00:29:14.260 | I need, I don't need to go back and I don't need to check on codex.

00:29:17.260 | I don't need to check on, um, Claude or I don't need to check on Gemini.

00:29:21.260 | So if Gemini, I think today we had that Google launch.

00:29:24.260 | I'm not sure, but Google, Google is launching Gemini three, right?

00:29:28.260 | So I can just plug that in right here and I can use that for my own optimization.

00:29:34.260 | Yeah, no, that's really interesting in that.

00:29:36.260 | So we have, we have frontier models with different specializations.

00:29:39.260 | And then if I'm reading you correctly, then we have a RLI IAF pipeline that is monitoring like per task.

00:29:47.260 | And then for each task, we're going to have the various frontier models at different weights, depending on the task, be doing more, be more heavily weighted during the RL AIF process.

00:29:58.260 | Is that kind of the idea?

00:29:59.260 | Yeah.

00:30:00.260 | Yeah.

00:30:01.260 | Sweet.

00:30:02.260 | Nice.

00:30:03.260 | That's really cool.

00:30:04.260 | So this is like the top down thing.

00:30:05.260 | Uh, so while I, while I am optimizing bottoms up, I found out that I have two X boosts.

00:30:10.260 | That's enough.

00:30:11.260 | Like that's good.

00:30:12.260 | It's good for me for now.

00:30:13.260 | So what I needed to do is if I can optimize from top down.

00:30:16.260 | If my, uh, local model is performing at least as good as Claude currently, not even as good as Claude.

00:30:23.260 | Uh, I can say three points on it, 3.5.

00:30:25.260 | I've achieved the task that I need.

00:30:27.260 | I, I have two, um, agents running concurrently fully local without me having to pay a single cent, except obviously the hardware costs doing my task for me.

00:30:38.260 | And they are, they are obviously going to explore private data.

00:30:41.260 | They are going to be sandbox into an environment where they cannot like send out data.

00:30:45.260 | Uh, and everything remains fully local.

00:30:50.260 | I guess also, uh, cost of electricity.

00:30:52.260 | And, and as the, you know, we're getting colder going into the winter months, your room is going to be nice and toasty from that GPU.

00:30:58.260 | Yeah.

00:30:59.260 | True, true.

00:31:00.260 | Pretty much.

00:31:01.260 | Right.

00:31:02.260 | Yes.

00:31:03.260 | Uh, this is not public.

00:31:05.260 | Like, uh, I, I can post a GitHub Gist somewhere, but like this is a proposal for now and I've not fully explored this.

00:31:13.260 | I started this two days ago.

00:31:14.260 | I have a current debate working out, but I ran a, ran out of API calls.

00:31:17.260 | So I'm waiting.

00:31:18.260 | But yeah, that's the idea that I have.

00:31:19.260 | Everything is fully modular.

00:31:20.260 | So you can plug in any model you want.

00:31:21.260 | So sometimes what ends up happening is 3.5 is better at some tasks than 4.1 Opus.

00:31:27.260 | So why not use 3.5?

00:31:28.260 | They're more stable.

00:31:29.260 | 100%.

00:31:30.260 | Uh, I think someone posted that, um, codex is giving a streaming error.

00:31:35.260 | Why not?

00:31:36.260 | Then we can just switch to a GPT five.

00:31:48.260 | What's your, uh, validation path or like your consensus algorithm, um, kind of set up here.

00:31:54.260 | If I'm, if I'm, so if I'm tracking correctly, we have, uh, RLI AIF that's happening through some

00:32:00.260 | kind of debate format, which is weighted depending on the task.

00:32:03.260 | And then there's some kind of termination point.

00:32:06.260 | That's either $5 or like, more or less, like, I like it.

00:32:09.260 | This works for me kind of thing.

00:32:10.260 | Are you driving the model during the RL AIF process?

00:32:15.260 | So you can get to the validation state and then lock it in, or, uh, what's kind of the process?

00:32:22.260 | So right.

00:32:23.260 | Let's see.

00:32:27.260 | Uh, apply test comparison.

00:32:29.260 | Oh, pairwise comparison.

00:32:30.260 | Okay.

00:32:31.260 | Okay.

00:32:32.260 | Proposal that beats all.

00:32:32.260 | Oh, there's this corner.

00:32:33.260 | I'm not familiar with converse say, or border.

00:32:34.260 | I'm presuming that those are just like a scoring scoring.

00:32:35.260 | Yes.

00:32:36.260 | Yeah.

00:32:37.260 | And it's like LLMS judge kind of like preference, kind of preference based stuff.

00:32:38.260 | Yes.

00:32:39.260 | Nice.

00:32:40.260 | Maybe.

00:32:41.260 | Yeah.

00:32:42.260 | That part's really cool.

00:32:43.260 | Yeah.

00:32:44.260 | Yeah.

00:32:45.260 | Yeah.

00:32:45.260 | Really cool.

00:32:46.260 | But I was going to ask, can you, can you get into, I have like some, some, I think the questions

00:32:51.260 | that'll actually help you in whatever journey you're on now, but I'm curious, like banter packs

00:32:55.260 | versus banter hearts, banter blogs.

00:32:57.260 | Why did you name them that way?

00:32:58.260 | Like what's the significance of the naming?

00:32:59.260 | There's no significance.

00:33:00.260 | I was being, I was being dumb.

00:33:01.260 | Honestly.

00:33:02.260 | Yeah.

00:33:03.260 | Like there's no significance.

00:33:04.260 | I was being dumb.

00:33:05.260 | Honestly.

00:33:06.260 | Yeah.

00:33:07.260 | Like there's no significance.

00:33:08.260 | I just, I think this is what I'm hearing vibes.

00:33:09.260 | Yeah.

00:33:10.260 | Yeah.

00:33:11.260 | I like the entire, entire ecosystem is called Chimera.

00:33:13.260 | But I started off just because I wanted to, I wanted to, I wanted to, I wanted to, I wanted

00:33:18.260 | The model optimization is whichever one has the best banters, bro.

00:33:19.260 | Yeah.

00:33:20.260 | Yeah.

00:33:21.260 | Yeah.

00:33:22.260 | I mean, normally if you see, we had a GPT-3, which we had a GPT-3, which is a GPT-3, which

00:33:25.260 | I had like the entire entire ecosystem is called Chimera but I started off just because I wanted to

00:33:32.600 | and it's gotten pretty far the model optimization is whichever one has the best banters bro

00:33:39.720 | yeah yeah I mean normally if you see uh we had a GPT-3 being robotic and people didn't enjoy it

00:33:50.560 | so they personalized it uh GPT uh adapts to your vocal cues or sorry the things that you say

00:33:57.000 | and it speaks in your tone that's what I'm using

00:34:01.020 | so this this is pretty much like personalization on a whole new level because not everybody is going

00:34:09.260 | to be the same type of depth not everybody is going to have the same type of life tasks

00:34:12.540 | not everybody is going to have like specific requirements some people are going to be hyper

00:34:19.240 | fixated on having deterministic outputs and some people are good with just arguing through ideas

00:34:24.880 | and just exploring them and the constitution remembers that so this is yeah is this the daily driver for

00:34:34.880 | you now like are you using this tuned model for all sorts of daily stuff or are you still sort of like

00:34:40.780 | tuning it and finding new angles or is it like is it useful in certain contexts but not others because

00:34:45.760 | this size can be kind of hit or miss so I'm curious how like in use day-to-day it feels

00:34:51.400 | day-to-day I enjoy it a lot honestly like I while I'm chilling uh while I'm just thinking through

00:34:59.280 | ideas I go through a back and back and forth through like hitting this idea RLA finance waiting

00:35:04.480 | weightage I had no as someone else said the entire algorithm I had no idea if that could be used

00:35:11.260 | so I explored so I explored through all of these three uh frontier models together and I found

00:35:16.900 | out that this worked for me like this was the best way to handle weights 100% so why not just debate

00:35:22.900 | like uh normally we have ideas and we need we need uh we have to google multiple things so instead of

00:35:28.540 | googling I I'd much rather have the lm google it for me and find like specific answers

00:35:33.520 | so we're using we're using weighted voting and ranked choice is that so are these are the larger models

00:35:41.620 | doing are they deciding like what what what decision are they making here like are they actually doing

00:35:48.820 | weight updates uh not weight updates they are creating an RLA of data set so normally what ends

00:35:57.340 | up happening is uh question is generated now what is quantum physics local model generates an answer

00:36:03.160 | open AI generates an answer uh Gemini generates an answer cloud generates an answer we enter a debate

00:36:10.300 | there are four responses uh and which is the best why is it the best uh once the debate phase ends uh

00:36:18.040 | uh they hit a consensus and that answer is stored into uh I look over the answer if I say that the

00:36:24.400 | answer is good it's stored away into the constitution or rather the RLA of data set which will

00:36:29.320 | eventually be part of my dynamic constitution but yes it's totally and it keeps on updating until uh

00:36:35.720 | like we hit the repeat part so if there's enough data I can train my local model to generate answers

00:36:44.280 | similar to cloud for coding similar to Gemini for daily tasks stuff like that okay and then where does

00:36:52.320 | the um the like kernel level optimization stuff come in fit into that picture I see data set and then I see

00:36:59.520 | I see test train I think yeah so normally this is where it starts uh getting fun I have I created a data set where my

00:37:08.040 | what is quantum uh quantum physics so that's the size size of the prompt and the the data set is going to

00:37:14.400 | be generated so it's going to go on to the GPU so some tasks do not require a full GPU offload some tasks do

00:37:21.240 | require a full GPU offload now it decides the orchestrator decides based of the three reports

00:37:26.200 | that I have tr10 108 through tr110 and plus whatever further research that I find it decides that this

00:37:33.000 | is how much GPU allocation needs to happen in uh 70 nanoseconds sorry 70 microseconds and it offloads

00:37:39.640 | on to the GPU so your elements respond as fast as they can on your current current hardware

00:37:47.080 | so top down buttons up

00:37:51.080 | let's okay so there's an orchestrator deciding how much GPU to allocate and that's for each

00:38:01.000 | task and the thing that's being allocated is GPU memory for the small model and the small model is

00:38:09.880 | being is being let's see small is small model being is small model actually getting weight updates in

00:38:19.400 | here I know I'm losing it I think it's not getting weight update it's not getting anything it's just

00:38:23.880 | generating answers for now okay it will be further optimized later

00:38:29.080 | so currently it's just working out like it's just responding to my questions

00:38:32.200 | okay but there's so the idea being like there is a GPU scheduler it has some level of access to this

00:38:39.800 | um this constitution based kind of um uh uh golden data set and then since it just has that it's like

00:38:49.960 | infer this and do the scheduling kind of thing correct nice I like that actually the the um you just let

00:38:58.440 | go and like like get to the point where you like you you understand the interesting steps up until

00:39:04.040 | here and then you're like figure it out I think that's yeah I mean uh hardware is going to change

00:39:11.080 | I started with 40 80 50 90 is already out 60 90 is going to be out by the next year

00:39:17.080 | and I might end up with a 500 I might end up with a h 100 if I optimize for each GPU I uh I'm going to end up

00:39:24.440 | a bit of confused I can't do that so why would I let that happen I would much rather have I would much

00:39:29.640 | rather benchmark whatever GPU I have on hand and then optimize my LLM base for that GPU so any personal

00:39:35.960 | GPU that you have you can optimize for it one shot optimization one shot LLM deployment one shot responses

00:39:44.920 | so you don't that's the reason the RL AIF uh data set is not integrated into the banter hearts or chimera

00:39:52.520 | because this will be your ground truth for your future models you see like if you switch devices

00:40:01.480 | if you switch devices you if you go from a macbook m3 and if you go into a macbook m5 pro max whatever

00:40:09.080 | so the optimizations are going to be benchmarked differently but ultimately what preferences that

00:40:15.960 | you have for responses need to be the golden standard for the local model that you deploy so if you deploy

00:40:22.040 | like I am deploying gemma 3 you deploy GPT 120b OSS both of them are going to have different answers

00:40:29.240 | but your preferences are yours your answers are yours so that GPT OSS 5 120b needs to know that

00:40:37.080 | your life preferences are this your coding preferences are this your tasks are this this is what you eat

00:40:43.480 | this is what you enjoy and that's like the RL AIF end of it the top down version of it and the bottom

00:40:51.240 | of version is basically optimization optimization of GPT OSS 120b on your local hardware

00:40:58.120 | so further quantization will obviously make the model tinier based on your specific tasks but that's

00:41:05.480 | eventually yeah and I'm wondering you could I feel like there's a way to be able to do like gpu scheduling

00:41:14.280 | alongside uh model architecture optimizations right because like if you're if you have enough g if you

00:41:20.440 | have enough vram to apply this to a gemma right now then you want to explore the effectiveness of

00:41:26.600 | different quants or different model architectures like uh like uh you know kimmy with uh with kind of their

00:41:34.440 | they're heavy experts few um a few heads kind of strategy and then you could see sort of uh uh you

00:41:43.240 | could end up rather than getting that one to approve you could get five to approve and then the five have

00:41:48.280 | all have sort of different architectures you've got a llama you got a gemma you got etc etc it could be

00:41:52.440 | very um correct a lot of possibility there so yes that's the reason why banter packs or uh like this

00:42:01.240 | infrastructure architecture has a full uh elk observability with prometheus uh grafana and uh

00:42:07.320 | hager eager for tracing like what's going wrong if the agents are deployed so anything that agent does is

00:42:14.040 | tracked any sort of api calls happen they attract and that's why the mlx platform also has a prometheus plus

00:42:19.800 | grafana uh not prometheus grafana n-sync uh monitoring plus uh prometheus grafana monitoring

00:42:26.360 | i didn't go for the lk there so for the more like pro coded people it's already there

00:42:39.400 | yeah i would are you i would presume that the gpu scheduler that level of orchestration if that's

00:42:48.600 | getting logged alongside the responses then you should be able to build a a a good in addition

00:42:55.880 | to your to your full output data set you could um you know you stick some probes into the into the

00:43:02.760 | cuda graph or whatever um and then the kind of see what other what other um metrics you could you could

00:43:11.080 | whatever levels you could you could start making optimizations at it gets a little fuzzy there but

00:43:15.320 | yeah that's the reason like that's why i have ready redis for uh initial uh logging and postgres for uh

00:43:21.960 | golden logging like the truth ground truth logging so postgres handles the actual entire data logging

00:43:27.240 | and redis is like immediately are responding to the cache coming up on the 45th minute uh if anyone

00:43:35.080 | else in the audience has some questions too like um you know just just a call for questions i was going

00:43:39.800 | to also ask you about the tts uh pipeline like have you played around with the voices and things of that

00:43:45.640 | nature like as jarvis is responding to you yeah so basically uh my initial take was based uh orchestrating

00:43:53.640 | over the api but i just went the whisper pipeline uh pipeline it works for me for now but because this

00:43:59.400 | whisper piper pipeline just is pretty simple i don't need to wire up a lot of things and it does not

00:44:05.320 | require a lot of resources i was going to try a kitten us have you heard of kitten us kitten us is of

00:44:11.000 | 25 mb uh like tts pipeline no i haven't it was launched uh last month uh it's a very popular uh

00:44:21.000 | not very popular but somewhat popular startup on y combinator um let me i'll i want to post that link

00:44:28.840 | so i'll i'll take some time yeah actually if you have any more questions i'll take some time and find

00:44:32.360 | that in person let's see um kitchen yeah uh i would be interested so you have you have whisper for

00:44:39.800 | your tts what's your stt currently piper oh piper okay yeah whisper piper pipeline yeah

00:44:46.920 | i did not uh heavily dive into the entire like i had initial uh success at atms uh latency

00:45:00.360 | so then i didn't go fully into this level of optimization like that uh front end aspect would

00:45:06.440 | come later once i have the entire back end mapped out and the back end is going straight down to the

00:45:12.760 | silicon so it just gets complicated yeah yeah i can imagine um have you have you uh uh uh have you

00:45:21.320 | looked into or you might have already covered it i'm not sure i got uh it was a little late but uh

00:45:25.800 | uh multi-modality have you tried like throwing a moon dream in there and seeing what happens kind of

00:45:29.640 | thing i have a plan like the this current orchestration has four uh entire things sorry four

00:45:36.600 | how do i put this four repers the fifth rep is going to be uh generating

00:45:44.520 | comics more or less so this discussions that happen just make them funny and make them into comics

00:45:54.040 | and then once uh sora 2 i have the availability for sora 2

00:45:58.120 | you'll be able to hear gemini plus cloud plus gpt for discuss it out loud and yelling at my local

00:46:05.080 | model to be better that sounds like fun yeah i know it's it's a good thing yeah

00:46:12.360 | eventually hardware will be good enough so i could just uh have at least i'd presume it would be like a

00:46:21.000 | five ms lag but even with five ms lag i could just observe it live animating like a live animation of

00:46:29.240 | my constitution debate happening anime episode about what your gpt your gpu kernel is currently doing

00:46:36.520 | while you're working i mean yes more or less it's entertaining like i think i like the idea very

00:46:42.040 | personally like just uh having like an interactive environment where i could just chime in and because

00:46:47.800 | obviously i have the uh voice pipeline ready so if i say something and this sd dts pipeline is being read

00:46:54.440 | by uh all these models so i'm actually interacting with anime versions of these guys ironically arguing

00:47:02.280 | with this kernel because the model's hallucinating again yes i love it that's very that's gonna be

00:47:08.760 | that's gonna be a party well and i think there's something to that too in terms of in terms of real

00:47:14.200 | optimization right because like the kinds of especially if you get to hear them argue like

00:47:19.880 | you know i can do a hyper parameter sweep or whatever but if if we're if we're showing the scheduler

00:47:27.800 | this entire like anime episode debate process and being like okay now allocate the vram um

00:47:34.680 | it would be really hard to articulate that in terms of a gpu config um whatever that thing is sort of

00:47:42.600 | like inferring from this sort of blob of data that it's got to work with um i think could be real good

00:47:48.520 | could you could actually all it has to infer from is what question i have asked and what feedback i'm giving

00:47:56.600 | so the model itself is like the model allocation is generating like the system is generating some words

00:48:01.960 | so the ttft is being allocated the gpa allocation is just for the model generating words and the prompt

00:48:07.320 | engine like whatever prompt i generate whatever question i asked that's what it has to observe nothing

00:48:11.400 | else this discussion is happening over like on an entirely different thing but uh as long as it tracks that

00:48:20.440 | my question has 15 tokens and 15 tokens requires this amount of model uh resource allocation partial or

00:48:28.520 | full overload what else needs to be done because the the scheduler doesn't need to do anything else

00:48:34.840 | or only be dumb enough to simply like this is what you need to do this is where it needs to be done

00:48:39.480 | eventually when systems get smart enough i'll have a smart a smart agent decide

00:48:49.320 | but we are not there yet hardware wise we are not there yet yeah the um okay so the the vram allocation

00:48:57.560 | is going to be for sequence level or like batch level kind of um uh token token load okay okay

00:49:05.960 | and then i'll put only sequence level pre-allocate it

00:49:16.600 | to put it simply uh all of us work out right so the best understanding of this would be uh you

00:49:25.000 | know how much pressure that you need to put into your muscles to lift a 10 kg dumbbell or a 10 kg bar

00:49:32.440 | but you have to think about your form when you're whether you're going to squat with it whether you're

00:49:39.880 | going to bench press with it but the pressure doesn't change the muscle allocation doesn't change

00:49:45.320 | all that changes is how you move your muscles so this rleif is the thinking part and the muscle

00:49:52.520 | allocation is the gpu scheduler part both are independent but related

00:49:58.120 | uh b did you have a question i said you turned your camera on i don't know if you had a question

00:50:08.360 | specifically for sahil um but i have a i have a couple of questions uh if not

00:50:15.320 | no it was by mistake sorry yeah oh okay okay no worries no worries no worries i just wanted to you

00:50:21.240 | know make sure we include um is this your sahil is this your first like uh i guess shot or not shot but

00:50:28.440 | like attempt at getting exposure to these ideas in in your research uh like this specifically the jarvis

00:50:34.280 | thing yes like it was very random as if you see the first comment that i have is extremely quite

00:50:44.840 | it's it's extremely stupid honestly are you do you do you have have any like the labs i don't know what

00:50:53.960 | your dream job is because that's kind of how we started the conversation but have any labs reached out

00:50:58.840 | for you to do this type of research or ai research ml research in general uh based on your your degree

00:51:05.160 | or or anything that that you're interested in doing i have applied but the recent uh h1b debacle uh i my

00:51:12.520 | interviews got put on hold so i just decided why why wait yeah much rather work no for sure for sure i so

00:51:19.640 | okay ideally what is your what is your dream job like do you have any uh uh in mind i mean eventually i

00:51:26.920 | might end up as a founder given the scope of my capacity at muse but this this was my first comment

00:51:34.520 | got it okay yeah i and and i'm asking like these type of questions because i think platforms like

00:51:42.120 | this i don't know if you know but all the recordings go on youtube uh and like there's obviously like the

00:51:47.000 | latent space or let me not say obviously because i'm not sure if everyone here is familiar but swigs host the

00:51:51.800 | latent space podcast uh which is like a huge platform uh with a lot of subscribers and this uh recording

00:51:58.440 | actually goes on like latent space tv which is kind of like a an associated youtube channel but but gets

00:52:04.440 | a couple hundred views per video um and i'm asking these questions because i'm also curious like do you

00:52:09.880 | have a video uh of you a demo of you like using jarvis and action uh i do plan on creating one but this

00:52:18.920 | rlaf pipeline got really interesting so i'm currently working on that yeah okay okay i understand i think

00:52:24.040 | i had a uh tauri ui test run yeah it is giving me issues but i'll figure that out yeah for sure yeah i just

00:52:31.880 | think this is really interesting research and obviously like uh could open you up to lots of job opportunities in

00:52:37.400 | my opinion uh especially in this space so i just think posting videos like this or demos like this

00:52:44.280 | and work like this research like this on on these type of communities um could get you access to you

00:52:49.480 | know all kinds of um opportunities so i just wanted to say that like like while we have a couple minutes

00:52:54.680 | left and there's still room for other people to ask questions i just want to ask those uh type of

00:52:58.600 | questions uh i guess number one for the recording uh that's gonna go on youtube but but number two just

00:53:03.240 | kind of for my own curiosity um it seems like we have some chats related as the the web three guy in

00:53:10.520 | the in the chat i can certainly think of like five or six ways to to monetize this like immediately if

00:53:16.280 | if one one wanted to but i i think a better question perhaps that should be included in the recording

00:53:22.520 | is for the experiments that you're doing and the kind of like research that you're looking into

00:53:26.920 | um are there any resource shortages that you're are the so like you got a 4080 you're kind of compute

00:53:34.360 | constrained is that is the is the constraint here breeding creativity or is are there things where

00:53:42.520 | it's like man i really wish i had x or y to test this on kind of thing or both it's actually both because

00:53:50.280 | the thing is uh it's an interesting space if you're deploying this locally and you can have a live

00:53:56.760 | system like the eventual pathway that i have is to have a fully integrated house so if you've seen any

00:54:02.040 | of the iron man movies for sure you have a ar vr system that you can interact with and do whatever you

00:54:08.200 | want and you if you have a gpu powerful enough to handle that the system already enables it

00:54:16.840 | and yeah the world models are are just getting better um yeah yeah like more people are working

00:54:24.280 | on model optimization i'm creating a platform that can plug in any model and optimize it for your local

00:54:30.600 | system yeah i had i i have a lot of i have a very like kind of similar vein project except i went a lot

00:54:38.200 | uh uh uh i went for the big guys i was like okay let's let's assume that we don't have any compute

00:54:44.040 | constraints and then we have a coding task what are what are what's the best selection of coding models

00:54:50.600 | best selection of coding lauras and then best amount of like kernel level optimizations that we can make

00:54:55.720 | to each pipeline kind of thing um but the the the like the um sort of uh uh online kind of uh uh constitution

00:55:09.320 | building of the golden data set is a is a very like interesting approach that's that's kind of cool

00:55:14.120 | that it works um sort of at any level of or in or yeah have you tried scaling it up at all i guess

00:55:19.160 | i have tried but my system decided to bsod then so i just decided to stop like you see these are the

00:55:28.920 | kind of optimization that run like if you are looking at the screen it's everything end to end

00:55:35.880 | so that is a pipeline that is ready for enterprise basically just deploy it over any sort of the local

00:55:41.960 | multi gpu system optimize it optimize their lm to run over that local gpu system and see how it

00:55:48.680 | eventually ends up everything from here pretty much checkpointing yeah

00:55:54.040 | yeah i think there's a there's a company that i know that's like actively looking for people to give

00:56:01.400 | compute too i'll see if i can grab the link

00:56:04.200 | because i think it would be cool to try this on like 8x h100 and just see see how far it rips yes

00:56:12.440 | yeah it does like you start to see right the idea is really interesting and the more i dive into it

00:56:21.720 | the more interesting it gets that's why everything has to be it's well organized but yeah more or less

00:56:30.600 | this is kind of like a full circle moment for me because like one of the reasons why i got into the

00:56:33.880 | ai space uh a couple years ago is i saw a guy on youtube trying to build jarvis but it was nowhere

00:56:38.840 | near this advanced um so it's it's funny to see something like this kind of like come to fruition

00:56:45.480 | i'm also curious are you on are you on i saw you had github uh i should probably go look at your profile

00:56:49.880 | but do you are you on twitter as well no i'm not on twitter i i did open an account i sure do like get

00:56:55.720 | active on it but i'm working on this so i like basically i spend at least 10 to 12 hours a day

00:57:01.960 | just running through this and seeing where it goes like yesterday yeah pretty much and like these tests

00:57:09.400 | also run a lot so i don't spend too much time on social media more or less i had an automated pipeline

00:57:15.400 | so what what ends up happening is uh like the cicd uh for this banter packs is this here so this uh

00:57:28.680 | publishes into banter blogs which publishes onto my linkedin but that's about it and to maintain like

00:57:37.480 | them and to maintain the code base i have like a full cicd pipeline yeah i got 16 gigabytes of vram

00:57:43.720 | at home i would totally do a demo video for this and like uh because i just think i can see it going

00:57:48.920 | viral in the spaces that are interested in things like that and and we have like two minutes left uh so

00:57:53.080 | we probably have to ask them wrapping up questions um but yes there's like a subreddit called local llama

00:57:59.480 | and then an ex twitter community called local llama that would eat stuff like this of like they

00:58:04.360 | absolutely love running on gpus at home and that's that's the primary focus of those communities so i

00:58:09.560 | think they would love uh stuff like this uh speaking of communities if you guys don't have any other

00:58:14.040 | parting questions or sahil if you have a correct request or any any questions for us uh or any calls

00:58:19.880 | to the to the public that may see this video on youtube if you have any like closing comments honestly

00:58:26.440 | anyone who wants to tweak around with this and have fun discussing this please contact me let's have fun

00:58:32.920 | perfect perfect um and if anyone else left on the call uh yaks if you don't have anything to close

00:58:38.680 | basically uh what what cable normally does at this time is say like you know on a weekly basis we meet

00:58:45.240 | on fridays at this time and kind of like try and uh discuss interesting ideas in the ai space so if you

00:58:51.240 | have any interesting ideas and would love to sign up uh for any of the following weeks uh please let us know

00:58:56.440 | we typically coordinate in discord so the uh ai and action builders channel is the discord uh where we

00:59:02.360 | kind of get together and you can talk to our ai and action bot which will help you schedule for an

00:59:06.360 | upcoming week um and we're looking for people to speak i think the next we have somebody maybe two

00:59:12.360 | weeks from now but other than that yeah we we have we we have open slot next week so if you're watching

00:59:19.640 | this and it's uh and it's not next friday which is the what 24th i think yeah come to latent dot space

00:59:28.040 | slash p slash community and then come to the discord and tell us that you want to present something uh the

00:59:33.800 | other addendum that i'll make is for sahil and then anybody else who is experimenting on a level where they need

00:59:39.960 | gpus i just dropped a link into the ai in action channel um from a company called illoon and they

00:59:46.200 | are actively trying to give away compute right now so there's a form to fill out um if you click the

00:59:51.640 | link that is found in that discord and perhaps in the show notes we'll see cool it's uh a minute over

00:59:59.960 | so we'll wrap it up thank you for presenting this is really cool like very interesting and when you said rl i