LLMs: A Hackers Guide

So this is just meant to be relatively casual. I didn't really prep for this. This is just a template and through to get a bunch of stuff. But a lot of it is just a lot of questions that I've been getting over the last couple of years. Just about AI and now about prompting and everything else.

Because a lot of the work that we now do with Gen AI kind of requires that you think about things differently, right? So this is just a collection of all of those things, and as we go through them, we'll see. Just tiny bit about me, I run a company called Grey Wing here in Singapore.

We're focused on commercial shipping. We do a ton of AI work, we do a ton of data work. When LLMs came up, we started looking at them as NLP effectively on steroids. Because it was a suddenly all of the English language and people's messages, emails were all accessible, so we started working there.

Out of that work came that initial launch, which was a comms automation tool, that's doing pretty well. Another one was an assistant of our own. Far before OpenAI had tool usage or any of those things, we were doing charts and a bunch of things. Then came sort of RAG.

We do a ton of work with multi-modal RAG, which is visual RAG. So being able to process sort of complex information, how to visually index it, and sort of how to use that to answer mission critical questions. So that is, for example, a DRAM data sheet from Samsung. So they were testing the product.

And in shipping, all of these data sets are really important, and the importance of the output is also very high. So this is just to say, we've done quite a bit of work across different parts of AI, and this is just everything we've learned. Also, I'll put up little QR codes just in lieu of links.

The slides are gonna be up later. So if you see something and you're like, I wanna know a little bit more, there's gonna be a QR code. So this is, in the interest of hacking, everything today is gonna be about going from, let's say, 0 to 0.1, right? Where do you start?

How do you start? So this is just an open source project that I recently released. I'll use that as an example later on, how to go from script to project to release. This is effectively a tool to generate docs. So we had a docs problem, everybody's got a docs problem.

But what we did have is a ton of meetings. We'd have tons of meetings, repeated meetings about the same things, explaining the same things. So now we have Whisper, we have all of these things accessible. Can we just make docs out of things we've explained ten times before? So this was launched two days ago.

This has already had, I don't know, about 50 or 60 projects already make docs using it. Cuz it turns out, people, it's easy to talk. It's very hard to write docs, right? So that's just an example. So the first kind of thing I wanna talk about really, and this is, it took me a long time to discover and it made a huge difference, is just the iterative loop, right?

And when I say iterative loop, I mean what is your process to build stuff, right? So long time ago, if you were coding that long ago, we had write compile run, right? We basically would write code, it would take a long time to compile, people still working on Xcode would have that same problem today.

We'd run it, we'd go back, write, compile, and run. And then we had sort of interpretive languages come along, and then we now have the REPL, which is effectively the read eval print loop, right, which is you write code, runs instantly, you keep changing it, you keep changing it, something happens.

And then a bunch of guys came along. I don't fully agree with these guys, but they had a good point. You have test driven development, test build, test, that kind of thing. But I think AI really, because the way these models work is not deterministic, and prompting can feel kind of code-like, and working with prompts can feel kind of code-like, but they're really the furthest thing.

So they really need new patterns. So the pattern that me and my team, and a lot of other people that I know, have fallen into is what I'm calling CPLN, so we'll see what that means. So the first one's just chat, right? Whatever your problem is, whatever you're trying to do, just chat with models, make more and more and more examples, and just keep changing prompts, keep changing what you're doing, right?

A lot of people, and myself included, get into this habit, because we have that habit from coding, and sort of building things, of doing it once, and it sort of works. And then forever, you're iterating on that particular prompt. You make very small changes, you keep fixing things, and you keep fixing things.

But I think, really, where you should be spending most of your time, really, is just changing and finding new approaches to solve things. Cuz that makes a huge, huge difference. And that's unheard of in code, right? You wouldn't write something with the intention of rewriting it seven times before you got to the end, right?

You'd write something, something'd be broken, you'd fix that broken thing, you'd fix the next broken thing, and then you'd be done. So the most important thing is just chat. And I'm still surprised, and I still talk to the people on my team, me included, have this problem where we just don't play.

We don't chat enough, really, right? Once we get to a system that sort of gets to like 40%, we're like, okay, we're now going to production. We're just, this is almost good. And so that's a big problem, right? The next one's just take whatever you've learned, go to the playground.

There's a lot of tools, and some of them are really good. But in most cases, 90% of what you want is still just in the playground. Everyone's got a playground. All you're really looking for is the ability to retroactively edit prompts and conversation histories. Some people call it surfing the latent space and sort of make changes, right?

So this is where you'd spend maybe 20% of your time. Once you've got that working, right, let's say you've got one of it working. The next, in most cases, and this is just examples from Lamentis, is loop, right? Add more data, add more test cases, a lot more. See how solid your hypotheses were when you started, right?

And always reset if it doesn't work. Once you're done with that, right, nest, right? Once you're done with that, you've got a general sense of the approach you want to take, 99.999% of the time. And I almost dared a few people to do it, and I so far haven't seen a single prompt or a single approach where it couldn't be nested.

And by that I mean effectively break the prompt, break the work you're doing into smaller and smaller and smaller subsegments. We'll go into it later, right? But if you're not going to accept a 700 line code file as good, you shouldn't accept a 100 line prompt as good, right?

Or a 50 line prompt as good. It could always be made simpler, it could always be broken down. So really, you just want to keep doing that, right? And once you've gotten this far, and luck would have it, if you go to production and you've got users, you've got subtasks now.

And they can go through the exact same loop, right? You run into a problem, you've got a new customer with new kind of data, new problems, new things, you want to go back to the original loop. So this is kind of where you want to be spending, or where I find the best division of your time being, right?

This entire blue segment is just try new approaches, right? Because these models, they've been around for about a year, but they are so new. And we are still finding new ways to use them. That you might try something, you might be the first person in the world to have tried it.

You might genuinely be the first person to have thought of that particular way of solving a problem with a model, right? So I really can't emphasize that enough. And you probably want to spend about 20% of your time tuning the prompts. Almost everything usually is a prompt issue. Because I'm presuming that the people you work with, and the things you're building, that you guys are good at coding.

If you're not, there's tons of ways to get better at it. Computers are really good at coding. So in most cases, it's your prompt, right? If it's not your prompt, it's your input. It's the data you're providing. Ideally, see if you can change the size and shape of that data.

And in most cases, that fixes your problem. So a couple of do's and don'ts. The first one, and I still have this problem. Although, it's wonderful to know that someone in this room has solved my biggest problem, which is diarization, which is awesome. But really, just use all modalities, right?

I think a lot of people kind of forgot that when we got ChatGPT, and in short order, we also got audio, we got vision, right? And we got speech to text, and all of these different modalities. And even just the input modality of text, you can transform it into so many things, right?

You can take text and transform that into code to get a more structured representation. You can get structured data, you can do language transformations. So use all of the tools that you've got, right? So let's, yeah. So speech, for example, here's where you'd use each one. Speech is verbose.

If you've got anything dealing with users, we love to talk, right? This entire talk is probably gonna be, I don't know, I'm hoping not, like about 8,000 words, right? If you ask me to type out 8,000 words, it would take me far longer. I would be far less likely to do it, and I'd probably tell you no, right?

If you present the users with a text box, they'll give you five words. If you ask them to just press a button and talk, they'll give you 200 words, right? And these models, the things that we work with, they love context. The more context you can provide, the better.

Vision is insanely useful, right? There's a lot of relationships that you can capture with a picture that you can't with text. Like we know this, right? It's 1,000 words. Anytime you write as a person, you wanna put pictures in for the same reason, right? So all of that can be captured, and now we're getting smarter and smarter and smarter models that can understand that information.

You can use it as really expensive OCR if you want to, we do in some places. But in a lot of cases, it's also far more dense, right? Even if the diagram on the top right, top left, that's an actual diagram that we use, by the way, were to be represented in text, that would be far more token heavy than that picture, right, can encode.

Code is awesome for structure, both for input and output. Like you almost always wanna be using structure both on the input and the output, right? Use structured output whenever possible. Structure your input whenever possible, right? Almost everything humans ever touch usually has some structure, right? Like when I talk, my talk has a structure.

When you write a paragraph, there's a topic sentence. Everything humans ever do usually has structure. And if you're leaving it out, if you're not extracting it, it's a lot harder to control. Yeah, this is just stuff that we use, right? So we use TypeScript and Zod to build type specs, and that makes it so much easier to steer these models.

We use SQL when we wanna express something as a search query. Even if we never run that SQL, it helps the model think, it helps the AI system sort of better guide these things. Yeah, same thing here, use structured output as often as you can, far easier to guide.

It's also far less prone to hallucinations, because you've got a type spec on the inside. And structured output usually constrains the output that's coming out of it, that you see far fewer generations. Sorry, far fewer hallucinations with structured output, right? And I can talk about that more if we have time at the end, but it usually has to do with token probabilities and the output set.

The same thing is, again, use as much as you can, cuz you got this massive model for free, right? Kind of, right? Commoditized down, and you got this massive model that had 2 trillion, 3 trillion tokens thrown in about human information into it, right? Use that as much as you can, lean into it, right?

There's a lot of libraries, let's say projects that I've either consulted or advised with, where they're inventing their own DSLs. They're inventing their own languages to express what they want. When ideally, if they expressed it as a super set of something that existed, say TypeScript, Python, English, Hindi, whatever's in there, you'd get a lot more benefit out of that.

Cool, so this is a bunch of don'ts. None of these are hard rules, but they're general rules of thumb, especially when you start out, right? In AI, I mean, this is a meme at this point, but we are still very, very early, right? This is not very early days of development or very early days of design.

Like, if you wanted to get into design, and you wanted to be a good painter or a good designer, you wouldn't use Dolly, right? You wouldn't add an abstraction between you and the thing. You would learn how to paint, right? Because you want that knowledge. You actually want that harder knowledge of how these things work, how they behave, what the actual nature of these things are.

The more abstractions and toolkits and libraries you put between yourself and the model, when you're developing, the less you learn, right? Some of them, honestly, are really good. But that's also a problem, because they're really good, and they have this little circle of things that they do really well.

And very quickly, if you're lucky, somewhat slower, you'll want to step out of it, and then it's just a wasteland, right? If you've ever built something with WordPress or Squarespace and then just wanted to do one thing that it didn't do, you know what I'm talking about, right? That's impossible, everything will fight you.

So ideally, don't add abstractions. I know it can be, especially people with a coding background, kind of sometimes I've seen want to distance themselves from prompting, distance themselves from the non-deterministic nature of these things. Bad instinct, don't look away from it. The next one is also, I know we've got credits to OpenAI, but everyone wants to give you free money.

Everyone wants to give you free credits these days if you're a provider. It's too much investor money in this space. Don't stick to one model, right? They're all very different. They were all kind of similar when they came out, because everyone was working with the same information set. But things have diverged massively.

They're all practically different people, right? It's almost like if you gave some work to someone on your team and they couldn't do it, you wouldn't go, this is undoable. You'd probably give it to someone else, right? Same thing, work with different models. They're all very, very differently trained. There's even different personalities in there.

This one is kind of easy to keep track of, right? Basically, have a general rule of thumb that your outputs are not gonna be that much bigger than your inputs, in most cases. Again, rule of thumb, that's not gonna end up well, right? If you're looking to generate, let's say, 20 paragraphs of an article from five words of input, you're usually just gonna get very generic, not so good input, right?

Not so good output. So try and keep those ratios relatively the same if you can, right? Cool, some smaller FAQs, because these questions get asked a lot, right? So agents, a lot of people have asked me about agents. The simple answer there is anything with looping and termination is usually considered an agent, right?

So anytime you've got a system and it basically loops on the same prompt or some set of prompts, and it basically has the ability to continue execution and then decide when it wants to stop, that's usually an agent. This one is really helpful, right? When you run into problems or when you start working on a project, or you're just looking for a project to work on, it's useful to know what capabilities just got added to the tool set, right?

With Gen AI. These are four of the biggest ones, right? The first one's just plain NLP. If you've done NLP or anything close to it, it just got way better, right? We can classify documents, we can classify information all sorts of ways, we can label them, and we can do all sorts of things with them that previously NLP really couldn't do.

The second one's filtering and extraction, right? So you can pull information out, right? And the next one's sort of transformation. So anytime you've got rags, summarization, that's a transformation, right? If you're doing code generation, a lot of cases, that's transformation. If you're doing translation, that's transformation, right? So oftentimes it's useful to look at your problem, right, in an industry or your problem set in front of you, or you're just looking for ideas.

If you look for one of these four things, if you look for one of these four classes, it's an easier way to structure, maybe that's where you wanna go, instead of where to put things. The final one, and I think some people are using it, but I've seen that use case sort of go down for some reason.

It's just general purpose generation, right? You want it to write things no one's ever written before. You want it to make things up. So some resources, I'm not gonna be talking about prompting, not gonna be talking about rag. These are just some articles. These are my articles. If you don't like me, the top of it has people that I respect that are far smarter than me, so click the links and go there and read those.

Cool, the next one, and this might be the final one, is debugging, right? I don't think I've heard that many people talk about. I mean, among people who work with AI, this is a massive conversation, right? How do you debug? Because the sort of curse and sort of the benefit that we got with modern AI things is that it's very easy to build a demo.

It's very easy to get to something that sort of works, but it's very hard to debug things when they go wrong, right? That's almost, again, new paradigm. So what is happening to you, right? If nothing works, right, always go down to the prompt level. And if you can't, then get rid of your abstractions and work up from there, right?

Try a different model. Try going up a level of intelligence and see if it fixes it. That should tell you where your problems are. Or try going down a level of intelligence and see what happens. The next one is transform the input. In most cases, it's your input that's the issue.

Either it's too verbose, it's not the right transformation, it's not structured the right way. So any transformations you can do on the input is gonna make a massive difference, right? And finally, if you're not doing this already, add more structure to the output, right? More structure is gonna help you point out where your problems are.

More structure is gonna tell you, sort of expose some of the big issues there. Okay, so this doesn't usually happen to people. This usually does, right? Is it's kind of working? It's kind of working, and I can spend another two weeks on it, and it'll get a bit further down the line of kind of working.

But it's not working, necessarily, right? So again, I'm gonna go back to data. In most cases, you wanna find out what separates your offensive data, which is where it doesn't work, to the stuff that does work, right? Try all sorts of transformations. One of those is gonna point to some sort of difference between the stuff that works and the stuff that doesn't, right?

If you do, that's a prompt, right? More validation's always gonna help. And then we saw the classification before, right? If you're trying to do more than one of those things inside the same system, inside the same, with the same model, usually separate it out, right? And it makes a huge difference.

Finally, yeah, just classify your errors. Most errors I've seen sort of fall into these three issues. You've either got app level issues in terms of how that data's being fed in and fed out, and how models are orchestrated once things get too large. Or you get factuality issues, right?

It's just making things up that don't exist, or it's just giving you information that it really shouldn't, or pulling out the wrong information. It's a factuality issue. The third one's just instruction following. Is it just not listening to the specific instructions that you're giving it, right? And this is at the model level, but it happens at the meta level as well.

Even if you're working with, say, three models and 300 prompts, all of these things still apply. Okay, so what do you do, right? The first one is, whatever you're doing, right? Whatever you're doing as far as prompting and working with models go, you're almost always too verbose. Because in most cases, it's English, and once we start adding things, they kind of work.

So you get to this sort of Pareto level of, it works, but it just doesn't. It's almost how humans behave. Cut them down, there's usually space to cut them down, cut them again. The lower your task complexity per prompt, or per task, or per function, the better, right? The easier it is to debug, the easier it is for you to have things with defined blast radiuses, where if something goes wrong, you can swap it out and fix it.

Otherwise, if something goes wrong some day, you're gonna have a problem. So, how much time have we got left? Okay, ten minutes, perfect. So this is just an example of that particular project that I mentioned at the beginning, right? So it started with just a specific issue, honestly, it wasn't even me.

It was Hibi, who's actually here, who had a transcript for me. And she was like, okay, can we make docs out of this, right? Or I think it came partly from that. So there was a lot of talking. There was a lot of trying to figure out what we can pull out, what it understood out of the transcript.

You're trying to look for understanding. You're trying to see if this can even be done. You're just testing very high level hypothesis, right? Some of the things I tested were sort of trying to pull out structure directly. Some of the other ones were trying to classify that data before pulling out structure.

You learn just a lot about what it is. You figure out where you wanna put the transcript, whether chunking is a valid strategy. All of that you can learn from just talking, right? The next one is talk, but then start changing things, right? Now you start adding steps. Now you start adding structure.

You start getting information out. And once you're done with that, the entire thing, and this actually worked, was just this one script, right? Really, I mean, you don't have to read that. It's actually in the repo. It's just this one script, right? And really all it did was just loop twice over everything, and then break it down into sections and use different models to write different things, right, so there's one model, you know, that's generating the structure.

There's another model that's actually doing the long form writing. And then the final one is just breaking it down into smaller and smaller and smaller functions. So if you look in the repo, it's still not that big, right? But there's a lot more state management. There's a lot more state management.

There's a lot of self-healing. There's a lot of correction. All of that stuff can go in after, like you've proven the thesis. Cool, actually, I'm ahead of time. I didn't think I would be. So the final thing, and I will say this, is a lot of people I speak to are still very concerned about cost, right?

I don't know how many of you guys watched the NVIDIA keynote that happened a couple of days ago, but long story short, everything you're using now is gonna get at least 10x, if not 50x cheaper, in very short order, right? It's gonna get 10x, if not 50x faster, in very short order.

So what would you build if you were building for, say, six months from now, or what would you make if you just presumed that today, right? And it's a different way of working with these things. If something costs ten bucks, that's a different system than if it costs one cent, right?

If something takes an hour, that's different from if it takes six minutes, right, so I would say this is a valid presumption to make, right, when you're building something. Is what more can you do if you just presume that about the future? Immediate future, right? Because we still haven't even gotten hardware level optimizations.

That's what NVIDIA's doing now. That's a 10x. Memory level optimizations, again, still coming up. That's a 10x. Quantization, that's probably another 10x. So all of these things are almost being done now. And they're very comparatively easy, engineering-wise. It's just incremental optimization to get there. But cool, that's everything. Feel free to find me after, or just reach out on Twitter, I'm happy to help.

>> >> Do you have some questions? >> Sure, yeah. >> Yeah, I think I have a question. We got like, what do you think about long context window model and embedding model? >> Okay, so long context is tough, right? Because I might say something where I don't know what I'm talking about.

That said, this has been my question as well. The problem with context windows is our algorithm for attention is quadratic. What I mean by that, it scales exponentially to get twice as much context out of something, you've got to spend four times the amount of memory and compute. We still have that curse.

There's no way to, we still don't know a good way to get around it, right? So what that means effectively is to get really long context windows, you have to cheat. You effectively have to say, okay, I'm going to have something before I run the model that's going to kind of figure out which part of the context to actually pay attention to.

So you don't actually get the full context window, right? You kind of do, but if you take the full context window and you're trying to use every single token in it to compute an answer, it's not going to work. So that is still very much a problem that could be solved.

That's one of those open problems. I think it's an open problem that could be solved tonight by someone that's working somewhere or ten years from now, we just don't know, right? >> You've mentioned a bunch about transforming the input. How do you go about doing that? Do you use AI to transform, is your input global?

>> In most cases, yes, you're going to be using AI to transform it. But there's also just a ton of structured stuff you can do, right, very easily. Like most documents, let's say I've got a PDF, or I've got, let's say these slides, or I've got one of my documents is in Markdown.

There's a ton of structure in there you can just grep for, right? because I can very quickly figure out what the sections are. I can very easily separate by sentences. That's all stuff that you can do today, right? So even just knowing that that's got 300 sentences in it, that's a transformation of the input that is valuable, super valuable, right?

Because we can make assumptions already, right? If I give you a document that someone's written, I can presume that the title is probably the highest compressed information in there, right, that is a good enough thing. I can presume that the first section will have some sort of intro of what the thing is, right?

Those are all transformations, but yes, usually you use AI. >> I had a question. So how do you think about >> I actually haven't used Devin. I just have not had the time, but I've had people tell me that it's good. Look, coding is going to be where these models make just a massive, massive difference, right?

I already use a cursor which can understand just a massive amount of context and sort of work. It has been six months since I wrote any code that wasn't at least partially AI generated, so it's just going to keep getting bigger and bigger and bigger. That said, I will say the time that most devs that I know and most companies that I know spend is in business logic, maintenance, and sort of really trying to transform customer input to really massive systems with a ton of legacy code, like we're a long way away from that, right?

What I mean is it's getting easier and easier for you to spin up a more and more and more complex project from scratch, right? But the massive dev work that sort of sits, kind of sits past that, right? That still hasn't been touched. >> >> The efforts to do something there, because that's where the money is in a lot of ways, because that's where most enterprises are, right?

If you look at SAP, or you look at most of these guys, have not so far borne active fruit. I know most of the companies in that space, they're still having trouble getting it to work with very large code bases, right? Like, let's say anything above a 50% company that's existed for more than three years.

That code base, so far, AI hasn't been able to touch, right? >> I had a question. So I recently read a, I wouldn't say read the paper, I read the abstract, right? So where it was like, I think from Amazon or from somewhere, or Netflix perhaps, that getting cosine similarities between embeddings, it's not really a good measure for getting the meaning of things, right?

And this is a preface for my question. And also when we do vector searches and just try to pull relevant information, I don't know, it feels like it doesn't work. I'm trying to figure out what am I doing wrong, how to do it better. I watched Jerry Hill's talk from our index, it was an 18-minute talk or something.

It's a very nice talk, but it just kind of flew over. So what's your recommendation? >> I think the problem here is embeddings are sort of fuzzy search on steroids. If you're using them for anything more, I think even today you have a problem, right, because a couple of things.

One, these are really tiny models, comparatively, right? Big brain, small brain, tiny brain, these are really tiny models. In most cases, they don't have a good understanding of the underlying text. That's why long context embeddings never made sense, right? The longer the context, it just doesn't really make sense.

Not to mention, in most cases, that's a transformation of the input, right? What Hebe was saying, that's a transformation of the input, is you're transforming it. Well, you're transforming it into a space where it's a lot harder for you to work with it, right? You're transforming it to a set of numbers.

And now the only thing you have is cosine similarity. You can have a bias matrix, you can push that math a little bit more. But because that model is unknown to you, the model's workings are unknown to you, those are forever gonna be a bunch of numbers, right? In some insanely high dimensional space.

So there's not a lot to do there, right? What is becoming very possible now, that I see a lot of companies switching to, is just use the whole brain, use the LLM, right? Whatever you're using embeddings for, you can use an LLM, right? It's just more expensive, right? In most cases, you can use an LLM for that.

Like let's say you're using, I'll give you the most brute force example of this. Let's say using embeddings to take 100,000 items and see which ones are similar or which ones closest to your query. You can take an LLM, run it through every single one of those documents and ask, hey, is this close, is this close, is this close?

And you'll get an answer, right? That is not a good way to do it, do not do it this way. But you see what I mean, right? So they are kind of, you can substitute one for the other just a little bit. I think embeddings have a place, right?

But they should always be the last step in your pipeline. You should cut down the search space as much as possible with structured search, transformations, it's a BM25, there's a bunch of stuff you can do, right? You should never be searching your search space with embeddings, right? You should always be searching some reduced search space where, hey, last 20 things, and I know these are relevant because keywords.

I know these are relevant because location. I know these are relevant because an LLM told me after transformation, whatever. Now I can embed, that's fine, right? But if you embed at the beginning, in most cases, it just doesn't work at scale. >> So it's more like to get the results and then sort it.

Is that where embeddings come in? >> More like to get the results, and yes, kind of to sort it, but kind of also to identify useful parts of those results. Let's say the results you got were pages, but you want sentences, right? You wanna know which part of it is heat map wise the most important.

You can use embeddings for that, right? All right, thank you so much. >> Yeah. >>

LLMs: A Hackers Guide

Transcript