back to index

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor


Chapters

0:0 Introductions
2:50 Early experiments with Generative AI at StitchFix
9:39 Design philosophy behind the Instructor library
13:17 JSON Mode vs Function Calling
14:43 Single vs parallel function calling
16:28 How many functions is too many?
20:40 How to evaluate function calling
24:1 What is Instructor good for?
26:41 The Evolution from Looping to Workflow in AI Engineering
31:58 State of the AI Engineering Stack
33:40 Why Instructor isn't VC backed
37:8 Advice on Pursuing Open Source Projects and Consulting
42:59 The Concept of High Agency and Its Importance
51:6 Prompts as Code and the Structure of AI Inputs and Outputs
53:6 The Emergence of AI Engineering as a Distinct Field

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hey, everyone.
00:00:01.080 | Welcome to the Latent Space Podcast.
00:00:02.960 | This is Alessio, partner at CTN Residence at Decibel Partners.
00:00:06.640 | And I'm joined by my co-host, Swiggs, founder of Small AI.
00:00:09.520 | Hello.
00:00:10.020 | We're back in the remote studio with Jason Liu from Instructure.
00:00:13.360 | Welcome, Jason.
00:00:14.120 | Hey there.
00:00:14.620 | Thanks for having me.
00:00:16.440 | Jason, you are extremely famous.
00:00:18.440 | So I don't know what I'm going to do introducing you.
00:00:20.720 | But you're one of the Waterloo clan.
00:00:24.220 | There's a small cadre of you that's just completely
00:00:26.520 | dominating machine learning.
00:00:28.600 | Actually, can you list Waterloo alums
00:00:30.520 | that you know are just dominating and crushing it
00:00:33.480 | right now?
00:00:34.600 | So John from Ryasana is doing his inversion models.
00:00:42.320 | I know there's Clive Chen.
00:00:45.040 | Clive Chen from Waterloo.
00:00:46.120 | He was one of the kids where I-- when I started the data science
00:00:49.640 | club, he was one of the guys who was joining in and just
00:00:51.880 | hanging out in the room.
00:00:52.920 | And now he was at Tesla, working with Harpathy.
00:00:56.160 | Now he's at OpenAI.
00:00:58.280 | He's in my climbing club.
00:01:00.640 | Oh, hell yeah.
00:01:01.440 | Yeah.
00:01:02.160 | I haven't seen him in like six years now.
00:01:05.200 | To get in the social scene in San Francisco,
00:01:07.320 | you have to climb.
00:01:08.720 | So yeah, both in career and in rocks.
00:01:12.000 | Yeah, I mean, a lot of good problem solving there.
00:01:14.120 | But oh man, I feel like now that you put me on the spot,
00:01:16.840 | I don't know.
00:01:17.840 | It's OK.
00:01:19.760 | Yeah, that was a riff.
00:01:20.920 | OK, but anyway, so you started a data science club at Waterloo.
00:01:23.760 | We can talk about that.
00:01:24.840 | But then also spent five years at Stitch Fix as an MLDE.
00:01:29.000 | You pioneered the use of OpenAI's LLMs
00:01:31.400 | to increase stylus efficiency.
00:01:33.440 | So you must have been a very, very early user.
00:01:35.360 | This was pretty early on.
00:01:37.840 | Yeah, I mean, this was like GBD3.
00:01:42.200 | OK, so we actually were using transformers at Stitch Fix
00:01:46.600 | before the GBD3 model.
00:01:47.600 | So we were just using transformers recommendation
00:01:49.640 | systems.
00:01:50.140 | At that time, I was very skeptical of transformers.
00:01:53.680 | I was like, why do we need all this infrastructure?
00:01:55.760 | We can just use matrix factorization.
00:01:57.520 | When GBD2 came out, I fine tuned my own GBD2
00:02:00.480 | to write rap lyrics.
00:02:01.600 | And I was like, OK, this is cute.
00:02:03.240 | OK, I got to go back to my real job.
00:02:05.320 | Who cares if I can write a rap lyric?
00:02:08.400 | When GBD3 Instruct came out, again, I
00:02:11.480 | was very much like, why are we using a POST request
00:02:15.720 | to review every comment a person leaves?
00:02:17.920 | We can just use classical models.
00:02:19.840 | So I was very against language models for the longest time.
00:02:23.800 | And then when ChachiPD came out, I basically
00:02:25.680 | just wrote a long apology letter to everyone at the company.
00:02:29.000 | I was like, hey, guys, I was very dismissive
00:02:31.560 | of some of this technology.
00:02:32.680 | I didn't think it would scale well.
00:02:34.120 | And I am wrong.
00:02:35.540 | This is incredible.
00:02:36.920 | And I immediately just transitioned
00:02:38.400 | to go from computer vision recommendation systems to LLMs.
00:02:42.240 | But funny enough, now that we have RAG,
00:02:44.120 | we're kind of going back to recommendation systems.
00:02:46.440 | Yeah, speaking of that, I think Alessio's going to bring up--
00:02:49.020 | I was going to say, we had Brian Bishop from Max
00:02:51.520 | on the podcast.
00:02:52.800 | Did you overlap Stitch Fix?
00:02:54.400 | Yeah, yeah, he was one of my main users
00:02:56.760 | of the recommendation framework that I
00:02:58.480 | had built out at Stitch Fix.
00:02:59.840 | Yeah, we talked a lot about Rexis, so it makes sense.
00:03:02.880 | So I actually, now I have adopted that line,
00:03:05.920 | that RAG is Rexis.
00:03:08.080 | And if you're trying to reinvent new concepts,
00:03:10.760 | you should study Rexis first, because you're
00:03:13.120 | going to independently reinvent a lot of concepts.
00:03:15.160 | So your system was called Flight.
00:03:16.640 | It's a recommendation framework with over 80% adoption,
00:03:19.100 | servicing 350 million requests every day.
00:03:22.300 | Wasn't there something existing at Stitch Fix?
00:03:24.220 | Like, why did you have to write one from scratch?
00:03:26.460 | No, so I think because at Stitch Fix, a lot of the machine
00:03:29.980 | learning engineers and data scientists
00:03:31.580 | were writing production code, sort of every team's systems
00:03:34.620 | were very bespoke.
00:03:36.140 | It's like, this team only needs to do real-time recommendations
00:03:39.860 | with small data, so they just have a fast API
00:03:42.260 | app with some Panda's code.
00:03:44.300 | This other team has to do a lot more data,
00:03:46.160 | so they have some kind of Spark job
00:03:47.700 | that does some batch ETL that does a recommendation, right?
00:03:51.320 | And so what happens is each team writes their code differently,
00:03:54.020 | and I have to come in and refactor their code.
00:03:56.340 | And I was like, oh, man, I'm refactoring
00:03:58.380 | four different code bases four different times.
00:04:01.660 | Wouldn't it be better if all the code quality was my fault?
00:04:04.580 | All right, well, let me just write this framework,
00:04:06.660 | force everyone else to use it, and now one person
00:04:09.180 | can maintain five different systems rather than five teams
00:04:12.740 | having their own bespoke system.
00:04:14.860 | And so it was really a need of just standardizing everything.
00:04:18.200 | And then once you do that, you can do observability
00:04:22.000 | across the entire pipeline and make large, sweeping
00:04:25.760 | improvements in this infrastructure.
00:04:28.120 | If we notice that something is slow,
00:04:30.160 | we can detect it on the operator layer.
00:04:33.320 | Just, hey, hey, this team, you guys are doing this operation.
00:04:35.960 | It's lowering our latency by 30%.
00:04:38.640 | If you just optimize your Python code here,
00:04:42.440 | we can probably make an extra million dollars.
00:04:44.360 | Like, jump on a call and figure this out.
00:04:46.580 | And then a lot of it was just doing all this observability
00:04:49.880 | work to figure out what the heck is going on
00:04:52.100 | and optimize this system from not only just a code
00:04:54.360 | perspective, but just harassing the org and saying,
00:04:58.600 | we need to add caching here.
00:05:00.120 | We're doing duplicated work here.
00:05:01.800 | Let's go clean up the systems.
00:05:03.920 | Yeah.
00:05:05.400 | One more system that I'm interested in finding out
00:05:07.720 | more about is your similarity search system
00:05:10.040 | using Qlip and GPT-3 embedding in FICE,
00:05:13.760 | where you said over $50 million in annual revenue.
00:05:17.240 | So of course, they all gave all that to you, right?
00:05:19.560 | No, no.
00:05:20.240 | I mean, it's not going up and down.
00:05:22.200 | But I got a little bit, so I'm pretty happy about that.
00:05:25.640 | But there, that was when we were doing fine-tuning ResNets
00:05:31.480 | to do image classification.
00:05:33.360 | And so a lot of it was, given an image,
00:05:35.840 | if we could predict the different attributes we have
00:05:38.000 | in our merchandising, and we can predict the text embeddings
00:05:41.280 | of the comments, then we can build a image vector or image
00:05:46.840 | embedding that can capture both descriptions of the clothing
00:05:49.920 | and sales of the clothing.
00:05:51.600 | And then we would use these additional vectors
00:05:53.520 | to augment our recommendation system.
00:05:56.360 | And so with this, the recommendation system
00:05:59.000 | really was just around, what are similar items?
00:06:02.000 | What are complementary items?
00:06:03.240 | What are items that you would wear in a single outfit?
00:06:05.960 | And being able to say, on a product page,
00:06:08.360 | let me show you 15, 20 more things.
00:06:10.720 | And then what we found was like, hey, when you turn that on,
00:06:13.260 | you make a bunch of money.
00:06:14.120 | Yeah.
00:06:14.640 | OK, so you didn't actually use GPT-3 embeddings.
00:06:17.320 | You fine-tuned your own.
00:06:19.160 | Because I was surprised that GPT-3 worked off the shelf.
00:06:21.520 | OK, OK.
00:06:23.400 | Because at this point, we would have 3 million pieces
00:06:26.560 | of inventory over a billion interactions
00:06:28.920 | between users and clothes.
00:06:32.240 | Any kind of fine-tuning would definitely
00:06:34.080 | outperform some off-the-shelf model.
00:06:38.400 | Cool.
00:06:39.400 | I'm about to move on from Stitch Fix.
00:06:41.240 | But any other fun stories from the Stitch Fix days
00:06:44.000 | that you want to cover?
00:06:45.840 | No, I think that's basically it.
00:06:47.160 | I mean, the biggest one, really, was the fact
00:06:49.000 | that, I think, for just four years,
00:06:50.560 | I was so bearish on language models and just NLP in general.
00:06:54.080 | I was like, oh, none of this really works.
00:06:56.520 | Why would I spend time focusing on this?
00:06:58.320 | I've got to go do the thing that makes money--
00:07:00.440 | recommendations, bounding boxes, image classification.
00:07:03.640 | Yeah.
00:07:04.160 | And now I'm prompting an image model.
00:07:07.240 | I was like, oh, man, I was wrong.
00:07:09.880 | I think-- OK, so my Stitch Fix question would be,
00:07:14.240 | I think you have a bit of a drip.
00:07:15.840 | And I don't.
00:07:16.720 | My primary wardrobe is free startup conference t-shirts.
00:07:21.480 | Should more technology brothers be using Stitch Fix?
00:07:26.040 | Or what's your fashion advice?
00:07:28.120 | Oh, man, I mean, I'm not a user of Stitch Fix, right?
00:07:31.160 | It's like, I enjoy going out and touching things and putting
00:07:35.480 | things on and trying them on, right?
00:07:37.360 | I think Stitch Fix is a place where you kind of go
00:07:39.440 | because you want the work offloaded.
00:07:42.080 | Whereas I really love the clothing
00:07:44.840 | I buy where I have to--
00:07:46.880 | when I land in Japan, I'm doing a 45-minute walk up a giant hill
00:07:50.480 | to find this weird denim shop.
00:07:52.480 | That's the stuff that really excites me.
00:07:54.520 | But I think the bigger thing that's really captured
00:07:56.840 | is this idea that narrative matters a lot to human beings.
00:08:03.280 | And I think the recommendation system,
00:08:05.160 | that's really hard to capture.
00:08:07.240 | It's easy to sell--
00:08:08.200 | it's easy to use AI to sell a $20 shirt.
00:08:10.680 | But it's really hard for AI to sell a $500 shirt.
00:08:14.200 | But people are buying $500 shirts, you know what I mean?
00:08:16.600 | There's definitely something that we can't really
00:08:19.200 | capture just yet that we probably will figure out
00:08:21.640 | how to in the future.
00:08:24.120 | Well, it'll probably-- I'll put in JSON,
00:08:26.440 | which is what we're going to turn to next.
00:08:28.640 | So then you went on a sabbatical to South Park Commons
00:08:31.760 | in New York, which is unusual because it's usually--
00:08:34.600 | Yeah, so basically in 2020, really, I
00:08:39.800 | was just enjoying working a lot.
00:08:41.820 | And so I was just building a lot of stuff.
00:08:43.600 | This is where we were making the tens of millions of dollars
00:08:46.360 | doing stuff.
00:08:47.640 | And then I had a hand injury, and so I really
00:08:49.520 | couldn't code anymore for about a year, two years.
00:08:52.840 | And so I kind of took half of it as medical leave.
00:08:55.640 | The other half, I became more of a tech lead,
00:08:57.600 | just making sure the systems or lights were on.
00:09:01.320 | And then when I went to New York,
00:09:03.920 | I spent some time there and kind of just wound down
00:09:06.400 | the tech work, did some pottery, did some jiu jitsu.
00:09:09.560 | And after GBD came out, I was like, oh, I clearly
00:09:14.960 | need to figure out what is going on here because something
00:09:17.720 | feels very magical, and I don't understand it.
00:09:20.120 | So I spent basically five months just prompting and playing
00:09:23.200 | around with stuff.
00:09:24.600 | And then afterwards, it was just my startup friends
00:09:26.960 | going like, hey, Jason, my investors
00:09:29.800 | want us to have an AI strategy.
00:09:31.760 | Can you help us out?
00:09:33.120 | And it just snowballed more and more
00:09:35.680 | until I was making this my full-time job.
00:09:38.680 | And you had YouTube University and a journaling app,
00:09:42.640 | a bunch of other explorations.
00:09:44.440 | But it seems like the most productive
00:09:47.280 | or the most best-known thing that came out of your time
00:09:50.360 | there was Instructor.
00:09:51.720 | Yeah, written on the bullet train in Japan.
00:09:54.960 | Well, tell us the origin story.
00:09:57.080 | Yeah, I mean, I think at some point,
00:10:00.240 | tools like Guardrails that Marvin came out,
00:10:03.240 | those are kind of tools that use XML and Pytantic
00:10:06.000 | to get structured data out.
00:10:07.560 | But they really were doing things sort of in the prompt.
00:10:10.720 | And these are built with sort of the Instruct models in mind.
00:10:14.080 | And I really-- like, I'd already done that in the past.
00:10:17.160 | At Stitch Fix, one of the things we did
00:10:18.800 | was we would take a request note and turn that into a JSON
00:10:22.720 | object that we would use to send it to our search engine, right?
00:10:26.960 | So if you said, like, I wanted skinny jeans that
00:10:29.440 | were this size, that would turn into JSON
00:10:31.720 | that we would send to our internal search APIs.
00:10:34.360 | It always felt kind of gross.
00:10:35.960 | A lot of it is just, like, you read the JSON,
00:10:37.840 | you parse it, you make sure the names are strings
00:10:40.000 | and ages are numbers, and you do all this messy stuff.
00:10:43.520 | But when Function Calling came out,
00:10:45.480 | it was very much sort of a new way of doing things.
00:10:48.480 | Function Calling lets you define the schema
00:10:50.800 | separate from the data and the instructions.
00:10:54.440 | And what this meant was you can kind
00:10:57.160 | of have a lot more complex schemas
00:10:59.000 | and just map them in Pydantic.
00:11:01.200 | And then you can just keep those very separate.
00:11:03.160 | And then once you add, like, methods,
00:11:04.700 | you can add validators and all that kind of stuff.
00:11:07.060 | The one thing I really had with a lot of these libraries,
00:11:09.520 | though, was it was doing a lot of the string formatting
00:11:11.960 | themselves, which was fine when it was the instruction tune
00:11:15.520 | models.
00:11:16.000 | You just have a string.
00:11:17.480 | But when you have these new chat models,
00:11:19.840 | you have these chat messages.
00:11:21.320 | And I just didn't really feel like not
00:11:24.680 | being able to access that for the developer
00:11:26.560 | was sort of a good benefit that they would get.
00:11:30.480 | And so I just said, let me write the most simple SDK
00:11:34.240 | around the OpenAI SDK, simple wrapper on the SDK,
00:11:39.120 | just handle the response model a bit,
00:11:41.240 | and kind of think of myself more like requests
00:11:44.880 | than actual framework that people can use.
00:11:46.680 | And so the goal is, hey, this is something
00:11:48.360 | that you can use to build your own framework.
00:11:50.360 | But let me just do all the boring stuff
00:11:51.980 | that nobody really wants to do.
00:11:53.600 | People want to build their own frameworks.
00:11:55.360 | People don't want to build JSON parsing.
00:11:59.640 | And the retrying and all that other stuff.
00:12:02.080 | Yeah.
00:12:03.200 | Yeah, we had a little bit of this discussion before the show.
00:12:05.960 | But that design principle of going for being requests
00:12:09.320 | rather than being Django, what inspires you there?
00:12:16.320 | This has come from a lot of prior pain.
00:12:18.280 | Are there other open source projects
00:12:20.760 | that kind of inspired your philosophy here?
00:12:23.040 | Yeah, I mean, I think it would be requests.
00:12:25.000 | I think it is just the obvious thing you install.
00:12:29.280 | If you were going to go make HTTP requests in Python,
00:12:33.660 | you would obviously import requests.
00:12:35.260 | Maybe if you want to do more async work,
00:12:36.920 | there's future tools.
00:12:38.320 | But you don't really even think about installing it.
00:12:40.960 | And when you do install it, you don't think of it
00:12:42.960 | as like, oh, this is a requests app.
00:12:46.640 | No, this is just Python.
00:12:48.360 | The bigger question is, a lot of people
00:12:50.360 | ask questions like, oh, why isn't requests
00:12:52.360 | in the standard library?
00:12:54.700 | That's how I want my library to feel.
00:12:56.360 | It's like, oh, if you're going to use the LLM SDKs,
00:12:59.640 | you're obviously going to install Instructor.
00:13:01.780 | And then I think the second question would be, oh,
00:13:03.880 | how come Instructor doesn't just go into OpenAI,
00:13:06.240 | go into Anthropic?
00:13:07.400 | If that's the conversation we're having,
00:13:09.200 | that's where I feel like I've succeeded.
00:13:11.360 | Yeah, it's so standard, you may as well
00:13:14.480 | just have it in the base libraries.
00:13:16.960 | And the shape of the request has stayed the same.
00:13:20.000 | But initially, function calling was maybe
00:13:22.320 | equal structure outputs for a lot of people.
00:13:24.560 | I think now the models also support JSON mode
00:13:28.280 | and some of these things.
00:13:29.320 | And return JSON on my grandma is going to die.
00:13:33.060 | All of that stuff is maybe to decide.
00:13:35.400 | How have you seen that evolution?
00:13:37.320 | Maybe what's the meta game today?
00:13:39.320 | Should people just forget about function calling for structure
00:13:42.200 | outputs?
00:13:42.720 | Or when is structure output, like JSON mode,
00:13:46.080 | the best versus not?
00:13:47.360 | We'd love to get any thoughts, given
00:13:48.860 | that you do this every day.
00:13:50.160 | Yeah, I would almost say these are
00:13:51.880 | like different implementations of--
00:13:54.080 | the real thing we care about is the fact
00:13:55.720 | that now we have typed responses to language models.
00:13:58.200 | And because we have that type response,
00:13:59.960 | my ID is a little bit happier.
00:14:01.400 | I get autocomplete.
00:14:02.880 | If I'm using the response wrong, there's
00:14:04.580 | a little red squiggly line.
00:14:05.920 | Those are the things I care about.
00:14:07.560 | In terms of whether or not JSON mode is better,
00:14:09.720 | I usually think it's almost worse
00:14:12.160 | unless you want to spend less money on the prompt tokens
00:14:15.500 | that the function call represents.
00:14:18.880 | Primarily because with JSON mode,
00:14:20.300 | you don't actually specify the schema.
00:14:23.280 | So sure, JSON load works.
00:14:24.840 | But really, I care a lot more than just the fact
00:14:26.800 | that it is JSON.
00:14:28.880 | I think function calling gives you a tool to specify the fact
00:14:31.960 | that, OK, this is a list of objects that I want.
00:14:34.200 | And each object has a name or an age.
00:14:36.160 | And I want the age to be above 0.
00:14:37.780 | And I want to make sure it's parsed correctly.
00:14:41.040 | That's where function calling really shines.
00:14:43.800 | Any thoughts on single versus parallel function calling?
00:14:48.480 | When I first started-- so I did a presentation
00:14:50.640 | at our AI in Action Discord channel,
00:14:54.200 | and obviously, showcase instructor.
00:14:57.680 | One of the big things that we had before
00:14:59.580 | with single function calling is like when
00:15:01.240 | you're trying to extract lists, you
00:15:03.040 | have to make these funky properties that
00:15:05.240 | are lists to then actually return all the objects.
00:15:08.840 | How do you see the hack being put on the developer's plate
00:15:13.880 | versus more of the stuff just getting better in the model?
00:15:17.280 | And I know you tweeted recently about Anthropic, for example,
00:15:21.200 | some lists are not lists, they're strings.
00:15:22.960 | And there's all of these discrepancies.
00:15:25.720 | I almost would prefer it if it was always
00:15:28.120 | a single function call.
00:15:29.120 | But obviously, there is the agents workflows
00:15:31.400 | that Instructor doesn't really support that well,
00:15:34.120 | but are things that ought to be done.
00:15:36.520 | You could define, I think, maybe like 50 or 60
00:15:40.320 | different functions in a single API call.
00:15:43.200 | And if it was like get the weather, or turn the lights on,
00:15:45.720 | or do something else, it makes a lot of sense
00:15:47.060 | to have these parallel function calls.
00:15:48.840 | But in terms of an extraction workflow,
00:15:50.520 | I definitely think it's probably more helpful to have
00:15:53.560 | everything be a single schema.
00:15:56.480 | Just because you can specify relationships
00:15:58.520 | between these entities that you can't do in parallel function
00:16:01.800 | calling, you can have a single chain of thought
00:16:06.840 | before you generate a list of results.
00:16:09.680 | There's small API differences, right?
00:16:11.960 | Where, yeah, if it's for parallel function calling,
00:16:15.840 | if you do one, again, I really care about how the SDK looks.
00:16:21.120 | And so it's, OK, do I always return a list of functions,
00:16:23.640 | or do you just want to have the actual object back out?
00:16:26.080 | You want to have autocomplete over that object.
00:16:28.200 | What's the cap for how many function
00:16:31.320 | definitions you can put in where it still works well?
00:16:34.040 | Do you have any sense on that?
00:16:35.640 | I mean, for the most part, I haven't really
00:16:37.440 | had a need to do anything that's more than six or seven
00:16:40.560 | different functions.
00:16:41.400 | I think in the documentation, they support way more.
00:16:44.200 | But I don't even know if there's any good evals
00:16:46.880 | that have over two dozen function calls.
00:16:50.840 | I think if you run into issues where
00:16:53.320 | you have 20, or 50, or 60 function calls,
00:16:56.060 | I think you're much better having those specifications
00:16:58.760 | saved in a vector database, and then have them be retrieved.
00:17:02.700 | So if there are 30 tools, you should basically
00:17:04.620 | be ranking them, and then using the top K
00:17:07.660 | to do selection a little bit better,
00:17:10.220 | rather than just shoving 60 functions into a single API.
00:17:13.860 | Yeah.
00:17:14.740 | Well, I mean, so I think this is relevant now,
00:17:17.060 | because previously, I think context limits prevented you
00:17:20.260 | from having more than a dozen tools anyway.
00:17:24.060 | And now that we have a million token context windows,
00:17:28.380 | Cloud recently, with their new function calling release,
00:17:30.820 | said they can handle over 250 tools, which is insane to me.
00:17:34.980 | That's a lot.
00:17:37.740 | I would say, you're saying you don't think
00:17:40.180 | there's many people doing that.
00:17:41.620 | I think anyone with a sort of agent-like platform where
00:17:44.300 | you have a bunch of connectors, they
00:17:46.700 | wouldn't run into that problem.
00:17:48.100 | Probably, you're right that they should use a vector database
00:17:50.640 | and kind of rag their tools.
00:17:53.260 | I know Zapier has like a few thousand,
00:17:54.860 | like 8,000, 9,000 connectors that obviously
00:17:57.700 | don't fit anywhere.
00:17:59.060 | So yeah, I mean, I think that would
00:18:00.940 | be it, unless you need some kind of intelligence
00:18:03.300 | that chains things together, which is, I think,
00:18:05.420 | what Alessio is coming back to.
00:18:07.780 | There is this trend about parallel function calling.
00:18:10.540 | I don't know what I think about that.
00:18:12.160 | Anthropx version was--
00:18:14.060 | I think they used multiple tools in sequence,
00:18:16.300 | but they're not in parallel.
00:18:18.100 | I haven't explored this at all.
00:18:19.420 | I'm just throwing this open to you
00:18:20.940 | as to what do you think about all these things.
00:18:22.940 | You know, do we assume that all function calls
00:18:25.140 | could happen in any order?
00:18:26.940 | I think there's a lot of--
00:18:29.500 | in which case, we either can assume that,
00:18:32.140 | or we can assume that things need to happen in some kind
00:18:34.540 | of sequence as a DAG.
00:18:35.780 | But if it's a DAG, really, that's just one JSON object
00:18:38.420 | that is the entire DAG, rather than going, OK,
00:18:41.140 | the order of the function that I return don't matter.
00:18:44.240 | That's just-- that's definitely just not true in practice.
00:18:47.420 | If I have a thing that's like, turn the lights on,
00:18:49.500 | unplug the power, and then turn the toaster on or something,
00:18:52.020 | the order doesn't matter.
00:18:55.140 | And it's unclear how well you can
00:18:57.380 | describe the importance of that reasoning to a language model
00:19:01.740 | I mean, I'm sure you can do it with good enough prompting,
00:19:04.380 | but I just haven't had any use cases where the function
00:19:07.300 | sequence really matters.
00:19:08.900 | Yeah.
00:19:09.500 | To me, the most interesting thing
00:19:10.860 | is the models are better at picking
00:19:13.980 | than your ranking is, usually.
00:19:16.020 | Like, I'm incubating a company around system integration.
00:19:19.500 | And for example, with one system,
00:19:21.500 | there are like 780 endpoints.
00:19:23.900 | And if you actually try and do vector similarity,
00:19:26.780 | it's not that good, because the people that wrote the specs
00:19:29.300 | didn't have in mind making them semantically apart.
00:19:32.980 | They're kind of like, oh, create this, create this, create this.
00:19:35.740 | Versus when you give it to a model, and you put--
00:19:38.020 | like in Opus, you put them all, it's
00:19:39.940 | quite good at picking which ones you should actually run.
00:19:43.300 | And I'm curious to see if the model providers actually
00:19:46.020 | care about some of those workflows,
00:19:47.900 | or if the agent companies are actually
00:19:49.820 | going to build very good rankers to kind of fill that gap.
00:19:54.340 | Yeah, my money is on the rankers,
00:19:55.940 | because you can do those so easily.
00:19:58.340 | You could just say, well, given the embeddings of my search
00:20:01.500 | query and the embeddings of the description,
00:20:04.740 | I can just train XGBoost and just make sure
00:20:06.700 | that I have very high MRR, which is mean reciprocal rank.
00:20:10.620 | And so the only objective is to make sure
00:20:13.080 | that the tools you use are in the top end filter.
00:20:17.020 | That feels super straightforward,
00:20:18.620 | and you don't have to actually figure out
00:20:19.740 | how to fine tune a language model
00:20:21.120 | to do tool selection anymore.
00:20:23.540 | Yeah, I definitely think that's the case.
00:20:25.260 | Because for the most part, I imagine
00:20:27.500 | you either have less than three tools or more than 1,000.
00:20:32.620 | I don't know what kind of companies say, oh, thank God,
00:20:34.940 | we only have like 185 tools.
00:20:37.900 | And this works perfectly, right?
00:20:40.180 | That's right.
00:20:41.340 | And before we maybe move on just from this,
00:20:44.420 | it was interesting to me you retweeted this thing
00:20:46.580 | about entropic function calling, and it
00:20:48.460 | was Joshua Brown's retweeting some benchmark that's like,
00:20:52.500 | oh my God, entropic function calling, so good.
00:20:55.500 | And then you retweeted it, and then you tweeted later,
00:20:58.020 | and it's like, it's actually not that good.
00:21:00.380 | What's your flow for like, how do you actually
00:21:03.060 | test these things?
00:21:03.860 | Because obviously, the benchmarks are lying, right?
00:21:06.260 | Because the benchmark says it's good, and you said it's bad,
00:21:08.780 | and I trust you more than the benchmark.
00:21:10.820 | How do you think about that, and then how
00:21:12.500 | do you evolve it over time?
00:21:14.740 | Yeah, it's mostly just client data.
00:21:17.780 | I think when-- I actually have been
00:21:19.620 | mostly busy with enough client work
00:21:21.340 | that I haven't been able to reproduce public benchmarks,
00:21:23.860 | and so I can't even share some of the results of entropic.
00:21:26.620 | But I would just say, in production,
00:21:28.820 | we have some pretty interesting schemas,
00:21:31.660 | where it's iteratively building lists, where we're
00:21:35.180 | doing updates of lists, like we're doing in-place updates,
00:21:38.580 | so upserts and inserts.
00:21:40.660 | And in those situations, we're like, oh, yeah,
00:21:42.580 | we have a bunch of different parsing errors.
00:21:44.380 | Numbers are being returned as strings.
00:21:46.020 | We were expecting lists of objects,
00:21:47.620 | but we're getting strings that are like the strings of JSON.
00:21:51.700 | So we had to call JSON parse on individual elements.
00:21:57.420 | Overall, I'm super happy with the entropic models
00:22:00.580 | compared to the OpenAI models.
00:22:01.820 | Like, Sonnet is very cost-effective.
00:22:04.020 | Haiku is-- in function calling, it's actually better.
00:22:08.140 | But I think we just had to file down the edges a little bit,
00:22:10.900 | where our tests pass, but then we actually
00:22:13.660 | apply to production, we get half a percent of traffic
00:22:17.940 | having issues, where if you ask for JSON,
00:22:20.500 | it'll still try to talk to you.
00:22:22.380 | Or if you use function calling, we'll have a parse error.
00:22:25.340 | And so I think these are things that are definitely
00:22:27.460 | going to be things that are fixed in the upcoming weeks.
00:22:30.780 | But in terms of the reasoning capabilities, man,
00:22:34.220 | it's hard to beat 70% cost reduction,
00:22:38.300 | especially when you're building consumer applications.
00:22:41.100 | If you're building something for consultants or private equity,
00:22:43.380 | you're charging $400.
00:22:44.500 | It doesn't really matter if it's $1 or $2.
00:22:47.340 | But for consumer apps, it makes products viable.
00:22:51.140 | If you can go from 4 to Sonnet, you
00:22:53.180 | might actually be able to price it better.
00:22:55.660 | I had this chart about the ELO versus the cost
00:22:59.260 | of all the models.
00:23:00.700 | And you could put trend graphs on each of those things
00:23:05.620 | about higher ELO equals higher cost, except for Haiku.
00:23:08.620 | Haiku kind of just broke the lines, or the ISO ELOs,
00:23:11.900 | if you want to call it.
00:23:15.460 | Cool.
00:23:16.180 | Before we go too far into your opinions
00:23:18.900 | on just the overall ecosystem, I want
00:23:21.220 | to make sure that we map out the surface area of Instructure.
00:23:23.940 | I would say that most people would
00:23:25.820 | be familiar with Instructure from your talks,
00:23:28.180 | and your tweets, and all that.
00:23:30.260 | You had the number one talk from the AI Engineer Summit.
00:23:34.140 | MARK MANDEL: Two Lews, Jason Lew and Jerry Lew.
00:23:36.300 | FRANCESC CAMPOY: Yeah, yeah, yeah.
00:23:38.720 | Started with J and then a Lew to do well.
00:23:42.600 | But yeah, until I actually went through your cookbook,
00:23:45.520 | I didn't realize the surface area.
00:23:47.640 | How would you categorize the use cases?
00:23:50.760 | You have LLM self-critique.
00:23:53.520 | You have knowledge graphs in here.
00:23:55.040 | You have PII data sanitation.
00:23:57.760 | How do you characterize the people?
00:23:59.260 | What is the surface area of Instructure?
00:24:01.440 | MARK MANDEL: Yeah, so this is the part that feels crazy.
00:24:03.720 | Because really, the difference is LLMs give you strings,
00:24:06.720 | and Instructure gives you data structures.
00:24:08.780 | And once you get data structures again,
00:24:10.360 | you can do every lead code problem you ever thought of.
00:24:14.160 | And so I think there's a couple of really common applications.
00:24:16.960 | The first one, obviously, is extracting structured data.
00:24:20.200 | This is just be, OK, well, I want
00:24:22.200 | to put in an image of a receipt.
00:24:24.080 | I want to give back out a list of checkout items
00:24:26.560 | with a price, and a fee, and a coupon code, or whatever.
00:24:30.080 | That's one application.
00:24:31.640 | Another application really is around extracting graphs out.
00:24:36.560 | So one of the things we found out about these language models
00:24:38.640 | is that not only can you define nodes,
00:24:40.680 | it's really good at figuring out what are nodes
00:24:43.000 | and what are edges.
00:24:44.560 | And so we have a bunch of examples where not only
00:24:48.160 | do I extract that this happens after that, but also, OK,
00:24:52.600 | these two are dependencies of another task.
00:24:55.280 | And you can do extracting complex entities
00:24:58.280 | that have relationships.
00:24:59.720 | Given a story, for example, you could extract relationships
00:25:02.340 | of families across different characters.
00:25:04.480 | This can all be done by defining a graph.
00:25:07.200 | And then the last really big application really
00:25:09.600 | is just around query understanding.
00:25:12.320 | The idea is that any API call has some schema.
00:25:16.240 | And if you can define that schema ahead of time,
00:25:18.200 | you can use a language model to resolve a request
00:25:20.720 | into a much more complex request, one
00:25:24.200 | that an embedding could not do.
00:25:25.680 | So for example, I have a really popular post called,
00:25:28.120 | like, "Rag Is More Than Embeddings."
00:25:29.920 | And effectively, if I have a question like this,
00:25:32.200 | what was the latest thing that happened this week?
00:25:35.200 | That embeds to nothing.
00:25:38.400 | But really, that query should just
00:25:40.080 | be select all data where the date time is between today
00:25:43.680 | and today minus seven days.
00:25:47.480 | What if I said, how did my writing
00:25:50.160 | change between this month and last month?
00:25:52.120 | Again, embeddings would do nothing.
00:25:55.600 | But really, if you could do a group by over the month
00:25:58.000 | and a summarize, then you could, again,
00:26:00.080 | do something much more interesting.
00:26:01.840 | And so this really just calls out the fact
00:26:03.560 | that embeddings really is kind of like the lowest
00:26:05.800 | hanging fruit.
00:26:06.640 | And using something like an instructor
00:26:08.220 | can really help produce a data structure.
00:26:11.220 | And then you can just use your computer science
00:26:13.220 | to reason about this data structure.
00:26:14.720 | Maybe you say, OK, well, I'm going
00:26:16.200 | to produce a graph where I want to group by each month
00:26:19.000 | and then summarize them jointly.
00:26:20.800 | You can do that if you know how to define this data structure.
00:26:24.240 | In that part, you kind of run up against the chains of the world
00:26:28.080 | that used to have that.
00:26:30.480 | They still do have the self-querying,
00:26:32.360 | I think they used to call it, when we had
00:26:34.720 | Harrison on in our episode.
00:26:36.480 | How do you see yourself interacting
00:26:38.120 | with the other, I guess, LLM frameworks in the ecosystem?
00:26:42.000 | - Yeah, I mean, if they use Instructure,
00:26:43.880 | I think that's totally cool.
00:26:45.040 | I think because it's just, again, it's just Python.
00:26:48.160 | It's asking, oh, how does Django interact with requests?
00:26:51.320 | Well, you just might make a request.get in a Django app.
00:26:56.560 | But no one would say, oh, I went off of Django
00:26:59.920 | because I'm using requests now.
00:27:01.840 | They should be, ideally, the wrong comparison.
00:27:05.440 | In terms of especially the agent workflows,
00:27:07.680 | I think the real goal for me is to go down the LLM compiler
00:27:12.080 | route, which is instead of doing a React-type reasoning loop,
00:27:18.160 | I think my belief is that we should be using workflows.
00:27:23.560 | If we do this, then we always have
00:27:25.280 | a request and a complete workflow.
00:27:26.920 | We can fine-tune a model that has a better workflow.
00:27:29.320 | Whereas it's hard to think about how do you
00:27:31.160 | fine-tune a better React loop.
00:27:33.600 | Do you want to always train it to have less looping?
00:27:36.920 | In which case, you want it to get the right answer
00:27:38.960 | the first time, in which case, it
00:27:40.360 | was a workflow to begin with.
00:27:42.800 | - Can you define workflow?
00:27:44.240 | Because I think, obviously, I used
00:27:46.160 | to work at a workflow company, but I'm not sure
00:27:48.280 | this is a well-defined framework for everybody.
00:27:49.880 | - I'm thinking workflow in terms of the prefect Zapier
00:27:53.280 | workflow.
00:27:54.240 | I want to build a DAG.
00:27:55.400 | I want you to tell me what the nodes and edges are.
00:27:57.600 | And then maybe the edges are also put in with AI.
00:28:03.040 | But the idea is that I want to be
00:28:04.480 | able to present you the entire plan
00:28:06.200 | and then ask you to fix things as I execute it,
00:28:09.600 | rather than going, hey, I couldn't parse the JSON,
00:28:12.560 | so I'm going to try again.
00:28:13.840 | I couldn't parse the JSON, I'm going to try again.
00:28:15.640 | And then next thing you know, you spent $2 on OpenAI credits.
00:28:20.040 | Whereas with the plan, you can just
00:28:21.600 | say, oh, the edge between node x and y does not run.
00:28:27.840 | Let me just iteratively try to fix that component.
00:28:30.720 | Once that's fixed, go on to the next component.
00:28:33.800 | And obviously, you can get into a world where,
00:28:36.320 | if you have enough examples of the nodes x and y,
00:28:39.240 | maybe you can use a vector database
00:28:41.080 | to find a good few-shot examples.
00:28:43.280 | You can do a lot if you break down
00:28:45.320 | the problem into that workflow and execute in that workflow,
00:28:49.600 | rather than looping and hoping the reasoning is good enough
00:28:52.280 | to generate the correct output.
00:28:55.120 | Yeah, I would say I've been hammering on Devon a lot.
00:28:59.200 | I got access a couple of weeks ago.
00:29:01.680 | And obviously, for simple tasks, it does well.
00:29:06.120 | For the complicated, more than 10, 20-hour tasks,
00:29:10.800 | I can see it--
00:29:11.520 | That's a crazy comparison.
00:29:13.020 | We used to talk about three, four loops.
00:29:16.920 | Only once it gets to hour tasks, it's hard.
00:29:20.040 | Yeah.
00:29:21.000 | Less than an hour, there's nothing.
00:29:24.360 | That's crazy.
00:29:25.600 | I mean, I don't know.
00:29:26.520 | Yeah, OK, maybe my goalposts have shifted.
00:29:29.000 | I don't know.
00:29:30.400 | That's incredible.
00:29:32.520 | I'm like sub-one-minute executions.
00:29:34.760 | The fact that you're talking about 10 hours is incredible.
00:29:37.680 | I think it's a spectrum.
00:29:39.120 | I actually-- I really, really--
00:29:40.680 | I think I'm going to say this every single time
00:29:42.600 | I bring up Devon.
00:29:43.480 | Let's not reward them for taking longer to do things.
00:29:45.880 | Do you know what I mean?
00:29:46.880 | Like, that's a metric that is easily abusable.
00:29:51.280 | Sure.
00:29:51.800 | Yeah.
00:29:52.280 | You can run a game.
00:29:53.800 | Yeah, but all I'm saying is you can monotonically
00:29:56.400 | increase the success probability over an hour.
00:30:00.960 | That's winning to me.
00:30:02.000 | Obviously, if you run an hour and you've made no progress--
00:30:04.880 | like, I think when we were in auto-GBT land,
00:30:07.440 | there was that one example where I wanted it to buy me
00:30:10.600 | a bicycle.
00:30:11.160 | And overnight, I spent $7 on credits,
00:30:13.400 | and I never found the bicycle.
00:30:14.920 | Yeah, yeah.
00:30:16.160 | I wonder if you'll be able to purchase a bicycle.
00:30:18.760 | Because it actually can do things in real world,
00:30:21.160 | it just needs to suspend to you for off and stuff.
00:30:24.200 | But the point I was trying to make
00:30:26.020 | was that I can see it turning plans.
00:30:28.280 | Like, when it gets on--
00:30:29.520 | I think one of the agents' loopholes,
00:30:32.560 | or one of the things that is a real barrier for agents
00:30:34.840 | is LLMs really like to get stuck into a lane.
00:30:37.840 | And what you're talking about, what I've seen Devon do
00:30:42.040 | is it gets stuck in a lane, and it will just
00:30:43.960 | kind of change plans based on the performance of the plan
00:30:47.680 | itself.
00:30:49.960 | And it's kind of cool.
00:30:51.280 | Yeah, I feel like we've gone too much in the looping route.
00:30:53.840 | And I think a lot of more plans and DAGs and data structures
00:30:56.880 | are probably going to come back to help fill in some holes.
00:30:59.720 | Yeah.
00:31:00.240 | What's the interface to that?
00:31:02.600 | Do you see it's like an existing state machine kind of thing
00:31:06.360 | that connects to the LLMs, the traditional DAG player?
00:31:10.680 | So do you think we need something new for AI DAGs?
00:31:15.200 | Yeah, I mean, I think that the hard part is
00:31:17.320 | going to be describing visually the fact
00:31:19.640 | that this DAG can also change over time,
00:31:22.320 | and it should still be allowed to be fuzzy, right?
00:31:27.240 | I think in mathematics, we have plate diagrams, and Markov chain
00:31:30.560 | diagrams, and recurrence states, and all that.
00:31:32.840 | Some of that might come into this workflow world.
00:31:35.040 | But to be honest, I'm not too sure.
00:31:36.920 | I think right now, the first steps
00:31:39.160 | are just how do we take this DAG idea
00:31:41.680 | and break it down to modular components
00:31:43.720 | that we can prompt better, have few-shot examples for,
00:31:47.280 | and ultimately fine-tune against.
00:31:49.880 | But in terms of even the UI, it's
00:31:51.240 | hard to say what we'll likely win.
00:31:53.480 | I think people like Prefect and Zapier
00:31:55.600 | have a pretty good shot at doing a good job.
00:31:57.720 | Yeah.
00:31:58.320 | So you seem to use Prefect a lot.
00:31:59.800 | Actually, you worked at a Prefect competitor at Temporal.
00:32:02.160 | And I'm also very familiar with Dexter.
00:32:06.480 | What else would you call out as particularly interesting
00:32:09.200 | in the AI engineering stack?
00:32:12.280 | Man, I almost use nothing.
00:32:15.120 | I just use Cursor and PyTests.
00:32:19.160 | Oh, OK.
00:32:20.720 | I think that's basically it.
00:32:22.440 | A lot of the observability companies have--
00:32:25.520 | the more observability companies I've tried,
00:32:28.400 | the more I just use Postgres.
00:32:30.920 | Really?
00:32:32.160 | Postgres for observability?
00:32:34.600 | But the issue, really, is the fact
00:32:36.160 | that these observability companies isn't actually
00:32:38.920 | doing observability for the system.
00:32:40.520 | It's just doing the LLM thing.
00:32:42.640 | I still end up using Datadog or Sentry to do latency.
00:32:48.440 | And so I just have those systems handle it.
00:32:50.400 | And then the prompt-in, prompt-out latency token costs,
00:32:54.360 | I just put that in a Postgres table now.
00:32:56.320 | So you don't need 20 funded startups building LLM ops?
00:33:01.480 | Yeah, but I'm also an old, tired guy.
00:33:04.200 | Because of my background, I was like, yeah,
00:33:09.320 | the Python stuff I'll write myself.
00:33:10.800 | But I will also just use Vercel happily.
00:33:14.640 | Because I'm just not familiar with that world of tooling.
00:33:19.280 | Whereas I think I spent three good years building
00:33:22.520 | observability tools for recommendation systems.
00:33:24.760 | And I was like, oh, compared to that,
00:33:27.720 | Instructor is just one call.
00:33:29.600 | I just have to put time start, time end,
00:33:31.760 | and then count the prompt token.
00:33:34.040 | Because I'm not doing a very complex looping behavior.
00:33:36.280 | I'm doing mostly workflows and extraction.
00:33:40.440 | Yeah, I mean, while we're on this topic,
00:33:42.520 | we'll just kind of get this out of the way.
00:33:44.360 | You famously have decided to not be a venture-backed company.
00:33:48.360 | You want to do the consulting route.
00:33:51.320 | The obvious route for someone as successful as Instructor
00:33:53.880 | is like, oh, here's hosted Instructor with all tooling.
00:33:57.360 | And you just said you had a whole bunch of experience
00:33:59.640 | building observability tooling.
00:34:01.120 | You have the perfect background to do this, and you're not.
00:34:04.080 | Yeah, isn't that sick?
00:34:05.760 | I think that's sick.
00:34:06.600 | I know.
00:34:07.080 | I mean, I know why, because you want to go free dive.
00:34:09.440 | But--
00:34:09.920 | Yeah, well, yeah, because I think there's two things.
00:34:13.880 | One, it's like, if I tell myself I want to build requests,
00:34:17.160 | requests is not a venture-backed startup.
00:34:19.760 | I mean, one could argue whether or not Postman is.
00:34:22.400 | But I think for the most part, having worked so much,
00:34:25.960 | I'm kind of, like, I am more interested in looking
00:34:32.160 | at how systems are being applied and just having access
00:34:35.160 | to the most interesting data.
00:34:36.360 | And I think I can do that more through a consulting business
00:34:38.800 | where I can come in and go, oh, you
00:34:40.840 | want to build perfect memory.
00:34:42.040 | You want to build an agent.
00:34:43.120 | You want to build, like, automations over construction
00:34:45.420 | or, like, insurance and the supply chain.
00:34:47.400 | Or you want to handle, like, writing, like,
00:34:50.640 | private equity, like, mergers and acquisitions
00:34:52.920 | reports based off of user interviews.
00:34:54.840 | Those things are super fun.
00:34:56.840 | Whereas, like, maintaining the library, I think,
00:34:59.360 | is mostly just kind of, like, a utility
00:35:01.320 | that I try to keep up, especially because if it's not
00:35:04.120 | venture-backed, I have no reason to sort of go
00:35:06.720 | down the route of, like, trying to get 1,000 integrations.
00:35:10.160 | Like, in my mind, I just go, oh, OK, 98% of the people
00:35:13.620 | use OpenAI.
00:35:14.400 | I'll support that.
00:35:15.200 | And if someone contributes another, like, platform,
00:35:17.760 | that's great.
00:35:18.520 | I'll merge it in.
00:35:19.800 | But yeah, I mean, you only added Entropic Support, like,
00:35:22.520 | this year.
00:35:23.920 | Yeah, yeah, yeah.
00:35:24.840 | The thing, a lot of it was just, like,
00:35:26.840 | you couldn't even get an API key until, like, this year, right?
00:35:29.160 | Yeah, that's true, that's true.
00:35:30.480 | And so, OK, if I add it, like, last year, I was kind of--
00:35:33.120 | I'm trying to, like, double the code base to service,
00:35:35.600 | you know, half a percent of all downloads.
00:35:38.000 | Do you think the market share will shift a lot now
00:35:40.040 | that Entropic has, like, a very, very competitive offering?
00:35:43.880 | I think it's still hard to get API access.
00:35:48.240 | I don't know if it's fully GA now, if it's GA,
00:35:50.600 | if you can get commercial access really easily.
00:35:54.240 | I don't know.
00:35:54.800 | I got commercial after, like, two weeks to reach out to their sales team.
00:35:57.880 | OK, yeah, so two weeks.
00:35:58.920 | Yeah, there's a call list here.
00:36:00.520 | And then anytime you run into rate limits,
00:36:02.640 | just, like, ping one of the Entropic staff members.
00:36:05.480 | Then maybe we need to, like, cut that part out
00:36:07.160 | so I don't need to, like, you know, read false news.
00:36:09.280 | But it's a common question.
00:36:10.880 | Surely, just from the price perspective,
00:36:12.560 | it's going to make a lot of sense.
00:36:14.800 | Like, if you are a business, you should totally
00:36:18.400 | consider, like, SONET, right?
00:36:21.400 | Like, the cost savings is just going to justify it
00:36:24.320 | if you actually are doing things at volume.
00:36:26.360 | And yeah, I think their SDK is, like, pretty good.
00:36:29.880 | But back to the instructor thing,
00:36:31.280 | I just don't think it's a billion-dollar company.
00:36:33.600 | And I think if I raise money, the first question is going to be, like,
00:36:35.880 | how are you going to get a billion-dollar company?
00:36:37.120 | And I would just go, like, man, like,
00:36:38.840 | if I make a million dollars as a consultant, I'm super happy.
00:36:41.560 | I'm, like, more than ecstatic.
00:36:43.080 | I can have, like, a small staff of, like, three people.
00:36:46.000 | Like, it's fun.
00:36:47.720 | And I think a lot of my happiest founder friends
00:36:49.680 | are those who, like, raised the tiniest seed round,
00:36:52.080 | became profitable, they're making, like, 70, 60, 70, like, MRR,
00:36:56.440 | 70,000 MRR.
00:36:58.840 | And they're, like, we don't even need to raise the seed round.
00:37:00.680 | Like, let's just keep it, like, between me and my co-founder,
00:37:03.600 | we'll go traveling, and it'll be a great time.
00:37:05.960 | I think it's a lot of fun.
00:37:07.520 | - I repeat that as a seed investor in the company.
00:37:10.680 | I think that's, like, one of the things that people get wrong sometimes,
00:37:14.000 | and I see this a lot.
00:37:15.960 | They have an insight into, like, some new tech,
00:37:18.640 | like, say LLM, say AI, and they build some open source stuff,
00:37:21.840 | and it's like, I should just raise money and do this.
00:37:24.200 | And I tell people a lot, it's like, look, you can make a lot more money
00:37:27.440 | doing something else than doing a startup.
00:37:29.000 | Like, most people that do a company
00:37:30.720 | could make a lot more money just working somewhere else
00:37:33.360 | than doing the company itself.
00:37:34.720 | Do you have any advice for folks
00:37:37.200 | that are maybe in a similar situation?
00:37:38.640 | They're trying to decide, oh, should I stay in my, like, high-paid fang job
00:37:42.640 | and just tweet this on the side and do this on GitHub?
00:37:45.680 | Should I be a consultant?
00:37:47.160 | Like, being a consultant seems like a lot of work.
00:37:49.440 | It's like, you got to talk to all these people, you know?
00:37:52.760 | - There's a lot to unpack,
00:37:54.480 | because I think the open source thing is just like,
00:37:56.000 | well, I'm just doing it for, like, purely for fun,
00:37:58.720 | and I'm doing it because I think I'm right.
00:38:00.840 | But part of being right
00:38:02.800 | is the fact that it's not a venture-backed startup.
00:38:05.520 | Like, I think I'm right because this is all you need.
00:38:10.040 | Right? Like, you know.
00:38:12.760 | So I think a part of it is just, like, part of the philosophy
00:38:15.680 | is the fact that all you need is a very sharp blade
00:38:17.920 | to sort of do your work,
00:38:19.320 | and you don't actually need to build, like, a big enterprise.
00:38:22.240 | So that's one thing.
00:38:23.200 | I think the other thing, too, that I've been thinking around,
00:38:25.760 | just because I have a lot of friends at Google
00:38:26.960 | that want to leave right now,
00:38:28.880 | it's like, man, like, what we lack is not money or, like, money or, like, skill.
00:38:32.640 | Like, what we lack is courage.
00:38:34.520 | Like, you just have to do this, the hard thing,
00:38:38.040 | and you have to do it scared anyways, right?
00:38:40.160 | In terms of, like, whether or not you do want to do a founder,
00:38:41.960 | I think that's just a matter of, like, optionality.
00:38:44.040 | But I definitely recognize that the, like, expected value of being a founder
00:38:51.320 | is still quite low.
00:38:53.000 | - It is. - Right.
00:38:54.640 | Like, I know as many founder breakups
00:38:58.680 | and as I know friends who raised a seed round this year.
00:39:03.120 | Right? And, like, that is, like, the reality.
00:39:04.760 | And, like, you know, even from my perspective,
00:39:08.760 | it's been tough where it's like, oh, man, like,
00:39:11.080 | a lot of incubators want you to have co-founders.
00:39:12.880 | Now you spend half the time, like, fundraising
00:39:15.000 | and then trying to, like, meet co-founders
00:39:16.920 | and find co-founders rather than building the thing.
00:39:20.040 | And I was like, man, like, this is a lot of stuff,
00:39:23.840 | a lot of time spent out doing things I'm not really good at.
00:39:28.720 | I think, I do think there's a rising trend in solo founding.
00:39:32.560 | You know, I am a solo.
00:39:34.240 | I think that something like 30% of, like,
00:39:37.280 | I think, I forget what the exact stat is,
00:39:39.080 | something like 30% of starters that make it to, like,
00:39:41.160 | series B or something actually are solo founder.
00:39:44.240 | So I think, I feel like this must-have co-founder idea
00:39:48.000 | mostly comes from YC and most, everyone else copies it.
00:39:52.080 | And then, yeah, you, like,
00:39:53.720 | plenty of companies break up over co-founder breakups.
00:39:56.080 | - Yeah, and I bet it would be, like,
00:39:57.360 | I wonder how much of it is the people
00:39:59.000 | who don't have that much, like,
00:40:00.560 | and I hope this is not a diss to anybody,
00:40:03.240 | but it's like, you sort of,
00:40:04.440 | you go through the incubator route
00:40:05.840 | because you don't have, like, the social equity
00:40:07.560 | you would need to just sort of, like,
00:40:09.000 | send an email to Sequoia and be, like,
00:40:10.800 | "Hey, I'm going on this ride.
00:40:13.960 | "Do you want a ticket on the rocket ship?"
00:40:15.680 | Right, like, that's very hard to sell.
00:40:17.200 | Like, if I was to raise money, like, that's kind of,
00:40:19.720 | like, my message if I was to raise money is, like,
00:40:21.960 | "You've seen my Twitter.
00:40:23.080 | "My life is sick.
00:40:24.360 | "I've decided to make it much worse by being a founder
00:40:27.120 | "because this is something I have to do.
00:40:29.560 | "So do you want to come along?
00:40:31.040 | "Otherwise, I'm gonna fund it myself."
00:40:33.160 | Like, if I can't say that, like, I don't need the money
00:40:35.440 | 'cause, like, I can, like, handle payroll
00:40:37.880 | and, like, hire an intern and get an assistant.
00:40:39.560 | Like, that's all fine.
00:40:41.040 | But, like, what I don't want to do, it's, like,
00:40:44.400 | I really don't want to go back to meta.
00:40:46.080 | I want to, like, get two years
00:40:47.800 | to, like, try to find a problem we're solving.
00:40:50.680 | That feels like a bad time.
00:40:51.840 | - Yeah.
00:40:52.680 | Jason is like, "I wear a YSL jacket
00:40:54.400 | "on stage at AI Engineer Summit.
00:40:56.080 | "I don't need your accelerator money."
00:40:58.560 | - And boots.
00:40:59.680 | You don't forget the boots.
00:41:00.640 | - That's true, that's true.
00:41:01.480 | - You have really good boots, really good boots.
00:41:04.080 | But I think that is a part of it, right?
00:41:06.840 | I think it is just, like, optionality.
00:41:08.120 | And also, just, like, I'm a lot older now.
00:41:10.320 | I think 22-year-old Jason
00:41:11.720 | would have been probably too scared,
00:41:13.360 | and now I'm, like, too wise.
00:41:15.200 | But I think it's a matter of, like,
00:41:17.080 | oh, if you raise money,
00:41:18.000 | you have to have a plan of spending it.
00:41:19.640 | And I'm just not that creative
00:41:21.200 | with spending that much money.
00:41:24.080 | - Yeah.
00:41:24.920 | I mean, to be clear,
00:41:25.760 | you just celebrated your 30-year birthday.
00:41:26.840 | Happy birthday.
00:41:27.680 | - Yeah, it's awesome.
00:41:28.880 | I'm going to Mexico next weekend.
00:41:31.320 | - You know, a lot older is relative
00:41:32.680 | to some of the folks listening.
00:41:34.320 | (laughing)
00:41:35.960 | - Seeing on the career tips,
00:41:38.560 | I think SWIGS had a great post
00:41:40.400 | about are you too old to get into AI?
00:41:42.600 | I saw one of your tweets in January '23.
00:41:45.840 | You applied to, like, Figma, Notion, Cohere, Anthropic,
00:41:48.760 | and all of them rejected you
00:41:49.600 | because you didn't have enough LLM experience.
00:41:52.600 | I think at that time,
00:41:53.440 | it would be easy for a lot of people to say,
00:41:55.000 | oh, I kind of missed the boat, you know?
00:41:57.360 | I'm too late, not going to make it, you know?
00:42:01.200 | Any advice for people that feel like that, you know?
00:42:04.640 | - Yeah, I mean,
00:42:05.600 | like, the biggest learning here
00:42:07.560 | is actually from a lot of folks in jiu-jitsu.
00:42:09.600 | They're like, oh, man,
00:42:10.720 | is it too late to start jiu-jitsu?
00:42:11.960 | Like, oh, I'll join jiu-jitsu once I get in more shape.
00:42:16.120 | Right?
00:42:18.080 | It's like, there's a lot of, like, excuses.
00:42:19.840 | And then you say, oh, like, why should I start now?
00:42:21.640 | I'll be, like, 45 by the time I'm any good.
00:42:23.680 | And it's like, well, you'll be 45 anyways.
00:42:25.800 | Like, time is passing.
00:42:28.800 | Like, if you don't start now, you start tomorrow.
00:42:30.480 | You're just, like, one more day behind.
00:42:32.640 | And if you're, like, if you're worried about being behind,
00:42:34.440 | like, today is, like,
00:42:35.560 | the soonest you can start.
00:42:39.560 | Right?
00:42:40.400 | And so you got to recognize that,
00:42:41.240 | like, maybe you just don't want it, and that's fine too.
00:42:44.560 | Like, if you wanted it, you would have started.
00:42:46.880 | Like, you know.
00:42:48.200 | I think a lot of these people, again,
00:42:50.520 | probably think of things on a too short time horizon.
00:42:54.560 | But again, you know, you're going to be old anyways.
00:42:57.640 | You may as well just start now.
00:42:58.840 | - You know, one more thing on,
00:42:59.840 | I guess, the career advice slash sort of blogging.
00:43:04.840 | You always go viral for this post that you wrote
00:43:07.840 | on advice to young people and the lies you tell yourself.
00:43:10.040 | - Oh, yeah, yeah, yeah.
00:43:11.080 | - You said that you were writing it for your sister.
00:43:12.840 | Like, why is that?
00:43:13.680 | - Yeah, yeah, yeah.
00:43:14.520 | Yeah, she was, like, bummed out about, like, you know,
00:43:16.880 | going to college and, like, stressing about jobs.
00:43:19.040 | And I was like,
00:43:19.880 | oh, and I really want to hear, okay.
00:43:24.160 | And I just kind of, like, texted through the whole thing.
00:43:25.960 | It's crazy.
00:43:26.800 | It's got, like, 50,000 views.
00:43:28.080 | I'm like, I don't mind.
00:43:29.760 | - I mean, your average tweet has more.
00:43:32.800 | - But that thing is, like, you know,
00:43:36.760 | a 30-minute read now.
00:43:38.400 | - Yeah, yeah.
00:43:39.280 | So there's lots of stuff here, which I agree with.
00:43:41.080 | You know, I'm also of occasionally indulge
00:43:43.480 | in the sort of life reflection phase.
00:43:46.400 | There's the how to be lucky.
00:43:48.080 | There's the how to have higher agency.
00:43:51.280 | I feel like the agency thing is always making a,
00:43:53.720 | is always a trend in SF or just in tech circles.
00:43:57.880 | - How do you define having high agency?
00:44:00.120 | - Yeah, I mean, I'm almost, like,
00:44:01.760 | past the high agency phase now.
00:44:03.520 | Now my biggest concern is, like,
00:44:05.440 | okay, the agency is just, like, the norm of the vector.
00:44:08.120 | What also matters is the direction, right?
00:44:11.440 | It's, like, how pure is the shot?
00:44:13.800 | Yeah, I mean, I think agency is just a matter
00:44:15.680 | of, like, having courage and doing the thing.
00:44:17.240 | That's scary, right?
00:44:18.960 | Like, you know, if you want to go rock climbing,
00:44:21.080 | it's, like, do you decide you want to go rock climbing,
00:44:24.160 | and then you show up to the gym,
00:44:25.040 | you rent some shoes, and you just fall 40 times?
00:44:26.880 | Or do you go, like, oh, like,
00:44:28.520 | I'm actually more intelligent.
00:44:29.720 | Let me go research the kind of shoes that I want.
00:44:32.120 | Okay, like, there's flatter shoes and more inclined shoes.
00:44:35.280 | Like, which one should I get?
00:44:36.320 | Okay, let me go order the shoes on Amazon.
00:44:38.920 | I'll come back in three days.
00:44:40.120 | Like, oh, it's a little bit too tight.
00:44:41.320 | Maybe it's too aggressive.
00:44:42.440 | I'm only a beginner.
00:44:43.280 | Let me go change.
00:44:44.800 | No, I think the higher agent person just, like,
00:44:46.680 | goes and, like, falls down 20 times, right?
00:44:48.920 | Yeah, I think the higher agency person
00:44:51.320 | is more focused on, like, process metrics
00:44:54.520 | versus outcome metrics, right?
00:44:57.880 | Like, from pottery, like, one thing I learned was
00:45:00.280 | if you want to be good at pottery,
00:45:01.280 | you shouldn't count, like,
00:45:02.120 | the number of cups or bowls you make.
00:45:04.320 | You should just weigh the amount of clay you use, right?
00:45:08.360 | Like, the successful person says,
00:45:09.560 | oh, I went through 1,000 pounds of clay,
00:45:11.360 | 100 pounds of clay, right?
00:45:13.360 | The less agency person's like, oh, I made six cups,
00:45:15.360 | and then after I made six cups,
00:45:17.360 | like, there's not really, what do you do next?
00:45:20.080 | No, just pounds of clay, pounds of clay.
00:45:22.800 | Same with the work here, right?
00:45:23.640 | It's like, oh, you just got to write the tweets,
00:45:25.200 | like, make the commits, contribute open source,
00:45:27.280 | like, write the documentation.
00:45:29.200 | There's no real outcome, it's just a process,
00:45:30.840 | and if you love that process,
00:45:31.840 | you just get really good at the thing you're doing.
00:45:34.160 | - Yeah, so just to push back on this,
00:45:36.120 | 'cause obviously I mostly agree,
00:45:38.800 | how would you design performance review systems?
00:45:41.440 | (laughing)
00:45:43.600 | Because you were effectively saying
00:45:45.960 | we can count lines of code for developers, right?
00:45:47.960 | Like, did you put out--
00:45:48.960 | - No, I don't think that would be the actual,
00:45:50.640 | like, I think if you make that an outcome,
00:45:52.360 | like, I can just expand a for loop, right?
00:45:54.520 | I think, okay, so for performance review,
00:45:57.000 | this is interesting because I've mostly thought of it
00:45:59.600 | from the perspective of science and not engineering.
00:46:02.920 | Like, I've been running a lot of engineering stand-ups,
00:46:06.220 | primarily because there's not really
00:46:07.400 | that many machine learning folks.
00:46:09.840 | Like, the process outcome is like experiments and ideas,
00:46:14.240 | right, like, if you think about outcomes,
00:46:15.480 | what you might want to think about an outcome is,
00:46:16.960 | oh, I want to improve the revenue or whatnot,
00:46:19.400 | but that's really hard.
00:46:21.000 | But if you're someone who is going out like,
00:46:22.640 | okay, like this week,
00:46:23.880 | I want to come up with like three or four experiments,
00:46:25.760 | I might move the needle.
00:46:26.600 | Okay, nothing worked.
00:46:27.600 | To them, they might think, oh, nothing worked, like, I suck.
00:46:30.920 | But to me, it's like, wow,
00:46:31.760 | you've closed off all these other possible avenues
00:46:34.480 | for, like, research.
00:46:36.520 | Like, you're gonna get to the place
00:46:37.800 | that you're gonna figure out that direction really soon,
00:46:40.720 | right, like, there's no way you'd try 30 different things
00:46:43.080 | and none of them work.
00:46:43.920 | Usually, like, you know, 10 of them work,
00:46:46.160 | five of them work really well,
00:46:47.320 | two of them work really, really well,
00:46:48.600 | and one thing was, like, you know,
00:46:51.200 | the nail in the head.
00:46:53.240 | So agency lets you sort of capture
00:46:55.200 | the volume of experiments.
00:46:56.680 | And, like, experience lets you figure out, like,
00:46:58.520 | oh, that other half, it's not worth doing, right?
00:47:01.800 | Like, I think experience is gonna go,
00:47:03.800 | half these prompting papers don't make any sense,
00:47:05.760 | just use a chain of thought and just, you know,
00:47:07.440 | use a for loop.
00:47:08.320 | But that's kind of, that's basically it, right?
00:47:12.000 | It's like, usually performance for me is around, like,
00:47:13.760 | how many experiments are you running?
00:47:16.000 | Like, how oftentimes are you trying?
00:47:18.320 | - Yeah.
00:47:19.480 | - When do you give up on an experiment?
00:47:21.200 | Because at Stitch Fix, you kind of give up
00:47:23.000 | on language models, I guess, in a way,
00:47:24.880 | and as a tool to use.
00:47:27.000 | And then maybe the tools got better.
00:47:29.080 | They got better before, you know,
00:47:30.840 | you were kind of like, you were right at the time
00:47:32.840 | and then the tool improved.
00:47:34.080 | I think there are similar paths in my engineering career
00:47:37.640 | where I try one approach and at the time it doesn't work
00:47:39.920 | and then the thing changes,
00:47:41.320 | but then I kind of soured on that approach
00:47:43.120 | and I don't go back to it soon enough.
00:47:45.360 | - I see.
00:47:46.200 | What do you think about that loop?
00:47:48.400 | - So usually when I, like, when I'm coaching folks
00:47:51.080 | and they say, like, oh, these things don't work,
00:47:52.800 | I'm not going to pursue them in the future.
00:47:54.120 | Like, one of the big things, like, hey,
00:47:55.480 | the negative result is a result
00:47:56.960 | and this is something worth documenting.
00:47:58.200 | Like, this isn't academia.
00:47:59.240 | Like, if it's negative, you don't just, like, not public.
00:48:02.440 | But then, like, what do you actually write down?
00:48:03.640 | Like, what you should write down is, like,
00:48:04.760 | here are the conditions.
00:48:06.320 | This is the inputs and the outputs
00:48:07.600 | we tried the experiment on.
00:48:09.760 | And then one thing that's really valuable
00:48:11.840 | is basically writing down under what conditions
00:48:14.720 | would I revisit these experiments, right?
00:48:18.000 | It's like, these things don't work
00:48:19.400 | because of what we had at the time.
00:48:21.520 | If someone is reading this two years from now,
00:48:23.440 | under what conditions will we try again?
00:48:25.640 | That's really hard, but again, that's like another,
00:48:28.000 | that's like another skill you kind of learn, right?
00:48:30.320 | It's like, you do go back and you do experiments
00:48:32.360 | and you figure out why it works now.
00:48:34.600 | I think a lot of it here is just, like, scaling worked.
00:48:37.880 | - Yeah.
00:48:39.760 | - Right, like, you could actually, like, rap lyrics,
00:48:42.000 | you know, like, that was because I did not have
00:48:44.880 | high enough quality data.
00:48:46.640 | If we phase shift and say, okay,
00:48:48.480 | you don't even need training data.
00:48:49.680 | So, oh, great, then it might just work.
00:48:51.920 | - Yeah.
00:48:52.760 | - Different domain.
00:48:53.600 | - Do you have any, anything in your list
00:48:56.120 | that is like, it doesn't work now,
00:48:57.520 | but I want to try it again later?
00:48:58.840 | Something that people should, maybe keep in mind,
00:49:01.040 | you know, people always like, AGI when?
00:49:03.240 | You know, when are you going to know the AGI is here?
00:49:05.120 | Maybe it's less than that,
00:49:05.960 | but any stuff that you tried recently that didn't work
00:49:08.960 | that you think will get there?
00:49:11.080 | - I mean, I think, like, the personal assistants
00:49:14.000 | and the writing I've shown to myself
00:49:15.880 | is just not good enough yet.
00:49:17.400 | So, I hired a writer and I hired a personal assistant.
00:49:22.320 | So, now I'm going to basically, like,
00:49:23.600 | work with these people until I figure out, like,
00:49:25.800 | what I can actually, like, automate
00:49:27.120 | and what are, like, the reproducible steps, right?
00:49:30.000 | But, like, I think the experiment for me is, like,
00:49:31.880 | I'm going to go, like, pay a person, like,
00:49:33.520 | $1,000 a month to, like, help me improve my life
00:49:35.920 | and then let me, sort of, get them to help me figure out,
00:49:38.360 | like, what are the components
00:49:39.360 | and how do I actually modularize something
00:49:41.040 | to get it to work?
00:49:42.480 | 'Cause it's not just, like, OAuth, Gmail, Calendar,
00:49:46.000 | and, like, Notion.
00:49:46.880 | It's a little bit more complicated than that,
00:49:48.200 | but we just don't know what that is yet.
00:49:49.560 | Or those are two, sort of, systems that,
00:49:51.800 | I wish GPD 4 or Opus was actually good enough
00:49:54.160 | to just write me an essay,
00:49:55.160 | but most of the essays are still pretty bad.
00:49:57.640 | - Yeah, I would say, you know,
00:49:59.160 | on the personal assistant side,
00:50:00.760 | Lindy is probably the one I've seen the most.
00:50:04.360 | He was, Flo was a speaker at the summit.
00:50:06.680 | I don't know if you've checked it out
00:50:07.840 | or any other, sort of, agents, assistant startup.
00:50:11.040 | - Not recently.
00:50:11.880 | I haven't tried Lindy.
00:50:12.720 | It was, they were, like, behind,
00:50:13.560 | they were not GA last time I was considering it.
00:50:15.720 | - Yeah, yeah, they're not GA.
00:50:16.560 | - But a lot of it now, it's, like,
00:50:17.520 | oh, like, really what I want you to do is, like,
00:50:19.560 | take a look at all of my meetings
00:50:21.080 | and, like, write, like, a really good
00:50:23.440 | weekly summary email for my clients.
00:50:26.200 | Remind them that I'm, like, you know,
00:50:27.600 | thinking of them and, like, working for them.
00:50:30.040 | Right?
00:50:30.880 | Or it's, like, I want you to notice that, like,
00:50:32.760 | my Mondays were way, like, way too packed
00:50:35.520 | and, like, block out more time
00:50:36.800 | and also, like, email the people
00:50:38.960 | to do the reschedule
00:50:40.560 | and then try to opt in to move them around.
00:50:42.240 | And then I want you to say,
00:50:43.080 | oh, Jason should have, like, a 15-minute prep break
00:50:45.920 | after a four-back-to-back meeting.
00:50:48.520 | Those are things that, like,
00:50:50.240 | now I know I can prompt them in,
00:50:51.800 | but can it do it well?
00:50:53.000 | Like, before, I didn't even know
00:50:54.040 | that's what I wanted to prompt for.
00:50:55.320 | It was, like, defragging a calendar
00:50:57.840 | and adding breaks so I can, like, eat lunch.
00:51:01.160 | Right?
00:51:02.240 | - Yeah, that's the AGI test.
00:51:04.160 | - Yeah, exactly.
00:51:05.400 | Compassion, right?
00:51:06.800 | - I think one thing that, yeah,
00:51:07.920 | we didn't touch on it before,
00:51:09.040 | but I think was interesting.
00:51:10.920 | You had this tweet a while ago
00:51:12.200 | about prompts should be code.
00:51:14.640 | And then there were a lot of companies
00:51:17.200 | trying to build prompt engineering tooling,
00:51:19.440 | kind of trying to turn the prompt
00:51:21.080 | into a more structured thing.
00:51:23.240 | What's your thought today?
00:51:24.520 | Like, you know, now you want to turn the thinking
00:51:26.920 | into DAGs, like, do prompts should still be code?
00:51:29.480 | Like, any updated ideas?
00:51:31.920 | - Nah, it's the same thing, right?
00:51:34.040 | I think, like, you know,
00:51:35.200 | with Instructor, it is very much, like,
00:51:36.640 | the output model is defined as a code object.
00:51:41.640 | That code object is sent to the LLM
00:51:43.720 | and in return, you get a data structure.
00:51:46.400 | So the outputs of these models,
00:51:47.800 | I think, should also be code,
00:51:49.240 | like, code objects.
00:51:50.440 | And the inputs, somewhat, should be code objects.
00:51:52.440 | But I think the one thing that Instructor tries to do
00:51:54.680 | is separate instruction, data,
00:51:57.040 | and the types of the output.
00:51:58.840 | And beyond that, I really just think that, you know,
00:52:04.040 | most of it should be still, like,
00:52:06.040 | managed pretty closely to the developer.
00:52:08.440 | Like, so much of it is changing
00:52:10.040 | that if you give control of these systems away too early,
00:52:13.720 | you end up, ultimately, wanting them back.
00:52:16.400 | Like, many companies I know that I reach out are ones
00:52:18.600 | where, like, oh, we're going off of the frameworks
00:52:20.280 | because now that we know what the business outcomes
00:52:22.240 | we're trying to optimize for,
00:52:24.240 | these frameworks don't work.
00:52:25.560 | Yeah, 'cause, like, we do RAG,
00:52:27.760 | but we want to do RAG to, like, sell you supplements
00:52:31.960 | or to have you, like, schedule the fitness appointment.
00:52:35.000 | And, like, the prompts are kind of too baked into the systems
00:52:37.880 | to really pull them back out
00:52:38.960 | and, like, start doing upselling or something.
00:52:41.600 | It's really funny, but a lot of it ends up being, like,
00:52:43.800 | once you understand the business outcomes,
00:52:46.120 | you care way more about the prompt, right?
00:52:49.160 | - Actually, this is fun.
00:52:50.400 | So we were trying, in our prep for this call,
00:52:52.280 | we were trying to say, like,
00:52:53.120 | what can you, as an independent person, say
00:52:55.240 | that maybe me and Alessio cannot say
00:52:57.120 | or, you know, someone who works at a company can say?
00:53:00.040 | What do you think is the market share of the frameworks?
00:53:03.680 | The Lanchain, the Llama Index, the everything else.
00:53:06.240 | - Oh, massive.
00:53:07.520 | 'Cause not everyone wants to care about the code.
00:53:10.160 | - Yeah. - Right?
00:53:11.320 | It's like, I think that's a different question
00:53:14.520 | to, like, what is the business model
00:53:16.560 | and are they going to be, like,
00:53:17.400 | massively profitable businesses, right?
00:53:19.360 | Like, making hundreds of millions of dollars,
00:53:21.600 | that feels, like, so straightforward, right?
00:53:24.120 | 'Cause not everyone is a prompt engineer.
00:53:25.560 | Like, there's so much productivity to be captured
00:53:28.520 | in, like, back-office automations, right?
00:53:33.520 | It's not because they care about the prompts,
00:53:36.240 | that they care about managing these things.
00:53:39.200 | - Yeah, but those are not sort of low-code experiences,
00:53:41.400 | you know?
00:53:42.480 | - Yeah, I think the bigger challenge is, like,
00:53:45.640 | okay, $100 million, probably pretty easy.
00:53:49.760 | It's just time and effort.
00:53:50.800 | And they have both, like, the manpower
00:53:53.160 | and the money to sort of solve those problems.
00:53:57.280 | I think it's just like, again, if you go the VC route,
00:53:59.760 | then it's like, you're talking about billions
00:54:01.160 | and that's really the goal.
00:54:03.240 | That stuff, for me, it's, like, pretty unclear.
00:54:08.240 | - Okay. - But again,
00:54:09.200 | that is to say that, like,
00:54:10.040 | I sort of am building things for developers
00:54:11.720 | who want to use Instructure to build their own tooling.
00:54:14.880 | But in terms of the amount of developers
00:54:16.800 | there are in the world
00:54:17.640 | versus, like, downstream consumers of these things
00:54:19.760 | or even just, like, you know,
00:54:21.960 | think of how many companies will use, like,
00:54:24.680 | the Adobes and the IBMs, right?
00:54:26.400 | Because they want something that's fully managed
00:54:28.400 | and they want something that they know will work.
00:54:30.840 | And if the incremental 10% requires you
00:54:33.160 | to hire another team of 20 people,
00:54:34.680 | you might not want to do it.
00:54:36.320 | And I think that kind of organization is really good
00:54:38.440 | for those bigger companies.
00:54:40.840 | - And I just want to capture your thoughts
00:54:42.240 | on one more thing, which is,
00:54:43.080 | you said you wanted most of the prompts
00:54:44.920 | to stay close to the developer.
00:54:46.780 | I wouldn't, and Hummel Hussain wrote this, like,
00:54:51.720 | post which I really love called, like,
00:54:53.520 | "FU, show me the prompt."
00:54:55.240 | I think it cites you in one of those,
00:54:57.240 | part of the blog post.
00:54:58.480 | And I think DSPy is kind of, like,
00:55:00.120 | the complete antithesis of that,
00:55:02.480 | which is, I think, interesting.
00:55:03.760 | 'Cause I also hold the strong view
00:55:05.840 | that AI is a better prompt engineer than you are.
00:55:08.320 | And I don't know how to square that.
00:55:10.920 | I'm wondering if you have thoughts.
00:55:13.680 | - I think something like DSPy can work
00:55:17.440 | because there are, like,
00:55:19.480 | very short-term metrics to measure success.
00:55:25.440 | Right?
00:55:26.280 | It is, like, did you find the PII?
00:55:28.760 | Or, like, did you write the multi-hop question
00:55:31.480 | the correct way?
00:55:32.360 | But in these, like, workflows that I've been managing,
00:55:37.360 | a lot of it is, like, are we minimizing,
00:55:40.440 | like, minimizing churn and maximizing retention?
00:55:43.200 | Like, that's not, like, it's not really, like,
00:55:47.400 | a, like, uptuna, like, training loop, right?
00:55:51.160 | Like, those things are much more harder to capture.
00:55:52.800 | So we don't actually have those metrics for that, right?
00:55:55.840 | And obviously, we can figure out, like,
00:55:56.880 | okay, is the summary good?
00:55:58.120 | But then, like, how do you measure
00:55:59.320 | the quality of the summary, right?
00:56:01.920 | It's, like, that feedback loop,
00:56:05.040 | it ends up being a lot longer.
00:56:06.440 | And then, again, when something changes,
00:56:07.720 | it's really hard to make sure that it works
00:56:09.480 | across these, like, newer models,
00:56:11.000 | or, again, like, changes to work for the current process.
00:56:16.000 | Like, when we migrate from, like, Anthropic to OpenAI,
00:56:19.160 | like, there's just a ton of change
00:56:22.320 | that are, like, infrastructure related,
00:56:23.480 | not necessarily around the prompt itself.
00:56:26.280 | - Any other AI engineering startups
00:56:28.320 | that you think should not exist before we wrap up?
00:56:31.440 | - No, I mean, oh, my gosh.
00:56:33.040 | I mean, a lot of it, again, is just, like,
00:56:34.720 | every time of investors, like,
00:56:36.400 | what is, how does this make a billion dollars?
00:56:38.280 | Like, it doesn't.
00:56:39.640 | I'm gonna go back to just, like,
00:56:41.320 | tweeting and holding my breath underwater.
00:56:43.440 | Yeah, like, I don't really pay attention too much
00:56:45.520 | to most of this.
00:56:47.360 | Like, most of the stuff I'm doing
00:56:48.560 | is around, like, the consumer layer, right?
00:56:51.440 | Like, it's not in the consumer layer,
00:56:52.840 | but, like, the consumer of, like, LLM calls.
00:56:55.960 | I think people just wanna move really fast
00:56:57.400 | and they're willing to pick these vendors,
00:56:58.640 | but it's, like, I don't really know
00:57:01.800 | if anything has really, like, blown me out the water.
00:57:04.320 | Like, I only trust myself,
00:57:05.640 | but that's also a function of, like,
00:57:07.120 | just being an old man.
00:57:08.480 | Like, I think, you know,
00:57:09.760 | many companies are definitely very happy
00:57:11.640 | with using most of these tools anyways,
00:57:14.400 | but I definitely think I occupy, like,
00:57:18.960 | a very small space in the AI engineering ecosystem.
00:57:22.440 | - Yeah, I would say one of the challenges here,
00:57:25.280 | you know, you talk about dealing in the consumer
00:57:28.880 | of LLM's space.
00:57:31.920 | I think that's what AI engineering
00:57:33.320 | differs from ML engineering,
00:57:34.840 | and I think a constant disconnect
00:57:37.960 | or cognitive dissonance in this field,
00:57:41.240 | in the AI engineers that have sprung up,
00:57:43.920 | is that they're not as good as the ML engineers.
00:57:45.760 | They're not as qualified.
00:57:47.680 | I think that, you know,
00:57:48.920 | you are someone who has credibility in the MLE space,
00:57:51.560 | and you are also, you know,
00:57:54.360 | a very authoritative figure in the AIE space,
00:57:57.080 | and-- - Authoritative?
00:57:58.800 | - I think so.
00:57:59.640 | And, you know, I think you've built
00:58:01.640 | the de facto leading library.
00:58:03.240 | I think yours, I think Instructor should be
00:58:04.920 | part of the standard lib,
00:58:06.120 | even though I try to not use it.
00:58:07.400 | Like, I also try to figure out that,
00:58:09.960 | I basically also end up rebuilding Instructor, right?
00:58:12.240 | Like, that's a lot of the back and forth
00:58:15.400 | that we had over the past two days.
00:58:16.920 | (laughing)
00:58:18.080 | But like, yeah, like,
00:58:19.160 | I think that's a fundamental thing
00:58:21.080 | that we're trying to figure out.
00:58:21.920 | Like, there's a very small supply of MLEs.
00:58:24.480 | They're not, like, not everyone's gonna have
00:58:26.880 | that experience that you had,
00:58:28.920 | but the global demand for AI
00:58:31.200 | is going to far outstrip the existing MLEs.
00:58:34.000 | So what do we do?
00:58:34.840 | Do we force everyone to go through
00:58:36.080 | the standard MLE curriculum,
00:58:38.160 | or do we make a new one?
00:58:39.840 | - I've got some takes.
00:58:41.200 | - Go.
00:58:42.040 | - I think a lot of these app layer startups
00:58:44.400 | should not be hiring MLEs,
00:58:46.120 | 'cause they end up churning.
00:58:47.520 | - Yeah, they want to work at OpenAI.
00:58:50.080 | (laughing)
00:58:50.920 | 'Cause they're just like, "Hey guys,
00:58:52.240 | I joined and you have no data,
00:58:54.200 | and like, all I did this week was like,
00:58:56.440 | fix some TypeScript build errors,
00:58:58.320 | and like, figure out why we don't have any tests,
00:59:02.440 | and like, what is this framework X and Y?
00:59:04.840 | Like, how come, like, what am I,
00:59:07.000 | like, what are, like, how do you measure success?
00:59:08.720 | What are your biggest outcomes?
00:59:09.840 | Oh, no, okay, let's not focus on that?
00:59:11.560 | Great, I'll focus on like, these TypeScript build errors."
00:59:14.280 | (laughing)
00:59:15.360 | And then you're just like, "What am I doing?"
00:59:16.840 | And then you kind of sort of feel really frustrated.
00:59:18.920 | And I already recognize that,
00:59:21.720 | because I've made offers to machine learning engineers,
00:59:25.480 | they've joined, and they've left in like, two months.
00:59:28.240 | And the response is like,
00:59:30.520 | "Yeah, I think I'm going to join a research lab."
00:59:32.320 | So I think it's not even that,
00:59:33.600 | like, I don't even think you should be hiring these MLEs.
00:59:35.880 | On the other hand, what I also see a lot of,
00:59:38.600 | is the really motivated engineer
00:59:41.440 | that's doing more engineering,
00:59:42.840 | is not being allowed to actually like,
00:59:44.640 | fully pursue the AI engineering.
00:59:46.200 | So they're the guy who built a demo, it got traction,
00:59:49.400 | now it's working, but they're still being pulled back
00:59:51.600 | to figure out like,
00:59:53.000 | why Google Calendar integrations are not working,
00:59:55.240 | or like, how to make sure that like,
00:59:57.360 | you know, the button is loading on the page.
00:59:59.680 | And so I'm sort of like, in a very interesting position
01:00:02.720 | where the companies want to hire an MLE,
01:00:05.160 | they don't need to hire,
01:00:06.520 | but they won't let the excited people
01:00:08.080 | who've caught the AI engineering bug
01:00:09.680 | could go do that work more full time.
01:00:13.000 | - This is something I'm literally wrestling with,
01:00:14.600 | like, this week, as I just wrote something about it.
01:00:17.560 | This is one of the things
01:00:18.400 | I'm probably gonna be recommending in the future,
01:00:19.640 | is really thinking about like,
01:00:21.120 | where is the talent coming from?
01:00:22.280 | How much of it is internal?
01:00:23.400 | And do you really need to hire someone
01:00:25.120 | who's like, writing PyTorch code?
01:00:27.680 | - Yeah, exactly.
01:00:29.280 | Most of the time you're not,
01:00:30.120 | you're gonna need someone to write instructor code.
01:00:32.640 | - And you're just like, yeah, you're making this like,
01:00:36.200 | and like, I feel goofy all the time, just like, prompting.
01:00:38.840 | It's like, oh man, I wish I just had a target data set
01:00:41.280 | that I could like, train a model against.
01:00:42.720 | - Yes.
01:00:43.560 | - And I can just say it's right or wrong.
01:00:45.240 | - Yeah, so, you know, I guess what LeanSpace is,
01:00:48.240 | what the AI Engineering World's Fair is,
01:00:50.360 | is that we're trying to create
01:00:51.840 | and elevate this industry of AI engineers,
01:00:54.360 | where it's legitimate to actually
01:00:56.200 | take these motivated software engineers
01:00:58.600 | who wanna build more in AI and do creative things in AI,
01:01:01.200 | to actually say, you have the blessing,
01:01:03.040 | and this is a legitimate sub-specialty
01:01:05.640 | of software engineering.
01:01:07.120 | - Yeah, I think there's gonna be a mix of that,
01:01:09.080 | product engineering.
01:01:10.400 | I think a lot more data science is gonna come in
01:01:12.240 | versus machine learning engineering,
01:01:13.880 | 'cause a lot of it now is just quantifying,
01:01:16.640 | like, what does the business actually want as an outcome?
01:01:20.200 | Right, the outcome is not RAGAP.
01:01:22.600 | - Yeah.
01:01:23.440 | - The outcome is like, reduced churn,
01:01:25.280 | or something like that,
01:01:26.120 | but people need to figure out what that actually is,
01:01:27.600 | and how to measure it.
01:01:28.800 | - Yeah, yeah, all the data engineering tools still apply,
01:01:32.800 | BI layers, semantic layers, whatever.
01:01:35.200 | - Yeah.
01:01:36.920 | - Cool. - We'll see.
01:01:38.160 | - We'll have you back again for the World's Fair.
01:01:41.520 | We don't know what you're gonna talk about,
01:01:44.080 | but I'm sure it's gonna be amazing.
01:01:46.160 | You're a very--
01:01:47.000 | - The title is written.
01:01:47.840 | It's just, "Pydantic is still all you need."
01:01:50.200 | (laughing)
01:01:52.320 | - I'm worried about having too many all-you-need titles,
01:01:54.880 | because that's obviously very trendy.
01:01:57.280 | So, yeah, you have one of them,
01:01:58.760 | but I need to keep a lid on, like,
01:02:00.880 | everyone saying their thing is all you need.
01:02:03.320 | But yeah, we'll figure it out.
01:02:04.680 | - Pydantic is not my thing.
01:02:05.760 | It's someone else's thing.
01:02:06.600 | - Yeah, yeah, yeah.
01:02:07.440 | - I think that's why it works.
01:02:08.280 | - Yeah, it's true.
01:02:10.200 | - Cool, well, it was a real pleasure to have you on.
01:02:12.880 | - Of course.
01:02:13.720 | - Everyone should go follow you on Twitter
01:02:15.440 | and check out Instructure.
01:02:16.880 | There's also InstructureJS, I think,
01:02:18.440 | which I'm very happy to see.
01:02:20.440 | And what else?
01:02:21.840 | - Useinstructure.com.
01:02:23.880 | - Anything else to plug?
01:02:25.080 | - Useinstructure.com.
01:02:27.240 | We got a domain name now.
01:02:28.440 | - Nice, nice, awesome.
01:02:30.200 | Cool. - Cool.
01:02:31.240 | Thanks, Tristan.
01:02:32.880 | - Thanks.
01:02:34.200 | (upbeat music)
01:02:36.780 | (upbeat music)
01:02:39.360 | (upbeat music)
01:02:41.940 | (upbeat music)
01:02:44.560 | (upbeat music)
01:02:47.140 | (upbeat music)
01:02:49.720 | (upbeat music)
01:02:52.520 | (upbeat music)
01:02:55.100 | (gentle music)