High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

00:00:00.000 | Hey, everyone.

00:00:01.080 | Welcome to the Latent Space Podcast.

00:00:02.960 | This is Alessio, partner at CTN Residence at Decibel Partners.

00:00:06.640 | And I'm joined by my co-host, Swiggs, founder of Small AI.

00:00:09.520 | Hello.

00:00:10.020 | We're back in the remote studio with Jason Liu from Instructure.

00:00:13.360 | Welcome, Jason.

00:00:14.120 | Hey there.

00:00:14.620 | Thanks for having me.

00:00:16.440 | Jason, you are extremely famous.

00:00:18.440 | So I don't know what I'm going to do introducing you.

00:00:20.720 | But you're one of the Waterloo clan.

00:00:24.220 | There's a small cadre of you that's just completely

00:00:26.520 | dominating machine learning.

00:00:28.600 | Actually, can you list Waterloo alums

00:00:30.520 | that you know are just dominating and crushing it

00:00:33.480 | right now?

00:00:34.600 | So John from Ryasana is doing his inversion models.

00:00:42.320 | I know there's Clive Chen.

00:00:45.040 | Clive Chen from Waterloo.

00:00:46.120 | He was one of the kids where I-- when I started the data science

00:00:49.640 | club, he was one of the guys who was joining in and just

00:00:51.880 | hanging out in the room.

00:00:52.920 | And now he was at Tesla, working with Harpathy.

00:00:56.160 | Now he's at OpenAI.

00:00:58.280 | He's in my climbing club.

00:01:00.640 | Oh, hell yeah.

00:01:01.440 | Yeah.

00:01:02.160 | I haven't seen him in like six years now.

00:01:05.200 | To get in the social scene in San Francisco,

00:01:07.320 | you have to climb.

00:01:08.720 | So yeah, both in career and in rocks.

00:01:12.000 | Yeah, I mean, a lot of good problem solving there.

00:01:14.120 | But oh man, I feel like now that you put me on the spot,

00:01:16.840 | I don't know.

00:01:17.840 | It's OK.

00:01:19.760 | Yeah, that was a riff.

00:01:20.920 | OK, but anyway, so you started a data science club at Waterloo.

00:01:23.760 | We can talk about that.

00:01:24.840 | But then also spent five years at Stitch Fix as an MLDE.

00:01:29.000 | You pioneered the use of OpenAI's LLMs

00:01:31.400 | to increase stylus efficiency.

00:01:33.440 | So you must have been a very, very early user.

00:01:35.360 | This was pretty early on.

00:01:37.840 | Yeah, I mean, this was like GBD3.

00:01:42.200 | OK, so we actually were using transformers at Stitch Fix

00:01:46.600 | before the GBD3 model.

00:01:47.600 | So we were just using transformers recommendation

00:01:49.640 | systems.

00:01:50.140 | At that time, I was very skeptical of transformers.

00:01:53.680 | I was like, why do we need all this infrastructure?

00:01:55.760 | We can just use matrix factorization.

00:01:57.520 | When GBD2 came out, I fine tuned my own GBD2

00:02:00.480 | to write rap lyrics.

00:02:01.600 | And I was like, OK, this is cute.

00:02:03.240 | OK, I got to go back to my real job.

00:02:05.320 | Who cares if I can write a rap lyric?

00:02:08.400 | When GBD3 Instruct came out, again, I

00:02:11.480 | was very much like, why are we using a POST request

00:02:15.720 | to review every comment a person leaves?

00:02:17.920 | We can just use classical models.

00:02:19.840 | So I was very against language models for the longest time.

00:02:23.800 | And then when ChachiPD came out, I basically

00:02:25.680 | just wrote a long apology letter to everyone at the company.

00:02:29.000 | I was like, hey, guys, I was very dismissive

00:02:31.560 | of some of this technology.

00:02:32.680 | I didn't think it would scale well.

00:02:34.120 | And I am wrong.

00:02:35.540 | This is incredible.

00:02:36.920 | And I immediately just transitioned

00:02:38.400 | to go from computer vision recommendation systems to LLMs.

00:02:42.240 | But funny enough, now that we have RAG,

00:02:44.120 | we're kind of going back to recommendation systems.

00:02:46.440 | Yeah, speaking of that, I think Alessio's going to bring up--

00:02:49.020 | I was going to say, we had Brian Bishop from Max

00:02:51.520 | on the podcast.

00:02:52.800 | Did you overlap Stitch Fix?

00:02:54.400 | Yeah, yeah, he was one of my main users

00:02:56.760 | of the recommendation framework that I

00:02:58.480 | had built out at Stitch Fix.

00:02:59.840 | Yeah, we talked a lot about Rexis, so it makes sense.

00:03:02.880 | So I actually, now I have adopted that line,

00:03:05.920 | that RAG is Rexis.

00:03:08.080 | And if you're trying to reinvent new concepts,

00:03:10.760 | you should study Rexis first, because you're

00:03:13.120 | going to independently reinvent a lot of concepts.

00:03:15.160 | So your system was called Flight.

00:03:16.640 | It's a recommendation framework with over 80% adoption,

00:03:19.100 | servicing 350 million requests every day.

00:03:22.300 | Wasn't there something existing at Stitch Fix?

00:03:24.220 | Like, why did you have to write one from scratch?

00:03:26.460 | No, so I think because at Stitch Fix, a lot of the machine

00:03:29.980 | learning engineers and data scientists

00:03:31.580 | were writing production code, sort of every team's systems

00:03:34.620 | were very bespoke.

00:03:36.140 | It's like, this team only needs to do real-time recommendations

00:03:39.860 | with small data, so they just have a fast API

00:03:42.260 | app with some Panda's code.

00:03:44.300 | This other team has to do a lot more data,

00:03:46.160 | so they have some kind of Spark job

00:03:47.700 | that does some batch ETL that does a recommendation, right?

00:03:51.320 | And so what happens is each team writes their code differently,

00:03:54.020 | and I have to come in and refactor their code.

00:03:56.340 | And I was like, oh, man, I'm refactoring

00:03:58.380 | four different code bases four different times.

00:04:01.660 | Wouldn't it be better if all the code quality was my fault?

00:04:04.580 | All right, well, let me just write this framework,

00:04:06.660 | force everyone else to use it, and now one person

00:04:09.180 | can maintain five different systems rather than five teams

00:04:12.740 | having their own bespoke system.

00:04:14.860 | And so it was really a need of just standardizing everything.

00:04:18.200 | And then once you do that, you can do observability

00:04:22.000 | across the entire pipeline and make large, sweeping

00:04:25.760 | improvements in this infrastructure.

00:04:28.120 | If we notice that something is slow,

00:04:30.160 | we can detect it on the operator layer.

00:04:33.320 | Just, hey, hey, this team, you guys are doing this operation.

00:04:35.960 | It's lowering our latency by 30%.

00:04:38.640 | If you just optimize your Python code here,

00:04:42.440 | we can probably make an extra million dollars.

00:04:44.360 | Like, jump on a call and figure this out.

00:04:46.580 | And then a lot of it was just doing all this observability

00:04:49.880 | work to figure out what the heck is going on

00:04:52.100 | and optimize this system from not only just a code

00:04:54.360 | perspective, but just harassing the org and saying,

00:04:58.600 | we need to add caching here.

00:05:00.120 | We're doing duplicated work here.

00:05:01.800 | Let's go clean up the systems.

00:05:03.920 | Yeah.

00:05:05.400 | One more system that I'm interested in finding out

00:05:07.720 | more about is your similarity search system

00:05:10.040 | using Qlip and GPT-3 embedding in FICE,

00:05:13.760 | where you said over $50 million in annual revenue.

00:05:17.240 | So of course, they all gave all that to you, right?

00:05:19.560 | No, no.

00:05:20.240 | I mean, it's not going up and down.

00:05:22.200 | But I got a little bit, so I'm pretty happy about that.

00:05:25.640 | But there, that was when we were doing fine-tuning ResNets

00:05:31.480 | to do image classification.

00:05:33.360 | And so a lot of it was, given an image,

00:05:35.840 | if we could predict the different attributes we have

00:05:38.000 | in our merchandising, and we can predict the text embeddings

00:05:41.280 | of the comments, then we can build a image vector or image

00:05:46.840 | embedding that can capture both descriptions of the clothing

00:05:49.920 | and sales of the clothing.

00:05:51.600 | And then we would use these additional vectors

00:05:53.520 | to augment our recommendation system.

00:05:56.360 | And so with this, the recommendation system

00:05:59.000 | really was just around, what are similar items?

00:06:02.000 | What are complementary items?

00:06:03.240 | What are items that you would wear in a single outfit?

00:06:05.960 | And being able to say, on a product page,

00:06:08.360 | let me show you 15, 20 more things.

00:06:10.720 | And then what we found was like, hey, when you turn that on,

00:06:13.260 | you make a bunch of money.

00:06:14.120 | Yeah.

00:06:14.640 | OK, so you didn't actually use GPT-3 embeddings.

00:06:17.320 | You fine-tuned your own.

00:06:19.160 | Because I was surprised that GPT-3 worked off the shelf.

00:06:21.520 | OK, OK.

00:06:23.400 | Because at this point, we would have 3 million pieces

00:06:26.560 | of inventory over a billion interactions

00:06:28.920 | between users and clothes.

00:06:32.240 | Any kind of fine-tuning would definitely

00:06:34.080 | outperform some off-the-shelf model.

00:06:38.400 | Cool.

00:06:39.400 | I'm about to move on from Stitch Fix.

00:06:41.240 | But any other fun stories from the Stitch Fix days

00:06:44.000 | that you want to cover?

00:06:45.840 | No, I think that's basically it.

00:06:47.160 | I mean, the biggest one, really, was the fact

00:06:49.000 | that, I think, for just four years,

00:06:50.560 | I was so bearish on language models and just NLP in general.

00:06:54.080 | I was like, oh, none of this really works.

00:06:56.520 | Why would I spend time focusing on this?

00:06:58.320 | I've got to go do the thing that makes money--

00:07:00.440 | recommendations, bounding boxes, image classification.

00:07:03.640 | Yeah.

00:07:04.160 | And now I'm prompting an image model.

00:07:07.240 | I was like, oh, man, I was wrong.

00:07:09.880 | I think-- OK, so my Stitch Fix question would be,

00:07:14.240 | I think you have a bit of a drip.

00:07:15.840 | And I don't.

00:07:16.720 | My primary wardrobe is free startup conference t-shirts.

00:07:21.480 | Should more technology brothers be using Stitch Fix?

00:07:26.040 | Or what's your fashion advice?

00:07:28.120 | Oh, man, I mean, I'm not a user of Stitch Fix, right?

00:07:31.160 | It's like, I enjoy going out and touching things and putting

00:07:35.480 | things on and trying them on, right?

00:07:37.360 | I think Stitch Fix is a place where you kind of go

00:07:39.440 | because you want the work offloaded.

00:07:42.080 | Whereas I really love the clothing

00:07:44.840 | I buy where I have to--

00:07:46.880 | when I land in Japan, I'm doing a 45-minute walk up a giant hill

00:07:50.480 | to find this weird denim shop.

00:07:52.480 | That's the stuff that really excites me.

00:07:54.520 | But I think the bigger thing that's really captured

00:07:56.840 | is this idea that narrative matters a lot to human beings.

00:08:03.280 | And I think the recommendation system,

00:08:05.160 | that's really hard to capture.

00:08:07.240 | It's easy to sell--

00:08:08.200 | it's easy to use AI to sell a $20 shirt.

00:08:10.680 | But it's really hard for AI to sell a $500 shirt.

00:08:14.200 | But people are buying $500 shirts, you know what I mean?

00:08:16.600 | There's definitely something that we can't really

00:08:19.200 | capture just yet that we probably will figure out

00:08:21.640 | how to in the future.

00:08:24.120 | Well, it'll probably-- I'll put in JSON,

00:08:26.440 | which is what we're going to turn to next.

00:08:28.640 | So then you went on a sabbatical to South Park Commons

00:08:31.760 | in New York, which is unusual because it's usually--

00:08:34.600 | Yeah, so basically in 2020, really, I

00:08:39.800 | was just enjoying working a lot.

00:08:41.820 | And so I was just building a lot of stuff.

00:08:43.600 | This is where we were making the tens of millions of dollars

00:08:46.360 | doing stuff.

00:08:47.640 | And then I had a hand injury, and so I really

00:08:49.520 | couldn't code anymore for about a year, two years.

00:08:52.840 | And so I kind of took half of it as medical leave.

00:08:55.640 | The other half, I became more of a tech lead,

00:08:57.600 | just making sure the systems or lights were on.

00:09:01.320 | And then when I went to New York,

00:09:03.920 | I spent some time there and kind of just wound down

00:09:06.400 | the tech work, did some pottery, did some jiu jitsu.

00:09:09.560 | And after GBD came out, I was like, oh, I clearly

00:09:14.960 | need to figure out what is going on here because something

00:09:17.720 | feels very magical, and I don't understand it.

00:09:20.120 | So I spent basically five months just prompting and playing

00:09:23.200 | around with stuff.

00:09:24.600 | And then afterwards, it was just my startup friends

00:09:26.960 | going like, hey, Jason, my investors

00:09:29.800 | want us to have an AI strategy.

00:09:31.760 | Can you help us out?

00:09:33.120 | And it just snowballed more and more

00:09:35.680 | until I was making this my full-time job.

00:09:38.680 | And you had YouTube University and a journaling app,

00:09:42.640 | a bunch of other explorations.

00:09:44.440 | But it seems like the most productive

00:09:47.280 | or the most best-known thing that came out of your time

00:09:50.360 | there was Instructor.

00:09:51.720 | Yeah, written on the bullet train in Japan.

00:09:54.960 | Well, tell us the origin story.

00:09:57.080 | Yeah, I mean, I think at some point,

00:10:00.240 | tools like Guardrails that Marvin came out,

00:10:03.240 | those are kind of tools that use XML and Pytantic

00:10:06.000 | to get structured data out.

00:10:07.560 | But they really were doing things sort of in the prompt.

00:10:10.720 | And these are built with sort of the Instruct models in mind.

00:10:14.080 | And I really-- like, I'd already done that in the past.

00:10:17.160 | At Stitch Fix, one of the things we did

00:10:18.800 | was we would take a request note and turn that into a JSON

00:10:22.720 | object that we would use to send it to our search engine, right?

00:10:26.960 | So if you said, like, I wanted skinny jeans that

00:10:29.440 | were this size, that would turn into JSON

00:10:31.720 | that we would send to our internal search APIs.

00:10:34.360 | It always felt kind of gross.

00:10:35.960 | A lot of it is just, like, you read the JSON,

00:10:37.840 | you parse it, you make sure the names are strings

00:10:40.000 | and ages are numbers, and you do all this messy stuff.

00:10:43.520 | But when Function Calling came out,

00:10:45.480 | it was very much sort of a new way of doing things.

00:10:48.480 | Function Calling lets you define the schema

00:10:50.800 | separate from the data and the instructions.

00:10:54.440 | And what this meant was you can kind

00:10:57.160 | of have a lot more complex schemas

00:10:59.000 | and just map them in Pydantic.

00:11:01.200 | And then you can just keep those very separate.

00:11:03.160 | And then once you add, like, methods,

00:11:04.700 | you can add validators and all that kind of stuff.

00:11:07.060 | The one thing I really had with a lot of these libraries,

00:11:09.520 | though, was it was doing a lot of the string formatting

00:11:11.960 | themselves, which was fine when it was the instruction tune

00:11:15.520 | models.

00:11:16.000 | You just have a string.

00:11:17.480 | But when you have these new chat models,

00:11:19.840 | you have these chat messages.

00:11:21.320 | And I just didn't really feel like not

00:11:24.680 | being able to access that for the developer

00:11:26.560 | was sort of a good benefit that they would get.

00:11:30.480 | And so I just said, let me write the most simple SDK

00:11:34.240 | around the OpenAI SDK, simple wrapper on the SDK,

00:11:39.120 | just handle the response model a bit,

00:11:41.240 | and kind of think of myself more like requests

00:11:44.880 | than actual framework that people can use.

00:11:46.680 | And so the goal is, hey, this is something

00:11:48.360 | that you can use to build your own framework.

00:11:50.360 | But let me just do all the boring stuff

00:11:51.980 | that nobody really wants to do.

00:11:53.600 | People want to build their own frameworks.

00:11:55.360 | People don't want to build JSON parsing.

00:11:59.640 | And the retrying and all that other stuff.

00:12:02.080 | Yeah.

00:12:03.200 | Yeah, we had a little bit of this discussion before the show.

00:12:05.960 | But that design principle of going for being requests

00:12:09.320 | rather than being Django, what inspires you there?

00:12:16.320 | This has come from a lot of prior pain.

00:12:18.280 | Are there other open source projects

00:12:20.760 | that kind of inspired your philosophy here?

00:12:23.040 | Yeah, I mean, I think it would be requests.

00:12:25.000 | I think it is just the obvious thing you install.

00:12:29.280 | If you were going to go make HTTP requests in Python,

00:12:33.660 | you would obviously import requests.

00:12:35.260 | Maybe if you want to do more async work,

00:12:36.920 | there's future tools.

00:12:38.320 | But you don't really even think about installing it.

00:12:40.960 | And when you do install it, you don't think of it

00:12:42.960 | as like, oh, this is a requests app.

00:12:46.640 | No, this is just Python.

00:12:48.360 | The bigger question is, a lot of people

00:12:50.360 | ask questions like, oh, why isn't requests

00:12:52.360 | in the standard library?

00:12:54.700 | That's how I want my library to feel.

00:12:56.360 | It's like, oh, if you're going to use the LLM SDKs,

00:12:59.640 | you're obviously going to install Instructor.

00:13:01.780 | And then I think the second question would be, oh,

00:13:03.880 | how come Instructor doesn't just go into OpenAI,

00:13:06.240 | go into Anthropic?

00:13:07.400 | If that's the conversation we're having,

00:13:09.200 | that's where I feel like I've succeeded.

00:13:11.360 | Yeah, it's so standard, you may as well

00:13:14.480 | just have it in the base libraries.

00:13:16.960 | And the shape of the request has stayed the same.

00:13:20.000 | But initially, function calling was maybe

00:13:22.320 | equal structure outputs for a lot of people.

00:13:24.560 | I think now the models also support JSON mode

00:13:28.280 | and some of these things.

00:13:29.320 | And return JSON on my grandma is going to die.

00:13:33.060 | All of that stuff is maybe to decide.

00:13:35.400 | How have you seen that evolution?

00:13:37.320 | Maybe what's the meta game today?

00:13:39.320 | Should people just forget about function calling for structure

00:13:42.200 | outputs?

00:13:42.720 | Or when is structure output, like JSON mode,

00:13:46.080 | the best versus not?

00:13:47.360 | We'd love to get any thoughts, given

00:13:48.860 | that you do this every day.

00:13:50.160 | Yeah, I would almost say these are

00:13:51.880 | like different implementations of--

00:13:54.080 | the real thing we care about is the fact

00:13:55.720 | that now we have typed responses to language models.

00:13:58.200 | And because we have that type response,

00:13:59.960 | my ID is a little bit happier.

00:14:01.400 | I get autocomplete.

00:14:02.880 | If I'm using the response wrong, there's

00:14:04.580 | a little red squiggly line.

00:14:05.920 | Those are the things I care about.

00:14:07.560 | In terms of whether or not JSON mode is better,

00:14:09.720 | I usually think it's almost worse

00:14:12.160 | unless you want to spend less money on the prompt tokens

00:14:15.500 | that the function call represents.

00:14:18.880 | Primarily because with JSON mode,

00:14:20.300 | you don't actually specify the schema.

00:14:23.280 | So sure, JSON load works.

00:14:24.840 | But really, I care a lot more than just the fact

00:14:26.800 | that it is JSON.

00:14:28.880 | I think function calling gives you a tool to specify the fact

00:14:31.960 | that, OK, this is a list of objects that I want.

00:14:34.200 | And each object has a name or an age.

00:14:36.160 | And I want the age to be above 0.

00:14:37.780 | And I want to make sure it's parsed correctly.

00:14:41.040 | That's where function calling really shines.

00:14:43.800 | Any thoughts on single versus parallel function calling?

00:14:48.480 | When I first started-- so I did a presentation

00:14:50.640 | at our AI in Action Discord channel,

00:14:54.200 | and obviously, showcase instructor.

00:14:57.680 | One of the big things that we had before

00:14:59.580 | with single function calling is like when

00:15:01.240 | you're trying to extract lists, you

00:15:03.040 | have to make these funky properties that

00:15:05.240 | are lists to then actually return all the objects.

00:15:08.840 | How do you see the hack being put on the developer's plate

00:15:13.880 | versus more of the stuff just getting better in the model?

00:15:17.280 | And I know you tweeted recently about Anthropic, for example,

00:15:21.200 | some lists are not lists, they're strings.

00:15:22.960 | And there's all of these discrepancies.

00:15:25.720 | I almost would prefer it if it was always

00:15:28.120 | a single function call.

00:15:29.120 | But obviously, there is the agents workflows

00:15:31.400 | that Instructor doesn't really support that well,

00:15:34.120 | but are things that ought to be done.

00:15:36.520 | You could define, I think, maybe like 50 or 60

00:15:40.320 | different functions in a single API call.

00:15:43.200 | And if it was like get the weather, or turn the lights on,

00:15:45.720 | or do something else, it makes a lot of sense

00:15:47.060 | to have these parallel function calls.

00:15:48.840 | But in terms of an extraction workflow,

00:15:50.520 | I definitely think it's probably more helpful to have

00:15:53.560 | everything be a single schema.

00:15:56.480 | Just because you can specify relationships

00:15:58.520 | between these entities that you can't do in parallel function

00:16:01.800 | calling, you can have a single chain of thought

00:16:06.840 | before you generate a list of results.

00:16:09.680 | There's small API differences, right?

00:16:11.960 | Where, yeah, if it's for parallel function calling,

00:16:15.840 | if you do one, again, I really care about how the SDK looks.

00:16:21.120 | And so it's, OK, do I always return a list of functions,

00:16:23.640 | or do you just want to have the actual object back out?

00:16:26.080 | You want to have autocomplete over that object.

00:16:28.200 | What's the cap for how many function

00:16:31.320 | definitions you can put in where it still works well?

00:16:34.040 | Do you have any sense on that?

00:16:35.640 | I mean, for the most part, I haven't really

00:16:37.440 | had a need to do anything that's more than six or seven

00:16:40.560 | different functions.

00:16:41.400 | I think in the documentation, they support way more.

00:16:44.200 | But I don't even know if there's any good evals

00:16:46.880 | that have over two dozen function calls.

00:16:50.840 | I think if you run into issues where

00:16:53.320 | you have 20, or 50, or 60 function calls,

00:16:56.060 | I think you're much better having those specifications

00:16:58.760 | saved in a vector database, and then have them be retrieved.

00:17:02.700 | So if there are 30 tools, you should basically

00:17:04.620 | be ranking them, and then using the top K

00:17:07.660 | to do selection a little bit better,

00:17:10.220 | rather than just shoving 60 functions into a single API.

00:17:13.860 | Yeah.

00:17:14.740 | Well, I mean, so I think this is relevant now,

00:17:17.060 | because previously, I think context limits prevented you

00:17:20.260 | from having more than a dozen tools anyway.

00:17:24.060 | And now that we have a million token context windows,

00:17:28.380 | Cloud recently, with their new function calling release,

00:17:30.820 | said they can handle over 250 tools, which is insane to me.

00:17:34.980 | That's a lot.

00:17:37.740 | I would say, you're saying you don't think

00:17:40.180 | there's many people doing that.

00:17:41.620 | I think anyone with a sort of agent-like platform where

00:17:44.300 | you have a bunch of connectors, they

00:17:46.700 | wouldn't run into that problem.

00:17:48.100 | Probably, you're right that they should use a vector database

00:17:50.640 | and kind of rag their tools.

00:17:53.260 | I know Zapier has like a few thousand,

00:17:54.860 | like 8,000, 9,000 connectors that obviously

00:17:57.700 | don't fit anywhere.

00:17:59.060 | So yeah, I mean, I think that would

00:18:00.940 | be it, unless you need some kind of intelligence

00:18:03.300 | that chains things together, which is, I think,

00:18:05.420 | what Alessio is coming back to.

00:18:07.780 | There is this trend about parallel function calling.

00:18:10.540 | I don't know what I think about that.

00:18:12.160 | Anthropx version was--

00:18:14.060 | I think they used multiple tools in sequence,

00:18:16.300 | but they're not in parallel.

00:18:18.100 | I haven't explored this at all.

00:18:19.420 | I'm just throwing this open to you

00:18:20.940 | as to what do you think about all these things.

00:18:22.940 | You know, do we assume that all function calls

00:18:25.140 | could happen in any order?

00:18:26.940 | I think there's a lot of--

00:18:29.500 | in which case, we either can assume that,

00:18:32.140 | or we can assume that things need to happen in some kind

00:18:34.540 | of sequence as a DAG.

00:18:35.780 | But if it's a DAG, really, that's just one JSON object

00:18:38.420 | that is the entire DAG, rather than going, OK,

00:18:41.140 | the order of the function that I return don't matter.

00:18:44.240 | That's just-- that's definitely just not true in practice.

00:18:47.420 | If I have a thing that's like, turn the lights on,

00:18:49.500 | unplug the power, and then turn the toaster on or something,

00:18:52.020 | the order doesn't matter.

00:18:55.140 | And it's unclear how well you can

00:18:57.380 | describe the importance of that reasoning to a language model

00:19:01.100 | yet.

00:19:01.740 | I mean, I'm sure you can do it with good enough prompting,

00:19:04.380 | but I just haven't had any use cases where the function

00:19:07.300 | sequence really matters.

00:19:08.900 | Yeah.

00:19:09.500 | To me, the most interesting thing

00:19:10.860 | is the models are better at picking

00:19:13.980 | than your ranking is, usually.

00:19:16.020 | Like, I'm incubating a company around system integration.

00:19:19.500 | And for example, with one system,

00:19:21.500 | there are like 780 endpoints.

00:19:23.900 | And if you actually try and do vector similarity,

00:19:26.780 | it's not that good, because the people that wrote the specs

00:19:29.300 | didn't have in mind making them semantically apart.

00:19:32.980 | They're kind of like, oh, create this, create this, create this.

00:19:35.740 | Versus when you give it to a model, and you put--

00:19:38.020 | like in Opus, you put them all, it's

00:19:39.940 | quite good at picking which ones you should actually run.

00:19:43.300 | And I'm curious to see if the model providers actually

00:19:46.020 | care about some of those workflows,

00:19:47.900 | or if the agent companies are actually

00:19:49.820 | going to build very good rankers to kind of fill that gap.

00:19:54.340 | Yeah, my money is on the rankers,

00:19:55.940 | because you can do those so easily.

00:19:58.340 | You could just say, well, given the embeddings of my search

00:20:01.500 | query and the embeddings of the description,

00:20:04.740 | I can just train XGBoost and just make sure

00:20:06.700 | that I have very high MRR, which is mean reciprocal rank.

00:20:10.620 | And so the only objective is to make sure

00:20:13.080 | that the tools you use are in the top end filter.

00:20:17.020 | That feels super straightforward,

00:20:18.620 | and you don't have to actually figure out

00:20:19.740 | how to fine tune a language model

00:20:21.120 | to do tool selection anymore.

00:20:23.540 | Yeah, I definitely think that's the case.

00:20:25.260 | Because for the most part, I imagine

00:20:27.500 | you either have less than three tools or more than 1,000.

00:20:32.620 | I don't know what kind of companies say, oh, thank God,

00:20:34.940 | we only have like 185 tools.

00:20:37.900 | And this works perfectly, right?

00:20:40.180 | That's right.

00:20:41.340 | And before we maybe move on just from this,

00:20:44.420 | it was interesting to me you retweeted this thing

00:20:46.580 | about entropic function calling, and it

00:20:48.460 | was Joshua Brown's retweeting some benchmark that's like,

00:20:52.500 | oh my God, entropic function calling, so good.

00:20:55.500 | And then you retweeted it, and then you tweeted later,

00:20:58.020 | and it's like, it's actually not that good.

00:21:00.380 | What's your flow for like, how do you actually

00:21:03.060 | test these things?

00:21:03.860 | Because obviously, the benchmarks are lying, right?

00:21:06.260 | Because the benchmark says it's good, and you said it's bad,

00:21:08.780 | and I trust you more than the benchmark.

00:21:10.820 | How do you think about that, and then how

00:21:12.500 | do you evolve it over time?

00:21:14.740 | Yeah, it's mostly just client data.

00:21:17.780 | I think when-- I actually have been

00:21:19.620 | mostly busy with enough client work

00:21:21.340 | that I haven't been able to reproduce public benchmarks,

00:21:23.860 | and so I can't even share some of the results of entropic.

00:21:26.620 | But I would just say, in production,

00:21:28.820 | we have some pretty interesting schemas,

00:21:31.660 | where it's iteratively building lists, where we're

00:21:35.180 | doing updates of lists, like we're doing in-place updates,

00:21:38.580 | so upserts and inserts.

00:21:40.660 | And in those situations, we're like, oh, yeah,

00:21:42.580 | we have a bunch of different parsing errors.

00:21:44.380 | Numbers are being returned as strings.

00:21:46.020 | We were expecting lists of objects,

00:21:47.620 | but we're getting strings that are like the strings of JSON.

00:21:51.700 | So we had to call JSON parse on individual elements.

00:21:57.420 | Overall, I'm super happy with the entropic models

00:22:00.580 | compared to the OpenAI models.

00:22:01.820 | Like, Sonnet is very cost-effective.

00:22:04.020 | Haiku is-- in function calling, it's actually better.

00:22:08.140 | But I think we just had to file down the edges a little bit,

00:22:10.900 | where our tests pass, but then we actually

00:22:13.660 | apply to production, we get half a percent of traffic

00:22:17.940 | having issues, where if you ask for JSON,

00:22:20.500 | it'll still try to talk to you.

00:22:22.380 | Or if you use function calling, we'll have a parse error.

00:22:25.340 | And so I think these are things that are definitely

00:22:27.460 | going to be things that are fixed in the upcoming weeks.

00:22:30.780 | But in terms of the reasoning capabilities, man,

00:22:34.220 | it's hard to beat 70% cost reduction,

00:22:38.300 | especially when you're building consumer applications.

00:22:41.100 | If you're building something for consultants or private equity,

00:22:43.380 | you're charging $400.

00:22:44.500 | It doesn't really matter if it's $1 or $2.

00:22:47.340 | But for consumer apps, it makes products viable.

00:22:51.140 | If you can go from 4 to Sonnet, you

00:22:53.180 | might actually be able to price it better.

00:22:55.660 | I had this chart about the ELO versus the cost

00:22:59.260 | of all the models.

00:23:00.700 | And you could put trend graphs on each of those things

00:23:05.620 | about higher ELO equals higher cost, except for Haiku.

00:23:08.620 | Haiku kind of just broke the lines, or the ISO ELOs,

00:23:11.900 | if you want to call it.

00:23:15.460 | Cool.

00:23:16.180 | Before we go too far into your opinions

00:23:18.900 | on just the overall ecosystem, I want

00:23:21.220 | to make sure that we map out the surface area of Instructure.

00:23:23.940 | I would say that most people would

00:23:25.820 | be familiar with Instructure from your talks,

00:23:28.180 | and your tweets, and all that.

00:23:30.260 | You had the number one talk from the AI Engineer Summit.

00:23:34.140 | MARK MANDEL: Two Lews, Jason Lew and Jerry Lew.

00:23:36.300 | FRANCESC CAMPOY: Yeah, yeah, yeah.

00:23:38.720 | Started with J and then a Lew to do well.

00:23:42.600 | But yeah, until I actually went through your cookbook,

00:23:45.520 | I didn't realize the surface area.

00:23:47.640 | How would you categorize the use cases?

00:23:50.760 | You have LLM self-critique.

00:23:53.520 | You have knowledge graphs in here.

00:23:55.040 | You have PII data sanitation.

00:23:57.760 | How do you characterize the people?

00:23:59.260 | What is the surface area of Instructure?

00:24:01.440 | MARK MANDEL: Yeah, so this is the part that feels crazy.

00:24:03.720 | Because really, the difference is LLMs give you strings,

00:24:06.720 | and Instructure gives you data structures.

00:24:08.780 | And once you get data structures again,

00:24:10.360 | you can do every lead code problem you ever thought of.

00:24:14.160 | And so I think there's a couple of really common applications.

00:24:16.960 | The first one, obviously, is extracting structured data.

00:24:20.200 | This is just be, OK, well, I want

00:24:22.200 | to put in an image of a receipt.

00:24:24.080 | I want to give back out a list of checkout items

00:24:26.560 | with a price, and a fee, and a coupon code, or whatever.

00:24:30.080 | That's one application.

00:24:31.640 | Another application really is around extracting graphs out.

00:24:36.560 | So one of the things we found out about these language models

00:24:38.640 | is that not only can you define nodes,

00:24:40.680 | it's really good at figuring out what are nodes

00:24:43.000 | and what are edges.

00:24:44.560 | And so we have a bunch of examples where not only

00:24:48.160 | do I extract that this happens after that, but also, OK,

00:24:52.600 | these two are dependencies of another task.

00:24:55.280 | And you can do extracting complex entities

00:24:58.280 | that have relationships.

00:24:59.720 | Given a story, for example, you could extract relationships

00:25:02.340 | of families across different characters.

00:25:04.480 | This can all be done by defining a graph.

00:25:07.200 | And then the last really big application really

00:25:09.600 | is just around query understanding.

00:25:12.320 | The idea is that any API call has some schema.

00:25:16.240 | And if you can define that schema ahead of time,

00:25:18.200 | you can use a language model to resolve a request

00:25:20.720 | into a much more complex request, one

00:25:24.200 | that an embedding could not do.

00:25:25.680 | So for example, I have a really popular post called,

00:25:28.120 | like, "Rag Is More Than Embeddings."

00:25:29.920 | And effectively, if I have a question like this,

00:25:32.200 | what was the latest thing that happened this week?

00:25:35.200 | That embeds to nothing.

00:25:38.400 | But really, that query should just

00:25:40.080 | be select all data where the date time is between today

00:25:43.680 | and today minus seven days.

00:25:47.480 | What if I said, how did my writing

00:25:50.160 | change between this month and last month?

00:25:52.120 | Again, embeddings would do nothing.

00:25:55.600 | But really, if you could do a group by over the month

00:25:58.000 | and a summarize, then you could, again,

00:26:00.080 | do something much more interesting.

00:26:01.840 | And so this really just calls out the fact

00:26:03.560 | that embeddings really is kind of like the lowest

00:26:05.800 | hanging fruit.

00:26:06.640 | And using something like an instructor

00:26:08.220 | can really help produce a data structure.

00:26:11.220 | And then you can just use your computer science

00:26:13.220 | to reason about this data structure.

00:26:14.720 | Maybe you say, OK, well, I'm going

00:26:16.200 | to produce a graph where I want to group by each month

00:26:19.000 | and then summarize them jointly.

00:26:20.800 | You can do that if you know how to define this data structure.

00:26:24.240 | In that part, you kind of run up against the chains of the world

00:26:28.080 | that used to have that.

00:26:30.480 | They still do have the self-querying,

00:26:32.360 | I think they used to call it, when we had

00:26:34.720 | Harrison on in our episode.

00:26:36.480 | How do you see yourself interacting

00:26:38.120 | with the other, I guess, LLM frameworks in the ecosystem?

00:26:42.000 | - Yeah, I mean, if they use Instructure,

00:26:43.880 | I think that's totally cool.

00:26:45.040 | I think because it's just, again, it's just Python.

00:26:48.160 | It's asking, oh, how does Django interact with requests?

00:26:51.320 | Well, you just might make a request.get in a Django app.

00:26:56.560 | But no one would say, oh, I went off of Django

00:26:59.920 | because I'm using requests now.

00:27:01.840 | They should be, ideally, the wrong comparison.

00:27:05.440 | In terms of especially the agent workflows,

00:27:07.680 | I think the real goal for me is to go down the LLM compiler

00:27:12.080 | route, which is instead of doing a React-type reasoning loop,

00:27:18.160 | I think my belief is that we should be using workflows.

00:27:23.560 | If we do this, then we always have

00:27:25.280 | a request and a complete workflow.

00:27:26.920 | We can fine-tune a model that has a better workflow.

00:27:29.320 | Whereas it's hard to think about how do you

00:27:31.160 | fine-tune a better React loop.

00:27:33.600 | Do you want to always train it to have less looping?

00:27:36.920 | In which case, you want it to get the right answer

00:27:38.960 | the first time, in which case, it

00:27:40.360 | was a workflow to begin with.

00:27:42.800 | - Can you define workflow?

00:27:44.240 | Because I think, obviously, I used

00:27:46.160 | to work at a workflow company, but I'm not sure

00:27:48.280 | this is a well-defined framework for everybody.

00:27:49.880 | - I'm thinking workflow in terms of the prefect Zapier

00:27:53.280 | workflow.

00:27:54.240 | I want to build a DAG.

00:27:55.400 | I want you to tell me what the nodes and edges are.

00:27:57.600 | And then maybe the edges are also put in with AI.

00:28:03.040 | But the idea is that I want to be

00:28:04.480 | able to present you the entire plan

00:28:06.200 | and then ask you to fix things as I execute it,

00:28:09.600 | rather than going, hey, I couldn't parse the JSON,

00:28:12.560 | so I'm going to try again.

00:28:13.840 | I couldn't parse the JSON, I'm going to try again.

00:28:15.640 | And then next thing you know, you spent $2 on OpenAI credits.

00:28:20.040 | Whereas with the plan, you can just

00:28:21.600 | say, oh, the edge between node x and y does not run.

00:28:27.840 | Let me just iteratively try to fix that component.

00:28:30.720 | Once that's fixed, go on to the next component.

00:28:33.800 | And obviously, you can get into a world where,

00:28:36.320 | if you have enough examples of the nodes x and y,

00:28:39.240 | maybe you can use a vector database

00:28:41.080 | to find a good few-shot examples.

00:28:43.280 | You can do a lot if you break down

00:28:45.320 | the problem into that workflow and execute in that workflow,

00:28:49.600 | rather than looping and hoping the reasoning is good enough

00:28:52.280 | to generate the correct output.

00:28:55.120 | Yeah, I would say I've been hammering on Devon a lot.

00:28:59.200 | I got access a couple of weeks ago.

00:29:01.680 | And obviously, for simple tasks, it does well.

00:29:06.120 | For the complicated, more than 10, 20-hour tasks,

00:29:10.800 | I can see it--

00:29:11.520 | That's a crazy comparison.

00:29:13.020 | We used to talk about three, four loops.

00:29:16.920 | Only once it gets to hour tasks, it's hard.

00:29:20.040 | Yeah.

00:29:21.000 | Less than an hour, there's nothing.

00:29:24.360 | That's crazy.

00:29:25.600 | I mean, I don't know.

00:29:26.520 | Yeah, OK, maybe my goalposts have shifted.

00:29:29.000 | I don't know.

00:29:30.400 | That's incredible.

00:29:32.520 | I'm like sub-one-minute executions.

00:29:34.760 | The fact that you're talking about 10 hours is incredible.

00:29:37.680 | I think it's a spectrum.

00:29:39.120 | I actually-- I really, really--

00:29:40.680 | I think I'm going to say this every single time

00:29:42.600 | I bring up Devon.

00:29:43.480 | Let's not reward them for taking longer to do things.

00:29:45.880 | Do you know what I mean?

00:29:46.880 | Like, that's a metric that is easily abusable.

00:29:51.280 | Sure.

00:29:51.800 | Yeah.

00:29:52.280 | You can run a game.

00:29:53.800 | Yeah, but all I'm saying is you can monotonically

00:29:56.400 | increase the success probability over an hour.

00:30:00.960 | That's winning to me.

00:30:02.000 | Obviously, if you run an hour and you've made no progress--

00:30:04.880 | like, I think when we were in auto-GBT land,

00:30:07.440 | there was that one example where I wanted it to buy me

00:30:10.600 | a bicycle.

00:30:11.160 | And overnight, I spent $7 on credits,

00:30:13.400 | and I never found the bicycle.

00:30:14.920 | Yeah, yeah.

00:30:16.160 | I wonder if you'll be able to purchase a bicycle.

00:30:18.760 | Because it actually can do things in real world,

00:30:21.160 | it just needs to suspend to you for off and stuff.

00:30:24.200 | But the point I was trying to make

00:30:26.020 | was that I can see it turning plans.

00:30:28.280 | Like, when it gets on--

00:30:29.520 | I think one of the agents' loopholes,

00:30:32.560 | or one of the things that is a real barrier for agents

00:30:34.840 | is LLMs really like to get stuck into a lane.

00:30:37.840 | And what you're talking about, what I've seen Devon do

00:30:42.040 | is it gets stuck in a lane, and it will just

00:30:43.960 | kind of change plans based on the performance of the plan

00:30:47.680 | itself.

00:30:49.960 | And it's kind of cool.

00:30:51.280 | Yeah, I feel like we've gone too much in the looping route.

00:30:53.840 | And I think a lot of more plans and DAGs and data structures

00:30:56.880 | are probably going to come back to help fill in some holes.

00:30:59.720 | Yeah.

00:31:00.240 | What's the interface to that?

00:31:02.600 | Do you see it's like an existing state machine kind of thing

00:31:06.360 | that connects to the LLMs, the traditional DAG player?

00:31:10.680 | So do you think we need something new for AI DAGs?

00:31:15.200 | Yeah, I mean, I think that the hard part is

00:31:17.320 | going to be describing visually the fact

00:31:19.640 | that this DAG can also change over time,

00:31:22.320 | and it should still be allowed to be fuzzy, right?

00:31:27.240 | I think in mathematics, we have plate diagrams, and Markov chain

00:31:30.560 | diagrams, and recurrence states, and all that.

00:31:32.840 | Some of that might come into this workflow world.

00:31:35.040 | But to be honest, I'm not too sure.

00:31:36.920 | I think right now, the first steps

00:31:39.160 | are just how do we take this DAG idea

00:31:41.680 | and break it down to modular components

00:31:43.720 | that we can prompt better, have few-shot examples for,

00:31:47.280 | and ultimately fine-tune against.

00:31:49.880 | But in terms of even the UI, it's

00:31:51.240 | hard to say what we'll likely win.

00:31:53.480 | I think people like Prefect and Zapier

00:31:55.600 | have a pretty good shot at doing a good job.

00:31:57.720 | Yeah.

00:31:58.320 | So you seem to use Prefect a lot.

00:31:59.800 | Actually, you worked at a Prefect competitor at Temporal.

00:32:02.160 | And I'm also very familiar with Dexter.

00:32:06.480 | What else would you call out as particularly interesting

00:32:09.200 | in the AI engineering stack?

00:32:12.280 | Man, I almost use nothing.

00:32:15.120 | I just use Cursor and PyTests.

00:32:19.160 | Oh, OK.

00:32:20.720 | I think that's basically it.

00:32:22.440 | A lot of the observability companies have--

00:32:25.520 | the more observability companies I've tried,

00:32:28.400 | the more I just use Postgres.

00:32:30.920 | Really?

00:32:31.560 | OK.

00:32:32.160 | Postgres for observability?

00:32:34.600 | But the issue, really, is the fact

00:32:36.160 | that these observability companies isn't actually

00:32:38.920 | doing observability for the system.

00:32:40.520 | It's just doing the LLM thing.

00:32:42.640 | I still end up using Datadog or Sentry to do latency.

00:32:48.440 | And so I just have those systems handle it.

00:32:50.400 | And then the prompt-in, prompt-out latency token costs,

00:32:54.360 | I just put that in a Postgres table now.

00:32:56.320 | So you don't need 20 funded startups building LLM ops?

00:33:01.480 | Yeah, but I'm also an old, tired guy.

00:33:04.200 | Because of my background, I was like, yeah,

00:33:09.320 | the Python stuff I'll write myself.

00:33:10.800 | But I will also just use Vercel happily.

00:33:14.640 | Because I'm just not familiar with that world of tooling.

00:33:19.280 | Whereas I think I spent three good years building

00:33:22.520 | observability tools for recommendation systems.

00:33:24.760 | And I was like, oh, compared to that,

00:33:27.720 | Instructor is just one call.

00:33:29.600 | I just have to put time start, time end,

00:33:31.760 | and then count the prompt token.

00:33:34.040 | Because I'm not doing a very complex looping behavior.

00:33:36.280 | I'm doing mostly workflows and extraction.

00:33:40.440 | Yeah, I mean, while we're on this topic,

00:33:42.520 | we'll just kind of get this out of the way.

00:33:44.360 | You famously have decided to not be a venture-backed company.

00:33:48.360 | You want to do the consulting route.

00:33:51.320 | The obvious route for someone as successful as Instructor

00:33:53.880 | is like, oh, here's hosted Instructor with all tooling.

00:33:57.360 | And you just said you had a whole bunch of experience

00:33:59.640 | building observability tooling.

00:34:01.120 | You have the perfect background to do this, and you're not.

00:34:04.080 | Yeah, isn't that sick?

00:34:05.760 | I think that's sick.

00:34:06.600 | I know.

00:34:07.080 | I mean, I know why, because you want to go free dive.

00:34:09.440 | But--

00:34:09.920 | Yeah, well, yeah, because I think there's two things.

00:34:13.880 | One, it's like, if I tell myself I want to build requests,

00:34:17.160 | requests is not a venture-backed startup.

00:34:19.760 | I mean, one could argue whether or not Postman is.

00:34:22.400 | But I think for the most part, having worked so much,

00:34:25.960 | I'm kind of, like, I am more interested in looking

00:34:32.160 | at how systems are being applied and just having access

00:34:35.160 | to the most interesting data.

00:34:36.360 | And I think I can do that more through a consulting business

00:34:38.800 | where I can come in and go, oh, you

00:34:40.840 | want to build perfect memory.

00:34:42.040 | You want to build an agent.

00:34:43.120 | You want to build, like, automations over construction

00:34:45.420 | or, like, insurance and the supply chain.

00:34:47.400 | Or you want to handle, like, writing, like,

00:34:50.640 | private equity, like, mergers and acquisitions

00:34:52.920 | reports based off of user interviews.

00:34:54.840 | Those things are super fun.

00:34:56.840 | Whereas, like, maintaining the library, I think,

00:34:59.360 | is mostly just kind of, like, a utility

00:35:01.320 | that I try to keep up, especially because if it's not

00:35:04.120 | venture-backed, I have no reason to sort of go

00:35:06.720 | down the route of, like, trying to get 1,000 integrations.

00:35:10.160 | Like, in my mind, I just go, oh, OK, 98% of the people

00:35:13.620 | use OpenAI.

00:35:14.400 | I'll support that.

00:35:15.200 | And if someone contributes another, like, platform,

00:35:17.760 | that's great.

00:35:18.520 | I'll merge it in.

00:35:19.800 | But yeah, I mean, you only added Entropic Support, like,

00:35:22.520 | this year.

00:35:23.920 | Yeah, yeah, yeah.

00:35:24.840 | The thing, a lot of it was just, like,

00:35:26.840 | you couldn't even get an API key until, like, this year, right?

00:35:29.160 | Yeah, that's true, that's true.

00:35:30.480 | And so, OK, if I add it, like, last year, I was kind of--

00:35:33.120 | I'm trying to, like, double the code base to service,

00:35:35.600 | you know, half a percent of all downloads.

00:35:38.000 | Do you think the market share will shift a lot now

00:35:40.040 | that Entropic has, like, a very, very competitive offering?

00:35:43.880 | I think it's still hard to get API access.

00:35:48.240 | I don't know if it's fully GA now, if it's GA,

00:35:50.600 | if you can get commercial access really easily.

00:35:54.240 | I don't know.

00:35:54.800 | I got commercial after, like, two weeks to reach out to their sales team.

00:35:57.880 | OK, yeah, so two weeks.

00:35:58.920 | Yeah, there's a call list here.

00:36:00.520 | And then anytime you run into rate limits,

00:36:02.640 | just, like, ping one of the Entropic staff members.

00:36:05.480 | Then maybe we need to, like, cut that part out

00:36:07.160 | so I don't need to, like, you know, read false news.

00:36:09.280 | But it's a common question.

00:36:10.880 | Surely, just from the price perspective,

00:36:12.560 | it's going to make a lot of sense.

00:36:14.800 | Like, if you are a business, you should totally

00:36:18.400 | consider, like, SONET, right?

00:36:21.400 | Like, the cost savings is just going to justify it

00:36:24.320 | if you actually are doing things at volume.

00:36:26.360 | And yeah, I think their SDK is, like, pretty good.

00:36:29.880 | But back to the instructor thing,

00:36:31.280 | I just don't think it's a billion-dollar company.

00:36:33.600 | And I think if I raise money, the first question is going to be, like,

00:36:35.880 | how are you going to get a billion-dollar company?

00:36:37.120 | And I would just go, like, man, like,

00:36:38.840 | if I make a million dollars as a consultant, I'm super happy.

00:36:41.560 | I'm, like, more than ecstatic.

00:36:43.080 | I can have, like, a small staff of, like, three people.

00:36:46.000 | Like, it's fun.

00:36:47.720 | And I think a lot of my happiest founder friends

00:36:49.680 | are those who, like, raised the tiniest seed round,

00:36:52.080 | became profitable, they're making, like, 70, 60, 70, like, MRR,

00:36:56.440 | 70,000 MRR.

00:36:58.840 | And they're, like, we don't even need to raise the seed round.

00:37:00.680 | Like, let's just keep it, like, between me and my co-founder,

00:37:03.600 | we'll go traveling, and it'll be a great time.

00:37:05.960 | I think it's a lot of fun.

00:37:07.520 | - I repeat that as a seed investor in the company.

00:37:10.680 | I think that's, like, one of the things that people get wrong sometimes,

00:37:14.000 | and I see this a lot.

00:37:15.960 | They have an insight into, like, some new tech,

00:37:18.640 | like, say LLM, say AI, and they build some open source stuff,

00:37:21.840 | and it's like, I should just raise money and do this.

00:37:24.200 | And I tell people a lot, it's like, look, you can make a lot more money

00:37:27.440 | doing something else than doing a startup.

00:37:29.000 | Like, most people that do a company

00:37:30.720 | could make a lot more money just working somewhere else

00:37:33.360 | than doing the company itself.

00:37:34.720 | Do you have any advice for folks

00:37:37.200 | that are maybe in a similar situation?

00:37:38.640 | They're trying to decide, oh, should I stay in my, like, high-paid fang job

00:37:42.640 | and just tweet this on the side and do this on GitHub?

00:37:45.680 | Should I be a consultant?

00:37:47.160 | Like, being a consultant seems like a lot of work.

00:37:49.440 | It's like, you got to talk to all these people, you know?

00:37:52.760 | - There's a lot to unpack,

00:37:54.480 | because I think the open source thing is just like,

00:37:56.000 | well, I'm just doing it for, like, purely for fun,

00:37:58.720 | and I'm doing it because I think I'm right.

00:38:00.840 | But part of being right

00:38:02.800 | is the fact that it's not a venture-backed startup.

00:38:05.520 | Like, I think I'm right because this is all you need.

00:38:10.040 | Right? Like, you know.

00:38:12.760 | So I think a part of it is just, like, part of the philosophy

00:38:15.680 | is the fact that all you need is a very sharp blade

00:38:17.920 | to sort of do your work,

00:38:19.320 | and you don't actually need to build, like, a big enterprise.

00:38:22.240 | So that's one thing.

00:38:23.200 | I think the other thing, too, that I've been thinking around,

00:38:25.760 | just because I have a lot of friends at Google

00:38:26.960 | that want to leave right now,

00:38:28.880 | it's like, man, like, what we lack is not money or, like, money or, like, skill.

00:38:32.640 | Like, what we lack is courage.

00:38:34.520 | Like, you just have to do this, the hard thing,

00:38:38.040 | and you have to do it scared anyways, right?

00:38:40.160 | In terms of, like, whether or not you do want to do a founder,

00:38:41.960 | I think that's just a matter of, like, optionality.

00:38:44.040 | But I definitely recognize that the, like, expected value of being a founder

00:38:51.320 | is still quite low.

00:38:53.000 | - It is. - Right.

00:38:54.640 | Like, I know as many founder breakups

00:38:58.680 | and as I know friends who raised a seed round this year.

00:39:03.120 | Right? And, like, that is, like, the reality.

00:39:04.760 | And, like, you know, even from my perspective,

00:39:08.760 | it's been tough where it's like, oh, man, like,

00:39:11.080 | a lot of incubators want you to have co-founders.

00:39:12.880 | Now you spend half the time, like, fundraising

00:39:15.000 | and then trying to, like, meet co-founders

00:39:16.920 | and find co-founders rather than building the thing.

00:39:20.040 | And I was like, man, like, this is a lot of stuff,

00:39:23.840 | a lot of time spent out doing things I'm not really good at.

00:39:28.720 | I think, I do think there's a rising trend in solo founding.

00:39:32.560 | You know, I am a solo.

00:39:34.240 | I think that something like 30% of, like,

00:39:37.280 | I think, I forget what the exact stat is,

00:39:39.080 | something like 30% of starters that make it to, like,

00:39:41.160 | series B or something actually are solo founder.

00:39:44.240 | So I think, I feel like this must-have co-founder idea

00:39:48.000 | mostly comes from YC and most, everyone else copies it.

00:39:52.080 | And then, yeah, you, like,

00:39:53.720 | plenty of companies break up over co-founder breakups.

00:39:56.080 | - Yeah, and I bet it would be, like,

00:39:57.360 | I wonder how much of it is the people

00:39:59.000 | who don't have that much, like,

00:40:00.560 | and I hope this is not a diss to anybody,

00:40:03.240 | but it's like, you sort of,

00:40:04.440 | you go through the incubator route

00:40:05.840 | because you don't have, like, the social equity

00:40:07.560 | you would need to just sort of, like,

00:40:09.000 | send an email to Sequoia and be, like,

00:40:10.800 | "Hey, I'm going on this ride.

00:40:13.960 | "Do you want a ticket on the rocket ship?"

00:40:15.680 | Right, like, that's very hard to sell.

00:40:17.200 | Like, if I was to raise money, like, that's kind of,

00:40:19.720 | like, my message if I was to raise money is, like,

00:40:21.960 | "You've seen my Twitter.

00:40:23.080 | "My life is sick.

00:40:24.360 | "I've decided to make it much worse by being a founder

00:40:27.120 | "because this is something I have to do.

00:40:29.560 | "So do you want to come along?

00:40:31.040 | "Otherwise, I'm gonna fund it myself."

00:40:33.160 | Like, if I can't say that, like, I don't need the money

00:40:35.440 | 'cause, like, I can, like, handle payroll

00:40:37.880 | and, like, hire an intern and get an assistant.

00:40:39.560 | Like, that's all fine.

00:40:41.040 | But, like, what I don't want to do, it's, like,

00:40:44.400 | I really don't want to go back to meta.

00:40:46.080 | I want to, like, get two years

00:40:47.800 | to, like, try to find a problem we're solving.

00:40:50.680 | That feels like a bad time.

00:40:51.840 | - Yeah.

00:40:52.680 | Jason is like, "I wear a YSL jacket

00:40:54.400 | "on stage at AI Engineer Summit.

00:40:56.080 | "I don't need your accelerator money."

00:40:58.560 | - And boots.

00:40:59.680 | You don't forget the boots.

00:41:00.640 | - That's true, that's true.

00:41:01.480 | - You have really good boots, really good boots.

00:41:04.080 | But I think that is a part of it, right?

00:41:06.840 | I think it is just, like, optionality.

00:41:08.120 | And also, just, like, I'm a lot older now.

00:41:10.320 | I think 22-year-old Jason

00:41:11.720 | would have been probably too scared,

00:41:13.360 | and now I'm, like, too wise.

00:41:15.200 | But I think it's a matter of, like,

00:41:17.080 | oh, if you raise money,

00:41:18.000 | you have to have a plan of spending it.

00:41:19.640 | And I'm just not that creative

00:41:21.200 | with spending that much money.

00:41:24.080 | - Yeah.

00:41:24.920 | I mean, to be clear,

00:41:25.760 | you just celebrated your 30-year birthday.

00:41:26.840 | Happy birthday.

00:41:27.680 | - Yeah, it's awesome.

00:41:28.880 | I'm going to Mexico next weekend.

00:41:31.320 | - You know, a lot older is relative

00:41:32.680 | to some of the folks listening.

00:41:34.320 | (laughing)

00:41:35.960 | - Seeing on the career tips,

00:41:38.560 | I think SWIGS had a great post

00:41:40.400 | about are you too old to get into AI?

00:41:42.600 | I saw one of your tweets in January '23.

00:41:45.840 | You applied to, like, Figma, Notion, Cohere, Anthropic,

00:41:48.760 | and all of them rejected you

00:41:49.600 | because you didn't have enough LLM experience.

00:41:52.600 | I think at that time,

00:41:53.440 | it would be easy for a lot of people to say,

00:41:55.000 | oh, I kind of missed the boat, you know?

00:41:57.360 | I'm too late, not going to make it, you know?

00:42:01.200 | Any advice for people that feel like that, you know?

00:42:04.640 | - Yeah, I mean,

00:42:05.600 | like, the biggest learning here

00:42:07.560 | is actually from a lot of folks in jiu-jitsu.

00:42:09.600 | They're like, oh, man,

00:42:10.720 | is it too late to start jiu-jitsu?

00:42:11.960 | Like, oh, I'll join jiu-jitsu once I get in more shape.

00:42:16.120 | Right?

00:42:18.080 | It's like, there's a lot of, like, excuses.

00:42:19.840 | And then you say, oh, like, why should I start now?

00:42:21.640 | I'll be, like, 45 by the time I'm any good.

00:42:23.680 | And it's like, well, you'll be 45 anyways.

00:42:25.800 | Like, time is passing.

00:42:28.800 | Like, if you don't start now, you start tomorrow.

00:42:30.480 | You're just, like, one more day behind.

00:42:32.640 | And if you're, like, if you're worried about being behind,

00:42:34.440 | like, today is, like,

00:42:35.560 | the soonest you can start.

00:42:39.560 | Right?

00:42:40.400 | And so you got to recognize that,

00:42:41.240 | like, maybe you just don't want it, and that's fine too.

00:42:44.560 | Like, if you wanted it, you would have started.

00:42:46.880 | Like, you know.

00:42:48.200 | I think a lot of these people, again,

00:42:50.520 | probably think of things on a too short time horizon.

00:42:54.560 | But again, you know, you're going to be old anyways.

00:42:57.640 | You may as well just start now.

00:42:58.840 | - You know, one more thing on,

00:42:59.840 | I guess, the career advice slash sort of blogging.

00:43:04.840 | You always go viral for this post that you wrote

00:43:07.840 | on advice to young people and the lies you tell yourself.

00:43:10.040 | - Oh, yeah, yeah, yeah.

00:43:11.080 | - You said that you were writing it for your sister.

00:43:12.840 | Like, why is that?

00:43:13.680 | - Yeah, yeah, yeah.

00:43:14.520 | Yeah, she was, like, bummed out about, like, you know,

00:43:16.880 | going to college and, like, stressing about jobs.

00:43:19.040 | And I was like,

00:43:19.880 | oh, and I really want to hear, okay.

00:43:24.160 | And I just kind of, like, texted through the whole thing.

00:43:25.960 | It's crazy.

00:43:26.800 | It's got, like, 50,000 views.

00:43:28.080 | I'm like, I don't mind.

00:43:29.760 | - I mean, your average tweet has more.

00:43:32.800 | - But that thing is, like, you know,

00:43:36.760 | a 30-minute read now.

00:43:38.400 | - Yeah, yeah.

00:43:39.280 | So there's lots of stuff here, which I agree with.

00:43:41.080 | You know, I'm also of occasionally indulge

00:43:43.480 | in the sort of life reflection phase.

00:43:46.400 | There's the how to be lucky.

00:43:48.080 | There's the how to have higher agency.

00:43:51.280 | I feel like the agency thing is always making a,

00:43:53.720 | is always a trend in SF or just in tech circles.

00:43:57.880 | - How do you define having high agency?

00:44:00.120 | - Yeah, I mean, I'm almost, like,

00:44:01.760 | past the high agency phase now.

00:44:03.520 | Now my biggest concern is, like,

00:44:05.440 | okay, the agency is just, like, the norm of the vector.

00:44:08.120 | What also matters is the direction, right?

00:44:11.440 | It's, like, how pure is the shot?

00:44:13.800 | Yeah, I mean, I think agency is just a matter

00:44:15.680 | of, like, having courage and doing the thing.

00:44:17.240 | That's scary, right?

00:44:18.960 | Like, you know, if you want to go rock climbing,

00:44:21.080 | it's, like, do you decide you want to go rock climbing,

00:44:24.160 | and then you show up to the gym,

00:44:25.040 | you rent some shoes, and you just fall 40 times?

00:44:26.880 | Or do you go, like, oh, like,

00:44:28.520 | I'm actually more intelligent.

00:44:29.720 | Let me go research the kind of shoes that I want.

00:44:32.120 | Okay, like, there's flatter shoes and more inclined shoes.

00:44:35.280 | Like, which one should I get?

00:44:36.320 | Okay, let me go order the shoes on Amazon.

00:44:38.920 | I'll come back in three days.

00:44:40.120 | Like, oh, it's a little bit too tight.

00:44:41.320 | Maybe it's too aggressive.

00:44:42.440 | I'm only a beginner.

00:44:43.280 | Let me go change.

00:44:44.800 | No, I think the higher agent person just, like,

00:44:46.680 | goes and, like, falls down 20 times, right?

00:44:48.920 | Yeah, I think the higher agency person

00:44:51.320 | is more focused on, like, process metrics

00:44:54.520 | versus outcome metrics, right?

00:44:57.880 | Like, from pottery, like, one thing I learned was

00:45:00.280 | if you want to be good at pottery,

00:45:01.280 | you shouldn't count, like,

00:45:02.120 | the number of cups or bowls you make.

00:45:04.320 | You should just weigh the amount of clay you use, right?

00:45:08.360 | Like, the successful person says,

00:45:09.560 | oh, I went through 1,000 pounds of clay,

00:45:11.360 | 100 pounds of clay, right?

00:45:13.360 | The less agency person's like, oh, I made six cups,

00:45:15.360 | and then after I made six cups,

00:45:17.360 | like, there's not really, what do you do next?

00:45:20.080 | No, just pounds of clay, pounds of clay.

00:45:22.800 | Same with the work here, right?

00:45:23.640 | It's like, oh, you just got to write the tweets,

00:45:25.200 | like, make the commits, contribute open source,

00:45:27.280 | like, write the documentation.

00:45:29.200 | There's no real outcome, it's just a process,

00:45:30.840 | and if you love that process,

00:45:31.840 | you just get really good at the thing you're doing.

00:45:34.160 | - Yeah, so just to push back on this,

00:45:36.120 | 'cause obviously I mostly agree,

00:45:38.800 | how would you design performance review systems?

00:45:41.440 | (laughing)

00:45:43.600 | Because you were effectively saying

00:45:45.960 | we can count lines of code for developers, right?

00:45:47.960 | Like, did you put out--

00:45:48.960 | - No, I don't think that would be the actual,

00:45:50.640 | like, I think if you make that an outcome,

00:45:52.360 | like, I can just expand a for loop, right?

00:45:54.520 | I think, okay, so for performance review,

00:45:57.000 | this is interesting because I've mostly thought of it

00:45:59.600 | from the perspective of science and not engineering.

00:46:02.920 | Like, I've been running a lot of engineering stand-ups,

00:46:06.220 | primarily because there's not really

00:46:07.400 | that many machine learning folks.

00:46:09.840 | Like, the process outcome is like experiments and ideas,

00:46:14.240 | right, like, if you think about outcomes,

00:46:15.480 | what you might want to think about an outcome is,

00:46:16.960 | oh, I want to improve the revenue or whatnot,

00:46:19.400 | but that's really hard.

00:46:21.000 | But if you're someone who is going out like,

00:46:22.640 | okay, like this week,

00:46:23.880 | I want to come up with like three or four experiments,

00:46:25.760 | I might move the needle.

00:46:26.600 | Okay, nothing worked.

00:46:27.600 | To them, they might think, oh, nothing worked, like, I suck.

00:46:30.920 | But to me, it's like, wow,

00:46:31.760 | you've closed off all these other possible avenues

00:46:34.480 | for, like, research.

00:46:36.520 | Like, you're gonna get to the place

00:46:37.800 | that you're gonna figure out that direction really soon,

00:46:40.720 | right, like, there's no way you'd try 30 different things

00:46:43.080 | and none of them work.

00:46:43.920 | Usually, like, you know, 10 of them work,

00:46:46.160 | five of them work really well,

00:46:47.320 | two of them work really, really well,

00:46:48.600 | and one thing was, like, you know,

00:46:51.200 | the nail in the head.

00:46:53.240 | So agency lets you sort of capture

00:46:55.200 | the volume of experiments.

00:46:56.680 | And, like, experience lets you figure out, like,

00:46:58.520 | oh, that other half, it's not worth doing, right?

00:47:01.800 | Like, I think experience is gonna go,

00:47:03.800 | half these prompting papers don't make any sense,

00:47:05.760 | just use a chain of thought and just, you know,

00:47:07.440 | use a for loop.

00:47:08.320 | But that's kind of, that's basically it, right?

00:47:12.000 | It's like, usually performance for me is around, like,

00:47:13.760 | how many experiments are you running?

00:47:16.000 | Like, how oftentimes are you trying?

00:47:18.320 | - Yeah.

00:47:19.480 | - When do you give up on an experiment?

00:47:21.200 | Because at Stitch Fix, you kind of give up

00:47:23.000 | on language models, I guess, in a way,

00:47:24.880 | and as a tool to use.

00:47:27.000 | And then maybe the tools got better.

00:47:29.080 | They got better before, you know,

00:47:30.840 | you were kind of like, you were right at the time

00:47:32.840 | and then the tool improved.

00:47:34.080 | I think there are similar paths in my engineering career

00:47:37.640 | where I try one approach and at the time it doesn't work

00:47:39.920 | and then the thing changes,

00:47:41.320 | but then I kind of soured on that approach

00:47:43.120 | and I don't go back to it soon enough.

00:47:45.360 | - I see.

00:47:46.200 | What do you think about that loop?

00:47:48.400 | - So usually when I, like, when I'm coaching folks

00:47:51.080 | and they say, like, oh, these things don't work,

00:47:52.800 | I'm not going to pursue them in the future.

00:47:54.120 | Like, one of the big things, like, hey,

00:47:55.480 | the negative result is a result

00:47:56.960 | and this is something worth documenting.

00:47:58.200 | Like, this isn't academia.

00:47:59.240 | Like, if it's negative, you don't just, like, not public.

00:48:02.440 | But then, like, what do you actually write down?

00:48:03.640 | Like, what you should write down is, like,

00:48:04.760 | here are the conditions.

00:48:06.320 | This is the inputs and the outputs

00:48:07.600 | we tried the experiment on.

00:48:09.760 | And then one thing that's really valuable

00:48:11.840 | is basically writing down under what conditions

00:48:14.720 | would I revisit these experiments, right?

00:48:18.000 | It's like, these things don't work

00:48:19.400 | because of what we had at the time.

00:48:21.520 | If someone is reading this two years from now,

00:48:23.440 | under what conditions will we try again?

00:48:25.640 | That's really hard, but again, that's like another,

00:48:28.000 | that's like another skill you kind of learn, right?

00:48:30.320 | It's like, you do go back and you do experiments

00:48:32.360 | and you figure out why it works now.

00:48:34.600 | I think a lot of it here is just, like, scaling worked.

00:48:37.880 | - Yeah.

00:48:39.760 | - Right, like, you could actually, like, rap lyrics,

00:48:42.000 | you know, like, that was because I did not have

00:48:44.880 | high enough quality data.

00:48:46.640 | If we phase shift and say, okay,

00:48:48.480 | you don't even need training data.

00:48:49.680 | So, oh, great, then it might just work.

00:48:51.920 | - Yeah.

00:48:52.760 | - Different domain.

00:48:53.600 | - Do you have any, anything in your list

00:48:56.120 | that is like, it doesn't work now,

00:48:57.520 | but I want to try it again later?

00:48:58.840 | Something that people should, maybe keep in mind,

00:49:01.040 | you know, people always like, AGI when?

00:49:03.240 | You know, when are you going to know the AGI is here?

00:49:05.120 | Maybe it's less than that,

00:49:05.960 | but any stuff that you tried recently that didn't work

00:49:08.960 | that you think will get there?

00:49:11.080 | - I mean, I think, like, the personal assistants

00:49:14.000 | and the writing I've shown to myself

00:49:15.880 | is just not good enough yet.

00:49:17.400 | So, I hired a writer and I hired a personal assistant.

00:49:22.320 | So, now I'm going to basically, like,

00:49:23.600 | work with these people until I figure out, like,

00:49:25.800 | what I can actually, like, automate

00:49:27.120 | and what are, like, the reproducible steps, right?

00:49:30.000 | But, like, I think the experiment for me is, like,

00:49:31.880 | I'm going to go, like, pay a person, like,

00:49:33.520 | $1,000 a month to, like, help me improve my life

00:49:35.920 | and then let me, sort of, get them to help me figure out,

00:49:38.360 | like, what are the components

00:49:39.360 | and how do I actually modularize something

00:49:41.040 | to get it to work?

00:49:42.480 | 'Cause it's not just, like, OAuth, Gmail, Calendar,

00:49:46.000 | and, like, Notion.

00:49:46.880 | It's a little bit more complicated than that,

00:49:48.200 | but we just don't know what that is yet.

00:49:49.560 | Or those are two, sort of, systems that,

00:49:51.800 | I wish GPD 4 or Opus was actually good enough

00:49:54.160 | to just write me an essay,

00:49:55.160 | but most of the essays are still pretty bad.

00:49:57.640 | - Yeah, I would say, you know,

00:49:59.160 | on the personal assistant side,

00:50:00.760 | Lindy is probably the one I've seen the most.

00:50:04.360 | He was, Flo was a speaker at the summit.

00:50:06.680 | I don't know if you've checked it out

00:50:07.840 | or any other, sort of, agents, assistant startup.

00:50:11.040 | - Not recently.

00:50:11.880 | I haven't tried Lindy.

00:50:12.720 | It was, they were, like, behind,

00:50:13.560 | they were not GA last time I was considering it.

00:50:15.720 | - Yeah, yeah, they're not GA.

00:50:16.560 | - But a lot of it now, it's, like,

00:50:17.520 | oh, like, really what I want you to do is, like,

00:50:19.560 | take a look at all of my meetings

00:50:21.080 | and, like, write, like, a really good

00:50:23.440 | weekly summary email for my clients.

00:50:26.200 | Remind them that I'm, like, you know,

00:50:27.600 | thinking of them and, like, working for them.

00:50:30.040 | Right?

00:50:30.880 | Or it's, like, I want you to notice that, like,

00:50:32.760 | my Mondays were way, like, way too packed

00:50:35.520 | and, like, block out more time

00:50:36.800 | and also, like, email the people

00:50:38.960 | to do the reschedule

00:50:40.560 | and then try to opt in to move them around.

00:50:42.240 | And then I want you to say,

00:50:43.080 | oh, Jason should have, like, a 15-minute prep break

00:50:45.920 | after a four-back-to-back meeting.

00:50:48.520 | Those are things that, like,

00:50:50.240 | now I know I can prompt them in,

00:50:51.800 | but can it do it well?

00:50:53.000 | Like, before, I didn't even know

00:50:54.040 | that's what I wanted to prompt for.

00:50:55.320 | It was, like, defragging a calendar

00:50:57.840 | and adding breaks so I can, like, eat lunch.

00:51:01.160 | Right?

00:51:02.240 | - Yeah, that's the AGI test.

00:51:04.160 | - Yeah, exactly.

00:51:05.400 | Compassion, right?

00:51:06.800 | - I think one thing that, yeah,

00:51:07.920 | we didn't touch on it before,

00:51:09.040 | but I think was interesting.

00:51:10.920 | You had this tweet a while ago

00:51:12.200 | about prompts should be code.

00:51:14.640 | And then there were a lot of companies

00:51:17.200 | trying to build prompt engineering tooling,

00:51:19.440 | kind of trying to turn the prompt

00:51:21.080 | into a more structured thing.

00:51:23.240 | What's your thought today?

00:51:24.520 | Like, you know, now you want to turn the thinking

00:51:26.920 | into DAGs, like, do prompts should still be code?

00:51:29.480 | Like, any updated ideas?

00:51:31.920 | - Nah, it's the same thing, right?

00:51:34.040 | I think, like, you know,

00:51:35.200 | with Instructor, it is very much, like,

00:51:36.640 | the output model is defined as a code object.

00:51:41.640 | That code object is sent to the LLM

00:51:43.720 | and in return, you get a data structure.

00:51:46.400 | So the outputs of these models,

00:51:47.800 | I think, should also be code,

00:51:49.240 | like, code objects.

00:51:50.440 | And the inputs, somewhat, should be code objects.

00:51:52.440 | But I think the one thing that Instructor tries to do

00:51:54.680 | is separate instruction, data,

00:51:57.040 | and the types of the output.

00:51:58.840 | And beyond that, I really just think that, you know,

00:52:04.040 | most of it should be still, like,

00:52:06.040 | managed pretty closely to the developer.

00:52:08.440 | Like, so much of it is changing

00:52:10.040 | that if you give control of these systems away too early,

00:52:13.720 | you end up, ultimately, wanting them back.

00:52:16.400 | Like, many companies I know that I reach out are ones

00:52:18.600 | where, like, oh, we're going off of the frameworks

00:52:20.280 | because now that we know what the business outcomes

00:52:22.240 | we're trying to optimize for,

00:52:24.240 | these frameworks don't work.

00:52:25.560 | Yeah, 'cause, like, we do RAG,

00:52:27.760 | but we want to do RAG to, like, sell you supplements

00:52:31.960 | or to have you, like, schedule the fitness appointment.

00:52:35.000 | And, like, the prompts are kind of too baked into the systems

00:52:37.880 | to really pull them back out

00:52:38.960 | and, like, start doing upselling or something.

00:52:41.600 | It's really funny, but a lot of it ends up being, like,

00:52:43.800 | once you understand the business outcomes,

00:52:46.120 | you care way more about the prompt, right?

00:52:49.160 | - Actually, this is fun.

00:52:50.400 | So we were trying, in our prep for this call,

00:52:52.280 | we were trying to say, like,

00:52:53.120 | what can you, as an independent person, say

00:52:55.240 | that maybe me and Alessio cannot say

00:52:57.120 | or, you know, someone who works at a company can say?

00:53:00.040 | What do you think is the market share of the frameworks?

00:53:03.680 | The Lanchain, the Llama Index, the everything else.

00:53:06.240 | - Oh, massive.

00:53:07.520 | 'Cause not everyone wants to care about the code.

00:53:10.160 | - Yeah. - Right?

00:53:11.320 | It's like, I think that's a different question

00:53:14.520 | to, like, what is the business model

00:53:16.560 | and are they going to be, like,

00:53:17.400 | massively profitable businesses, right?

00:53:19.360 | Like, making hundreds of millions of dollars,

00:53:21.600 | that feels, like, so straightforward, right?

00:53:24.120 | 'Cause not everyone is a prompt engineer.

00:53:25.560 | Like, there's so much productivity to be captured

00:53:28.520 | in, like, back-office automations, right?

00:53:33.520 | It's not because they care about the prompts,

00:53:36.240 | that they care about managing these things.

00:53:39.200 | - Yeah, but those are not sort of low-code experiences,

00:53:41.400 | you know?

00:53:42.480 | - Yeah, I think the bigger challenge is, like,

00:53:45.640 | okay, $100 million, probably pretty easy.

00:53:49.760 | It's just time and effort.

00:53:50.800 | And they have both, like, the manpower

00:53:53.160 | and the money to sort of solve those problems.

00:53:57.280 | I think it's just like, again, if you go the VC route,

00:53:59.760 | then it's like, you're talking about billions

00:54:01.160 | and that's really the goal.

00:54:03.240 | That stuff, for me, it's, like, pretty unclear.

00:54:08.240 | - Okay. - But again,

00:54:09.200 | that is to say that, like,

00:54:10.040 | I sort of am building things for developers

00:54:11.720 | who want to use Instructure to build their own tooling.

00:54:14.880 | But in terms of the amount of developers

00:54:16.800 | there are in the world

00:54:17.640 | versus, like, downstream consumers of these things

00:54:19.760 | or even just, like, you know,

00:54:21.960 | think of how many companies will use, like,

00:54:24.680 | the Adobes and the IBMs, right?

00:54:26.400 | Because they want something that's fully managed

00:54:28.400 | and they want something that they know will work.

00:54:30.840 | And if the incremental 10% requires you

00:54:33.160 | to hire another team of 20 people,

00:54:34.680 | you might not want to do it.

00:54:36.320 | And I think that kind of organization is really good

00:54:38.440 | for those bigger companies.

00:54:40.840 | - And I just want to capture your thoughts

00:54:42.240 | on one more thing, which is,

00:54:43.080 | you said you wanted most of the prompts

00:54:44.920 | to stay close to the developer.

00:54:46.780 | I wouldn't, and Hummel Hussain wrote this, like,

00:54:51.720 | post which I really love called, like,

00:54:53.520 | "FU, show me the prompt."

00:54:55.240 | I think it cites you in one of those,

00:54:57.240 | part of the blog post.

00:54:58.480 | And I think DSPy is kind of, like,

00:55:00.120 | the complete antithesis of that,

00:55:02.480 | which is, I think, interesting.

00:55:03.760 | 'Cause I also hold the strong view

00:55:05.840 | that AI is a better prompt engineer than you are.

00:55:08.320 | And I don't know how to square that.

00:55:10.920 | I'm wondering if you have thoughts.

00:55:13.680 | - I think something like DSPy can work

00:55:17.440 | because there are, like,

00:55:19.480 | very short-term metrics to measure success.

00:55:25.440 | Right?

00:55:26.280 | It is, like, did you find the PII?

00:55:28.760 | Or, like, did you write the multi-hop question

00:55:31.480 | the correct way?

00:55:32.360 | But in these, like, workflows that I've been managing,

00:55:37.360 | a lot of it is, like, are we minimizing,

00:55:40.440 | like, minimizing churn and maximizing retention?

00:55:43.200 | Like, that's not, like, it's not really, like,

00:55:47.400 | a, like, uptuna, like, training loop, right?

00:55:51.160 | Like, those things are much more harder to capture.

00:55:52.800 | So we don't actually have those metrics for that, right?

00:55:55.840 | And obviously, we can figure out, like,

00:55:56.880 | okay, is the summary good?

00:55:58.120 | But then, like, how do you measure

00:55:59.320 | the quality of the summary, right?

00:56:01.920 | It's, like, that feedback loop,

00:56:05.040 | it ends up being a lot longer.

00:56:06.440 | And then, again, when something changes,

00:56:07.720 | it's really hard to make sure that it works

00:56:09.480 | across these, like, newer models,

00:56:11.000 | or, again, like, changes to work for the current process.

00:56:16.000 | Like, when we migrate from, like, Anthropic to OpenAI,

00:56:19.160 | like, there's just a ton of change

00:56:22.320 | that are, like, infrastructure related,

00:56:23.480 | not necessarily around the prompt itself.

00:56:26.280 | - Any other AI engineering startups

00:56:28.320 | that you think should not exist before we wrap up?

00:56:31.440 | - No, I mean, oh, my gosh.

00:56:33.040 | I mean, a lot of it, again, is just, like,

00:56:34.720 | every time of investors, like,

00:56:36.400 | what is, how does this make a billion dollars?

00:56:38.280 | Like, it doesn't.

00:56:39.640 | I'm gonna go back to just, like,

00:56:41.320 | tweeting and holding my breath underwater.

00:56:43.440 | Yeah, like, I don't really pay attention too much

00:56:45.520 | to most of this.

00:56:47.360 | Like, most of the stuff I'm doing

00:56:48.560 | is around, like, the consumer layer, right?

00:56:51.440 | Like, it's not in the consumer layer,

00:56:52.840 | but, like, the consumer of, like, LLM calls.

00:56:55.960 | I think people just wanna move really fast

00:56:57.400 | and they're willing to pick these vendors,

00:56:58.640 | but it's, like, I don't really know

00:57:01.800 | if anything has really, like, blown me out the water.

00:57:04.320 | Like, I only trust myself,

00:57:05.640 | but that's also a function of, like,

00:57:07.120 | just being an old man.

00:57:08.480 | Like, I think, you know,

00:57:09.760 | many companies are definitely very happy

00:57:11.640 | with using most of these tools anyways,

00:57:14.400 | but I definitely think I occupy, like,

00:57:18.960 | a very small space in the AI engineering ecosystem.

00:57:22.440 | - Yeah, I would say one of the challenges here,

00:57:25.280 | you know, you talk about dealing in the consumer

00:57:28.880 | of LLM's space.

00:57:31.920 | I think that's what AI engineering

00:57:33.320 | differs from ML engineering,

00:57:34.840 | and I think a constant disconnect

00:57:37.960 | or cognitive dissonance in this field,

00:57:41.240 | in the AI engineers that have sprung up,

00:57:43.920 | is that they're not as good as the ML engineers.

00:57:45.760 | They're not as qualified.

00:57:47.680 | I think that, you know,

00:57:48.920 | you are someone who has credibility in the MLE space,

00:57:51.560 | and you are also, you know,

00:57:54.360 | a very authoritative figure in the AIE space,

00:57:57.080 | and-- - Authoritative?

00:57:58.800 | - I think so.

00:57:59.640 | And, you know, I think you've built

00:58:01.640 | the de facto leading library.

00:58:03.240 | I think yours, I think Instructor should be

00:58:04.920 | part of the standard lib,

00:58:06.120 | even though I try to not use it.

00:58:07.400 | Like, I also try to figure out that,

00:58:09.960 | I basically also end up rebuilding Instructor, right?

00:58:12.240 | Like, that's a lot of the back and forth

00:58:15.400 | that we had over the past two days.

00:58:16.920 | (laughing)

00:58:18.080 | But like, yeah, like,

00:58:19.160 | I think that's a fundamental thing

00:58:21.080 | that we're trying to figure out.

00:58:21.920 | Like, there's a very small supply of MLEs.

00:58:24.480 | They're not, like, not everyone's gonna have

00:58:26.880 | that experience that you had,

00:58:28.920 | but the global demand for AI

00:58:31.200 | is going to far outstrip the existing MLEs.

00:58:34.000 | So what do we do?

00:58:34.840 | Do we force everyone to go through

00:58:36.080 | the standard MLE curriculum,

00:58:38.160 | or do we make a new one?

00:58:39.840 | - I've got some takes.

00:58:41.200 | - Go.

00:58:42.040 | - I think a lot of these app layer startups

00:58:44.400 | should not be hiring MLEs,

00:58:46.120 | 'cause they end up churning.

00:58:47.520 | - Yeah, they want to work at OpenAI.

00:58:50.080 | (laughing)

00:58:50.920 | 'Cause they're just like, "Hey guys,

00:58:52.240 | I joined and you have no data,

00:58:54.200 | and like, all I did this week was like,

00:58:56.440 | fix some TypeScript build errors,

00:58:58.320 | and like, figure out why we don't have any tests,

00:59:02.440 | and like, what is this framework X and Y?

00:59:04.840 | Like, how come, like, what am I,

00:59:07.000 | like, what are, like, how do you measure success?

00:59:08.720 | What are your biggest outcomes?

00:59:09.840 | Oh, no, okay, let's not focus on that?

00:59:11.560 | Great, I'll focus on like, these TypeScript build errors."

00:59:14.280 | (laughing)

00:59:15.360 | And then you're just like, "What am I doing?"

00:59:16.840 | And then you kind of sort of feel really frustrated.

00:59:18.920 | And I already recognize that,

00:59:21.720 | because I've made offers to machine learning engineers,

00:59:25.480 | they've joined, and they've left in like, two months.

00:59:28.240 | And the response is like,

00:59:30.520 | "Yeah, I think I'm going to join a research lab."

00:59:32.320 | So I think it's not even that,

00:59:33.600 | like, I don't even think you should be hiring these MLEs.

00:59:35.880 | On the other hand, what I also see a lot of,

00:59:38.600 | is the really motivated engineer

00:59:41.440 | that's doing more engineering,

00:59:42.840 | is not being allowed to actually like,

00:59:44.640 | fully pursue the AI engineering.

00:59:46.200 | So they're the guy who built a demo, it got traction,

00:59:49.400 | now it's working, but they're still being pulled back

00:59:51.600 | to figure out like,

00:59:53.000 | why Google Calendar integrations are not working,

00:59:55.240 | or like, how to make sure that like,

00:59:57.360 | you know, the button is loading on the page.

00:59:59.680 | And so I'm sort of like, in a very interesting position

01:00:02.720 | where the companies want to hire an MLE,

01:00:05.160 | they don't need to hire,

01:00:06.520 | but they won't let the excited people

01:00:08.080 | who've caught the AI engineering bug

01:00:09.680 | could go do that work more full time.

01:00:13.000 | - This is something I'm literally wrestling with,

01:00:14.600 | like, this week, as I just wrote something about it.

01:00:17.560 | This is one of the things

01:00:18.400 | I'm probably gonna be recommending in the future,

01:00:19.640 | is really thinking about like,

01:00:21.120 | where is the talent coming from?

01:00:22.280 | How much of it is internal?

01:00:23.400 | And do you really need to hire someone

01:00:25.120 | who's like, writing PyTorch code?

01:00:27.680 | - Yeah, exactly.

01:00:29.280 | Most of the time you're not,

01:00:30.120 | you're gonna need someone to write instructor code.

01:00:32.640 | - And you're just like, yeah, you're making this like,

01:00:36.200 | and like, I feel goofy all the time, just like, prompting.

01:00:38.840 | It's like, oh man, I wish I just had a target data set

01:00:41.280 | that I could like, train a model against.

01:00:42.720 | - Yes.

01:00:43.560 | - And I can just say it's right or wrong.

01:00:45.240 | - Yeah, so, you know, I guess what LeanSpace is,

01:00:48.240 | what the AI Engineering World's Fair is,

01:00:50.360 | is that we're trying to create

01:00:51.840 | and elevate this industry of AI engineers,

01:00:54.360 | where it's legitimate to actually

01:00:56.200 | take these motivated software engineers

01:00:58.600 | who wanna build more in AI and do creative things in AI,

01:01:01.200 | to actually say, you have the blessing,

01:01:03.040 | and this is a legitimate sub-specialty

01:01:05.640 | of software engineering.

01:01:07.120 | - Yeah, I think there's gonna be a mix of that,

01:01:09.080 | product engineering.

01:01:10.400 | I think a lot more data science is gonna come in

01:01:12.240 | versus machine learning engineering,

01:01:13.880 | 'cause a lot of it now is just quantifying,

01:01:16.640 | like, what does the business actually want as an outcome?

01:01:20.200 | Right, the outcome is not RAGAP.

01:01:22.600 | - Yeah.

01:01:23.440 | - The outcome is like, reduced churn,

01:01:25.280 | or something like that,

01:01:26.120 | but people need to figure out what that actually is,

01:01:27.600 | and how to measure it.

01:01:28.800 | - Yeah, yeah, all the data engineering tools still apply,

01:01:32.800 | BI layers, semantic layers, whatever.

01:01:35.200 | - Yeah.

01:01:36.920 | - Cool. - We'll see.

01:01:38.160 | - We'll have you back again for the World's Fair.

01:01:41.520 | We don't know what you're gonna talk about,

01:01:44.080 | but I'm sure it's gonna be amazing.

01:01:46.160 | You're a very--

01:01:47.000 | - The title is written.

01:01:47.840 | It's just, "Pydantic is still all you need."

01:01:50.200 | (laughing)

01:01:52.320 | - I'm worried about having too many all-you-need titles,

01:01:54.880 | because that's obviously very trendy.

01:01:57.280 | So, yeah, you have one of them,

01:01:58.760 | but I need to keep a lid on, like,

01:02:00.880 | everyone saying their thing is all you need.

01:02:03.320 | But yeah, we'll figure it out.

01:02:04.680 | - Pydantic is not my thing.

01:02:05.760 | It's someone else's thing.

01:02:06.600 | - Yeah, yeah, yeah.

01:02:07.440 | - I think that's why it works.

01:02:08.280 | - Yeah, it's true.

01:02:10.200 | - Cool, well, it was a real pleasure to have you on.

01:02:12.880 | - Of course.

01:02:13.720 | - Everyone should go follow you on Twitter

01:02:15.440 | and check out Instructure.

01:02:16.880 | There's also InstructureJS, I think,

01:02:18.440 | which I'm very happy to see.

01:02:20.440 | And what else?

01:02:21.840 | - Useinstructure.com.

01:02:23.880 | - Anything else to plug?

01:02:25.080 | - Useinstructure.com.

01:02:27.240 | We got a domain name now.

01:02:28.440 | - Nice, nice, awesome.

01:02:30.200 | Cool. - Cool.

01:02:31.240 | Thanks, Tristan.

01:02:32.880 | - Thanks.

01:02:34.200 | (upbeat music)

01:02:36.780 | (upbeat music)

01:02:39.360 | (upbeat music)

01:02:41.940 | (upbeat music)

01:02:44.560 | (upbeat music)

01:02:47.140 | (upbeat music)

01:02:49.720 | (upbeat music)

01:02:52.520 | (upbeat music)

01:02:55.100 | (gentle music)

High Agency Pydantic over VC Backed Frameworks — with Jason Liu of Instructor

Chapters