Which Jobs Can Be Replaced Today: Fryderyk Wiatrowski and Peter Albert

00:00:00.000 | My name is Fredrik, co-founder of Zeta Labs and me and Peter, my co-founder, today will

00:00:17.220 | speak about the job replacement, the future of job replacement and how we see agents in there.

00:00:22.620 | I will start with a vision and then Peter will tell you a bit more how you can contribute to

00:00:28.920 | that by building agents on your own.

00:00:31.700 | Cool, so let me start with asking you a question.

00:00:37.380 | If you could hire a reliable autonomous browser agent today, let's assume that it's fully reliable

00:00:43.300 | and very fast.

00:00:45.240 | What tasks would you want to have replaced?

00:00:48.300 | I need three tasks.

00:00:51.360 | Foreign ones.

00:00:54.140 | Travel organization.

00:00:55.140 | Yeah.

00:00:56.140 | Travel organization, yeah.

00:00:57.820 | Expenses.

00:00:58.820 | Expenses, that's cool.

00:00:59.820 | Yes.

00:01:00.820 | Calendar.

00:01:01.820 | Calendar.

00:01:02.820 | I love the calendar one.

00:01:05.140 | Cool.

00:01:06.140 | I think if you have a reliable agent and you wanted to behave as your employee, our claim

00:01:20.600 | is that sometimes you don't want to prompt the agent.

00:01:22.920 | Just like when you hire a good employee to your company, you spend some time on the onboarding,

00:01:29.840 | initial training.

00:01:31.380 | But after that, if it's a good employee, it's very independent, you don't need to ask them

00:01:38.700 | to do things, right?

00:01:40.680 | So you can expect them to come up with the tasks on their own and you probably shouldn't

00:01:45.080 | expect them to ask you every day what should be done.

00:01:49.100 | You don't want to overlook everything that's being done and you want the employee to come

00:01:55.360 | up with the tasks on their own.

00:01:56.940 | Ideally, you just give them the vision and they will do the rest.

00:02:03.740 | The question is how we can implement this into agents.

00:02:08.840 | Probably, this is not the right UI for this.

00:02:14.200 | In this UI, you just prompt the agent to do things for you, one off.

00:02:17.200 | I think it's great for demos and for showing the capabilities of the agents, like demo drives,

00:02:22.960 | but I don't think this is what the future looks like.

00:02:28.000 | So the question we can ask ourselves is how to embed agents into our workflows.

00:02:33.360 | I think a good place to start is asking ourselves what are the high leverage activities that

00:02:40.460 | we want to preserve in our daily life and what's noise.

00:02:45.000 | I will give you an example.

00:02:47.720 | In a founder's role, a very high leverage activity is hiring.

00:02:52.960 | Of course, if I hire a team of ten people, they can do the job for me ten times better than

00:02:57.540 | me because I hire smarter people and it's a huge leverage.

00:03:01.700 | However, this huge leverage is surrounded by a lot of noise.

00:03:05.640 | For example, setting up meetings or searching for them on LinkedIn.

00:03:09.180 | So I think the vision that we can keep in mind when building agents and thinking about how we can embed them into our workflows is how to distill the high leverage activities and preserve them for ourselves while outsourcing the low leverage things for agents.

00:03:25.780 | One solution right now is just like if you are, I don't know, a big company and a big founder, you can hire an executive assistant that will do the meetings for you and set up those things.

00:03:38.580 | You probably need to give them a big salary like a 50-70k or something.

00:03:44.540 | But initially, you always start with a problem.

00:03:47.160 | It's not like you look for an executive assistant.

00:03:49.380 | You start with a problem, hey, I need to set up meetings and I don't have a very specific need

00:03:55.400 | for all the surrounding skills that they have.

00:03:58.240 | So instead of hiring, how about we just hire agents to do just this one thing?

00:04:03.500 | I think if we find a way to implement agents in this way, that would be truly revolutionary.

00:04:10.000 | So for example, here, you know, if we build a simple agent today, if you use our agent, Jace, go to Jace.ai and use our agent,

00:04:17.260 | you can just CC Jace into your emails and it will do the meeting setup for you.

00:04:20.460 | Super simple, doesn't need to interact with browsers, tools, anything, needs to know your calendar and needs to have access to your availability, that's it.

00:04:28.280 | So as you can see, I see Jace, Jace is replying to an investor interested in Zeta Labs and setting up a meeting for me, knowing my availability.

00:04:37.780 | That's sweet, but you know, most of our tasks require integrations and tool access, right?

00:04:45.280 | So I think in order to think about how we can enable those integrations, we can distinguish two modes of human work.

00:04:57.780 | One is reactive and the other one is proactive.

00:05:03.780 | A great example of a reactive job is a customer support.

00:05:08.280 | What happens in customer support is that when you get an email, for example, asking for a refund, you do a very well-defined task.

00:05:16.280 | So what you do is you go to your logging system or whatever, you see whether someone used the app or not, and based on that, you can give them a refund or not.

00:05:27.280 | And that's pretty simple.

00:05:29.780 | You need to attach Stripe, the agent can do the refund and everything.

00:05:33.280 | And then there are those proactive jobs, like for example, a founder's job, where it's super difficult to describe what needs to be done by a simple set of rules, right?

00:05:42.780 | However, we still think that in every job, even in founder's job, there is the reactive layer that still creates the noise and ideally would be handled for us.

00:05:54.780 | So let's focus on the reactive part.

00:05:58.280 | We think that the reactive jobs will go first.

00:06:01.280 | The beauty of the reactive things is that once you set the rules for the agent, once you do the initial training, you don't need to prompt the agents anymore.

00:06:12.280 | They will just follow the rules and do everything for you.

00:06:18.280 | But once the complexity comes in, it can be more difficult.

00:06:22.780 | Of course, agents as of today can perform reliably, I think, hundreds, up to a thousand steps, but we can't really trust them yet.

00:06:31.780 | So we have to be careful.

00:06:33.780 | Okay, so we talked about the reactive bit and the proactive bit.

00:06:40.280 | I think the first step to implementing the reactive job replacement by agents is to create a pool of triggers, the triggers that cause the reactions of agents.

00:06:51.280 | Once you have the pool and the agents know the rules by which they should pick up tasks and perform the actions, the agents then, like the pool being, for example, slack messages, emails, or phone calls, the agents can pick up the tasks, suggest solutions for you,

00:07:10.780 | do some upfront work in the browser, and then just show you, hey, do you want me to perform this action, and then you can just approve.

00:07:17.280 | So we went from prompting agents to do things, to manage our calendars, to agents doing this on their own.

00:07:28.280 | I think, I think we all know that everything is a reaction.

00:07:32.780 | In particular, like even founders job is about reacting to things like macro market movements, things that are rather unusual,

00:07:44.780 | and there are not many templates to being a founder as opposed to, for example, being in customer support.

00:07:51.780 | However, you know, the more descriptive our rule set is, the more proactive jobs we will be able to replace.

00:07:58.280 | And as we extend the rule set for reacting to the triggers, I think we will go closer and closer to replacing the proactive jobs.

00:08:10.280 | So, assume that we have built the meta aggregator of all the triggers that is being, you know, macro movements of the market, as well as, you know, emails, phone calls, slack messages,

00:08:21.280 | linear issues, whatever, whatever triggers our actions.

00:08:24.780 | I think that continuous improvement of the foundation models and their cognitive abilities going up will allow us to have a very complex rule set.

00:08:35.780 | Because those models will be able to reason just like humans in terms of performing their jobs and deciding on what's next.

00:08:42.780 | But the question is how to build this aggregator.

00:08:46.780 | And I think a very simple way to start would be, for example, in linear or in slack and then agents just picking up tasks and performing them.

00:08:54.280 | But then once we build the aggregator, the next step is to make the agents act.

00:09:02.280 | And, you know, browsers, as opposed to APIs, allow us for very generic actions.

00:09:07.780 | APIs are very often limited and not really well implemented.

00:09:11.780 | So, if we can make the browser agents work, can we fully replace humans at their jobs?

00:09:19.280 | I will pass now to Peter.

00:09:22.280 | And Peter will tell you how you can implement browser agents today to reliably perform your daily jobs and what are the challenges on the way.

00:09:33.280 | So, I'm Peter.

00:09:34.780 | Hi, I'm Peter.

00:09:35.780 | I previously worked on the number two models at Meta.

00:09:38.780 | And I'll, yeah, like Frederick said, I'll give you a bit more actionable insights on how to actually build agents and kind of what the steps are you need to go through there.

00:09:47.780 | So, I think, like, if you want to build any kind of part in your system and especially for agents,

00:09:54.780 | you go through a few different steps based on the complexity of your task and how much performance you want to have.

00:10:01.280 | And basically, this is kind of separate for every single system you have.

00:10:05.280 | Usually, you can start up with prompting and once things get more complex, you can add cognitive architectures.

00:10:11.280 | You can add fine-tuning and reinforcement learning.

00:10:14.280 | But I would only, I would kind of go through each of the steps separately because each of these kind of reduces your iteration speed a lot.

00:10:21.280 | So, prompting cognitive architectures, you can do things within hours.

00:10:24.280 | And fine-tuning reinforcement learning is like more like a week to month, monthly projects.

00:10:28.280 | And any kind of changes you want to make basically slow everything down a lot.

00:10:32.780 | But if you kind of need to go on a certain task to really high performance, you kind of have to go also through steps.

00:10:38.780 | So, first steps are some general ideas, like about how to improve your prompting.

00:10:45.780 | So, usually, it's a good idea to kind of rewrite your prompts with language models itself

00:10:49.780 | because you kind of use the low-perplexity text that you give to a model.

00:10:54.280 | So, also, you can have your own misunderstandings that you had initially, you will kind of be taken out.

00:10:59.280 | We'll also try to use like XML syntax.

00:11:02.280 | Like in Fabric kind of started with this for the prompts.

00:11:04.780 | But I think it's useful for any model just to separate instructions from content.

00:11:09.780 | Also, some useful mindset is to always try to match the fine-tuning and pre-train distribution of your language model.

00:11:18.280 | So, if you use GPT-4, you kind of have to think about what did OpenAI probably fine-tune your models with.

00:11:23.780 | And also, what is kind of in the web in there.

00:11:26.780 | So, for example, you should probably prefer JSON or XML or Markdown unified format when trying to output text

00:11:34.780 | just because it's more frequent in pre-train distribution.

00:11:38.780 | So, if you shouldn't probably introduce your own format or your kind of own syntax to do things.

00:11:42.780 | You want to do things just like the most likely to appear in the web.

00:11:49.280 | Yeah, it's also similar for what you put in the system prompt versus user prompt.

00:11:53.280 | How you, if you decide to split up things into multiple messages or just one.

00:11:57.280 | Just have to think about like what probably OpenAI got in their fine-tuning data set.

00:12:01.280 | What most people do.

00:12:02.780 | And this is usually a good way to do, to stay within the distribution of the fine-tuning.

00:12:07.280 | And will give you better performance.

00:12:09.280 | Also, you should kind of try to think of how to minimize computation at each token.

00:12:15.780 | So, if you have like a complex task, one thing to keep in mind that you, if you want to output,

00:12:21.780 | for example, mappings, like element ID 5 or 4, you usually want to prefer text values.

00:12:27.780 | So, if your model can output text, it's better than specific numbers.

00:12:30.780 | And because basically, you're skipping one mapping step.

00:12:34.780 | So, you should try to let the model only reason about one thing at a time.

00:12:39.280 | It's also relevant, like if you do classification, for example.

00:12:42.280 | Like it's better to first classify things into broader categories and smaller ones.

00:12:45.280 | Instead of directly going to the smallest one.

00:12:47.280 | And if you do this, it will just improve performance in general.

00:12:50.280 | I would also like, if you have some idea about how to think about a problem.

00:12:54.280 | I would try to give you, if you do a chain of thought, I wouldn't just say, let's think

00:12:59.780 | step by step.

00:13:00.780 | But instead, you should kind of think about how you would approach a problem.

00:13:03.780 | And specifically, add these things in your chain of thought.

00:13:06.780 | So, for example, first thinking about summarizing the problem you have.

00:13:10.780 | And go on going for all the steps.

00:13:12.780 | A few more ideas is that you should think about everything that you put in a prompt is basically

00:13:19.280 | noise.

00:13:20.280 | And some valuable content in there.

00:13:22.280 | So, if you can reduce the noise as much as possible and only have the most relevant context

00:13:27.280 | in there, the model doesn't need to find out what is noise and what is real context.

00:13:31.280 | For future examples, some really useful tips is that you shouldn't probably, like, if you

00:13:37.780 | have really long inputs and really long future examples, you don't have to show the full ones.

00:13:41.780 | You just only focus on the few most important parts of them.

00:13:45.780 | And the model is still usually capture kind of how things work.

00:13:50.780 | And also, like, even once you start prompting, you want to already set up some maybe like 10,

00:13:55.780 | 20 examples to kind of always run each prompt through.

00:13:58.780 | Additionally, two will have, like, some better end-to-end emails as well.

00:14:02.280 | Yeah.

00:14:03.280 | So, once you kind of exhausted all the gains you kind of get with prompting, you can try

00:14:07.880 | to split up tasks into multiple pieces.

00:14:10.280 | So, basically, you can think about how much cognitive load you put on a model with one prompt.

00:14:15.280 | And if you can split things up further.

00:14:16.780 | And also, if you have some preconceived notion about how the problem should work, you can kind

00:14:20.780 | of improve this.

00:14:21.780 | So, a few ideas about this is kind of what you add explicit state tracking of, like, if you

00:14:26.280 | have a longer task, you basically keep track of the state that the task currently is in.

00:14:30.780 | This basically increases the quality of context you kind of give in your model.

00:14:35.380 | Here's some ideas, for example, of using kind of planning.

00:14:38.380 | You kind of generate plans.

00:14:39.780 | Afterwards, modify them.

00:14:40.780 | Re-plan.

00:14:41.780 | You can also have the verification steps afterwards.

00:14:43.780 | Or you kind of take notes in a scratch pad for intermediate work.

00:14:52.280 | But once you kind of split up things in multiple pieces, when latency becomes more of an issue.

00:14:55.880 | So, there are often good ways to kind of parallelize the work while still kind of getting good performance.

00:15:02.280 | Also, for all of these kind of agents, it's really important to think about really natural interfaces

00:15:06.880 | for them to use tools.

00:15:09.980 | So, for example, we found that if you want to update state, like, for example, some notes

00:15:15.980 | you have, it's often good to address them, like, with key value updates.

00:15:18.980 | Because, basically, you're kind of mimicking the Python syntax.

00:15:22.380 | You can update some dictionary.

00:15:24.380 | And this kind of allows you to target elements really well.

00:15:26.480 | Because the first model only needs to think about the key.

00:15:28.980 | And then, afterwards, you can think about what to update and not do both things at the same time.

00:15:34.080 | You can get even better performance usually with most models if you do full rewrites of what you kind of want to update.

00:15:39.080 | Because this gives the model even more time to think about things.

00:15:42.080 | But when you add more latency.

00:15:44.080 | So, I think that's, for example, why you see in cursor, like, all the text rewritten.

00:15:48.080 | Because currently, both best models are not that good at generating, like, small divs.

00:15:53.080 | Also, you should try to avoid any recursive nest structures if you can.

00:15:58.080 | This will just, like, the more nested it is, it will just be more out of distribution

00:16:03.180 | and make it more complex for a model.

00:16:05.180 | Yeah, also, you can kind of use some reasoning templates.

00:16:09.180 | Like, if you know how the problems are structured, you should also kind of reflect this in a prompt.

00:16:12.680 | If you deal with images, then one good idea is that you try to move the reasoning into your text.

00:16:19.180 | So, you first describe the key points in your image.

00:16:21.780 | Like, you let model output the text form what the image contains.

00:16:25.780 | And then, afterwards, reason about this.

00:16:27.180 | Just because models have been trained on trillions of tokens in text.

00:16:30.280 | And most of these, like, image-language pairs are usually much less data.

00:16:35.280 | So, if you do reasoning about images, it usually performs worse than if you first convert it into text.

00:16:39.280 | And then, afterwards, into more text.

00:16:43.280 | You should also think about how to design your components to correct for one part.

00:16:46.280 | if you have one part in the system that creates an error that another handles it.

00:16:51.380 | And the more cognitive components you kind of add, the more prettiness you also add to the system.

00:16:55.380 | So, usually, it's a good idea to kind of keep it to a minimum.

00:16:58.380 | But, on the other hand, like, in more components you add to kind of more cognitive load.

00:17:02.480 | You split into smaller pieces.

00:17:04.480 | So, the model can have less load on this.

00:17:07.480 | The next stage, if you still don't get enough performance out of your model, or if you want to get lower cost,

00:17:15.580 | you can kind of start for fine-tuning.

00:17:17.580 | And a simple way to kind of collect data, no matter what your use case is, is by simulating, basically, real interactions in your app.

00:17:24.580 | And you do this by creating templates about how a human-- like, it's basically different roles of a human.

00:17:32.680 | And then, you instruct one language models to act as a human, and basically, the rest of your system acts as normal.

00:17:38.680 | And it's basically synthetic data that you can kind of use to fine-tune on.

00:17:42.680 | And for this, like, prompt diversity and difficulty is really key.

00:17:45.680 | So, I think one of the core issues of, like, Alpaca models, like, in the beginning, the first kind of fine-tuned models that went out there,

00:17:52.680 | was that the prompt difficulty was way too low.

00:17:54.980 | Basically, models learn much better, the more difficult your prompt is.

00:17:58.280 | Even if your task is not that difficult, I would try to add more and more conditions and more and more complexity,

00:18:02.980 | because then a model has more things to learn than just for a single question.

00:18:05.780 | Yeah, also, another thing you can kind of try is, like, if you have multiple steps in your pipeline, you can--

00:18:13.380 | in the end, if you do fine-tune, you can distill to skip some of them.

00:18:16.580 | You basically have, like, some initial inputs and some final output, and you can directly distill a model to get these first inputs

00:18:22.280 | and auto-generate the final outputs.

00:18:23.880 | And this can increase a lot of-- decrease your latency, but will kind of decrease performance a bit.

00:18:29.080 | Then the next step that's really easy is that you kind of filter your data.

00:18:37.080 | So this kind of gives you-- like, usually with just fine-tuning, you can get to, like, GVD4 performance on your specific task,

00:18:42.880 | but you can get much further if you simply do some rejection sampling or filtering of your data.

00:18:47.380 | And for this, an easy way is, if you don't have execution feedback in some way,

00:18:50.980 | it's that you basically use language model judges.

00:18:54.180 | You'll just judge your output of your model, or, like, of a larger system, or even the final output of your system.

00:18:59.180 | And then, basically, you'll filter out whole swaths of your data that you know probably didn't work as well.

00:19:04.680 | Even if your judge is not perfect, this will kind of increase your performance a lot.

00:19:08.180 | And finally, so the last step that you can kind of approach is, like, reinforcement learning.

00:19:17.280 | So this even allows you to kind of optimize multiple steps in your system.

00:19:20.280 | So it's especially important for agents.

00:19:22.880 | And good ways to get a signal for this is execution feedback over, like we said,

00:19:27.180 | these language model judges of different parts of your system.

00:19:29.780 | But I would usually consider reinforcement learning a kind of a final step when other methods don't work,

00:19:34.780 | because it makes it kind of difficult to move to different models.

00:19:38.080 | And also, there's a lot of setup costs you have to do.

00:19:41.880 | Yeah, so I think that's it.

00:19:43.880 | Awesome.

00:19:44.880 | Thanks.

00:19:45.880 | Thank you.

00:19:56.980 | Thank you.