How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

Okay, so welcome to the Latent Space Podcast. This is another remote episode that we're recording. Actually, this is the first one that we're doing around a guest post. And I'm very honored to have two of the authors of the post with me, James and Adam from Elicit. Welcome, James.

Welcome, Adam. Thank you. Great to be here. Hey there. Hey. Okay, so I think I will do this kind of in order. I think James, you're sort of the primary author. So James, you are head of engineering at Elicit. You also were VP Eng at Teespring and Spring as well.

And you also, you know, you have a long history in sort of engineering. How did you, you know, find your way into something like Elicit where, you know, you are basically a traditional sort of VP Eng, VP technology type person moving into more of an AI role? Yeah, that's right.

It definitely was something of a sideways move, if not a left turn. So the story there was I'd been doing, as you said, VP technology, CTO type stuff for around about 15 years or so. And noticed that there was this crazy explosion of capability and interesting stuff happening within AI and ML and language models, that kind of thing.

I guess this was in 2019 or so, and decided that I needed to get involved. You know, this is a kind of generational shift. Spent maybe a year or so trying to get up to speed on the state of the art, reading papers, reading books, practicing things, that kind of stuff.

Was going to found a startup actually in the space of interpretability and transparency. And through that met Andreas, who has obviously been on the podcast before, asked him to be an advisor for my startup. And he countered with, "Maybe you'd like to come and run the engineering team at Elicit," which it turns out was a much better idea.

And yeah, I kind of quickly changed in that direction. So I think some of the stuff that we're going to be talking about today is how actually a lot of the work when you're building applications with AI and ML looks and smells and feels much more like conventional software engineering with a few key differences, rather than really deep ML stuff.

And I think that's one of the reasons why I was able to transfer the skills over from one place to the other. Yeah, I definitely agree with that. I do often say that I think AI engineering is about 90% software engineering with the 10% of really strong, really differentiated AI engineering.

And obviously that number might change over time. I want to also welcome Adam onto my podcast, because you welcomed me onto your podcast two years ago. And I'm really, really glad for that. That was a fun episode. You famously founded Heroku. You just wrapped up a few years working on Muse.

And now you describe yourself as a journalist, internal journalist working on Elicit. Yeah, well, I'm kind of a little bit in a wandering phase here and trying to take this time in between ventures to see what's out there in the world. And some of my wandering took me to the Elicit team and found that they were some of the folks who were doing the most interesting, really deep work in terms of taking the capabilities of language models and applying them to what I feel like are really important problems.

So in this case, science and literature search and that sort of thing. It fits into my general interest in tools and productivity software. I think of it as a tool for thought in many ways, but a tool for science, obviously, if we can accelerate that discovery of new medicines and things like that, that's just so powerful.

But to me, it's kind of also an opportunity to learn at the feet of some real masters in this space, people who have been working on it since before it was cool, if you want to put it that way. So for me, the last couple of months have been this crash course.

And why I sometimes describe myself as an internal journalist is I'm helping to write some posts, including supporting James in this article here we're doing for Latent Space, where I'm just bringing my writing skill and that sort of thing to bear on their very deep domain expertise around language models and applying them to the real world and kind of surface that in a way that's accessible, legible, that sort of thing.

And so the great benefit to me is I get to learn this stuff in a way that I don't think I would or I haven't, just kind of tinkering with my own side projects. Yeah, totally. I forgot to mention that you also run Ink and Switch, which is one of the leading research labs, in my mind, of the tools for thought productivity space, whatever people mentioned there, or maybe future programming even, a little bit of that as well.

I think you guys definitely started the local first wave. I think they were just the first conference that you guys held. I don't know if you were personally involved. Yeah, I was one of the co-organizers, along with a few other folks, called Local First Conf here in Berlin. Huge success from my point of view.

Local First, obviously, a whole other topic we can talk about on another day. I think there actually is a lot more, what would you call it, handshake emoji between language models and the local first data model. And that was part of the topic of the conference here. But yeah, topic for another day.

Not necessarily. I mean, if I can grab your thoughts at the end on local first and AI, we can talk about that. I featured, I selected as one of my keynotes, Justine Tunney, from Llamafile, working at Llamafile in Mozilla, because I think there's a lot of people interested in that stuff.

But we can focus on the headline topic, just to not bury the lead, which is we're talking about how to hire AI engineers. This is something that I've been looking for a credible source on for months. People keep asking me for my opinions. I don't feel qualified to give an opinion, given that I only have so much engineering experience.

And it's not like I've defined a hiring process that I'm super happy with, even though I've worked with a number of AI engineers. I'll just leave it open to you, James. How was your process of defining your hiring roles? Yeah. So I think the first thing to say is that we've effectively been hiring for this kind of a role since before you coined the term and tried to kind of build this understanding of what it was, which is not a bad thing.

It was a concept that was coming to the fore and effectively needed a name, which is what you did. So the reason I mentioned that is I think it was something that we kind of backed into, if you will. We didn't sit down and come up with a brand new role from scratch.

This is a completely novel set of responsibilities and skills that this person would need. However, it is a kind of particular blend of different skills and attitudes and curiosities, interests, which I think makes sense to kind of bundle together. So in the post, the three things that we say are most important for a highly effective AI engineer are, first of all, conventional software engineering skills, which is kind of a given, but definitely worth mentioning.

The second thing is a curiosity and enthusiasm for machine learning and maybe in particular language models. That's certainly true in our case. And then the third thing is to do with basically a fault-first mindset, being able to build systems that can handle things going wrong in some sense. And yeah, I think the kind of middle point, the curiosity about ML and language models is probably fairly self-evident.

They're going to be working with and prompting and dealing with the responses from these models. So that's clearly relevant. The last point though, maybe takes the most explaining to do with this fault-first mindset and the ability to build resilient systems. The reason that is so important is because compared to normal APIs where normal, think of something like a Stripe API or a search API or something like this, conventional search API, the latency when you're working with language models is wild.

Like you can get 10X variation. I mean, I was looking at the stats before actually, before the podcast, we do often normally, in fact, see a 10X variation in the P90 latency over the course of half an hour, an hour, when we're prompting these models, which is way higher than if you're working with a more kind of conventional, conventionally backed API.

And the responses that you get, the actual content of the responses are naturally unpredictable as well. They come back with different formats. Maybe you're expecting JSON. It's not quite JSON. You have to handle this stuff. And also the semantics of the messages are unpredictable too, which is a good thing.

Like this is one of the things that you're looking for from these language models, but it all adds up to needing to build a resilient, reliable, solid feeling system on top of this fundamentally, well, certainly currently fundamentally shaky foundation. The models do not behave in the way that you would like them to and the ability to structure the code around them such that it does give the user this warm, assuring, nappy, solid feeling is really what we're driving for there.

Yeah, I think, sorry, go ahead. Go ahead. You can try, man. Yeah, that really struck me as we were starting to dig on what this article would contain, kind of language models as this chaotic medium. Sorry. Let me start again. Yeah, you can edit that. What really struck me as we dug in on the content for this article was that third point there.

The language models as this kind of chaotic medium, this dragon, this wild horse you're riding and trying to guide in the direction that is going to be useful and reliable to users. Because I think so much of software engineering is about making things not only high performance and snappy, but really just making it stable, reliable, predictable, which is literally the opposite of what you get from the language models.

And yet, yeah, the output is so useful. And indeed, some of their creativity, if you want to call it that, which is precisely their value. And so you need to work with this medium. And I guess the nuanced or the thing that came out of Elisa's experience that I thought was so interesting is quite a lot of working with that is things that come from distributed systems engineering.

But you have really the AI engineers kind of as sort of as we're defining them or labeling them on the Elisa team is people who are really application developers. You're building things for end users. You're thinking about, okay, I need to populate this interface with some response to user input that's useful to the tasks they're trying to do.

But you have this thing, this medium that you're working with, that in some ways you need to apply some of this chaos engineering, distributed systems engineering, which typically those people with those engineering skills are not kind of the application level developers with the product mindset or whatever. They're more deep in the guts of a system.

And so those skills and knowledge do exist throughout the engineering discipline, but sort of putting them together into one person, that feels like sort of a unique thing. And working with the folks on the Elisa team who have that skills, I'm quite struck by that unique blend. I haven't really seen that before in my 30-year career in technology.

Yeah, that's fascinating. I like the reference to chaos engineering. I have some appreciation. I think when you had me on your podcast, I was still working at Temporal, and that was like a nice framework. If you live within Temporal's boundaries, you can pretend that all those faults don't exist, and you can code in a sort of very fault-tolerant way.

What is you guys' solutions around this, actually? I think you're emphasizing having the mindset, but maybe naming some technologies would help. Not saying that you have to adopt these technologies, but they're just quick vectors into what you're talking about when you're talking about distributed systems. That's such a big, chunky word.

Are we talking Kubernetes? I suspect we're not. We're talking something else now. Yeah, that's right. It's more the application level rather than at the infrastructure level, at least the way that it works for us. So there's nothing kind of radically novel here. It is more a careful application of existing concepts.

So the kinds of tools that we reach for to handle these kind of slightly chaotic objects that Adam was just talking about are retries, and fallbacks, and timeouts, and careful error handling. Yeah, the standard stuff, really. There's also a great degree of dependence. We rely heavily on parallelization because these language models are not innately very snappy, and there's just a lot of I/O going back and forth.

All these things I'm talking about, when I was in my earlier stages of a career, these are the things that are the difficult parts that more senior software engineers will be better at. It is careful error handling, and concurrency, and fallbacks, and distributed systems, and eventual consistency, and all this kind of stuff.

And as Adam was saying, the kind of person that is deep in the guts of some kind of distributed systems, a really high-scale back-end kind of a problem, would probably naturally have these kinds of skills. But you'll find them on day one if you're building an ML-powered app, even if it's not got massive scale.

I think one thing that I would mention that we do do-- yeah, maybe two related things, actually. The first is we're big fans of strong typing. We share the types all the way from the back-end Python code all the way to the front-end in TypeScript, and find that is-- I mean, we're probably doing this anyway, but it really helps one reason around the shapes of the data, which are going to be going back and forth, and that's really important when you can't rely upon-- you're going to have to coerce the data that you get back from the ML if you want for it to be structured, basically speaking.

And the second thing which is related is we use checked exceptions inside our Python code base, which means that we can use the type system to make sure we are handling, properly handling, all of the various things that could be going wrong, all the different exceptions that could be getting raised.

Checked exceptions are not really particularly popular, actually. There's not many people that are big fans of them. For our particular use case, to really make sure that we've not just forgotten to handle this particular type of error, we have found them useful to force us to think about all the different edge cases that could come up.

That's fascinating. Just a quick note of technology. How do you share types from Python to TypeScript? Do you use GraphQL? Do you use something else? We don't use GraphQL. So we've got the types defined in Python, that's the source of truth, and we go from the open API spec, and there's a tool that we can use to generate types dynamically, like TypeScript types from those open API definitions.

Okay, cool. Sorry for diving into that rabbit hole a little bit. I always like to spell out technologies for people to dig their teeth into. One thing I'll mention quickly is that a lot of the stuff that you mentioned is typically not part of the normal interview loop. It's actually really hard to interview for, because this is the stuff that you polish out as you go into production.

Coding interviews are typically about the happy path. How do we do that? How do we look for a defensive, fault-first mindset? Because you can defensive code it all day long, and not add functionality to your application. Yeah, it's a great question, and I think that's exactly true. Normally, the interview is about the happy path, and then there's maybe a box checking exercise at the end of the candidate says, "Of course, in reality, I would handle the edge cases," or something like this.

That, unfortunately, isn't quite good enough when the happy path is very, very narrow, and there's lots of weirdness on either side. Basically speaking, it's just a case of foregrounding those kind of concerns through the interview process. There's no magic to it. We talk about this in the post that we're going to be putting up on LatentSpace, but there's two main technical exercises that we do through our interview process for this role.

The first is more coding-focused, and the second is more system design-y, whiteboarding a potential solution. Without giving too much away, in the coding exercise, you do need to think about edge cases. You do need to think about errors. How best to put this? Yeah, the exercise consists of adding features and fixing bugs inside the code base.

In both of those two cases, it does demand, because of the way that we set the application up and the interview up, it does demand that you think about something other than the happy path. But your thinking is the right prompt of how do we get the candidate thinking outside of the normal sweet spot, smoothly paved path.

In terms of the system design interview, that's a little easier to prompt this fault-first mindset, because it's very easy in that situation just to say, let's imagine that this node dies. How does the app still work? Let's imagine that this network is going super slow. Let's imagine that, I don't know, you run out of capacity in this database that you've sketched out here.

How do you handle that sort of stuff? So, in both cases, they're not firmly anchored to and built specifically around language models and ways language models can go wrong, but we do exercise the same muscles of thinking defensively and foregrounding the edge cases, basically. Yeah, any comment there? Yeah, I guess I wanted to mention too, James, earlier there, you mentioned retries, and this is something that I think I've seen some interesting debates internally about things regarding, first of all, retries can be costly, right?

In general, this medium, in addition to having this incredibly high variance and response rate and being non-deterministic, is actually quite expensive. And so, in many cases, doing a retry when you get a fail does make sense, but actually that has an impact on cost. And so, there is some sense to which, at least I've seen the AI engineers on our team worry about that.

They worry about, okay, how do we give the best user experience, but balance that against what the infrastructure is going to cost our company, which I think is, again, an interesting mix of, yeah, again, it's a little bit the distributed system mindset, but it's also a product perspective and you're thinking about the end user experience, but also the bottom line for the business.

You're bringing together a lot of qualities there. And there's also the fallback case, which is kind of a related or adjacent one. I think there was also a discussion on that internally where, I think it maybe was search, there was something recently where there was one of the frontline search providers was having some, yeah, slowness and outages, and essentially then we had a fallback, but essentially that gave people for a while, especially new users that come in that don't know the difference, they're getting worse results for their search.

And so, then you have this debate about, okay, there's sort of what is correct to do from an engineering perspective, but then there's also what actually is the best result for the user. Is giving them a kind of a worse answer to their search result better, or is it better to kind of give them an error and be like, yeah, sorry, it's not working right at the moment, try later.

Both are obviously non-optimal, but this is the kind of thing I think that you run into or the kind of thing we need to grapple with a lot more than you would other kinds of medians. Yeah, that's a really good example. I think it brings to the fore the two different things that you could be optimizing for of uptime and response at all costs on one end of the spectrum, and then effectively fragility, but kind of, if you get a response, it's the best response we can come up with at the other end of the spectrum.

And where you want to land there kind of depends on, well, it certainly depends on the app, obviously depends on the user. I think it depends on the feature within the app as well. So in the search case that you mentioned there, in retrospect, we probably didn't want to have the fallback.

And we've actually just recently on Monday changed that to show an error message rather than giving people a kind of degraded experience. In other situations, we could use, for example, a large language model from provider B rather than provider A, and get something which is within a few percentage points performance.

And that's just a really different situation. Yeah, like any interesting question, the answer is it depends. I do hear a lot of people suggesting, let's call this model shadowing as a defensive technique, which is if open AI happens to be down, which happens more often than people think, then you fall back to entropic or something.

How realistic is that? Don't you have to develop completely different prompts for different models, and won't the performance of your application suffer for whatever reason? It maybe calls differently, or it's not maintained in the same way. I think that people raise this idea of fallbacks to models, but I don't see it practiced very much.

Yeah, it is. You definitely need to have a different prompt if you want to stay within a few percentage points degradation, like I said before. And that certainly comes at a cost of fallbacks and backups and things like this. It's really easy for them to go stale and kind of flake out on you because they're off the beaten track.

And in our particular case inside of Elicit, we do have fallbacks for a number of crucial functions where it's going to be very obvious if something has gone wrong, but we don't have fallbacks in all cases. It really depends on a task-to-task basis throughout the app, so I can't give you a single simple rule of thumb for, in this case, do this, and in the other, do that.

But yeah, it's a little bit easier now that the APIs between the Anthropic models and OpenAI are more similar than they used to be, so we don't have two totally separate code paths with different protocols, like wire protocols, to speak, which makes things easier. But you're right, you do need to have different prompts if you want to have similar performance across the providers.

I'll also note, just observing again as a relative newcomer here, I was surprised, impressed, I'm not sure what the word is for it, at the blend of different backends that the team is using, and so there's many, the product presents as kind of one single interface, but there's actually several dozen kind of main paths.

There's like, for example, the search versus a data extraction of a certain type versus chat with papers versus, and each one of these, you know, the team has worked very hard to pick the right model for the job and craft the prompt there, but also is constantly testing new ones.

So a new one comes out from either from the big providers, or in some cases, our own models that are, you know, running on essentially our own infrastructure, and sometimes that's more about cost or performance, but the point is kind of switching very fluidly between them, and very quickly, because this field is moving so fast, and there's new ones to choose from all the time, is like part of the day-to-day, I would say, so it isn't more of a like, there's a main one, it's been kind of the same for a year, there's a fallback, but it's got cobwebs on it, it's more like which model and which prompt is changing weekly, and so I think it's quite reasonable to have a fallback that you can expect might work.

I'm curious, because you guys have had experience working at both, you know, Elicit, which is a smaller operation, and larger companies, a lot of companies are looking at this with a certain amount of trepidation as, you know, it's very chaotic. When you have one engineering team that knows everyone else's names, and like, you know, they meet constantly in Slack and know what's going on, it's easier to sync on technology choices.

When you have 100 teams, all shipping AI products, and all making their own independent tech choices, it can be very hard to control. One solution I'm hearing from the sales forces of the world, and Walmarts of the world, is that they are creating their own AI gateway, right? Internal AI gateway.

This is the one model hub that controls all the things, and has all standards. Is that a feasible thing? Is that something that you would want? Is that something you have and you're working towards? What are your thoughts on this stuff? Like, centralization of control, or like an AI platform internally?

Yeah, I think certainly for larger organizations, and organizations that are doing things which maybe are running into HIPAA compliance, or other legislative tools like that, it could make a lot of sense. Yeah. I think for the TLDR for something like Elicit is, we are small enough, as you indicated, and need to have full control over all the levers available, and switch between different models, and different prompts, and whatnot.

As Adam was just saying, that kind of thing wouldn't work for us. But yeah, I've spoken with and advised a couple of companies that are trying to sell into that kind of a space, or at a larger stage, and it does seem to make a lot of sense for them.

So, for example, if you're trying to sell to a large enterprise, and they cannot have any data leaving the EU, then you need to be really careful about someone just accidentally putting in the sort of US-East-1 GPT4 endpoints, or something like this. If you're... Do you want to think of a more specific example there?

Yeah. I think the... I'd be interested in understanding better what the specific problem is that they're looking to solve with that, whether it is to do with data security, or centralization of billing, or if they have a kind of suite of prompts, or something like this, that people can choose from, so they don't need to reinvent the wheel again and again.

I wouldn't be able to say without understanding the problems and their proposed solutions, you know, which kind of situations that'd be better or worse fit for. But yeah, for Elicit, where really the secret sauce, if there is a secret sauce, is which models we're using, how we're using them, how we're combining them, how we're thinking about the user problem, how we're thinking about all these pieces coming together.

You really need to have all of the affordances available to you to be able to experiment with things and iterate rapidly. And generally speaking, whenever you put these kind of layers of abstraction, and control, and generalization in there, that gets in the way. So for us, it would not work.

Do you feel like there's always a tendency to want to reach for standardization and abstractions pretty early in a new technology cycle? There's something comforting there, or you feel like you can see them, or whatever. I feel like there's some of that discussion around lang chain right now. But yeah, this is not only so early, but also moving so fast.

I think it's tough to ask for that. That's not the space we're in. But yeah, the larger an organization, the more that's your default is to want to reach for that. It's a sort of comfort. Yeah, that's interesting. I find it interesting that you would say that being a founder of Heroku, where you were one of the first platforms as a service that more or less standardized what that early development experience should have looked like.

And I think basically people are feeling the differences between calling various model lab APIs and having an actual AI platform where all their development needs are thought of for them. I define this in my AI engineer post as well. The model labs just see their job ending at serving models, and that's about it.

But actually, the responsibility of the AI engineer has to fill in a lot of the gaps beyond that. Yeah, that's true. I think a huge part of the exercise with Heroku, which was largely inspired by Rails, which itself was one of the first frameworks to standardize the CRUD app with the SQL database, and people have been building apps like that for many, many years.

I had built many apps. I had made my own kind of templates based on that. I think others had done it. And Rails came along at the right moment, where we had been doing it long enough that you see the patterns, and then you can say, look, let's extract those into a framework that's going to make it not only easier to build for the experts, but for people who are relatively new, the best practices are encoded into that framework, in a model controller, to take one example.

But then, yeah, once you see that, and once you experience the power of a framework, and again, it's so comforting, and you develop faster, and it's easier to onboard new people to it because you have these standards and this consistency, then folks want that for something new that's evolving.

Now, here I'm thinking maybe if you fast forward a little to, for example, when React came on the scene a decade ago or whatever, and then, okay, we need to do state management, what's that? And then there's a new library every six months. Okay, this is the one, this is the gold standard.

And then six months later, that's deprecated. Because, of course, it's evolving. You need to figure it out. The tacit knowledge and the experience of putting it in practice and seeing what those real needs are, are critical. And so it is really about finding the right time to say, yes, we can generalize, we can make standards and abstractions, whether it's for a company, whether it's for an open source library, for a whole class of apps, and it's very much a much more of a judgment call/just a sense of taste or experience to be able to say, yeah, we're at the right point, we can standardize this.

But it's at least my very, again, and I'm so new to that, this world compared to you both, but my sense is, yeah, still the Wild West, that's what makes it so exciting and feels kind of too early for too much in the way of standardized abstractions. Not that it's not interesting to try, but you can't necessarily get there in the same way Rails did until you've got that decade of experience of whatever building different classes of apps in that, with that technology.

Yeah, it's interesting to think about what is going to stay more static and what is expected to change over the coming five years, let's say, which seems like, when I think about it through an ML lens, is an incredibly long time. And if you just said five years, it doesn't seem that long.

I think that kind of talks to part of the problem here is that things that are moving are moving incredibly quickly. I would expect, this is my hot take rather than some kind of official carefully thought out position, but my hot take would be something like, you'll be able to get to good quality apps without doing really careful prompt engineering.

I don't think that prompt engineering is going to be a kind of durable differential skill that people will hold. I do think that the way that you set up the ML problem to kind of ask the right questions, if you see what I mean, rather than the specific phrasing of exactly how you're doing chain of thought or few shot or something in the prompt, I think the way that you set it up is probably going to remain to be trickier for longer.

And I think some of the operational challenges that we've been talking about of wild variations in latency and handling the... I mean, one way to think about these models is the first lesson that you learn when you're an engineer, software engineer, is that you need to sanitize user input, right?

I think it was the top OWASP security threat for a while. You have to sanitize and validate user input. And we got used to that. And it kind of feels like this is the shell around the app and then everything else inside you're kind of in control of, and you can grasp and you can debug, et cetera.

And what we've effectively done is through some kind of weird rear guard action, we now got these slightly chaotic things. I think of them more as complex adaptive systems, which are related, but a bit different, definitely have some of the same dynamics. We've injected these into the foundations of the app.

And you kind of now need to think with this defensive mindset downwards as well as upwards, if you see what I mean. So I think it will take a while for us to truly wrap our heads around that. Also, these kinds of problems, you have to handle things being unreliable and slow sometimes and whatever else, even if it doesn't happen very often, there isn't some kind of industry-wide accepted way of handling that at massive scale.

There are definitely patterns and anti-patterns and tools and whatnot, but it's not like this is a solved problem. So I would expect that it's not going to go down easily as a solvable problem at the ML scale either. Yeah, excellent. I would describe in the terminology of the stuff that I've written in the past, I described this inversion of architecture as sort of LLM at the core versus LLM or code at the core.

We're very used to code at the core. Actually, we can scale that very well. When we build LLM core apps, we have to realize that the central part of our app that's orchestrating things is actually prone to prompt injections and non-determinism and all that good stuff. I did want to move the conversation a little bit from the sort of defensive side of things to the more offensive or the fun side of things, capabilities side of things, because that is the other part of the job description that we kind of skimmed over.

So I'll repeat what you said earlier. You want people to have a genuine curiosity and enthusiasm for the capabilities of language models. We're recording this the day after Anthropic just dropped Cloud 3.5. I was wondering, maybe this is a good exercise, is how do people have curiosity and enthusiasm for capabilities and language models when, for example, the research paper for Cloud 3.5 is four pages?

There's not much. Yeah. Well, maybe that's not a bad thing, actually, in this particular case. So yeah, if you really want to know exactly how the sausage was made, that hasn't been possible for a few years now, in fact, for these new models. But from our perspective, when we're building Illicit, what we primarily care about is what can these models do?

How do they perform on the tasks that we already have set up and the evaluations we have in mind? And then on a slightly more expansive note, what kinds of new capabilities do they seem to have? Can we illicit, no pun intended, from the models? For example, well, there's very obvious ones like multimodality.

There wasn't that, and then there was that. Or it could be something a bit more subtle, like it seems to be getting better at reasoning, or it seems to be getting better at metacognition, or it seems to be getting better at marking its own work and giving calibrated confidence estimates, things like this.

Yeah, there's plenty to be excited about there. It's just that, yeah, there's rightly or wrongly been this shift over the last few years to not give all the details. No, but from application development perspective, every time there's a new model released, there's a flow of activity in our Slack, and we try to figure out what it can do, what it can't do, run our evaluation frameworks.

And yeah, it's always an exciting, happy day. Yeah, from my perspective, what I'm seeing from the folks on the team is, first of all, just awareness of the new stuff that's coming out. So that's an enthusiasm for the space and following along. And then being able to very quickly, partially that's having Slack to do this, but be able to quickly map that to, okay, what does this do for our specific case?

And the simple version of that is let's run the evaluation framework, which Alyssa has quite a comprehensive one. I'm actually working on an article on that right now, which I'm very excited about, because it's a very interesting world of things. But basically you can just try the new model in the evaluations framework, run it.

It has a whole slew of benchmarks, which includes not just accuracy and confidence, but also things like performance, cost and so on. And all of these things may trade off against each other. Maybe it's actually, it's very slightly worse, but it's way faster and way cheaper. So actually this might be a net win, for example, or it's way more accurate, but that comes at it's slower and higher cost.

And so now you need to think about those trade-offs. And so to me, coming back to the qualities of an AI engineer, especially when you're trying to hire for them, it is very much an application developer in the sense of a product mindset of what are our users or our customers trying to do?

What problem do they need solved? Or what does our product solve for them? And how does the capabilities of a particular model potentially solve that better for them than what exists today? And by the way, what exists today is becoming an increasingly gigantic cornucopia of things, right? And so you say, okay, this new model has these capabilities, therefore the simple version of that is plug it into our existing evaluations and just look at that and see if it seems like it's better for a straight out swap out.

But when you talk about, for example, you have multimodal capability and then you say, okay, wait a minute, actually maybe there's a new feature or a whole new way we could be using it, not just a simple model swap out, but actually a different thing we could do that we couldn't do before that would have been too slow or too inaccurate or something like that, that now we do have the capability to do.

So I think of that as being a kind of core skill. I don't even know if I want to call it a skill. Maybe it's even like an attitude or a perspective, which is a desire to both be excited about the new technology, the new models and things as they come along, but also holding in the mind, what does our product do?

Who is our user? And how can we connect the capabilities of this technology to how we're helping people in whatever it is our product does? Yeah. I'm just looking at one of our internal Slack channels where we talk about things like new model releases and that kind of thing.

And it is notable looking through these, the kind of things that people are excited about and not, I don't know, the context, the context window is much larger or it's look at how many parameters it has or something like this. It's always framed in terms of maybe this could be applied to that kind of part of Elicit, or maybe this would open up this new possibility for Elicit.

And as Adam was saying, yeah, I don't think it's really a novel or separate skill. It's the kind of attitude I would like to have all engineers to have a company our stage actually, and maybe more generally even, which is not just kind of getting nerd sniped by some kind of technology number, fancy metric or something, but how is this actually going to be applicable to the thing which matters in the end?

How is this going to help users? How is this going to help move things forward strategically? That kind of thing. Yeah, applying what you know, I think is the key here. Getting hands on as well. I would recommend a few resources for people listening along. The first is Elicit's ML reading list, which I found so delightful after talking with Andreas about it.

It looks like that's part of your onboarding. We've actually set up an asynchronous paper club instead of my discord for people following on that reading list. I love that you separate things out into tier one and two and three, and that gives people a factored cognition way of looking into the corpus, right?

Yes, the corpus of things to know is growing and the water is slowly rising as far as what a bar for a competent AI engineer is, but I think having some structured thought as to what are the big ones that everyone must know, I think is key. It's something I haven't really defined for people, and I'm glad that Elicit actually has something out there that people can refer to.

I wouldn't necessarily make it required for the job interview maybe, but it'd be interesting to see what would be a red flag if some AI engineer would not know. I don't know where we would stoop to call something required knowledge, or you're not part of the cool kids club, but there increasingly is something like that, right?

Not knowing what context is is a black mark in my opinion, right? Yeah, I think it does connect back to what we were saying before of this genuine curiosity about ML. Well, maybe it's actually that combined with something else which is really important, which is a self-starting bias towards action kind of a mindset, which again- Everybody needs.

Exactly, yeah. Everyone needs that, so if you put those two together, or if I'm truly curious about this and I'm going to figure out how to make things happen, then you end up with people reading reading lists, reading papers, doing side projects, this kind of thing. So it isn't something that we explicitly include.

We don't have an ML-focused interview for the AI engineer role at all, actually. It doesn't really seem helpful. The skills which we are checking for, as I mentioned before, this fault-first mindset and conventional software engineering kind of thing, it's point one and point three on the list that we talked about.

In terms of checking for ML curiosity and how familiar they are with these concepts, that's more through talking interviews and culture fit types of things. We want for them to have a take on what Elisa is doing, certainly as they progress through the interview process. They don't need to be completely up-to-date on everything we've ever done on day zero, although that's always nice when it happens.

But for them to really engage with it, ask interesting questions, and be kind of brought into our view on how we want ML to proceed, I think that is really important and that would reveal that they have this kind of interest, this ML curiosity. There's a second aspect to that.

I don't know if now's the right time to talk about it, which is I do think that an ML-first approach to building software is something of a different mindset. I could describe that a bit now if that seems good, but up to you. So yeah, I think when I joined Elicit, this was the biggest adjustment that I had to make personally.

So as I said before, I'd been effectively building conventional software stuff for 15 years or so, something like this, well for longer actually, but professionally for like 15 years, and had a lot of pattern matching built into my brain and kind of muscle memory for if you see this kind of a problem, then you do that kind of a thing.

And I had to unlearn quite a lot of that when joining Elicit because we truly are ML-first and try to use ML to the fullest. And some of the things that that means is this relinquishing of control almost. At some point, you are calling into this fairly opaque black box thing and hoping it does the right thing, and dealing with the stuff that it sends back to you.

And that's just very different if you're interacting with, again, APIs and databases, that kind of a thing. You can't just keep on debugging. At some point, you hit this obscure wall. And I think the second part to this is, the pattern I was used to is that the external parts of the app are where most of the messiness is, not necessarily in terms of code, but in terms of degrees of freedom almost.

If the user can and will do anything at any point, and they'll put all sorts of wonky stuff inside of text inputs, and they'll click buttons you didn't expect them to click, and all this kind of thing. But then by the time you're down into your SQL queries, for example, as long as you've done your input validation, things are pretty well defined.

And that, as we said before, is not really the case. When you're working with language models, there is this kind of intrinsic uncertainty when you get down to the kernel, down to the core. Even beyond that, all that stuff is somewhat defensive, and these are things to be wary of to some degree.

The flip side of that, the really kind of positive part of taking an ML-first mindset when you're building applications, is that once you get comfortable taking your hands off the wheel at a certain point, and relinquishing control, letting go, really kind of unexpected, powerful things can happen if you lean on the capabilities of the model without trying to overly constrain and slice and dice problems to the point where you're not really wringing out the most capability from the model that you might.

So, I was trying to think of examples of this earlier, and one that came to mind was we were working really early, just after I joined Elicit, we were working on something where we wanted to generate text and include citations embedded within it. So, it'd have a claim, and then, you know, square brackets, one, in superscript, something like this.

And every fiber in my being was screaming that we should have some way of kind of forcing this to happen, or structured output, such that we could guarantee that this citation was always going to be present later on, you know, that the kind of the indication of a footnote would actually match up with the footnote itself, and kind of went into this symbolic, "I need full control" kind of mindset.

And it was notable that Andreas, who's our CEO, again, has been on the podcast, was the opposite. He was just kind of, "Give it a couple of examples, and it'll probably be fine, and then we can kind of figure out with a regular expression at the end." It really did not sit well with me, to be honest.

I was like, "But it could say anything. It could literally say anything." And I don't know about just using a regex to sort of handle this. This is an important feature of the app. But, you know, that's my first kind of starkest introduction to this ML-first mindset, I suppose, which Andreas has been cultivating for much longer than me, much longer than most.

Yeah, there might be some surprises of stuff you get back from the model, but you can also... it's about finding the sweet spot, I suppose, where you don't want to give a completely open-ended prompt to the model and expect it to do exactly the right thing. You can ask it too much, and it gets confused, and starts repeating itself, or goes around in loops, or just goes off in a random direction, or something like this.

But you can also over-constrain the model and not really make the most of the capabilities. And I think that is a mindset adjustment that most people who are coming into AI engineering afresh would need to make of giving up control and expecting that there's going to be a little bit of extra pain and defensive stuff on the tail end.

But the benefits that you get as a result are really striking. That was a brilliant start. The ML-first mindset, I think, is something that I struggle with as well, because the errors, when they do happen, are bad. They will hallucinate, and your systems will not catch it sometimes if you don't have a large enough sample set.

I'll leave it open to you, Adam. What else do you think about when you think about curiosity and exploring capabilities? Are there reliable ways to get people to push themselves on capabilities? Because I think a lot of times we have this implicit over-confidence, maybe, of we think we know what it is, what a thing is, when actually we don't.

And we need to keep a more open mind. And I think you do a particularly good job of always having an open mind. And I want to get that out of more engineers that I talk to, but I struggle sometimes. And I can scratch that question if nothing comes to mind.

Yeah. I suppose being an engineer is, at its heart, this sort of contradiction of, on one hand, systematic, almost very literal, wanting to control exactly what James described, understand everything, model it in your mind, precision, systematizing. But fundamentally, it is a creative endeavor. At least I got into creating with computers because I saw them as a canvas for creativity, for making great things, and for making a medium for making things that are so multidimensional that it goes beyond any medium humanity's ever had for creating things.

So I think or hope that a lot of engineers are drawn to it partially because you need both of those. You need that systematic, controlling side, and then the creative, open-ended, almost like artistic side. And I think it is exactly the same here. In fact, if anything, I feel like there's a theme running through everything James has said here, which is, in many ways, what we're looking for in an AI engineer is not really all that fundamentally different from other, call it conventional engineering or other types of engineering, but working with this strange new medium that has these different qualities.

But in the end, a lot of the things are an amalgamation of past engineering skills. And I think that mix of curiosity, artistic, open-ended, what can we do with this, with a desire to systematize, control, make reliable, make repeatable, is the mix you need. And trying to find that balance, I think, is probably where it's at.

Fundamentally, I think people who are getting into this field, to work on this, is because they're excited by the promise and the potential of the technology. So to not have that kind of creative, open-ended, curiosity side would be surprising. Why do it otherwise? So I think that blend is always what you're looking for broadly.

But here, now we're just scoping it to this new world of language models. And I think the two technical aspects of the... Let me start that again. I think the fault-first mindset and the ML curiosity attitude could be somewhat in tension, right? Because, for example, the stereotypical version of someone that is great at building fault-tolerant systems has probably been doing it for a decade or two.

They've been principal engineer at some massive scale technology company. And that kind of a person might be less able to turn on a dime and relinquish control and be creative and take on this different mindset. Whereas someone who's very early in their career is much more able to do that kind of exploration and follow their curiosity kind of a thing.

And they might be a little bit less practiced in how to serve terabytes of traffic every day, obviously. Yeah, the stereotype that comes to mind for me with those two you just described is the principal engineer, fault-tolerance, handle unpredictable, is kind of grumpy and always skeptical of anything new and it's probably not going to work and that sort of thing.

Whereas that fresh-faced early in their career, maybe more application-focused, and it's always thinking about the happy path and the optimistic and, "Oh, don't worry about the edge case. That probably won't happen." I don't write code with bugs, I don't know, whatever, like this. But really need both together, I think.

Both of those attitudes or personalities, if that's even the right way to put it, together in one is, I think, what's-- Yeah, and I think people can come from either end of the spectrum, to be clear. Not all grizzled principal engineers are the way that I'm described, thankfully. Some probably are.

And not all junior engineers are allergic to writing careful software or unable and unexcited to pick that up. Yeah, it could be someone that's in the middle of the career and naturally has a bit of both, could be someone at either end and just wants to round out their skill set and lean into the thing that they're a bit weaker on.

Any of the above would work well for us. Okay, lovely. We've covered a fair amount of like-- Actually, I think we've accidentally defined AI engineering along the way as well, because you kind of have to do that in order to hire and interview for people. The last piece I wanted to offer to our audience is sourcing.

A very underappreciated part, because people just tend to rely on recruiters and assume that the candidates fall from the sky. But I think the two of you have had plenty of experience with really good sourcing, and I just want to leave some time open for what does AI engineer sourcing look like?

Is it being very loud on Twitter? Well, I mean, that definitely helps. I am really quiet on Twitter, unfortunately, but a lot of my teammates are much more effective on that front, which is deeply appreciated. I think in terms of-- Maybe I'll focus a little bit more on active/outbound, if you will, rather than the kind of marketing/branding type of work that Adam's been really effective with us on.

The kinds of things that I'm looking for are certainly side projects. It's really easy still. We're early enough on in this process that people can still do interesting-- Pretty much at the cutting edge, not in terms of training whole models, of course, but in terms of doing AI engineering.

You can very much build interesting apps that have interesting ideas and work well just using a basic open AI API key. People sharing that kind of stuff on Twitter is always really interesting, or in Discord or Slacks, things like this. In terms of the kind of caricature of the grizzled principal engineer kind of a person, it's notable.

I've spoken with a bunch of people coming from that kind of perspective. They're fairly easy to find. They tend to be on LinkedIn. They tend to be really obvious on LinkedIn because they're maybe a bit more senior. They've got a ton of connections. They're probably expected to post thought leadership kinds of things on LinkedIn.

Everyone's favorite. Some of those people are interested in picking up new skills and jumping into ML and large language models. Sometimes it's obvious from a profile. Sometimes you just need to reach out and introduce yourself and say, "Hey, this is what we're doing. We think we could use your skills." A bunch of them will bite your hand off, actually, because it is such an interesting area.

That's how we've found success at sourcing on the kind of more experienced end of the spectrum. I think on the less experienced end of the spectrum, having lots of hooks in the ocean seems to be a good strategy if I think about what's worked for us. It tends to be much harder to find those people because they have less of an online presence in terms of active outbound.

Things like blog posts, things like hot takes on Twitter, things like challenges that we might have, those are the kind of vectors through which you can find these keen, full of energy, less experienced people and bring them towards you. Adam, do you have anything? You're pretty good on Twitter compared to me, at least.

What's your take on, yeah, the kind of more like bring stuff out there and have people come towards you for this kind of a role? Yeah, I do typically think of sourcing as being the one-two punch of one, raise the beacon. Let the world know that you are working on interesting problems and you're expanding your team and maybe there's a place for someone like them on that team.

That could come in a variety of forms, whether it's going to a job fair and having a booth. Obviously, it's job descriptions posted to your site. It's obviously things like, in some cases, yeah, blog posts about stuff you're working on, releasing open source, anything that goes out into the world and people find out about what you're doing, not at the very surface level of here's what the product is and, I don't know, we have a couple of job descriptions on the site, but a layer deeper of like here's the kind, here's what it actually looks like to work on the sort of things we're working on.

So, I think that's one piece of it and then the other piece of it, as you said, is the outbound. I think it's not enough to, especially when you're small, I think it changes a lot when you're a bigger company with a strong brand or if the product you're working on is more in a technical space and so, therefore, maybe there's actually among your customers, there's the sorts of people that you might like to work for you.

I don't know, if you're GitHub, then probably all of your users and the people you want to hire are among your user base, which is a nice combination, but for most products, that's not going to be the case. So then, now the outbound is a big piece of it and part of that is, as you said, getting out into the world, whether it's going to meetups, whether it's going to conferences, whether it's being on Twitter and just genuinely being out there and part of the field and having conversations with people and seeing people who are doing interesting things and making connections with them, hopefully not in a transactional way or you're always just, you know, sniffing around for who's available to hire, but you just generally, if you like this work and you want to be part of the field and you want to follow along with people who are doing interesting things and then, by the way, you will discover when they post, "Oh, I'm wrapping up my job here and thinking about the next thing," and that's a good time to ping them and be like, "Oh, cool.

Actually, we have maybe some things that you might be interested in here on the team," and that kind of outbound. But I think it also pairs well. It's not just that you need both, it's that they reinforce each other. So, if someone has seen, for example, the open source project you've released and they're like, "Oh, that's cool," and they briefly look at your company and then you follow each other on Twitter or whatever and then they post, "Hey, I'm thinking about my next thing," and you write them and they already have some context of like, "Oh, I like that project you did and I liked, you know, I kind of have some ambient awareness of what you're doing.

Yeah, let's have a conversation. This isn't totally cold." So, I think those two together are important. The other footnote I would put, again, on the specifics, that's, I think, general sourcing for any kind of role, but for AI engineering specifically, you're not looking for professional experience. At this stage, you're not always looking for professional experience with language models.

It's just too early. So, it's totally fine that someone has the professional experience with the conventional engineering skills, but, yeah, the interest, the curiosity, that sort of thing expressed through side projects, hackathons, blog posts, whatever it is. Yeah, absolutely. I often tell people, a lot of people are asking me for San Francisco AI engineers because they want, there's this sort of wave or reaction against the remote mindset, which I know that you guys probably differ in opinion on, but a lot of people are trying to, you know, go back to office.

And so, my only option for people is just find them at the hackathons. Like, you know, the most self-driven, motivated people who can work on things quickly and ship fast are already in hackathons, and just go through the list of winners. And then, self-interestedly, you know, if, for example, someone's hosting an AI conference from June 25th to June 27th in San Francisco, you might want to show up there and see who might be available.

So, like, and that is true. Like, you know, it's not something I want to advertise to the employers, the people who come, but a lot of people change jobs at conferences. This is a known thing, so. Yeah, of course. But I think it's the same as engaging on Twitter, engaging in open source, attending conferences.

100%, this is a great way both to find new opportunities if you're a job seeker, find people for your team, if you're a hiring manager, but if you come at it too network-y and transactional, that's just gross for everyone. Hopefully, we're all people that got into this work largely because we love it, and it's nice to connect with other people that have the same, you know, skills and struggle with the same problems in their work, and you make genuine connections, and you learn from each other, and by the way, from that can come as a, well, not quite a side effect, but an effect on the list is pairing together people who are looking for opportunities with people who have interesting problems to work on.

Yeah, totally. Yeah, most important part of employer branding, you know, have a great mission, have great teammates, you know, if you can show that off in whatever way you can, you'll be starting off on the right foot. On that note, we have been really successful with hiring a number of people from targeted job boards, maybe is the right way of saying it, so not some kind of generic indeed.com or something, not to trash them, but something that's a bit more tied to your mission, tied to what you're doing, something which is really relevant, something which is going to cut down the search space for what you're looking at, what the candidate's looking at, so we're definitely affiliated with the safety, effective altruist kind of movement.

We've gone to a few EA globals and have hired people effectively through the 80,000 hours list as well, so you know, that's not the only reason why people would want to join illicit, but as an example of if you're interested in AI safety or, you know, whatever your take is on this stuff, then there's probably something, there's a substack, there's a podcast, there's a mailing list, there's a job board, there's something which lets you zoom in on the kind of particular take that you agree with.

You brought this up, so I have to ask, what is the state of EA post-SBF? I don't know if I'm the person to, I don't know if I'm the spokesman for that. Yeah, I mean, look, it's still going on, there's definitely a period of reflection and licking of wounds and thinking how did this happen.

There's been a few conversations with people really senior in EA talking about how it was a super difficult time from a personal perspective and what is this even all about, and I don't know if this is a good thing that I've done and, you know, quite a sobering moment for everyone, I think.

But yeah, you know, it's definitely still going, EA forum is active, we have people from illicit going to EA global. Yeah, if anything, from a personal perspective, I hope that it helps us spot blowhards and charlatans more easily and avoid whatever the kind of massive circumstances were that got us into the situation with SBF and the kind of unfortunate fallout from that.

If it makes us a bit more able to spot that happening, then all for the better. Excellent. Cool, I will leave it there. Any last comments about just hiring in general? Advice to other technology leaders in AI? You know, one thing I'm trying to do for my conference as well is to create a forum for technology leaders to share thoughts, right?

Like what's an interesting trend? What's an interesting open problem? What should people contact you on if they're working on something interesting? Yeah, a couple of thoughts here. So firstly, when I think back to how I was when I was in my early 20s, when I was at when I was at college, or university, the purity and capabilities and just kind of general put togetherness of people at that age now is strikingly different to where I was then.

And I think this is not because I was especially sadistical or something when I was when I was young. I think I hear the same thing echoed in other people about my about my age. So the takeaway from that is finding a way of presenting yourself to and identifying and bringing in really high capability young people into your organization.

I mean, it's always been true, but I think it's even more true now that they're not. They're not. They're kind of more professional, more capable, more committed, more driven, have more of a sense of what they're all about than certainly I did 20 years ago. So that's, that's the first thing.

I think the second thing is in terms of the interview process. This is somewhat a general take, but it definitely applies to AI engineer roles, and I think more so to AI engineer roles. I really have a strong dislike and distaste for interview questions, which are arbitrary and kind of strip away all the context from what it really is to do the work.

We try to make the interview process that's illicit a simulation of working together. The only people that we go into an interview process with are pretty obviously extraordinary, really, really capable. They must have done something for them to have moved into the proper interview process. It is a check on technical capability and in the ways that we've described, but it's at least as much them sizing us up.

Is this something which is worth my time? Is it something that I'm going to really be able to dedicate myself to? So be able to show them this is really what it's like working at illicit. This is the people you're going to work with. These are the kinds of tasks that you're going to be doing.

This is the sort of environment that we work in. These are the tools we use. All that kind of stuff is really, really important from a candidate experience, but it also gives us a ton more signal as well about, you know, what is it actually like to work with this person?

Not just can they do really well on some kind of LeetCode style problem. I think the reason that it bears particularly on the AI engineer role is because it is something of an emerging category, if you will. So there isn't a very kind of well-established, do these, nobody's written the book yet.

Maybe this is the beginning of us writing the book on how to get hired as an AI engineer, but that book doesn't exist at the moment. Yeah, you know, it's an empirical job as much as any other kind of software engineering. It's less about having kind of book learning and more about being able to apply that in a real world situation.

So let's make the interview as close to a real world situation as possible. Adam, any last thoughts? I think you're muted. I think it'd be hard to follow that to add on to what James said. I do co-sign a lot of that. Yeah, I think this is a really great overview of just the sort of state of hiring AI engineers.

And honestly, that's just what AI engineering even is, which it really is. When I was thinking about this as an industrial movement, it was very much around the labor market, actually, and these economic forces that give rise to a role like this, both on the incentives of the model labs, as well as the demand and supply of engineers and the interest level of companies and the engineers working on these problems.

So I definitely see you guys as pioneers. Thank you so much for putting together this piece, which is something I've been seeking for a long time. You even shared your job description, your reading list and your interview loop. So if anyone's looking to hire AI engineers, I expect this to be the definitive piece and definitive podcast covering it.

So thank you so much for taking the time to cover it with me. It was fun. Thanks. Yeah, thanks a lot. Really enjoyed the conversation. And I appreciate you naming something which we all had in our heads, but couldn't put a label on. It was going to be named anyway.

So actually, I never actually personally say that I coined the term because I'm sure someone else used the term before me. All I did was write a popular piece on it. All right. So I'm happy to help because I know that it contributed to job creation at a bunch of companies I respect and how people find each other, which is my whole goal here.

So, yeah, thanks for helping me do this.

How To Hire AI Engineers (ft. James Brady and Adam Wiggins of Elicit)

Chapters

Transcript