Storyteller: Building Multi-modal Apps with TS & ModelFusion

00:00:00.000 | Hey everyone, I'm presenting Storyteller, an app for generating short audio stories for

00:00:18.200 | preschool kids. Storyteller is implemented using TypeScript and Model Fusion, an AI orchestration

00:00:25.000 | library that I've been developing. It generates audio stories that are about two minutes long

00:00:31.080 | and all it needs is a voice input. Here is an example of the kind of story it generates

00:00:36.580 | to give you an idea. One day while they were playing, Benny noticed something strange. The

00:00:42.540 | forest wasn't as vibrant as before. The leaves were turning brown and the animals seemed less

00:00:47.740 | cheerful. Worried, Benny asked his friends what was wrong. Friends, why do the trees look so sad

00:00:53.100 | and why are you all so quiet today? Benny, the forest is in trouble. The trees are dying

00:00:57.940 | and we don't know what to do. How does this work? Let's dive into the details of the Storyteller

00:01:03.180 | application. Storyteller is a client server application. The client is written using React

00:01:09.640 | and the server is a custom Fastify implementation. The main challenges were responsiveness, meaning

00:01:15.580 | getting results to the user as quickly as possible, quality and consistency. So when you start Storyteller,

00:01:25.660 | it's just a small screen that has a record topic button. And once you start pressing it, it starts

00:01:31.700 | recording. The audio when you release gets sent to the server as a buffer and there we transcribe it. For

00:01:39.740 | transcription, I'm using OpenAI Whisper. It is really quick for a short topic, 1.5 seconds. And once it becomes

00:01:47.900 | available, an event goes back to the client. So the client server communication works through an event

00:01:55.460 | stream, server sent events, that are being sent back. The event arrives on the client and the React state updates,

00:02:06.140 | it's updating the screen. Okay, so then the user knows something is going on. In parallel,

00:02:11.340 | I start generating the Story Outline. For this, I use GPT-3 Turbo Instruct, which I found to be very fast.

00:02:19.020 | So it can generate a Story Outline in about 4 seconds. And once we have that, we can start a bunch of other

00:02:26.060 | tasks in parallel. Generating the title, generating the image, and generating and narrating the audio story

00:02:35.420 | it will all happen in parallel. I'll go through those one by one now. First, the title is generated. For

00:02:43.900 | this, OpenAI GPT-3 Turbo Instruct is used again, giving a really quick result. Once the title is available,

00:02:51.500 | it's being sent to the client again as an event and rendered there. In parallel, the image generation runs.

00:03:00.140 | First, there needs to be a prompt to actually generate the image. And here, consistency is important.

00:03:06.780 | So we pass in the whole story into a GPT-4 prompt that then extracts relevant representative keywords

00:03:13.980 | for an image from the story. That image prompt is passed into Stability AI Stable Diffusion XL where an image

00:03:24.460 | is generated. The generated image is stored as a virtual file in the server. And then an event is sent to the

00:03:33.900 | client with a path to the client with a path to the file. The client can then, through a regular URL request,

00:03:40.220 | just retrieve the image as part of an image tag. And it shows up in the UI. Generating the full audio story is

00:03:51.020 | the most time-consuming piece of the puzzle. Here, we have a complex prompt that takes in the story and

00:03:58.540 | creates a structure with dialogue and speakers and extends the story. We use GPT-4 here with a low

00:04:07.500 | temperature to retain the story. And the problem is it takes one and a half minutes, which is

00:04:12.940 | unacceptably long for an interactive client. So how can this be solved? The key idea is streaming the

00:04:22.300 | structure. That's a little bit more difficult than just streaming characters token by token.

00:04:27.340 | We need to always partially pass the structure and then determine if there is a new passage that we can

00:04:35.580 | actually narrate and synthesize speech for. Model Fusion takes care of the partial parsing and returns an iterable

00:04:44.940 | over fragments of partially parsed results, but the application needs to decide what to do with them. Here, we determine

00:04:52.540 | which story part is finished so we can actually narrate it. So we narrate each story part as it's getting finished.

00:05:02.940 | For each story part, we need to determine which voice we use to narrate it. The narrator has a

00:05:10.220 | predefined voice and for all the speakers where we already have voices, we can immediately proceed. However,

00:05:17.020 | when there's a new speaker, we need to figure out which voice to give it.

00:05:20.940 | The first step for this is to generate a voice description for the speaker. Here's a GPT-3-5 turbo prompt

00:05:31.100 | that gives us a structured result with gender and a voice description and we then use that

00:05:35.980 | for retrieval where we beforehand embedded all the voices based on their descriptions and now can just

00:05:43.020 | retrieve them filtered by gender. Then a voice is selected making sure there are no speakers with the

00:05:49.580 | same voice and finally we can generate the audio. Here for the speech synthesis element and 11labs are supported

00:05:59.420 | based on the voices that have been chosen. One of those providers is picked and the audio synthesized.

00:06:04.860 | Similar to the images, we generate an audio file and we store it virtually in the server and then send

00:06:13.500 | the path to the client which reconstructs the URL and just retrieves it as a media element. Once the first

00:06:21.740 | audio is completed, the client can start playing. While this is ongoing, in the background you're listening and in the background the server continues to generate more and more parts.

00:06:35.820 | And that's it. So let's recap how the main challenge of responsiveness is addressed here.

00:06:41.260 | We have a loading state that has multiple parts that are updated as more results become available.

00:06:48.300 | We use streaming and parallel processing in the backend to make results available as quickly as possible and you can start listening while the processing is still going on.

00:06:58.460 | And finally, models are being chosen such that the processing time for like the generation, say the story, is minimized.

00:07:07.660 | So cool. I hope you enjoyed my talk. Thank you for listening. And if you want to find out more, you can find Storyteller and also Model Fusion

00:07:15.020 | on GitHub at GitHub.com L Grammar Storyteller and GitHub.com L Grammar Model Fusion.

Storyteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD