back to indexStoryteller: Building Multi-modal Apps with TS & ModelFusion - Lars Grammel, PhD

00:00:00.000 |
Hey everyone, I'm presenting Storyteller, an app for generating short audio stories for 00:00:18.200 |
preschool kids. Storyteller is implemented using TypeScript and Model Fusion, an AI orchestration 00:00:25.000 |
library that I've been developing. It generates audio stories that are about two minutes long 00:00:31.080 |
and all it needs is a voice input. Here is an example of the kind of story it generates 00:00:36.580 |
to give you an idea. One day while they were playing, Benny noticed something strange. The 00:00:42.540 |
forest wasn't as vibrant as before. The leaves were turning brown and the animals seemed less 00:00:47.740 |
cheerful. Worried, Benny asked his friends what was wrong. Friends, why do the trees look so sad 00:00:53.100 |
and why are you all so quiet today? Benny, the forest is in trouble. The trees are dying 00:00:57.940 |
and we don't know what to do. How does this work? Let's dive into the details of the Storyteller 00:01:03.180 |
application. Storyteller is a client server application. The client is written using React 00:01:09.640 |
and the server is a custom Fastify implementation. The main challenges were responsiveness, meaning 00:01:15.580 |
getting results to the user as quickly as possible, quality and consistency. So when you start Storyteller, 00:01:25.660 |
it's just a small screen that has a record topic button. And once you start pressing it, it starts 00:01:31.700 |
recording. The audio when you release gets sent to the server as a buffer and there we transcribe it. For 00:01:39.740 |
transcription, I'm using OpenAI Whisper. It is really quick for a short topic, 1.5 seconds. And once it becomes 00:01:47.900 |
available, an event goes back to the client. So the client server communication works through an event 00:01:55.460 |
stream, server sent events, that are being sent back. The event arrives on the client and the React state updates, 00:02:06.140 |
it's updating the screen. Okay, so then the user knows something is going on. In parallel, 00:02:11.340 |
I start generating the Story Outline. For this, I use GPT-3 Turbo Instruct, which I found to be very fast. 00:02:19.020 |
So it can generate a Story Outline in about 4 seconds. And once we have that, we can start a bunch of other 00:02:26.060 |
tasks in parallel. Generating the title, generating the image, and generating and narrating the audio story 00:02:35.420 |
it will all happen in parallel. I'll go through those one by one now. First, the title is generated. For 00:02:43.900 |
this, OpenAI GPT-3 Turbo Instruct is used again, giving a really quick result. Once the title is available, 00:02:51.500 |
it's being sent to the client again as an event and rendered there. In parallel, the image generation runs. 00:03:00.140 |
First, there needs to be a prompt to actually generate the image. And here, consistency is important. 00:03:06.780 |
So we pass in the whole story into a GPT-4 prompt that then extracts relevant representative keywords 00:03:13.980 |
for an image from the story. That image prompt is passed into Stability AI Stable Diffusion XL where an image 00:03:24.460 |
is generated. The generated image is stored as a virtual file in the server. And then an event is sent to the 00:03:33.900 |
client with a path to the client with a path to the file. The client can then, through a regular URL request, 00:03:40.220 |
just retrieve the image as part of an image tag. And it shows up in the UI. Generating the full audio story is 00:03:51.020 |
the most time-consuming piece of the puzzle. Here, we have a complex prompt that takes in the story and 00:03:58.540 |
creates a structure with dialogue and speakers and extends the story. We use GPT-4 here with a low 00:04:07.500 |
temperature to retain the story. And the problem is it takes one and a half minutes, which is 00:04:12.940 |
unacceptably long for an interactive client. So how can this be solved? The key idea is streaming the 00:04:22.300 |
structure. That's a little bit more difficult than just streaming characters token by token. 00:04:27.340 |
We need to always partially pass the structure and then determine if there is a new passage that we can 00:04:35.580 |
actually narrate and synthesize speech for. Model Fusion takes care of the partial parsing and returns an iterable 00:04:44.940 |
over fragments of partially parsed results, but the application needs to decide what to do with them. Here, we determine 00:04:52.540 |
which story part is finished so we can actually narrate it. So we narrate each story part as it's getting finished. 00:05:02.940 |
For each story part, we need to determine which voice we use to narrate it. The narrator has a 00:05:10.220 |
predefined voice and for all the speakers where we already have voices, we can immediately proceed. However, 00:05:17.020 |
when there's a new speaker, we need to figure out which voice to give it. 00:05:20.940 |
The first step for this is to generate a voice description for the speaker. Here's a GPT-3-5 turbo prompt 00:05:31.100 |
that gives us a structured result with gender and a voice description and we then use that 00:05:35.980 |
for retrieval where we beforehand embedded all the voices based on their descriptions and now can just 00:05:43.020 |
retrieve them filtered by gender. Then a voice is selected making sure there are no speakers with the 00:05:49.580 |
same voice and finally we can generate the audio. Here for the speech synthesis element and 11labs are supported 00:05:59.420 |
based on the voices that have been chosen. One of those providers is picked and the audio synthesized. 00:06:04.860 |
Similar to the images, we generate an audio file and we store it virtually in the server and then send 00:06:13.500 |
the path to the client which reconstructs the URL and just retrieves it as a media element. Once the first 00:06:21.740 |
audio is completed, the client can start playing. While this is ongoing, in the background you're listening and in the background the server continues to generate more and more parts. 00:06:35.820 |
And that's it. So let's recap how the main challenge of responsiveness is addressed here. 00:06:41.260 |
We have a loading state that has multiple parts that are updated as more results become available. 00:06:48.300 |
We use streaming and parallel processing in the backend to make results available as quickly as possible and you can start listening while the processing is still going on. 00:06:58.460 |
And finally, models are being chosen such that the processing time for like the generation, say the story, is minimized. 00:07:07.660 |
So cool. I hope you enjoyed my talk. Thank you for listening. And if you want to find out more, you can find Storyteller and also Model Fusion 00:07:15.020 |
on GitHub at GitHub.com L Grammar Storyteller and GitHub.com L Grammar Model Fusion.