Hey everyone, I'm presenting Storyteller, an app for generating short audio stories for preschool kids. Storyteller is implemented using TypeScript and Model Fusion, an AI orchestration library that I've been developing. It generates audio stories that are about two minutes long and all it needs is a voice input. Here is an example of the kind of story it generates to give you an idea.
One day while they were playing, Benny noticed something strange. The forest wasn't as vibrant as before. The leaves were turning brown and the animals seemed less cheerful. Worried, Benny asked his friends what was wrong. Friends, why do the trees look so sad and why are you all so quiet today?
Benny, the forest is in trouble. The trees are dying and we don't know what to do. How does this work? Let's dive into the details of the Storyteller application. Storyteller is a client server application. The client is written using React and the server is a custom Fastify implementation. The main challenges were responsiveness, meaning getting results to the user as quickly as possible, quality and consistency.
So when you start Storyteller, it's just a small screen that has a record topic button. And once you start pressing it, it starts recording. The audio when you release gets sent to the server as a buffer and there we transcribe it. For transcription, I'm using OpenAI Whisper. It is really quick for a short topic, 1.5 seconds.
And once it becomes available, an event goes back to the client. So the client server communication works through an event stream, server sent events, that are being sent back. The event arrives on the client and the React state updates, it's updating the screen. Okay, so then the user knows something is going on.
In parallel, I start generating the Story Outline. For this, I use GPT-3 Turbo Instruct, which I found to be very fast. So it can generate a Story Outline in about 4 seconds. And once we have that, we can start a bunch of other tasks in parallel. Generating the title, generating the image, and generating and narrating the audio story it will all happen in parallel.
I'll go through those one by one now. First, the title is generated. For this, OpenAI GPT-3 Turbo Instruct is used again, giving a really quick result. Once the title is available, it's being sent to the client again as an event and rendered there. In parallel, the image generation runs.
First, there needs to be a prompt to actually generate the image. And here, consistency is important. So we pass in the whole story into a GPT-4 prompt that then extracts relevant representative keywords for an image from the story. That image prompt is passed into Stability AI Stable Diffusion XL where an image is generated.
The generated image is stored as a virtual file in the server. And then an event is sent to the client with a path to the client with a path to the file. The client can then, through a regular URL request, just retrieve the image as part of an image tag.
And it shows up in the UI. Generating the full audio story is the most time-consuming piece of the puzzle. Here, we have a complex prompt that takes in the story and creates a structure with dialogue and speakers and extends the story. We use GPT-4 here with a low temperature to retain the story.
And the problem is it takes one and a half minutes, which is unacceptably long for an interactive client. So how can this be solved? The key idea is streaming the structure. That's a little bit more difficult than just streaming characters token by token. We need to always partially pass the structure and then determine if there is a new passage that we can actually narrate and synthesize speech for.
Model Fusion takes care of the partial parsing and returns an iterable over fragments of partially parsed results, but the application needs to decide what to do with them. Here, we determine which story part is finished so we can actually narrate it. So we narrate each story part as it's getting finished.
For each story part, we need to determine which voice we use to narrate it. The narrator has a predefined voice and for all the speakers where we already have voices, we can immediately proceed. However, when there's a new speaker, we need to figure out which voice to give it.
The first step for this is to generate a voice description for the speaker. Here's a GPT-3-5 turbo prompt that gives us a structured result with gender and a voice description and we then use that for retrieval where we beforehand embedded all the voices based on their descriptions and now can just retrieve them filtered by gender.
Then a voice is selected making sure there are no speakers with the same voice and finally we can generate the audio. Here for the speech synthesis element and 11labs are supported based on the voices that have been chosen. One of those providers is picked and the audio synthesized. Similar to the images, we generate an audio file and we store it virtually in the server and then send the path to the client which reconstructs the URL and just retrieves it as a media element.
Once the first audio is completed, the client can start playing. While this is ongoing, in the background you're listening and in the background the server continues to generate more and more parts. And that's it. So let's recap how the main challenge of responsiveness is addressed here. We have a loading state that has multiple parts that are updated as more results become available.
We use streaming and parallel processing in the backend to make results available as quickly as possible and you can start listening while the processing is still going on. And finally, models are being chosen such that the processing time for like the generation, say the story, is minimized. So cool.
I hope you enjoyed my talk. Thank you for listening. And if you want to find out more, you can find Storyteller and also Model Fusion on GitHub at GitHub.com L Grammar Storyteller and GitHub.com L Grammar Model Fusion.