Back to Index

Wav2Lip (generate talking avatar videos) - Paper reading and explanation


Transcript

Hello guys, welcome to my review of the lip-sync expert is all you need for speech to lip generation in the wild which is a very long title to say that Basically, we are talking about a model that can allow you to generate lip-synced videos Using arbitrary videos and arbitrary audios.

Let's see an example directly from the official website of the paper Here in the example tree proposed we can see that there is a video without any audio And the person here is not talking Then we have an audio of three seconds. Let's listen to it I'll go around wait for my call Then we press this button sync this pair and the output video which is here.

I already generated it before we can play it I'll go around wait for my call As you can see the results are remarkable. We can see that the person Before was not not talking at all. And now actually the his lip has been automatically generated And the what the movement of his lip actually match what he's saying I don't see any big difference actually from what I would expect from any other person Let's go into the details of how this all works so The the system can be used for many applications for example for dubbing videos in multiple languages educational videos or any other thing that you like And let's go to the architecture of the video of the model We can see that we have two streams.

One is an audio stream And one way one is a video stream. They are both down sampled using a convolutional neural networks Combined together and then up sampled again with skip connections from the video stream And this is the generator. So we are talking about a GAN network Actually more or less a GAN we will see why it's different And then the generator frame so we have a sequence of frames are Compared with what is the ground truth and this is the reconstruction loss of the image Actually, the authors claim that this the reconstruction loss is not enough to generate a good image And which is basically also the technique used by previous models.

So before the Wav2Lip was introduced Because as the author claims in the previous here in 3.1 we can see that the pixel level reconstruction loss is a weak judge of lip-sync why because The the system tries to the model tries to generate the image and try to make it look like the original however, the model doesn't concentrate on the lip area only which is What is what we want is one of the most important thing that we want to judge in this model, right?

But they say that the lip area actually correspond to less than four percent of the total reconstruction loss So it is um, we can we should find a way to concentrate on that To generate a better lip area, of course while preserving the original image So we don't want the background to change.

We don't want the pose of the person to change etc So what the authors do they introduce a sync net sync net is a model that allows was introduced previously Allows to check how much a video and audio are synced together And if they are not synced by how much they are out of sync The authors they call it a lip-sync expert They retrain the sync net from the ground Using little variations.

For example, the original sync net was trained using black and white images Now they use color images and secondly, they change the loss function to cosine similarity So the generator actually the loss function of the generator Is a combination of the L1 construction loss the GAN loss and the sync loss We can find it here in the equation number six So actually the L total is the total loss of the generator, which is a combination of this loss of this loss and this loss And there are some weights to choose how much emphasis to give to each loss The the system has been trained using an adapt optimizer, these are the parameters But let's go to check the results now The authors compare the current model with the previous models and using three different data sets The first one is dubbed.

So we have a video and the audio that is dubbed taken from the internet, I guess and Where the audio and the video are not in sync And they try to sync it using Guav2Lip and also the two baseline models Speech2Read and LipGAN We can see that according to human evaluators.

So these are all Evaluations made by humans. The Guav2Lip actually is preferred. The method of evaluation is written here in 4.4.2 And Secondly, we have a random data set with that data set of random videos with random audios and the Guav2Lip is trained to Sync them and finally we have TTS in which the audio is generated from a TTS system As we can see overall the Guav2Lip is performing much better.

We will see some example later and according to human evaluators and We see that here in the authors right finally it is worth noting that our lip-sync videos are preferred over existing methods Or even the actual unsynced videos over 90% of the time so it means that also the visual quality is not bad Here are some examples for example, we can see the red frames are from the previous models and we can see the quality of the face here of the german chancellor is not so good, but here with the syncnet with the Guav2Lip the reconstruction image is quite good The authors actually train two models one with GAN and one without the GAN loss and we can see that Without the GAN is performing better on some metrics and a little worse on some other metrics And so actually I think this model have a lot of potential for generating talking avatars or for dubbing videos generating educational videos Maybe in the future.

We don't need to record the three times the same video in multiple language We can just generate it once and let the AI do the rest. Thank you for listening