back to indexWav2Lip (generate talking avatar videos) - Paper reading and explanation
00:00:00.000 |
Hello guys, welcome to my review of the lip-sync expert is all you need for speech to lip generation in the wild 00:00:08.480 |
Basically, we are talking about a model that can allow you to generate lip-synced videos 00:00:14.720 |
Using arbitrary videos and arbitrary audios. Let's see an example directly from the official website of the paper 00:00:21.920 |
Here in the example tree proposed we can see that there is a video without any audio 00:00:31.120 |
Then we have an audio of three seconds. Let's listen to it 00:00:37.680 |
Then we press this button sync this pair and the output video which is here. I already generated it before we can play it 00:00:49.280 |
As you can see the results are remarkable. We can see that the person 00:00:53.520 |
Before was not not talking at all. And now actually the his lip has been automatically generated 00:00:59.140 |
And the what the movement of his lip actually match what he's saying 00:01:03.680 |
I don't see any big difference actually from what I would expect from any other person 00:01:09.600 |
Let's go into the details of how this all works 00:01:15.840 |
The the system can be used for many applications for example for dubbing videos in multiple languages 00:01:23.360 |
educational videos or any other thing that you like 00:01:26.080 |
And let's go to the architecture of the video of the model 00:01:31.120 |
We can see that we have two streams. One is an audio stream 00:01:34.560 |
And one way one is a video stream. They are both down sampled using a convolutional neural networks 00:01:41.600 |
Combined together and then up sampled again with skip connections from the video stream 00:01:47.680 |
And this is the generator. So we are talking about a GAN network 00:01:52.300 |
Actually more or less a GAN we will see why it's different 00:01:55.420 |
And then the generator frame so we have a sequence of frames 00:02:03.580 |
Compared with what is the ground truth and this is the reconstruction loss of the image 00:02:08.860 |
Actually, the authors claim that this the reconstruction loss is not enough to generate a good image 00:02:18.380 |
And which is basically also the technique used by previous models. So before the Wav2Lip was 00:02:35.740 |
we can see that the pixel level reconstruction loss is a weak judge of lip-sync why because 00:02:39.820 |
The the system tries to the model tries to generate the image and try to make it look like the original 00:02:45.740 |
however, the model doesn't concentrate on the lip area only which is 00:02:50.140 |
What is what we want is one of the most important thing that we want to judge in this model, right? 00:02:55.260 |
But they say that the lip area actually correspond to less than four percent of the total reconstruction loss 00:03:00.540 |
So it is um, we can we should find a way to concentrate on that 00:03:05.340 |
To generate a better lip area, of course while preserving the original image 00:03:10.380 |
So we don't want the background to change. We don't want the pose of the person to change etc 00:03:15.020 |
So what the authors do they introduce a sync net sync net is a model 00:03:23.820 |
Allows to check how much a video and audio are synced together 00:03:28.620 |
And if they are not synced by how much they are out of sync 00:03:39.740 |
Using little variations. For example, the original sync net was trained using black and white images 00:03:45.900 |
Now they use color images and secondly, they change the loss function to cosine similarity 00:03:51.280 |
So the generator actually the loss function of the generator 00:03:56.560 |
Is a combination of the L1 construction loss the GAN loss and the sync loss 00:04:03.500 |
We can find it here in the equation number six 00:04:08.060 |
So actually the L total is the total loss of the generator, which is a combination of this loss of this loss and this loss 00:04:15.100 |
And there are some weights to choose how much emphasis to give to each loss 00:04:21.020 |
The the system has been trained using an adapt optimizer, these are the parameters 00:04:32.780 |
The authors compare the current model with the previous models and using three different data sets 00:04:40.300 |
The first one is dubbed. So we have a video and the audio that is dubbed taken from the internet, I guess 00:04:49.340 |
Where the audio and the video are not in sync 00:04:52.220 |
And they try to sync it using Guav2Lip and also the two baseline models Speech2Read and LipGAN 00:04:57.980 |
We can see that according to human evaluators. So these are all 00:05:03.180 |
Evaluations made by humans. The Guav2Lip actually is preferred. The method of evaluation is written here in 00:05:14.700 |
Secondly, we have a random data set with that 00:05:18.060 |
data set of random videos with random audios and the Guav2Lip is trained to 00:05:23.820 |
Sync them and finally we have TTS in which the audio is generated from a TTS system 00:05:31.100 |
As we can see overall the Guav2Lip is performing much better. We will see some example later 00:05:42.220 |
We see that here in the authors right finally it is worth noting that our lip-sync videos are preferred over existing methods 00:05:49.020 |
Or even the actual unsynced videos over 90% of the time so it means that also the visual quality is not bad 00:05:58.540 |
Here are some examples for example, we can see the red 00:06:01.420 |
frames are from the previous models and we can see the quality of the face here of 00:06:11.500 |
is not so good, but here with the syncnet with the 00:06:16.140 |
Guav2Lip the reconstruction image is quite good 00:06:21.660 |
The authors actually train two models one with 00:06:27.180 |
GAN and one without the GAN loss and we can see that 00:06:30.540 |
Without the GAN is performing better on some metrics and a little worse on some other metrics 00:06:37.820 |
And so actually I think this model have a lot of potential for generating talking avatars or for dubbing videos generating educational videos 00:06:46.860 |
Maybe in the future. We don't need to record the three times the same video in multiple language 00:06:51.820 |
We can just generate it once and let the AI do the rest. Thank you for listening