Wav2Lip (generate talking avatar videos) - Paper reading and explanation

00:00:00.000 | Hello guys, welcome to my review of the lip-sync expert is all you need for speech to lip generation in the wild

00:00:06.400 | which is a very long title to say that

00:00:08.480 | Basically, we are talking about a model that can allow you to generate lip-synced videos

00:00:14.720 | Using arbitrary videos and arbitrary audios. Let's see an example directly from the official website of the paper

00:00:21.920 | Here in the example tree proposed we can see that there is a video without any audio

00:00:29.120 | And the person here is not talking

00:00:31.120 | Then we have an audio of three seconds. Let's listen to it

00:00:34.640 | I'll go around wait for my call

00:00:37.680 | Then we press this button sync this pair and the output video which is here. I already generated it before we can play it

00:00:45.760 | I'll go around wait for my call

00:00:49.280 | As you can see the results are remarkable. We can see that the person

00:00:53.520 | Before was not not talking at all. And now actually the his lip has been automatically generated

00:00:59.140 | And the what the movement of his lip actually match what he's saying

00:01:03.680 | I don't see any big difference actually from what I would expect from any other person

00:01:09.600 | Let's go into the details of how this all works

00:01:13.120 | so

00:01:15.840 | The the system can be used for many applications for example for dubbing videos in multiple languages

00:01:23.360 | educational videos or any other thing that you like

00:01:26.080 | And let's go to the architecture of the video of the model

00:01:31.120 | We can see that we have two streams. One is an audio stream

00:01:34.560 | And one way one is a video stream. They are both down sampled using a convolutional neural networks

00:01:41.600 | Combined together and then up sampled again with skip connections from the video stream

00:01:47.680 | And this is the generator. So we are talking about a GAN network

00:01:52.300 | Actually more or less a GAN we will see why it's different

00:01:55.420 | And then the generator frame so we have a sequence of frames

00:02:01.420 | are

00:02:03.580 | Compared with what is the ground truth and this is the reconstruction loss of the image

00:02:08.860 | Actually, the authors claim that this the reconstruction loss is not enough to generate a good image

00:02:18.380 | And which is basically also the technique used by previous models. So before the Wav2Lip was

00:02:24.540 | introduced

00:02:28.860 | Because as the author claims

00:02:31.820 | in the previous

00:02:34.620 | here in 3.1

00:02:35.740 | we can see that the pixel level reconstruction loss is a weak judge of lip-sync why because

00:02:39.820 | The the system tries to the model tries to generate the image and try to make it look like the original

00:02:45.740 | however, the model doesn't concentrate on the lip area only which is

00:02:50.140 | What is what we want is one of the most important thing that we want to judge in this model, right?

00:02:55.260 | But they say that the lip area actually correspond to less than four percent of the total reconstruction loss

00:03:00.540 | So it is um, we can we should find a way to concentrate on that

00:03:05.340 | To generate a better lip area, of course while preserving the original image

00:03:10.380 | So we don't want the background to change. We don't want the pose of the person to change etc

00:03:15.020 | So what the authors do they introduce a sync net sync net is a model

00:03:19.420 | that allows

00:03:21.980 | was introduced previously

00:03:23.820 | Allows to check how much a video and audio are synced together

00:03:28.620 | And if they are not synced by how much they are out of sync

00:03:31.740 | The authors they call it a lip-sync expert

00:03:34.940 | They retrain the sync net

00:03:37.820 | from the ground

00:03:39.740 | Using little variations. For example, the original sync net was trained using black and white images

00:03:45.900 | Now they use color images and secondly, they change the loss function to cosine similarity

00:03:51.280 | So the generator actually the loss function of the generator

00:03:56.560 | Is a combination of the L1 construction loss the GAN loss and the sync loss

00:04:03.500 | We can find it here in the equation number six

00:04:08.060 | So actually the L total is the total loss of the generator, which is a combination of this loss of this loss and this loss

00:04:15.100 | And there are some weights to choose how much emphasis to give to each loss

00:04:21.020 | The the system has been trained using an adapt optimizer, these are the parameters

00:04:27.520 | But let's go to check the results

00:04:30.940 | now

00:04:32.780 | The authors compare the current model with the previous models and using three different data sets

00:04:40.300 | The first one is dubbed. So we have a video and the audio that is dubbed taken from the internet, I guess

00:04:46.780 | and

00:04:49.340 | Where the audio and the video are not in sync

00:04:52.220 | And they try to sync it using Guav2Lip and also the two baseline models Speech2Read and LipGAN

00:04:57.980 | We can see that according to human evaluators. So these are all

00:05:03.180 | Evaluations made by humans. The Guav2Lip actually is preferred. The method of evaluation is written here in

00:05:10.220 | 4.4.2

00:05:12.760 | And

00:05:14.700 | Secondly, we have a random data set with that

00:05:18.060 | data set of random videos with random audios and the Guav2Lip is trained to

00:05:23.820 | Sync them and finally we have TTS in which the audio is generated from a TTS system

00:05:31.100 | As we can see overall the Guav2Lip is performing much better. We will see some example later

00:05:36.540 | and according to human evaluators

00:05:39.580 | and

00:05:42.220 | We see that here in the authors right finally it is worth noting that our lip-sync videos are preferred over existing methods

00:05:49.020 | Or even the actual unsynced videos over 90% of the time so it means that also the visual quality is not bad

00:05:58.540 | Here are some examples for example, we can see the red

00:06:01.420 | frames are from the previous models and we can see the quality of the face here of

00:06:08.860 | the german chancellor

00:06:11.500 | is not so good, but here with the syncnet with the

00:06:16.140 | Guav2Lip the reconstruction image is quite good

00:06:21.660 | The authors actually train two models one with

00:06:27.180 | GAN and one without the GAN loss and we can see that

00:06:30.540 | Without the GAN is performing better on some metrics and a little worse on some other metrics

00:06:37.820 | And so actually I think this model have a lot of potential for generating talking avatars or for dubbing videos generating educational videos

00:06:46.860 | Maybe in the future. We don't need to record the three times the same video in multiple language

00:06:51.820 | We can just generate it once and let the AI do the rest. Thank you for listening

00:06:55.520 | [BLANK_AUDIO]