back to index

Wav2Lip (generate talking avatar videos) - Paper reading and explanation


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hello guys, welcome to my review of the lip-sync expert is all you need for speech to lip generation in the wild
00:00:06.400 | which is a very long title to say that
00:00:08.480 | Basically, we are talking about a model that can allow you to generate lip-synced videos
00:00:14.720 | Using arbitrary videos and arbitrary audios. Let's see an example directly from the official website of the paper
00:00:21.920 | Here in the example tree proposed we can see that there is a video without any audio
00:00:29.120 | And the person here is not talking
00:00:31.120 | Then we have an audio of three seconds. Let's listen to it
00:00:34.640 | I'll go around wait for my call
00:00:37.680 | Then we press this button sync this pair and the output video which is here. I already generated it before we can play it
00:00:45.760 | I'll go around wait for my call
00:00:49.280 | As you can see the results are remarkable. We can see that the person
00:00:53.520 | Before was not not talking at all. And now actually the his lip has been automatically generated
00:00:59.140 | And the what the movement of his lip actually match what he's saying
00:01:03.680 | I don't see any big difference actually from what I would expect from any other person
00:01:09.600 | Let's go into the details of how this all works
00:01:15.840 | The the system can be used for many applications for example for dubbing videos in multiple languages
00:01:23.360 | educational videos or any other thing that you like
00:01:26.080 | And let's go to the architecture of the video of the model
00:01:31.120 | We can see that we have two streams. One is an audio stream
00:01:34.560 | And one way one is a video stream. They are both down sampled using a convolutional neural networks
00:01:41.600 | Combined together and then up sampled again with skip connections from the video stream
00:01:47.680 | And this is the generator. So we are talking about a GAN network
00:01:52.300 | Actually more or less a GAN we will see why it's different
00:01:55.420 | And then the generator frame so we have a sequence of frames
00:02:03.580 | Compared with what is the ground truth and this is the reconstruction loss of the image
00:02:08.860 | Actually, the authors claim that this the reconstruction loss is not enough to generate a good image
00:02:18.380 | And which is basically also the technique used by previous models. So before the Wav2Lip was
00:02:24.540 | introduced
00:02:28.860 | Because as the author claims
00:02:31.820 | in the previous
00:02:34.620 | here in 3.1
00:02:35.740 | we can see that the pixel level reconstruction loss is a weak judge of lip-sync why because
00:02:39.820 | The the system tries to the model tries to generate the image and try to make it look like the original
00:02:45.740 | however, the model doesn't concentrate on the lip area only which is
00:02:50.140 | What is what we want is one of the most important thing that we want to judge in this model, right?
00:02:55.260 | But they say that the lip area actually correspond to less than four percent of the total reconstruction loss
00:03:00.540 | So it is um, we can we should find a way to concentrate on that
00:03:05.340 | To generate a better lip area, of course while preserving the original image
00:03:10.380 | So we don't want the background to change. We don't want the pose of the person to change etc
00:03:15.020 | So what the authors do they introduce a sync net sync net is a model
00:03:19.420 | that allows
00:03:21.980 | was introduced previously
00:03:23.820 | Allows to check how much a video and audio are synced together
00:03:28.620 | And if they are not synced by how much they are out of sync
00:03:31.740 | The authors they call it a lip-sync expert
00:03:34.940 | They retrain the sync net
00:03:37.820 | from the ground
00:03:39.740 | Using little variations. For example, the original sync net was trained using black and white images
00:03:45.900 | Now they use color images and secondly, they change the loss function to cosine similarity
00:03:51.280 | So the generator actually the loss function of the generator
00:03:56.560 | Is a combination of the L1 construction loss the GAN loss and the sync loss
00:04:03.500 | We can find it here in the equation number six
00:04:08.060 | So actually the L total is the total loss of the generator, which is a combination of this loss of this loss and this loss
00:04:15.100 | And there are some weights to choose how much emphasis to give to each loss
00:04:21.020 | The the system has been trained using an adapt optimizer, these are the parameters
00:04:27.520 | But let's go to check the results
00:04:32.780 | The authors compare the current model with the previous models and using three different data sets
00:04:40.300 | The first one is dubbed. So we have a video and the audio that is dubbed taken from the internet, I guess
00:04:49.340 | Where the audio and the video are not in sync
00:04:52.220 | And they try to sync it using Guav2Lip and also the two baseline models Speech2Read and LipGAN
00:04:57.980 | We can see that according to human evaluators. So these are all
00:05:03.180 | Evaluations made by humans. The Guav2Lip actually is preferred. The method of evaluation is written here in
00:05:10.220 | 4.4.2
00:05:14.700 | Secondly, we have a random data set with that
00:05:18.060 | data set of random videos with random audios and the Guav2Lip is trained to
00:05:23.820 | Sync them and finally we have TTS in which the audio is generated from a TTS system
00:05:31.100 | As we can see overall the Guav2Lip is performing much better. We will see some example later
00:05:36.540 | and according to human evaluators
00:05:42.220 | We see that here in the authors right finally it is worth noting that our lip-sync videos are preferred over existing methods
00:05:49.020 | Or even the actual unsynced videos over 90% of the time so it means that also the visual quality is not bad
00:05:58.540 | Here are some examples for example, we can see the red
00:06:01.420 | frames are from the previous models and we can see the quality of the face here of
00:06:08.860 | the german chancellor
00:06:11.500 | is not so good, but here with the syncnet with the
00:06:16.140 | Guav2Lip the reconstruction image is quite good
00:06:21.660 | The authors actually train two models one with
00:06:27.180 | GAN and one without the GAN loss and we can see that
00:06:30.540 | Without the GAN is performing better on some metrics and a little worse on some other metrics
00:06:37.820 | And so actually I think this model have a lot of potential for generating talking avatars or for dubbing videos generating educational videos
00:06:46.860 | Maybe in the future. We don't need to record the three times the same video in multiple language
00:06:51.820 | We can just generate it once and let the AI do the rest. Thank you for listening
00:06:55.520 | [BLANK_AUDIO]