This paper which was written by Chaofeng Chen, Xiaoming Li, Lingbo Yang, Xianhui Lin, Lei Zhang, and Kwan-Yee K. Wong, was published at CVPR 2021. They propose a progressive semantic-aware style transformation network, called PSFR-GAN, for blind face restoration. They utilize multi-scale low-resolution (LR) face images and their semantic segmentation maps to recover high-resolution (HR) face images through semantic aware style transformation. Furthermore, they introduce semantic aware style loss which calculates the feature loss of each semantic region individually to improve the restoration of face textures.
The main contributions of the paper are:
- They propose a novel multi-scale progressive framework for blind face restoration, i.e., PSFR-GAN which can make better use of multi-scale inputs pixel-wise and semantic-wise.
- They introduce semantic aware style loss to improve the restoration of face textures in each semantic region and reduce the occurrence of artifacts.
- Their model generalizes better to real LR face images than state-of-the-arts.
- They introduce a pre-trained face parsing network (FPN) to generate the segmentation map of the LR face image so the model is able to generate HR face image given only LR face image. It makes the model more practical in real-world cases where the segmentation maps of LR face images barely exist.
The main architecture of PSFR-GAN is the generator network. We provide the architecture drawn in the paper here.
The generator network starts with a constant
Each of the style transformation block learns a parameter
We reimplement the PSFR-GAN and publish the code at github.com/sidneyrachel/psfrgan-reimplementation. We implement the code using PyTorch library and refer to the original code available at github.com/chaofengc/PSFRGAN.
The proposed architecture is divided into three main networks: Face Parsing Network (FPN), generator, and discriminator. In the following sections, we will explain the implementation of each network. Finally, we will explain how PSFR-GAN is constructed from these three networks.
FPN produces a semantic segmentation map of a face image that labels each region with different colors, e.g., left eye, right eye, nose, upper lip, bottom lip, neck, etc. PSFRGAN uses a pre-trained model of this network to produce a semantic segmentation map of a low-resolution (LR) face image at the training and inference step. Below is the detailed architecture of FPN.
It consists of a type-1 convolutional layer (we will explain later about the type), four encoder residual blocks that downsample the image, ten body residual blocks, and four decoder residual blocks that upsample the image. We can also see a residual connection that takes the summation of the last encoder residual block and the last body residual block. Finally, we have two type-1 convolutional layers, i.e., the type-1 image convolutional layer and type-1 mask convolutional layer, that will produce a high-resolution (HR) face image and semantic segmentation map for the corresponding LR face image, respectively. We can also see how the number of channels transforms between layers from the label written on the bottom of the layer with the format "<previous number of channels> to <next number of channels>".
FPN behaves in a multi-task setting to produce the segmentation map and HR face image at the same time for the corresponding LR face image. However, PSFRGAN will only use the semantic segmentation map for future prediction. The generation of the HR face image behaves as supervision for predicting a better semantic segmentation map.
Each residual block in FPN consists of 3 convolutional layers with different types according to which part of the network it belongs to, i.e., encoder residual block, body residual block, or decoder residual block. The detailed architecture of the residual block is shown below. The input goes to convolutional layer 1 and convolutional layer 2. From convolutional layer 2, it is propagated to convolutional layer 3. Finally, we take the summation of the output from convolutional layer 1 and convolutional layer 3 as the final output of the residual block.
The table below shows the type for each convolutional layer for each kind of residual block.
convolutional layer | encoder residual block | body residual block | decoder residual block |
---|---|---|---|
conv. layer 1 | type-1 | type-1 | type-2 |
conv. layer 2 | type-3 | type-3 | type-4 |
conv. layer 3 | type-5 | type-5 | type-5 |
Finally, we will show the detailed components of each type of convolutional layer. Type-4 convolutional layer has a complete architecture that consists of 2x interpolation, 2D reflection padding, 2D convolution, batch normalization, and leaky ReLU. Meanwhile, type-1, type-2, type-3, and type-5 are just subsets of type-4.
The generator network generates the super-resolution (SR) face image given the LR face image and its segmentation map produced by FPN previously. Below is the detailed architecture of the generator network. It starts with a constant
The first part of the network consists of a 2D convolution and style transformation block, which produces
The detailed output shapes of the layers in the second part, i.e., upsampling, 2D convolution, and style transformation block for each repetition, can be seen below.
upsampling | 2D convolution | style transformation block | output |
---|---|---|---|
We will look deeper at the style transformation block since it is the main point of the proposed framework. The style transformation block takes and processes the LR+segmap image to update the input
The image below shows the detailed architecture of the SPADE normalization. The normalization interpolates the LR+segmap image to have the same height and width as the input
We use a discriminator network to generate the discriminator results for each concatenated HR image with ground truth semantic segmentation map, i.e., HR+segmap image and concatenated SR image produced by generator network with ground truth semantic segmentation map, i.e., SR+segmap image. They use the discriminator results to calculate the loss of PSFR-GAN. We will explain the detail of the losses in PSFR-GAN in the next section.
The image below shows the detailed architecture of the discriminator network. The network has three discriminators. Each discriminator has four main convolutional layers. Each discriminator's output will be saved as the final results. After feeding the input to a discriminator, the network downsamples the input using 2D average pooling before passing it to the next discriminator.
Each N-layer discriminator consists of a type-1 convolutional layer, four type-2 convolutional layers, and finally, a final 2D convolution.
The type-1 convolutional layer consists of 2D reflection padding, 2D convolution, and leaky ReLU. Meanwhile, the type-2 convolutional layer consists of 2D reflection padding, 2D convolution, 2D average pooling (for downsampling), instance normalization, and leaky ReLU.
Finally, PSFR-GAN unifies the three networks, i.e., FPN, generator, and discriminator. Given an LR face image, PSFR-GAN first utilizes pre-trained FPN to generate the semantic segmentation map of the corresponding LR face image. Given the LR face image and its segmentation map, the generator network will generate the SR face image. After that, the discriminator network will generate the discriminator results of the concatenation of SR face image and its segmentation map (SR+segmap image) and the concatenation of HR face image and its segmentation map (HR+segmap image). Note that the discriminator is present only in the training process where the segmentation map is provided in the training data. The discriminator results will be used to calculate some losses they utilize in this framework, i.e., feature matching loss, generator loss, and discriminator loss.
There are two groups of losses. The first group consists of semantic-aware style loss, pixel loss, feature matching loss, and generator loss. The loss of the first group is formulated in this equation, where
Meanwhile, the second group consists of discriminator loss.
The PSFR-GAN is trained by minimizing both losses alternatively.
To calculate semantic-aware style loss (SS loss), first, we extract features from the predicted SR face image and ground truth HR face image using features from VGG19. We extract 0th-2nd layer of VGG19 to build first module, 3rd-7th layer to build second module, 8th-16th layer to build third module, 17th-25th layer to build fourth module, and finally 26th-34th layer to build fifth module. We feed each image to the module and get five features (one from each module) for each image. For each image we will use 3rd-5th features to calculate the loss, we denote this as
The loss is formulated using this equation, where
The paper combines both pixel loss and feature matching loss to generate the reconstruction loss. We calculate the pixel loss between SR face image and HR image using torch.nn.L1Loss()
. The pixel loss is formulated using this equation.
We calculate the feature matching loss between SR+segmap image and HR+segmap image by feeding those images into the discriminator network that consists of 3 discriminators where each discriminator consists of 4 layers. We denote the features extracted by each discriminator from SR+segmap image as
We calculate the generator loss of the SR+segmap image by feeding the image into the discriminator network that consists of 3 discriminators. For each output of each discriminator, we calculate the loss by taking the negative mean of the output (hinge loss). We take the average loss from each discriminator. The generator loss is formulated using this equation.
We calculate the discriminator loss by feeding SR+segmap image and HR+segmap image into the discriminator network that consists of 3 discriminators. For each output of each discriminator, we calculate the loss from SR+segmap image by taking ReLU of the mean of
We gather the dataset for training from FFHQ which is available at github.com/NVlabs/ffhq-dataset. We download the 70,000 images with
We first apply random grayscale with probability of
import imgaug
from imgaug import augmenters
augmenters.Sometimes(0.3, augmenters.Grayscale(alpha=1.0))
We downsample the images to
After that, we apply additive gaussian noise
import imgaug
from imgaug import augmenters
import numpy as np
scale_size = np.random.randint(32, 256)
high_res_size = 512
augmenters.Sequential([
augmenters.Sometimes(0.5, augmenters.OneOf([
augmenters.GaussianBlur((3, 15)),
augmenters.AverageBlur((3, 15)),
augmenters.MedianBlur((3, 15)),
augmenters.MotionBlur((5, 25))
])),
augmenters.Resize(scale_size, interpolation=imgaug.ALL),
augmenters.Sometimes(0.2, augmenters.AdditiveGaussianNoise(loc=0, scale=(0.0, 25.5), per_channel=0.5)),
augmenters.Sometimes(0.7, augmenters.JpegCompression(compression=(10, 65))),
augmenters.Resize(high_res_size)
])
For generating the segmentation map for LQ face images and HQ face images, we use the pre-trained model provided by the original paper which is available here parse_multi_iter_90000.pth. Due to the limitation of training resources, we train the model only with 20 epochs and batch size 4. The result models are provided here trained_models. The script to train the model can be found at train.py
. The config for this script can be changed from config/base.json
and config/psfrgan/train.json
.
We gather dataset from CelebAHQ which is available at github.com/tkarras/progressive_growing_of_gans. We download the images from Google Drive using rclone. We choose 2,800 images with downsample_psfrgan.py
. The config for this script can be changed from config/downsample.json
.
We use the model epoch_20_net_gen.pth to generate super-resolution (SR) face images from LR face images. We measure Peak Signal-To-Noise Ratio (PSNR), Learned Perceptual Image Patch Similarity (LPIPS), Structural Similarity Index Measure (SSIM), and Fréchet Inception Distance (FID) between the predicted SR face images and ground truth HR face images. The original paper does not mention which libraries they use to evaluate the model. Therefore, we use libraries provided by torchmetrics to measure PSNR, LPIPS, and SSIM. For measuring FID, we use the library provided here github.com/mseitzer/pytorch-fid.
Note that we do not use the same dataset and metrics implementation as the original paper because they are not publicly available. Therefore, we do not compare the performance of our model (ours_epoch20
) with the one written in the original paper. Instead, we compare our model with the model provided in the original repository psfrgan_epoch15_net_G.pth, we denote this model as original_epoch15
. However, we are unsure whether this is the same model reported in the paper. Note that ours_epoch20
is trained using 35,000 images provided by FFHQ, meanwhile, original_epoch15
is trained using 70,000 images provided by FFHQ.
model | PSNR$\uparrow$ | SSIM$\uparrow$ | LPIPS$\downarrow$ | FID$\downarrow$ |
---|---|---|---|---|
original_epoch15 | 24.8226 | 0.6729 | 0.3347 | 12.0026 |
ours_epoch20 | 24.5472 | 0.6648 | 0.3290 | 11.9687 |
We provide some examples of images generated by each model.
code | HR face image | LR face image | ours_epoch20 | original_epoch15 |
---|---|---|---|---|
02911.jpg | ||||
03277.jpg | ||||
04957.jpg | ||||
29819.jpg |
We can see from the examples that our model produces relatively sharper images, e.g., 04957.jpg
and 29819.jpg
. However, the model provided in the original repository produces relatively smoother images with fewer artifacts, especially if we take a look at 03277.jpg
within the lip area.
The script to predict the test dataset is available at test_psfrgan.py
. The config for this script can be changed from config/base.json
and config/psfrgan/test.json
. Meanwhile, the script to calculate the metrics of the predicted images, i.e., PSNR, LPIPS, and SSIM, is available at evaluate_psfrgan.py
. The config of this script can be changed from config/evaluate.json
. To calculate FID score, we simply run python -m pytorch_fid <path/to/sr-folder> <path/to/hr-folder>
in the terminal.
You can experiment with the model using your own real LR face images. The script to preprocess the real LR face images is available at preprocess_psfrgan.py
. The config of this script can be changed from config/preprocess.json
. The script will produce cropped and aligned versions of the original LR faces which are ready to be enhanced by the model.
C. Chen, X. Li, L. Yang, X. Lin, L. Zhang and K. -Y. K. Wong, "Progressive Semantic-Aware Style Transformation for Blind Face Restoration," 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11891-11900, doi: 10.1109/CVPR46437.2021.01172.
T. Karras, S. Laine and T. Aila, "A Style-Based Generator Architecture for Generative Adversarial Networks," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4396-4405, doi: 10.1109/CVPR.2019.00453.
T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
Maximilian Seitzer. 2020. pytorch-fid: FID Score for PyTorch. https://github.com/mseitzer/pytorch-fid. Version 0.2.1.