Fast Face-swap Using Convolutional Neural Networks

We consider the problem of face swapping in images, where an input identity is transformed into a target identity while preserving pose, facial expression, and lighting. To perform this mapping, we use convolutional neural networks trained to capture the appearance of the target identity from an unstructured collection of his/her photographs.This approach is enabled by framing the face swapping problem in terms of style transfer, where the goal is to render an image in the style of another one. Building on recent advances in this area, we devise a new loss function that enables the network to produce highly photorealistic results. By combining neural networks with simple pre- and post-processing steps, we aim at making face swap work in real-time with no input from the user.


Introduction and related work
Face replacement or face swapping is relevant in many scenarios including the provision of privacy, appearance transfiguration in portraits, video compositing, and other creative applications. The exact formulation of this problem varies depending on the application, with some goals easier to achieve than others.
Bitouk et al. [2], for example, automatically substituted an input face by another face selected from a large database of images based on the similarity of appearance and pose. The method replaces the eyes, nose, and mouth of the face and further makes color and illumination adjustments in order to blend the two faces. This design has two major limitations which we address in this paper: there is no control over the output identity and the expression of the input face is altered.
A more difficult problem was addressed by Dale et al. [4]. Their work focused on the replacement of faces in videos, where video footage of two subjects performing similar roles are available. Compared to static images, sequential data poses extra difficulties of temporal alignment, tracking facial performance and ensuring temporal consistency of the resulting footage. The resulting system is com- plex and still requires a substantial amount of time and user guidance.
One notable approach trying to solve the related problem of pupeteering -that is, controlling the expression of one face with another face -was presented by Suwajanakorn et al. [29]. The core idea is to build a 3D model of both the input and the replacement face from a large number of images. That is, it only works well where a few hundred images are available but cannot be applied to single images.
The abovementioned approaches are based on complex multistage systems combining algorithms for face reconstruction, tracking, alignment and image compositing. These systems achieve convincing results which are sometimes indistinguishable from real photographs. However, none of these fully addresses the problem which we introduce below.
Problem outline: We consider the case where given a single input image of any person A, we would like to replace his/her identity with that of another person B, while keeping the input pose, facial expression, gaze direction, hairstyle and lighting intact. An example is given in Figure 1, where the original identity (Figure 1a) was altered with little or no changes to the other factors ( Figure 1b).
We propose a novel solution which is inspired by recent progress in artistic style transfer [7,14], where the goal is to render the semantic content of one image in the style of another image. The foundational work of Gatys et al. [7] defines the concepts of content and style as functions in the alignment realignment stitching input Figure 2: A schematic illustration of our approach. After aligning the input face to a reference image, a convolutional neural network is used to modify it. Afterwards, the generated face is realigned and combined with the input image by using a segmentation mask. The top row shows facial keypoints used to define the affine transformations of the alignment and realignment steps, and the skin segmentation mask used for stitching. feature space of convolutional neural networks trained for object recognition. Stylization is carried out using a rather slow and memory-consuming optimization process. It gradually changes pixel values of an image until its content and style statistics match those from a given content image and a given style image, respectively.
An alternative to the expensive optimization approach was proposed by Ulyanov et al. [31] and Johnson et al. [9]. They trained feed-forward neural networks to transform any image into its stylized version, thus moving costly computations to the training stage of the network. At test time, stylization requires a single forward pass through the network, which can be done in real time. The price of this improvement is that a separate network has to be trained per style.
While achieving remarkable results on transferring the style of many artworks, the neural style transfer method is less suited for photorealistic transfer. The reason appears to be that the Gram matrices used to represent the style do not capture enough information about the spatial layout of the image. This introduces unnatural distortions which go unnoticed in artistic images but not in real images. Li and Wand [14] alleviated this problem by replacing the correlation-based style loss with a patch-based loss preserving the local structures better. Their results were the first to suggest that photo-realistic and controlled modifications of photographs of faces may be possible using style transfer techniques. However, this direction was left fairly unexplored and like the work of Gatys et al. [7], the approach depended on expensive optimization. Later applications of the patch-based loss to feed-forward neural networks only explored texture synthesis and artistic style transfer [15]. This paper takes a step forward upon the work of Li and Wand [14]: we present a feed-forward neural network, which achieves high levels of photorealism in generated face-swapped images. The key component is that our method, unlike previous approaches to style transfer, uses a multi-image style loss, thus approximating a manifold describing a style rather than using a single reference point. We furthermore extend the loss function to explicitly match lighting conditions between images. Notably, the trained networks allow us to perform face swapping in, or near, real time. The main requirement for our method is to have a collection of images from the target (replacement) identity. For well photographed people whose images are available on the Internet, this collection can be easily obtained.
Since our approach to face replacement is rather unique, the results look different from those obtained with more classical computer vision techniques [2,4,10] or using image editing software (compare Figures 1b and 1c). While it is difficult to compete with an artist specializing in this task, our results suggest that achieving human-level performance may be possible with a fast and automated approach.

Method
Having an image of person A, we would like to transform his/her identity into person B's identity while keeping head pose and expression as well as lighting conditions intact. In terms of style transfer, we think of input image A's pose and expression as the content, and input image B's identity as the style. Light is dealt with in a separate way introduced below.
Following Ulyanov et al. [31] and Johnson et al. [9], we use a convolutional neural network parameterized by weights W to transform the content image x, i.e. input image A, into the output imagex = f W (x). Unlike previous work, we assume that we are given not one but a set of style images which we denote by Y = {y 1 , . . . , y N }. These images describe the identity which we would like to match and are only used during training of the network.
Our system has two additional components performing face alignment and background/hair/skin segmentation. We assume that all images (content and style), are aligned to a frontal-view reference face. This is achieved using an affine transformation, which aligns 68 facial keypoints from a given image to the reference keypoints. Facial keypoints were extracted using dlib [11]. Segmentation is used to restore the background and hair of the input image x, which is currently not preserved by our transformation network. We used a seamless cloning technique [23] available in OpenCV [20] to stitch the background and the resulting face-swapped image. While fast and relatively accurate methods for segmentation exist, including some based on neural networks [1,19,22], we assume for simplicity that a segmentation mask is given and focus on the remaining problems. An overview of the system is given in Figure 2.
In the following we will describe the architecture of the transformation network and the loss functions used for its training.

Transformation network
The architecture of our transformation network is based on the architecture of Ulyanov et al. [31] and is shown in Figure 3. It is a multiscale architecture with branches operating on different downsampled versions of the input image x. Each such branch has blocks of zero-padded convolutions followed by linear rectification. Branches are combined via nearest-neighbor upsampling by a factor of two and concatenation along the channel axis. The last branch of the network ends with a 1 × 1 convolution and 3 color channels.
The network in Figure 3, which is designed for 128×128 inputs, has 1M parameters. For larger inputs, e.g. 256×256 or 512 × 512, it is straightforward to infer the architecture of the extra branches. The network output is obtained only from the branch with the highest resolution.
We found it convenient to firstly train the network on 128 × 128 inputs, and then use it as a starting point for the network operating on larger images. In this way, we can achieve higher resolutions without the need to retrain the whole model. Although, we are restrained by the availability of high quality image data for model's training.

Loss functions
For every input image x, we aim to generate anx which jointly minimizes the following content and style loss. These losses are defined in the feature space of the normalised version of the 19-layer VGG network [7,27]. We will denote the VGG representation of x on layer l as Φ l (x). Here we assume that x and every style image y are aligned to a reference face. All images have the dimensionality of 3 × H × W . Content loss: For the lth layer of the VGG network, the content loss is given by [7]: In general, the content loss can be computed over multiple layers of the network, so that the overall content loss would be: Style loss: Our loss function is inspired by the patch-based loss of Li and Wand [14]. Following their notation, let Ψ(Φ l (x)) denote the list of all patches generated by looping over H l × W l possible locations in Φ l (x) and extracting a squared k × k neighbourhood around each point. This For every such patch fromx we find the best matching patch among patches extracted from Y and minimize the distance between them. As an error metric we used the cosine distance d c : where N N (i) selects for each patch a corresponding style image. Unlike Li and Wand [14], who used a single style image y and selected a patch among all possible patches Ψ(Φ l (y)), we only search for patches in the same location i, but across multiple style images: We found that only taking the best matching N best < N style images into account worked better, which here are as-  Figure 4: The lighting network is a siamese network trained to maximize the distance between images with different lighting conditions (inputs A and C) and to minimize this distance for pairs with equal illumination (inputs A and B). The distance is defined as an L2 norm in the feature space of the fully connected layer. All input images are aligned to the same reference face as for the inputs to the transformation network.
sumed to be sorted according to the Euclidean distance between their facial landmarks and landmarks of the input image x. In this way every training image has a costumized set of style images, namely those with similar pose and expression.
Similar to Equation 2, we can compute style loss over multiple layers of the VGG. Light loss: Unfortunately, the lighting conditions of the content image x are not preserved in the generated imagex when only using the above-mentioned losses defined in the VGG's feature space. We address this problem by introducing an extra term to our objective which penalizes changes in illumination. To define the illumination penalty, we exploited the idea of using a feature space of a pretrained network in the same way as we used VGG for the style and content. Such an approach would work if the feature space represented differences in lighting conditions. The VGG network is not appropriate for this task since it was trained for classifying objects, where illumination information is not particularly relevant.
To get the desirable property of lighting sensitivity, we constructed a small siamese convolutional neural network [3]. It was trained to discriminate between pairs of images with either equal or different illumination conditions. Pairs of images always had equal pose. We used the Exteded Yale Face Database B [8], which contains grayscale portraits of subjects under 9 poses and 64 lighting conditions. The architecture of the lighting network is shown in Figure 4. We will denote the feature representation of x in the last layer of the lighting network as Γ(x) and introduce the following loss function, which tries to prevent generated imagesx from having different illumination conditions than those from the content image x. Bothx and x are singlechannel luminance images.
Total variation regularization: Following the work of Johnson [9] and others, we used regularization to encourage spatial smoothness: The final loss function is a weighted combination of the described losses:

CageNet and SwiftNet
Technical details: We trained a transformation network to perform the face swapping with Nicolas Cage, of whom we collected about 60 photos from the Internet with different poses and facial expressions. To further increase the number of style images, every image was horizontally flipped. As a source of content images for training we used the CelebA dataset [18], which contains over 200,000 images of celebrities. Training of the network was performed in two stages. Firstly, the network described in Section 2.1 was trained to process 128 × 128 images. It minimized the objective function given by Equation 8, where L light was computed using a lighting network also trained on 128 × 128 inputs. In Equation 8, we used β = 10 −22 to make the lighting loss L light comparable to content and style losses. For the total variation loss, we chose γ = 0.3 . Training the transformation network with Adam [12] for 10K iterations with a batch size of 16 took 2.5 hours on a Tesla M40 GPU (Theano [30] and Lasagne [6] implementation). Weights were initialized orthogonally [26]. The learning rate was decreased from 0.001 to 0.0001 over the course of the training following a manual learning rate schedule.
With regards to the specifics of style transfer, we used the following settings. Style losses and content loss were computed using VGG layers {relu3_1, relu4_1} and {relu4_2} respectively. For the style loss, we used a patch size of k = 1. During training, each input image was matched to a set of N best style images, where N best was equal to 16. The style weight α in the total objective function (Equation 8) was the most crucial parameter to tune. Starting from α = 0 and gradually increasing it to α = 20 yielded the best results in our experiments.
Having trained a model for 128×128 inputs and outputs, we added an extra branch for processing 256 × 256 images. The additional branch was optimized while keeping the rest of the network fixed. The training protocol for this network was identical to the one described above, except the style weight α was increased to 80 and we used the lighting network trained on 256 × 256 inputs. The transformation network takes 12 hours to train and has about 2M parameters, of which half are trained during the second stage. Results: Figure 5b shows the final results of our face swapping method applied to a selection of images in Figure 5a. The raw outputs of the neural network are given in Figure 5c. We find that the neural network is able to introduce noticeable changes to the appearance of a face while keeping head pose, facial expression and lighting intact. Notably, it significantly alters the appearance of the nose, eyes, eyebrows, lips, and wrinkles in the faces, while keeping gaze direction and still producing a plausible image. However, coarser features such as the overall head shape are mostly unaltered by our approach, which in some cases diminishes the effect of a perceived change in identity. One can notice that when target and input identities have different skin colors, the resulting face has an average skin tone. This is partly due to the seamless cloning of the swapped image with the background, and to a certain extent due to the transformation network. The latter fuses the colors because its loss function is based on the VGG network, which is color sensitive.
To test how our results generalize to other identities, we trained the same transformation network using approximately 60 images of Taylor Swift. We find that results of similar quality can be achieved with the same hyperparameters (Figure 5b). Figure 6 shows the effect of the lighting loss in the total objective function. When no such loss is included, images generated with CageNet have flat lighting and lack shadows.
While the generated faces often clearly look like the target identity, it is in some cases difficult to recognize the person because features of the input identity remain in the output image. They could be completely eliminated by increasing the weight of the style loss. However, this comes at the cost of ignoring the input's facial expression as shown in Figure 7, which we do not consider to be a desirable behaviour since it changes the underlying emotional interpretation of the image. Indeed, the ability to transfer ex- pressions distinguishes our approach from other methods operating on a single image input. To make the comparison clear, we implemented a simple face swapping method which performs the same steps as in Figure 2, except for the application of the transformation network. This step was replaced by selecting an image from the style set whose facial landmarks are closest to those from the input image. The results are shown in Figure 8. While the baseline method trivially produces sharp looking faces, it alters expressions, gaze direction and faces generally blend in worse with the rest of the image.
In the following, we explore a few failure cases of our approach. We noticed that our network works better for frontal views than for profile views. In Figure 9 we see that as we progress from the side view to the frontal view, the face becomes more recognizable as Nicolas Cage. This may be caused by an imbalance in the datasets. Both our training set (CelebA) and the set of style images included a lot more frontal views than profile views due to the prevalence of these images on the Internet. Figure 9 also illustrates the failure of the illumination transfer where the network amplifies the sidelights. The reason might be the prevalence of images with harsh illumination conditions in the training dataset of the lighting network. Figure 10 demonstrates other examples which are currently not handled well by our approach. In particular, oc-cluding objects such as glasses are removed by the network and can lead to artefacts. Speed and Memory: A feed-forward pass through the transformation network takes 40 ms for a 256 × 256 input image on a GTX Titan X GPU. For the results presented in this paper, we manually segmented images into skin and background regions. However, a simple network we trained for automatic segmentation [25], can produce reasonable masks in about 5 ms. Approximately the same amount of CPU time (i7-5500U) is needed for image alignment. While we used dlib [11] for facial keypoints detection, much faster methods exist which can run in less than 0.1 ms [24]. Seamless cloning using OpenCV on average takes 35 ms.
At test time, style images do not have to be supplied to the network, so the memory consumption is low.

Discussion and future work
By the nature of style transfer, it is not feasible to evaluate our results quantitatively based on the values of the loss function [7]. Therefore, our analysis was limited to subjective evaluation only. The departure of our approach from conventional practices in face swapping makes it difficult to perform a fair comparison to prior works. Methods, which solely manipulate images [2,10] are capable of producing very crisp images, but they are not able to transfer facial poses and expressions accurately given a limited number of photographs from the target identity. More complex approaches, on the other hand, require many images from the person we want to replace [4,29].
Compared to previous style transfer results our method achieves high levels of photorealism. However, they can still be improved in multiple ways. Firstly, the quality of generated results depends on the collection of style images. Face replacement of a frontal view typically results in better quality compared to profile views. This is likely due to a greater number of frontal view portraits found on the Internet. Another source of problems are uncommon facial expressions and harsh lighting conditions from the input to the face swapped image. It may be possible to reduce these problems with larger and more carefully chosen photo collections. Some images also appear oversmoothed. This may be improved in future work by adding an adversarial loss, which has been shown to work well in combination with VGG-based losses [13,28].
Another potential improvement would be to modify the loss function so that the transformation network preserves occluding objects such as glasses. Similarly, we can try to penalize the network for changing the background of the input image. Here we used segmentation in a post-processing step to preserve the background. This could be automated by combining our network with a neural network trained for segmentation [17,25]. Further improvements may be achieved by enhancing the facial keypoint detection algorithm. In this work, we used dlib [11], which is accurate only up to a certain degree of head rotation. For extreme angles of view, the algorithm tries to approximate the location of invisible keypoints by fitting an average frontal face shape. Usually this results in inaccuracies for points along the jawline, which cause artifacts in the resulting face-swapped images.
Other small gains may be possible when using the VGG-Face [21] network for the content and style loss as suggested by Li et al. [16]. Unlike the VGG network used here, which was trained to classify images from various categories [5], VGG-Face was trained to recognize about 3K unique individuals. Therefore, the feature space of VGG-Face would likely be more suitable for our problem.

Conclusion
In this paper we provided a proof of concept for a fullyautomatic nearly real-time face swap with deep neural networks. We introduced a new objective and showed that style transfer using neural networks can generate realistic images of human faces. The proposed method deals with a specific type of face replacement. Here, the main difficulty was to change the identity without altering the original pose, facial expression and lighting. To the best of our knowledge, this particular problem has not been addressed previously.
While there are certainly still some issues to overcome, we feel we made significant progress on the challenging problem of neural-network based face swapping. There are many advantages to using feed-forward neural networks, e.g., ease of implementation, ease of adding new identities, ability to control the strength of the effect, or the potential to achieve much more natural looking results.