Intuitive Guide to Neural Style Transfer

An intuitive guide to exploring design choices and technicalities of neural style transfer networks.



Introduction


This tutorial will be covering the following parts in the coming sections of the tutorial.
  • Why neural style transfer and the high level architecture
  • Loading VGG 16 weights as the pretrained network weights
  • Defining inputs, outputs, losses and the optimiser for the neural style transfer network
  • Defining an input pipeline to feed data to the network
  • Training the network and saving the results
  • Conclusion

Aim of this article
  • A content image (c) — the image we want to transfer a style to
  • A style image (s) — the image we want to transfer the style from
  • An input (generated) image (g) — the image that contains the final result (the only trainable variable)
The architecture of the model as well as how the loss is computed is shown below. You do not need to develop a profound understanding of what is going on in the image below, as you will be seeing each component in detail in the next several sections to come. The idea is to give a high level understanding of the workflow taking place during style transfer.

Downloading and loading the pretrained VGG-16

Note: You are welcome to try more layers. But beware of the memory limitations of your CPU and GPU.
Defining functions to build the style transfer network

Creating TensorFlow variables

  • content image (tf.placeholder)
  • style image (tf.placeholder)
  • generated image (tf.Variable and trainable=True)
  • pretrained weights and biases (tf.Variable and trainable=False)
Make sure you leave the generated image trainable while keeping pretrained weights and biases frozen. Below we show two functions to define inputs and neural network weights.


Computing the VGG net output


Loss functions
Let A^l_{ij}(I) be the activation of the l th layer, th feature map and j th position obtained using the image I. Then the content loss is defined as,
Essentially L_{content} captures the root mean squared error between the activations produced by the generated image and the content image. But why does minimising the difference between the activations of higher layers ensure the content of the content image is preserved?

Intuition behind content loss

Style loss function
w^l (chosen uniform in this tutorial) is a weight given to each layer during loss computation and M^l is an hyperparameter that depends on the size of the l th layer. If you would like to see the exact value, please refer to this paper. However in this implementation, you are not using M^l as that will be absorbed by another parameter when defining the final loss.

Intuition behind the style loss

Below you can see an illustration of how the style matrix is computed. The style matrix is essentially a Gram matrix, where the (i,j) th element of the style matrix is computed by computing the element wise multiplication of the i th and j th feature maps and summing across both width and height. In the figure, red cross denotes element wise multiplication and the red plus sign denotes summing across both width height of the feature maps.
You can compute the style loss as follows.


Why is it that style is captured in the Gram matrix?
Note: Personally, I don’t think the above question has been answered satisfactorily. For example [4] explains the similarities between the style loss and domain adaptation. But this relationship does not answer the above question.
So let me take a shot at explaining this a bit more intuitively. Say you have the following feature maps. For simplicity I assume only three feature maps, and two of them are completely inactive. You have one feature map set where the first feature map looks like a dog, and in the second feature map set, the first feature map looks like a dog upside down. Then if you try to manually compute content and style losses, you will get these values. This means that we haven’t lost style information between two feature map sets. However, the content is quite different.

Final loss

where Î± and Î² are user-defined hyperparameters. Here Î² has absorbed the M^l normalisation factor defined earlier. By controlling Î± and Î² you can control the amount of content and style injected to the generated image. You can also see a nice visualisation of different effects of different Î± and Î² values in the paper.

Defining the optimiser



Defining the input pipeline
You have defined two input pipelines; one for content and one for style. The content input pipeline looks for jpg images that start with the word content_, where the style pipeline looks for images starting with style_.

Defining the computational graph
  • Define iterators that provide inputs
  • Define inputs and CNN variables
  • Define the content, style and the total loss
  • Define the optimisation operation


Running style transfer


When you run the above code, you should be getting some neat art saved to your disk like below.

Conclusion

Code for this tutorial is available here.

Post a Comment

0 Comments