We have recently watched Van Gogh’s known story in Loving Vincent. 125 artists come together and painted 65.000 Van Gogh style oil paintings on canvas in a manual way to produce this movie. Herein, AI is as talented as these 125 artists. A deep learning technique called artistic style transfer empowers us to produce that kind of paintings, too.
Artistic style transfer (aka neural style transfer) enables to transform ordinary images to masterpieces. Actually, this is a combination of some deep learning techniques such as convolutional neural networks, transfer learning and auto-encoders. Even though its implementations can be found anywhere, it has a really hard theoretical background and that is why these implementations are complex to be understood. In this post, we will mention the background of style transfer and apply it from scratch.
🙋♂️ You may consider to enroll my top-rated machine learning course on Udemy
First of all, this technique is not a typical neural networks operation. Typical neural networks tune weights based on input and output pairs. Here, we will use pre-trained network and will never update weights. We will update inputs instead of updating weights.
Original study uses VGG model as pre-trained network. We’ll consume same network in this post. But this is not a prerequisite, you can consume any other pre-trained neural networks. Basically, VGG network looks like the following illustration.
Images
In this study, we would like to transfer style of an image to another one. The image we would like to transform is called content image whereas the image we would like to transfer its style is called style image. Then, style image’s brush strokes would be reflected to content image and this new image is called as generated image.
Content and style images are already exist. You might remember that we must initialize weights randomly in neural networks. Here, generated image will be initialized randomly instead of weights. Remember that this application is not a typical neural networks. Let’s construct the code for reading content and style images, generating random image for generated image.
def preprocess_image(image_path): img = load_img(image_path, target_size=(height, weight)) img = img_to_array(img) img = np.expand_dims(img, axis=0) img = preprocess_input(img) return img content_image = preprocess_image("content.jpeg") style_image = preprocess_image("style.jpg") height = 224; weight = 224 #original size of vgg random_pixels = np.random.randint(256, size=(1, height, weight, 3)) generated_image = preprocess_input(random_pixels, axis=0)
Normally, python stores images in 3D numpy array (1 dimension for RGB codes). However, VGG network designed to work with 4D inputs. If you transfer 3D numpy array to its input, you’ll face with the exception layer “block1_conv1: expected ndim=4, found ndim=3“. That’s why, we have added expand dimensions command in preprocessing step. This command will add a dummy dimension to handle this fault. Additionally, input features of VGG network is 224x224x3. That is why, content, style and generated images are size of 224×224.
Network
Now, we are going to transfer those images to VGG network as input features. But, we need outputs of some layers instead of output of network. Remember that auto-encoders can be used to extract representation of data. Actually, we use VGG to extract representation of those images.
Luckily, Keras offers winning CNN models as out-of-the-box function.
from keras.applications import vgg19 content_model = vgg19.VGG19(input_tensor=K.variable(content_image), weights='imagenet') style_model = vgg19.VGG19(input_tensor=K.variable(content_image), weights='imagenet') generated_model = vgg19.VGG19(input_tensor=K.variable(generated_image), weights='imagenet')
Loss
We will store loss value twice, one for content and one for style. In typical neural networks, loss value is calculated by comparing actual output and model output (prediction). Here, we will compare compressed presentations of auto-encoded images. Please remember that auto-encoded compressed representations are actually outputs of some middle layers. Let’s store each output of a layer and layer name once network is run.
content_outputs = dict([(layer.name, layer.output) for layer in content_model.layers]) style_outputs = dict([(layer.name, layer.output) for layer in style_model.layers]) generated_outputs = dict([(layer.name, layer.output) for layer in generated_model.layers])
Content loss
We’ll transfer randomly generated image and content image to same VGG network. Original work uses 5th block’s 2nd convolution layer (block5_conv2) to calculate content loss. This is not a must, you might use different layer to compress images in your work.
We have already transfer both content and generated images to VGG network in previous step. We can calculate content loss as squared difference of outputs of same layer for both content and generated one.
def content_loss(content, generated): return K.sum(K.square(content - generated)) loss = K.variable(0) content_features = content_outputs['block5_conv2'] generated_features = generated_outputs['block5_conv2'] contentloss = content_loss(content_features, generated_features)
Style loss
This loss type is a little bit harder to calculate. Firstly, we will compare first 5 layer’s outputs.
Here, finding distances between gram matrices is expected. Gram matrix can be calculated by multiplying a matrix with its transposed version.
gram = K.dot(features, K.transpose(features))
We need to work on 2D matrices to calculate gram matrix. Basically batch flatten command transforms n dimensional matrix to 2 dimensional. Notice that the structure of VGG network. For istance, size of 3rd convolution layer is (56×56)x256. Here, 256 refers to number of filters in that layer. If shape of the layer transformed to 256x56x56, 56×56 sized matrices put alongside. Permute dimensions function will help us to organize matrices before flattening.
def gram_matrix(x): #put number of filters to 1st dimension first features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1))) gram = K.dot(features, K.transpose(features)) return gram
BTW, Visual demonstration of a gram matrix is illustrated below. You might think nc as number of filters.
Now, we can calculate style loss
def style_loss(style, generated): style_gram = gram_matrix(style) content_gram = gram_matrix(generated) channels = 3 size = height * weight return K.sum(K.square(style_gram - content_gram)) / (4. * (pow(channels,2)) * (pow(size,2))) #name of first 5 layers. you can check it by running content_model.summary() feature_layers = ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1'] styleloss = K.variable(0) for layer_name in feature_layers: style_features = style_outputs[layer_name] generated_features = generated_outputs[layer_name] styleloss = styleloss + style_loss(style_features[0], generated_features[0])
Total loss
We have calculated both content and style loss. We can calculate total loss right now.
alpha = 0.025; beta = 0.2 total_loss = alpha * contentloss + beta * styleloss
Gradient Descent
Remember that total loss is reflected to all weights backwardly in back propagation algorithm. Derivative of total error with respect to the each weight is calculated in neural networks learning procedure. This calculation is also called as gradient calculation. In style transfer, we need gradients with respect to the input instead of weights.
#gradients = K.gradients(total_loss, generated_model.trainable_weights) gradients = K.gradients(total_loss, generated_model.input) print(gradients)
In this way, (1, 224, 224, 3) shaped tensor will be calculated as gradients. Just like our images. Now, we will update input of generated image instead of weights.
learning_rate = np.array([0.1]) generated_image = generated_image - learning_rate * gradients[0]
Actually, we have just applied a basic gradient descent to randomly generated image. You might applied Adam optimization algorithm to create arts fast. BTW, original work L-BFGS optimization algorithm to update image content. Sincerely, this is the first time I have heard that optimization algorithm. Researchers said that L-BFGS overperforms than Adam and others. But, still gradient descent works. Finally, we need to do all those operations in a for loop (epoch) to be a real learning.
Testing
I have applied Van Gogh’s Starry Night oil style to Galatasaray University. Result seems very impressive after 250 epochs.
This is the style transfer animation for 250 epochs. Transformation of the image is really amazing.
Besides, we can apply style transfer into videos. Applying style transfer to a drone video seems amazing.
A video actually shows 24 frames per second. BTW, these are not real time solutions. I firstly extracted frames of the videos, then applied style transfer for frames. Also, I applied 100 epochs for each frame. This requires a powerful GPU. Even though I used a GPU, I spent more than a day to transform these 1-minute long videos. If you can handle 24 frames per second, then you can apply style transfer real time. Isn’t this look like a scene in Loving Vincent? This post covers how to apply style transfer for videos.
I’ve transformed lots of videos already. You can find the artistic style transfer videos in this playlist.
Copyright and intellectual property
Recently, Lex Fridman from MIT raised an argument covering the copyright of these kind of AI created arts. Potential owners could be base image / video owner, designer of the architecture or algorithm (VGG architecture built by Oxford VGG Group and algorithm creators are original paper authors of style transfer), the person running the code (states me 🙂 ), or AI system itself. The following video focuses on this subject.
To sum up
So, we have mentioned artistic style transfer in this post. It is a combination of several high level deep learning techniques such as convolutional neural networks, transfer learning and auto-encoders. I strongly recommend you to understand these related topics first before applying style transfer.
Even though background of this motivation is neural networks, we would not apply standard neural networks rules. We would update inputs instead of weights. We would consume outputs of some layers instead of network output.
As usual, source code of this post is pushed into GitHub.
Enjoy!
Support this blog if you do like!
Nice demonstration and explanation. Thanks!
I think it is remarkable that a network pretrained on entirely different images can be exploited for this, presumably because it has learned generally useful features. The trick of taking content loss near the classifier output layer and style loss near the image input layer is really clever.
I suppose the video could be improved by spending more epochs on the first frame, initializing the next frame with the last generated one and adding (after the first frame) a loss penalizing pixelwise difference (perhaps after slight downsampling) from the previous generated frame.
To speed things up, it could be possible to distill the style transfer into a feed-forward (convolutional) network, maybe one that combines paths of different depth and compression, to convey both style and content.
I applied 10 epochs for video generation. This lasts days even for this configuration. But you are right, increasing the epoch will make a masterpiece.