Skip to content

Instantly share code, notes, and snippets.

Last active March 5, 2021 13:10
Show Gist options
  • Save genekogan/d61c8010d470e1dbe15d to your computer and use it in GitHub Desktop.
Save genekogan/d61c8010d470e1dbe15d to your computer and use it in GitHub Desktop.
instructions for generating a style transfer animation from a video

Instructions for making a Neural-Style movie

The following instructions are for creating your own animations using the style transfer technique described by Gatys, Ecker, and Bethge, and implemented by Justin Johnson. To see an example of such an animation, see this video of Alice in Wonderland re-styled by 17 paintings.

Setting up the environment

The easiest way to set up the environment is to simply load Samim's a pre-built snap or use another cloud service like Amazon EC2. Unfortunately the g2.2xlarge GPU instances cost $0.99 per hour, and depending on parameters selected, it may take 10-15 minutes to produce a 512px-wide image, so it can cost $2-3 to generate 1 sec of video at 12fps.

If you do load the Terminal snap, make sure to run git pull to update to the latest version of neural-style.

Alternatively, you can set it up on your own computer, though the process is a bit tedious (especially installing Cuda) and a laptop generally lacks the VRAM to load the dataset without running out of memory. To set it on your own computer, best is to go with the concise instructions found here.

Once you have gotten neural-style setup, download the python script attached to this gist,, and install Pillow by running:

pip install Pillow

The script is used to blend each input image with the previous generated image. This helps to tease out the same features in successive frames and improve the consistency/smoothness of the frames. There is likely a much more effective way to do this, but blending is a cheap and easy solution to noisy features.

Next, install ffmpeg. Easiest is to just run brew install ffmpeg if you have homebrew installed.

Extracting frames and audio

Choose a source video file, e.g. myMovie.mp4, and use ffmpeg to extract the frames and raw audio from it. You can use any framerate you like, but I found it better to keep the framerate low, e.g. 12fps, to reduce noisiness in the output frames, and it also reduces the amount of computation needed.

ffmpeg -i myMovie.mp4 -r 12 -f image2 image-%5d.jpg
ffmpeg -i myMovie.mp4 rawaudio.wav

Choosing style images

Now the fun part: choosing style images to transform the content. In principle, any style image works, but the best style images to use are those which have strong textural components and patterns. A good place to start for ideas is Kyle McDonald's style studies which contains several dozen examples of style images picked from western art history.

Additionally, the neural-style repository itself contains examples on the README, as do the other implementations of the algorithm, including those by Kai Sheng Tai, and Anders Larsen. Lastly, many people have been posting images on twitter with the hasthag #stylenet, and the twitter account DeepForger has many examples as well.

Generating styled images

Once you've selected a style image, e.g. myStyle.jpg, you can generate an initial image from your first frame (image-00001.jpg) using neural_style.lua.

th neural_style.lua -style_image myStyle.jpg -content_image image-00001.jpg -output_image generated-00001.jpg

This uses the default parameters which work reasonably well, but you may try experimenting with the other parameters, which help to emphasize the content or style more, as well as using the style_scale factor to control the scale at which the network samples from the style image.

See the README at neural-style for more info about parameter selection.

Once you have your initial output frame, rather than generating the second frame directly from the second input frame, I found it was useful to partially blend the previous generated image into the next input frame, perhaps at no more than 5%. This helps to tease out the same high-level features in the second frame as were discovered in the first, and makes the video smoother and more consistent in high-level features (low-level features are still fairly noisy).

So blend the output into the next input, and then generate the next output from the resulting blended image. This example uses an alpha value of 0.95 of the second input frame (5% blend of the first frame.

python --input_1 generated-00001.jpg --input_2 image-00002.jpeg --output blended-00002.jpg --alpha 0.95
th neural_style.lua -style_image myStyle.jpg -content_image blended-00002.jpg -output_image generated-00002.jpg

Repeat this process for as many frames as you have. Easiest to make a bash script to automate it.

Update: neural-style now supports setting a random number seed in Torch which should improve frame-to-frame consistency, potentially reducing or eliminating the need for the blending step.

Putting it all together

Now that you have all the frames, you can use ffmpeg to create the generated movie from the generated frames and the raw audio initially extracted from the source video.

ffmpeg -framerate 12 -i frame-%05d.png -i rawaudio.wav -c:v libx264 -pix_fmt yuv420p

And voila, style-transfer video.

import argparse
from PIL import Image
import os
# arguments
parser = argparse.ArgumentParser()
parser.add_argument('--input_1', type=str, required=True)
parser.add_argument('--input_2', type=str, required=True)
parser.add_argument('--output', type=str, required=True)
parser.add_argument('--alpha', type=float, default=0.5)
args = parser.parse_args()
# load previous processed frame
im1 =
# load next frame
im2 =
# resize processed frame to match next input frame
im1 = im1.resize(im2.size)
# blend
im3 = Image.blend(im1, im2, args.alpha), "JPEG")
Copy link

Commit to neural-style added by hughperkins introduces -seed option doing amazing job at keeping image sequence more or less consistent. It seems to work good enough to drop the frame blending step.

Copy link

@raptorecki yeah i saw that. haven't had a chance to try it yet but have updated instructions. @microcosm, thanks for the catch, fixed it!

Copy link

Can you post the script to automate the process for x amount of frames, please?

Copy link

svdka commented Jun 26, 2016

Hi Yoloboyz, like you I had some trouble making a working script at the end I hacked this script to automate the process, just edit the python file to your needs :

Copy link

Dear genekogan:

How to do real time processing

I use

 ./  ./ball.mp4  ./example/seated-nude.jpg

ball.mp4:720p with 25 fps 18 seconds video
and I process dst 640x480 stylevideo
It took 5 hours to complete.

Machine configuration:
OS............: ubuntu14.04
Memory........: 32G
CPU...........: Intel Core i7-6700K 4.00GHz*8
DisplayCard...: NVIDIA GeForce GTX 1080
Harddisk......: 256G ssd

gtx1080@suker:~$ nvidia-smi
Tue Aug 23 21:17:24 2016
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 GeForce GTX 1080 Off | 0000:01:00.0 On | N/A |
| 34% 35C P8 10W / 200W | 1306MiB / 8112MiB | 5% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| 0 1169 G /usr/bin/X 1014MiB |
| 0 2025 G compiz 59MiB |
| 0 3571 G unity-control-center 2MiB |
| 0 10412 G ...ves-passed-by-fd --v8-snapshot-passed-by- 227MiB |

I am looking forward to your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment