Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 33 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save BlueSwordM/c7b4c22b57f3109bd614cbf0803ff086 to your computer and use it in GitHub Desktop.
Save BlueSwordM/c7b4c22b57f3109bd614cbf0803ff086 to your computer and use it in GitHub Desktop.
# Maximizing AV1 decoding speed: a modern 2022 encoding and decoding guide!

Hello. I've decided to share a lot more of my knowledge in public forums from now on, and to not divert any of my focus away from improving the world in a way that stays written in history.

This Gist is about discussing on how to improve AV1 decoding performance on 2 fronts: improving performance through more efficient decoding, and through decoding aware encoding.

Improving decoding performance through more efficient decoding.

Here are many tips on how to improve decoding performance on any machine:

1. Keep your favorite media player up to date!

For AV1 software decoding(or any kind of software decoding for that matter), this implies:

  • Utilizing the latest dav1d software AV1 decoder version (dav1d 1.0.0 and later).
  • Utilization of new hybrid acceleration techniques through GPU assisted decoding via software like libplacebo.

On your preferred operating system and programs, that likely means getting the latest software versions available.

That includes browsers (the most recent Firefox/Chrome browser versions have more up to date version of the dav1d decoder), media players (mpv, VLC, mpc-hc with the latest LAV filters), and even operating system libraries (for thumbnails, previews, or even image coding).

Browsers: Just update to the latest version available for your system. Media players: I recommend mpv and VLC as high performance customizable, as they can be kept up to date on most systems rather easily. I recommend mpv as it has currently access to GPU acceleration features that VLC does not have currently, is easier to make sure that the decoder is up to date, and behaves better when it comes to poor rendering/decoding performance.

On Windows, you can just download the latest mpv builds from these sources: https://jeremylee.sh/bins/ https://sourceforge.net/projects/mpv-player-windows/files/64bit/ https://github.com/zhongfly/mpv-winbuild

For Linux, since static mpv binaries are very rare to see, your best bet is to obviously respectfully ask if your package manager can get more recent media players and decoding libaries. However, do not harass or constantly insist on the matter. Be patient.

For those Linux folks who know they won't be able to get updated media players for another decade, then there exists this mpv flatpak, saving us: https://flathub.org/apps/details/io.mpv.Mpv

For Apple folks, you can just get VLC. For Android folks, I obviously recommend mpv-android: https://github.com/mpv-android/mpv-android/releases https://play.google.com/store/apps/details?id=is.xyz.mpv&gl=US

2. Maximizing media player performance

It is not only important to improve software decoding speed, but improve resource utilization through more efficient rendering, which becomes more important at higher resolutions, framerates, bit-depth, etc.

That is why I usually do not recommend browsers for optimal media playback: their rendering engines are usually not as efficient as dedicated media player rendering engines and consume other system resources as well, which is why I consider their main use to be cross platform playback with services like Jellyfin.

As for what media players to use to maximize the playback performance for AV1 stuff, I have to give it to mpv specifically. Even though both VLC and mpv are rather comparable overall, the higher performance rendering pipeline it has access to in the 1st place, particularly when the easily available libplacebo-based GPU-Next video backend can be accessed to obtain more efficient GPU utilization and other exotic features.

2.1. AV1 grain synthesis GPU acceleration through ffmpeg libplacebo/mpv GPU-Next

Since mpv has access to libplacebo rendering, it can also utilize leading edge features such as GPU hybrid acceleration. Currently, only AV1 GPU grain synthesis can be used, although that can still give a decent increase to decode performance when it can be utilized.

Activating GPU based AV1 grain synthesis exclusively requires the utilization of gpu-next, which means you should add this to your mpv/mpv.net configuration file if you want to get it working: vo=gpu-next
vd-lavc-film-grain=gpu

For even more efficient GPU utilization, using the Vulkan graphics API over OGL/DX11 can be used, although some older platforms and operating systems may not be supported, such as macOS: gpu-api=vulkan

Here is my config for reference: vo=gpu-next gpu-api=vulkan vd-lavc-film-grain=gpu

Note: GPU-Next supports Vulkan, DX11, and OpenGL.

This should benefit about every machine, but the faster your CPU, the less of a benefit there is. Also, unless you have an iGPU, non zerop copy GPU transactions(CPU<>RAM<>GPU transactions) means there's a latency penalty when switching tasks around. Throughput isn't very affected by this, but random access performance is, which means seeking performance can be hurt if your CPU is fast enough.

Even on faster CPUs, the increase in decoding performance for higher bitrate higher resolution streams is very appreciated when grain synthesis is present, so I recommend keeping it on.

##2.2. Taking advantage of caching to smooth down utilization bumps

On old and/or slow platforms, particularly at higher resolutions with exotic features like grain synthesis and AV1 frame super resolution, decoding performance is still a challenge in hard scenarios.

A good example of this would be an old 2C/4T Skylake laptop that I have for TV video watching: while it can decode 4k24 10b grain synth streams without dropping any frames, 4k30 10b performance is not consistent, even with GPU grain synthesis enabled.

That is because while decoding performance exceeds realtime for most of the stream, there are moments where that isn't the case, so decoding performance goes below realtime and output frames are dropped.

Decoding in advance and putting the raw frames inside of a buffer in RAM would theoritically allow these bumps to be completely smoothed out, and empirically speaking, that is the case: after enabling caching for 2GB of RAM in mpv (tripling the decoding overhead for a 4k30 10b stream), I was not able to get a single frame drop during playback, and that was with scaling and depth conversion down to 8b on my peasant 1080p TV, so real 4k performance should be slightly better overall.

You do sacrifice some RAM in the process, but if you really want to play high resolution high bitrate streams on a slow machine, it can be very helpful.

As for how to enable it, you'll need to be using mpv, and to put something similar to this in your mpv config:

vd-queue-enable=yes
ad-queue-enable=yes
vd-queue-max-bytes=2000MiB
vd-queue-max-samples=600000
vd-queue-max-secs=15
cache=yes
demuxer-max-bytes=650M
demuxer-max-back-bytes=1000M

So, that is basically all you can do to improve decoding performance on the software side, so let's get on to the encoding side.

2.3 Future improvements

With libplacebo AV1 GPU Grain synthesis now being viable to use, you have to ask yourself: will more hybrid GPU decoding functions be brought over?
That answer is yes actually, and in the future, we will get more functions that can be offloaded to the GPU, getting us closer and closer to the dream of a 4x A55 machine being able to decode a 4k 10b stream.

Improving decoding performance through decoding aware encoding

To make it short, different encoding decisions and different encoding tools can have a decent effect on decoding performance. In AV1, here are the main user controllable decoding bottlenecks(relatively speaking) not aigned in any order:

  • Loop filtering pipeline: CDEF, restoration filtering are the main culprits.
  • Grain synthesis.
  • Entropy coding at higher bitrates(although that is an issue with most modern codecs)
  • CDF(Cumulative Distribution function) updates regarding entropy coding. The less updates, the higher the performance(only confirmed to be an encoding performance increase).
  • Deep frame hierarchies.
  • Large superblock sizes, mostly at lower resolutions(mainly threading).
  • Number of reference frames.
  • Partition and transform sizes(bigger partitions are easier to compute than smaller partitions).
  • Bitrate.
  • Number of tiles in the stream(mainly threading).

This part of the guide is mainly geared towards aomenc users, as SVT-AV1 users have an command line option called --fast-decode X, which does some of the stuff above for you. The latter method is a lot simpler, but the former method allows for a lot more flexibility.

For a basic improvement in decoder side threading, here is what you can do in mainline aomenc quite easily: --tile-columns=1 --sb-size=64

This should improve decoder scaling nicely at lower mainstream resolutions like <=1080p. Note: unless you really struggle with decoding at 1080p, just --sb-size=64 should suffice at 1080p and lower. --tile-columns=1 is just a bonus.

At higher resolutions(>=1440p), reducing superblock size doesn't help all that much, so leaving it at the preset default is perfectly fine.

For further improvements in decoding performance, this can be done: --tile-columns=1 --sb-size=64 --enable-restoration=0

For even higher decoding performance: --tile-columns=1 --sb-size=64 --enable-restoration=0 --gf-max-pyr-height=4 --max-reference-frames=4

For the highest somewhat reasonable decoding performance target: --tile-columns=2 --tile-rows=1 --sb-size=64 --enable-restoration=0 --enable-cdef=0 --gf-max-pyr-height=4 --max-reference-frames=4 --min-partition-size=8

aomenc-av1 ultrafast decoding, encoding quality be damned: --tile-columns=2 --tile-rows=1 --sb-size=64 --enable-restoration=0 --enable-cdef=0 --gf-max-pyr-height=4 --max-reference-frames=3 --min-partition-size=8 --loopfilter-control=0

Note that it is possible to disable even more coding features, but the above command line is extensive and damaging enough as is to encoding performance that I wouldn't recommend going any further.

Explanation to each setting:

  • Tiles = Splitting up the video into tile columns and tile rows, somewhat restricting what kind of information can be accessed throughout the whole frame. In decoding, tile threading scales better than frame threading, especially for random access operations such as seeking.

  • SB-size = Super Block size. AV1 encoders can choose between 64x64-128x128 SBs when encoding. At lower resolutions(<=1080p), forcing the SB size to 64x64 doesn't hurt coding efficiency much, if at all(if the preset itself has access to 128x128 SBs in the 1st place of course), and improves encoder and decoder side threading.

  • CDEF and restoration filtering: These filters are part of the loop filter pipeline, and at higher resolutions and bitrates, they can become some decent decoding bottlenecks, although their influence on that today isn't very large. Start by disabling restoration filtering first and then CDEF if really needed.

  • gf-max-pyr-height = Group of Frames Max Pyramidal Height, or as describe before, the maximum depth of the frame hierarchy. The deeper the frame hierarchy, the more references there are to be made, which reduces encoding and decoding performance somewhat. Reducing this from the default 5 reduces the ceiling of coding gains that are possible from deeper frame hierarchies, but improves decoding performance.

  • max-reference-frames = maximum heuristically determined number of reference frames the encoder can use. The higher the number, the higher potential encoding performance you can get, but the more encoding and decoding overhead you sacrifice. Reduce it if needed.

  • min-partition-size = minimum partition size. How small you let the blocks be. In aomenc <=CPU-5, the minimum partition size for <2160p is 4x4. In aomenc CPU-6, it is 8x8. For >=2160p, it is 8x8 anyway, so no need to touch it for larger resolutions.

  • loopfilter-control = control parameter of the loop filter. When set to anything other than 1(default), it restricts the loop filter application. Put it to 0, and you disable the loop filtering pipeline entirely(deblocking, CDEF and restoration), which can cause some a rather high quality deficit.

  • cdf-update-mode = Cumulative Distribution function update. It just decides how often you want to update stuff for the entropy coder. 0 = never(not recommended), 1 = all frames(default), 2 = selectively update(the only one you should choose if you wish to get higher decoding performance at high bitrates). It also comes with a slight encoding speed increase.

That'll be all from me today.

Questions and criticism welcome.

@themeadery
Copy link

Interesting article. I was most interested in trying to get AV1 film grain synthesis working on my machine. I found a bug in my particular setup (Windows 11, mpv.net v6.0.3.2-beta, Nvidia 3050) that would make mpv crash every time I tried to load a file with film grain synthesis in it.

Turns out vd-lavc-film-grain=auto or vd-lavc-film-grain=gpu causes a crash every time. The error message even points right to the problem. I had to change it to vd-lavc-film-grain=cpu.

This allows me to have a pretty nice setup:
vo=gpu-next
vd-lavc-film-grain=cpu
gpu-api=opengl
hwdec=nvdec

Very low CPU usage, offloads almost everything to GPU. Even with the film grain forced to run on CPU I am only seeing ~3% load on an AMD 5800X (4K HDR 10-bit AV1). With NVDEC turned off CPU usage skyrockets to 28%, which is the highest I've seen testing all of my mpv renderers and options. "Video Decode" in task manager shows only 5% utilization with NVDEC on. And this is with a whole bunch of other quality options cranked up, which should be expensive.

I know I mentioned NVDEC a lot, but I believe the other hwdecs work just as well.

Vulkan does not behave for me. OpenGL is nice because it disables the exclusive mode switch jank and supports more features than d3d11.

@BlueSwordM
Copy link
Author

Oh very interesting.

On the topic of grain synthesis, if you have ASIC HW decode, you will not need GPU grain synthesis as it is a part of the spec that you need to support.

GPU grain synthesis is mainly made for streams that can't be HW decoded today, and to help those with no HW decode.
This might be an mpv or libplacebo bug, so reporting it would be best.

@themeadery
Copy link

themeadery commented Oct 25, 2022

Yeah, what is weird is not setting the option still causes the bug to appear, because not setting it implies 'auto' which then implies 'gpu' due to its criteria in the code and it crashes. It crashes right when it tries to load the film grain module, as that module is listed in the error. It does not crash until you select gpu-next instead of gpu because you aren't meeting the requirements to get film grain until you do that.

Last night I started figuring out how and who I'm going to write the bug report for.

I did some digging into all the codebases required to get AV1 going with mpv, and it seems dav1d is doing some clever stuff. They do all the heavy lifting, offloading things to where they need to be, whether that is cpu or gpu. It's a hybrid approach, not a one-or-the-other approach like many other previous solutions. I believe this is what is causing errors downstream when dev's are implementing features and options. My hunch is that dav1d knows where to put the film grain synthesis and merely having the option of where to put it in mpv causes it to screw up. I could be wrong, and also I am using mpv.net which could have introduced the bug over vanilla.

dav1d is the real star of the show and I believe is where most of the performance increases are coming from. What a heck of a project. Over 100k lines of code written in assembly. When I bought my 3050 I bought it for the excellent NVDEC/NVENC silicon to get the best hardware decoding/encoding I could. I had a naïve understanding that the media player would just throw the raw data at the GPU and the GPU + drivers would churn away and spit it back out at lightning speed. I saw AV1 listed as supported for decode (and now encode on 40-series) and thought that was that, hole in one. Had no idea VideoLAN could step in an make a hybrid decoder that was so good. And they aren't even done optimizing it, yet.

@BlueSwordM
Copy link
Author

Well, the fact that libplacebo GPU grain synthesis works fine on my end might indicate something else entirely :)
You should update to mpv git, it should likely clear up everything.

@themeadery
Copy link

themeadery commented Oct 25, 2022

I guess my fix got lost in my long enthusiastic posts. It works, as long as I explicitly set vd-lavc-film-grain=cpu. Which is kinda odd, since it is the opposite of your guide. I'm on the shinchiro build from 2022-10-14 which is what is in my mpv.net beta release. I might try compiling from scratch from the latest bleeding edge code to test, since I think I have my config dialed and I don't really need the mpv.net GUI to explore anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment