BlueSwordM/Decoding guide AV1 2022: Decoding and decoding aware encoding optimizations.md

## Decoding guide AV1 2022: Decoding and decoding aware encoding optimizations.md

      
    Raw
  

              Decoding guide AV1 2022: Decoding and decoding aware encoding optimizations.md
            
          
    Hello. I've decided to share a lot more of my knowledge in public forums from now on, and to not divert any of my focus away from improving the world in a way that stays written in history.
This Gist is about discussing on how to improve AV1 decoding performance on 2 fronts: improving performance through more efficient decoding, and through decoding aware encoding.
Improving decoding performance through more efficient decoding.

Here are many tips on how to improve decoding performance on any machine:
1. Keep your favorite media player up to date!

For AV1 software decoding(or any kind of software decoding for that matter), this implies:

Utilizing the latest dav1d software AV1 decoder version (dav1d 1.0.0 and later).
Utilization of new hybrid acceleration techniques through GPU assisted decoding via software like libplacebo.

On your preferred operating system and programs, that likely means getting the latest software versions available.
That includes browsers (the most recent Firefox/Chrome browser versions have more up to date version of the dav1d decoder), media players (mpv, VLC, mpc-hc with the latest LAV filters), and even operating system libraries (for thumbnails, previews, or even image coding).
Browsers: Just update to the latest version available for your system.
Media players: I recommend mpv and VLC as high performance customizable, as they can be kept up to date on most systems rather easily.
I recommend mpv as it has currently access to GPU acceleration features that VLC does not have currently, is easier to make sure that the decoder is up to date, and behaves better when it comes to poor rendering/decoding performance.
On Windows, you can just download the latest mpv builds from these sources:
https://jeremylee.sh/bins/
https://sourceforge.net/projects/mpv-player-windows/files/64bit/
https://github.com/zhongfly/mpv-winbuild
For Linux, since static mpv binaries are very rare to see, your best bet is to obviously respectfully ask if your package manager can get more recent media players and decoding libaries.
However, do not harass or constantly insist on the matter. Be patient.
For those Linux folks who know they won't be able to get updated media players for another decade, then there exists this mpv flatpak, saving us:
https://flathub.org/apps/details/io.mpv.Mpv
For Apple folks, you can just get VLC.
For Android folks, I obviously recommend mpv-android:
https://github.com/mpv-android/mpv-android/releases
https://play.google.com/store/apps/details?id=is.xyz.mpv&gl=US
2. Maximizing media player performance

It is not only important to improve software decoding speed, but improve resource utilization through more efficient rendering, which becomes more important at higher resolutions, framerates, bit-depth, etc.
That is why I usually do not recommend browsers for optimal media playback: their rendering engines are usually not as efficient as dedicated media player rendering engines and consume other system resources as well, which is why I consider their main use to be cross platform playback with services like Jellyfin.
As for what media players to use to maximize the playback performance for AV1 stuff, I have to give it to mpv specifically.
Even though both VLC and mpv are rather comparable overall, the higher performance rendering pipeline it has access to in the 1st place, particularly when the easily available libplacebo-based GPU-Next video backend can be accessed to obtain more efficient GPU utilization and other exotic features.
2.1. AV1 grain synthesis GPU acceleration through ffmpeg libplacebo/mpv GPU-Next

Since mpv has access to libplacebo rendering, it can also utilize leading edge features such as GPU hybrid acceleration. Currently, only AV1 GPU grain synthesis can be used, although that can still give a decent increase to decode performance when it can be utilized.
Activating GPU based AV1 grain synthesis exclusively requires the utilization of gpu-next, which means you should add this to your mpv/mpv.net configuration file if you want to get it working:
vo=gpu-next

vd-lavc-film-grain=gpu
For even more efficient GPU utilization, using the Vulkan graphics API over OGL/DX11 can be used, although some older platforms and operating systems may not be supported, such as macOS:
gpu-api=vulkan
Here is my config for reference:
vo=gpu-next
gpu-api=vulkan
vd-lavc-film-grain=gpu
Note: GPU-Next supports Vulkan, DX11, and OpenGL.
This should benefit about every machine, but the faster your CPU, the less of a benefit there is. Also, unless you have an iGPU, non zerop copy GPU transactions(CPU<>RAM<>GPU transactions) means there's a latency penalty when switching tasks around. Throughput isn't very affected by this, but random access performance is, which means seeking performance can be hurt if your CPU is fast enough.
Even on faster CPUs, the increase in decoding performance for higher bitrate higher resolution streams is very appreciated when grain synthesis is present, so I recommend keeping it on.
##2.2. Taking advantage of caching to smooth down utilization bumps
On old and/or slow platforms, particularly at higher resolutions with exotic features like grain synthesis and AV1 frame super resolution, decoding performance is still a challenge in hard scenarios.
A good example of this would be an old 2C/4T Skylake laptop that I have for TV video watching: while it can decode 4k24 10b grain synth streams without dropping any frames, 4k30 10b performance is not consistent, even with GPU grain synthesis enabled.
That is because while decoding performance exceeds realtime for most of the stream, there are moments where that isn't the case, so decoding performance goes below realtime and output frames are dropped.
Decoding in advance and putting the raw frames inside of a buffer in RAM would theoritically allow these bumps to be completely smoothed out, and empirically speaking, that is the case: after enabling caching for 2GB of RAM in mpv (tripling the decoding overhead for a 4k30 10b stream), I was not able to get a single frame drop during playback, and that was with scaling and depth conversion down to 8b on my peasant 1080p TV, so real 4k performance should be slightly better overall.
You do sacrifice some RAM in the process, but if you really want to play high resolution high bitrate streams on a slow machine, it can be very helpful.
As for how to enable it, you'll need to be using mpv, and to put something similar to this in your mpv config:
vd-queue-enable=yes

ad-queue-enable=yes

vd-queue-max-bytes=2000MiB

vd-queue-max-samples=600000

vd-queue-max-secs=15

cache=yes

demuxer-max-bytes=650M

demuxer-max-back-bytes=1000M
So, that is basically all you can do to improve decoding performance on the software side, so let's get on to the encoding side.
2.3 Future improvements

With libplacebo AV1 GPU Grain synthesis now being viable to use, you have to ask yourself: will more hybrid GPU decoding functions be brought over?

That answer is yes actually, and in the future, we will get more functions that can be offloaded to the GPU, getting us closer and closer to the dream of a 4x A55 machine being able to decode a 4k 10b stream.
Improving decoding performance through decoding aware encoding

To make it short, different encoding decisions and different encoding tools can have a decent effect on decoding performance.
In AV1, here are the main user controllable decoding bottlenecks(relatively speaking) not aigned in any order:

Loop filtering pipeline: CDEF, restoration filtering are the main culprits.
Grain synthesis.
Entropy coding at higher bitrates(although that is an issue with most modern codecs)
CDF(Cumulative Distribution function) updates regarding entropy coding. The less updates, the higher the performance(only confirmed to be an encoding performance increase).
Deep frame hierarchies.
Large superblock sizes, mostly at lower resolutions(mainly threading).
Number of reference frames.
Partition and transform sizes(bigger partitions are easier to compute than smaller partitions).
Bitrate.
Number of tiles in the stream(mainly threading).

This part of the guide is mainly geared towards aomenc users, as SVT-AV1 users have an command line option called --fast-decode X, which does some of the stuff above for you.
The latter method is a lot simpler, but the former method allows for a lot more flexibility.
For a basic improvement in decoder side threading, here is what you can do in mainline aomenc quite easily:
--tile-columns=1 --sb-size=64
This should improve decoder scaling nicely at lower mainstream resolutions like <=1080p.
Note: unless you really struggle with decoding at 1080p, just --sb-size=64 should suffice at 1080p and lower. --tile-columns=1 is just a bonus.
At higher resolutions(>=1440p), reducing superblock size doesn't help all that much, so leaving it at the preset default is perfectly fine.
For further improvements in decoding performance, this can be done:
--tile-columns=1 --sb-size=64 --enable-restoration=0
For even higher decoding performance:
--tile-columns=1 --sb-size=64 --enable-restoration=0 --gf-max-pyr-height=4 --max-reference-frames=4
For the highest somewhat reasonable decoding performance target:
--tile-columns=2 --tile-rows=1 --sb-size=64 --enable-restoration=0 --enable-cdef=0 --gf-max-pyr-height=4
--max-reference-frames=4 --min-partition-size=8
aomenc-av1 ultrafast decoding, encoding quality be damned:
--tile-columns=2 --tile-rows=1 --sb-size=64 --enable-restoration=0 --enable-cdef=0 --gf-max-pyr-height=4
--max-reference-frames=3 --min-partition-size=8 --loopfilter-control=0
Note that it is possible to disable even more coding features, but the above command line is extensive and damaging enough as is to encoding performance that I wouldn't recommend going any further.
Explanation to each setting:


Tiles = Splitting up the video into tile columns and tile rows, somewhat restricting what kind of information can be accessed throughout the whole frame.
In decoding, tile threading scales better than frame threading, especially for random access operations such as seeking.


SB-size = Super Block size. AV1 encoders can choose between 64x64-128x128 SBs when encoding.
At lower resolutions(<=1080p), forcing the SB size to 64x64 doesn't hurt coding efficiency much, if at all(if the preset itself has access to 128x128 SBs in the 1st place of course), and improves encoder and decoder side threading.


CDEF and restoration filtering: These filters are part of the loop filter pipeline, and at higher resolutions and bitrates, they can become some decent decoding bottlenecks, although their influence on that today isn't very large. Start by disabling restoration filtering first and then CDEF if really needed.


gf-max-pyr-height = Group of Frames Max Pyramidal Height, or as describe before, the maximum depth of the frame hierarchy.
The deeper the frame hierarchy, the more references there are to be made, which reduces encoding and decoding performance somewhat. Reducing this from the default 5 reduces the ceiling of coding gains that are possible from deeper frame hierarchies, but improves decoding performance.


max-reference-frames = maximum heuristically determined number of reference frames the encoder can use. The higher the number, the higher potential encoding performance you can get, but the more encoding and decoding overhead you sacrifice. Reduce it if needed.


min-partition-size = minimum partition size. How small you let the blocks be. In aomenc <=CPU-5, the minimum partition size for <2160p is 4x4. In aomenc CPU-6, it is 8x8.
For >=2160p, it is 8x8 anyway, so no need to touch it for larger resolutions.


loopfilter-control = control parameter of the loop filter. When set to anything other than 1(default), it restricts the loop filter application. Put it to 0, and you disable the loop filtering pipeline entirely(deblocking, CDEF and restoration), which can cause some a rather high quality deficit.


cdf-update-mode = Cumulative Distribution function update. It just decides how often you want to update stuff for the entropy coder. 0 = never(not recommended), 1 = all frames(default), 2 = selectively update(the only one you should choose if you wish to get higher decoding performance at high bitrates). It also comes with a slight encoding speed increase.


That'll be all from me today.
Questions and criticism welcome.