Skip to content

Instantly share code, notes, and snippets.

@zakne
Last active January 16, 2018 04:03
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zakne/06618f0af3ddd490df6e8701f0c402c9 to your computer and use it in GitHub Desktop.
Save zakne/06618f0af3ddd490df6e8701f0c402c9 to your computer and use it in GitHub Desktop.
GSoC vp9 decoder improvements report
During GSoC I was able to accomplish the following: optimize part of the ipred functions and implement tile threading support.
Links for optimized avx2 ipred functions:
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=35a5d9715dd82fd00f1d1401ec6be2d3e2eea81c
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=81fc617c125734aa6f3b3d938af75fef6db750e7
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=73d9a9a6af5d00cfa9b98c7d9fc9abd0c734ba8e
Links for the tile threading code:
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215363.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215361.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215393.html
Tile threading support is not commited to the main repository yet as it's being reviewed by developers.
After my changes I got these performance numbers:
Tile threading is ~45% faster at 2 threads vs 1.
Frame threading is ~55% faster at 2 threads vs 1.
ffvp9 tile threading is ~25% faster than libvpx-vp9 at 2 threads
There were a few challenging places when I was writing tile threading support,
one is to debug a multithreaded application in general, which is not that easy and I had little experience prior to it, but I
learned a lot and overall it has been a good experience, although I spent a lot of time on it.
Second, more specific, is making the loopfilter work with allocating small
VP9Filter *lflvl structure like 4 super block rows, but we don't know how far behind the working threads the loopfilter is,
so we need to synchronize the loopfilter and working threads, so the working threads don't overwrite lflvl structure with the rows
information that are ahead. And that's been quiet challenging, because the synchronization didn't work on the second frame, and it was
hard to debug it, to see what the problem is, I spent a lot of time on this.
Not to waste time, I solved it by allocating lflvl structure with the amount of superblock rows there are in a frame,
that eliminates race conditions completely, but requires a little bit more memory.
Overall, I am really happy with the work I done, although I hoped I would write a lot more code for vp9,
but this is my first time working on such a big project, and I got a lot of experience.
I still have to do: avx2 assembly for the loopfilter, alpha channel support and finish writing avx2 assembly
for the ipred functions. Looking forward for that!
UPD: As of 08.09.2017 tile threading has been commited to the main repository, links:
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=e59da0f7ff129d570adb72c6479f7ce07cf5a0f9
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=83c12fefd22fc2326a000019e5c1a33e90a874e8
@rubdos
Copy link

rubdos commented Sep 8, 2017

I'm interested in knowing how this performs on Ryzen, given that AVX is implemented in halves. Any insights?

@zakne
Copy link
Author

zakne commented Sep 9, 2017

@rubdos I haven't got a chance to test SIMD code on a Ryzen cpu yet, so I can't tell the exact numbers.

@rektide
Copy link

rektide commented Jan 16, 2018

Are there any numbers that say what kind of performance improvements AVX wins in vp9 decoding? All the example numbers are comparing 2 vs 1 threads. What do number look like at 1 vs 1 thread?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment