zakne/gist:06618f0af3ddd490df6e8701f0c402c9

## gistfile1.txt
During GSoC I was able to accomplish the following: optimize part of the ipred functions and implement tile threading support.

Links for optimized avx2 ipred functions:
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=35a5d9715dd82fd00f1d1401ec6be2d3e2eea81c
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=81fc617c125734aa6f3b3d938af75fef6db750e7
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=73d9a9a6af5d00cfa9b98c7d9fc9abd0c734ba8e

Links for the tile threading code:
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215363.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215361.html
http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215393.html

Tile threading support is not commited to the main repository yet as it's being reviewed by developers.

After my changes I got these performance numbers:
Tile threading is ~45% faster at 2 threads vs 1.
Frame threading is ~55% faster at 2 threads vs 1.
ffvp9 tile threading is ~25% faster than libvpx-vp9 at 2 threads

There were a few challenging places when I was writing tile threading support,
one is to debug a multithreaded application in general, which is not that easy and I had little experience prior to it, but I
learned a lot and overall it has been a good experience, although I spent a lot of time on it.
Second, more specific, is making the loopfilter work with allocating small
VP9Filter *lflvl structure like 4 super block rows, but we don't know how far behind the working threads the loopfilter is,
so we need to synchronize the loopfilter and working threads, so the working threads don't overwrite lflvl structure with the rows
information that are ahead. And that's been quiet challenging, because the synchronization didn't work on the second frame, and it was
hard to debug it, to see what the problem is, I spent a lot of time on this.
Not to waste time, I solved it by allocating lflvl structure with the amount of superblock rows there are in a frame,
that eliminates race conditions completely, but requires a little bit more memory.

Overall, I am really happy with the work I done, although I hoped I would write a lot more code for vp9,
but this is my first time working on such a big project, and I got a lot of experience.

I still have to do: avx2 assembly for the loopfilter, alpha channel support and finish writing avx2 assembly
for the ipred functions. Looking forward for that!

UPD: As of 08.09.2017 tile threading has been commited to the main repository, links:
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=e59da0f7ff129d570adb72c6479f7ce07cf5a0f9
http://git.videolan.org/?p=ffmpeg.git;a=commit;h=83c12fefd22fc2326a000019e5c1a33e90a874e8
	During GSoC I was able to accomplish the following: optimize part of the ipred functions and implement tile threading support.

	Links for optimized avx2 ipred functions:
	http://git.videolan.org/?p=ffmpeg.git;a=commit;h=35a5d9715dd82fd00f1d1401ec6be2d3e2eea81c
	http://git.videolan.org/?p=ffmpeg.git;a=commit;h=81fc617c125734aa6f3b3d938af75fef6db750e7
	http://git.videolan.org/?p=ffmpeg.git;a=commit;h=73d9a9a6af5d00cfa9b98c7d9fc9abd0c734ba8e

	Links for the tile threading code:
	http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215363.html
	http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215361.html
	http://ffmpeg.org/pipermail/ffmpeg-devel/2017-August/215393.html

	Tile threading support is not commited to the main repository yet as it's being reviewed by developers.

	After my changes I got these performance numbers:
	Tile threading is ~45% faster at 2 threads vs 1.
	Frame threading is ~55% faster at 2 threads vs 1.
	ffvp9 tile threading is ~25% faster than libvpx-vp9 at 2 threads

	There were a few challenging places when I was writing tile threading support,
	one is to debug a multithreaded application in general, which is not that easy and I had little experience prior to it, but I
	learned a lot and overall it has been a good experience, although I spent a lot of time on it.
	Second, more specific, is making the loopfilter work with allocating small
	VP9Filter *lflvl structure like 4 super block rows, but we don't know how far behind the working threads the loopfilter is,
	so we need to synchronize the loopfilter and working threads, so the working threads don't overwrite lflvl structure with the rows
	information that are ahead. And that's been quiet challenging, because the synchronization didn't work on the second frame, and it was
	hard to debug it, to see what the problem is, I spent a lot of time on this.
	Not to waste time, I solved it by allocating lflvl structure with the amount of superblock rows there are in a frame,
	that eliminates race conditions completely, but requires a little bit more memory.

	Overall, I am really happy with the work I done, although I hoped I would write a lot more code for vp9,
	but this is my first time working on such a big project, and I got a lot of experience.

	I still have to do: avx2 assembly for the loopfilter, alpha channel support and finish writing avx2 assembly
	for the ipred functions. Looking forward for that!

	UPD: As of 08.09.2017 tile threading has been commited to the main repository, links:
	http://git.videolan.org/?p=ffmpeg.git;a=commit;h=e59da0f7ff129d570adb72c6479f7ce07cf5a0f9
	http://git.videolan.org/?p=ffmpeg.git;a=commit;h=83c12fefd22fc2326a000019e5c1a33e90a874e8