dwbuiten/testing.md

## testing.md

      
    Raw
  

              testing.md
            
          
    Things to Look for When Evaluating Encoders


"Our encoder mixes DCT synergies and gamma radiation to provide excellent quality that abides by industry standard salsa dancing techniques, using adaptive streaming techniques." — Vendor.

First Steps


First, verify they actually have a product to sell that actually exists and can be tested in-house.
Warning signs: incredibly vague descriptions of the product, lots of "patent pending" claims with no real patents to be found, and insistence that they can do the testing for you and provide metrics instead of letting you test it yourself.


Then, before you even bother to evaluate something, make sure it's even useful to us; that is, even if it works 100% the same as their marketing claims, is it even worth anything to us?
Example #1: A product which automatically chooses the 'best' bitrate for a given video, at the cost of several times slowdown. The value proposition here doesn't actually provide any value to our use case (we need bitrate caps and spike control), but also provides something that we already have a good solution to (CRF).
Example #2: A new codec or encoder which provides quality marginally better, worse, or similar to x264's preset placebo, but at a much higher CPU time cost. This provides literally zero benefit.


Insist on actually testing it ourselves at some point. We should absolutely never buy into anything that they refuse to let us test in house before buying.


Make sure they are selling us something that can actually beat our current setup with our use case. For example, it is not useful to show that some vendor has an encoder that can beat x264 on speeds and settings similar to fast or ultrafast. We aim for the higher end of quality, not the faster, lower end.


Make sure the thing or codec can actually be played by a large number of users. It's not useful to provide encodes in a given codec, or with a given product's technique for e.g. VR or 3D, if nothing of value, and nothing widespread, supports playback of these items. Providing a browser plugin is not OK, either. Users will kill your dog if you ask them to install a browser plugin.


Evaluating an Encoder

This is often done wrong. Reliance on metrics is a terrible idea. Use your eyes. Don't just compare still images. We serve videos, not images.
First, read this: https://web.archive.org/web/20140822041755/http://x264dev.multimedia.cx/archives/472
The above blog post covers most basic comparison pitfalls like not comparing at the same bitrate.
General Rules


Use your own clips, and specific testing clips to test. Do not trust clips provided by the vendor. You want to test on real world content that you will actually be encoding. Also, use test clips that have been specifically crafted to see how good encoders are; for example: EBU test clips, and SVT test clips.


Use clips longer than a few seconds. Test with that have a length that is actually representative of real world content we care about. This is a good way to test things like ratecontrol, bitrate spike control, and speed.


Use real world settings. It's OK to test with super-pimped-out settings, but it's imperative to test with encoder settings we can actually use in the real world. This includes things like encoding at a reasonable speed, generating encodes that can actually be played by the crappy plastic toy hardware we have to support, and that can be played back by end users with a reasonable CPU and hardware. Similarly, testing ultrafast settings (which are generally promoted by vendors, since it makes them seem good), is not useful. We aim for reasonably high quality encodes, not RealMedia encodes circa 1995.


Make sure your test is reproducible. Always carefully document the test environment, and the software version used for your tests: another person should be able to recreate the same test conditions and obtain the same results as you do, otherwise the test is worth nothing. On the point it's often important to run benchmarks multiple times, in order avoid skewing the results due to some unforeseen circumstances. And report the variance in your results.


Always treat everything a vendor or marketing person says with suspicion. They're trying to sell us stuff, and they're going to try and mislead you into thinking their product is TOTES AMAZEBALLS. This is not always obvious. It could be something like suggesting what x264 settings you compare against, which to someone with a rudimentary understanding of video, may trick them into thinking a product is better than it is. It could also be making patently false claims and hoping you won't actually check to see if it is true.


No hardware encoders or dongles. We need to be able to scale with demand, and buying into these fixed us with a static amount of encoding machines, and precludes using The Butt^WCloud for overflow.


GPU Encoding is both expensive and scam. This is simply the state of the industry. You need to buy expensive video cards, and use horrible software to produces similarly horrible encodes, with less throughput than software based encoding, with less ability to scale.


Objective Things to Test


Reasonable speed and CPU usage. This one is obvious. We want to utilize our hardware the best we can, and want our encodes to finish before the heat death of the universe. Good frame (not slice) multithreading is a must, unless we are OK with chunking, although that requires chunk-stable ratecontrol, as described below.


Rate control.
Watch this video: https://www.youtube.com/watch?v=KMU0tzLwhbE
Now, replace every instance of 'developers' with 'rate control'. That's how important ratecontrol is. So much so, that this section contains sub-sections describing common pitfalls, things to test, and why to test those things.


Having a way to control bitrate spikes. This is, albeit, less of an issue when we serve files via adaptive streaming, it is nonetheless an issue. We need a way to control how much the bitrate is allowed to rise or fall over a given window (say, 3 seconds). Too much of a rise over too short of a period will cause excessive buffering during progressive playback, and dropping down to a lower quality during adaptive playback. In H.264 and HEVC, this takes the form of VBV, for example. Other codecs may label it as overshoot/undershoot.


Being able to actually hit bitrate targets. Rate control is not useful at all if it doesn't, well, control the rate of bits. It's not useful to encode something N times using Newtons Method, because the encoding library overshoots or undershoots a given bitrate massively.


working scene change detection. Without this, a video is going to look awful, and have pulses in quality when scenes change, because they, for example, encoded a scene change using a B frame.


Being able to handle mid-scene keyframe placement. This is important for adaptive streaming, where we want to place keyframes on N second boundaries, which may lie in the middle of scenes. If this is not handled gracefully by an encoder, it will result in keyframe popping in the middle of a scene, which is very jarring to a viewer.


Being able to actually allocate bits well. The obvious one. It can allocate bits to blocks and scenes that need them more. Not a hard concept.


Has some sort of working 'quality mode'. This is important for both chunking, and so that we can skip a 2nd pass if possible, and saving on storage space, since we know we can maintain a consistent quality with less bits.


Bonus: Chunk-stable ratecontrol algorithm. This is required for chunked encoding to work well. Basically, the aforementioned 'quality mode' needs to produces similar or identical results when processing the same scenes / sets of frames, both in isolation, and when encoding an entire video. This is important for both consistent and good quality, as well as being able to apply global chunk rate tracking, and 2nd pass chunk bitrate adjustments based on this tracking.


Can be either used as a library, or has some sort of sane output that we can mux. It should also be able to handle pipes, both in, and out. If it has a library API, it needs to handle buffers, not filename strings.


Visual Things to Test and Compare with x264:


Banding on gradients. Check for banding artifacts, especially on backgrounds, where there are fine gradients. This is especially important for clean CG videos. However, at very low bitrates, it is expected. In that case, look at if the banding is stable between frames and motion. It should not jerk around and flicker, if possible.


Dark scenes. Dark scenes tend to suffer from DCT decimation. That is, the encoder sees a a block of dark content, and thinks "hey the colors are all very similar her,e ill just make them all the same color for the whole block, to save bits". This can be good when done right, and horrible when done wrong. Things like adaptive quantization (with trellis quantization), psychovisual rate distortion, and to a lesser automated extent, tweakable deadzones, can and do help a lot here. Adaptive quantization is pretty much a must for any encoder.
Furthermore, even when psychovisual complexity between the source and encoded video is similar, it must be check that it looks good in motion. For example, if a source has a fine grain effect, or slightly moving dark background, it should not be static in the encoded version due to temporal decimation of complexity. Similarly, it should not exhibit a block pulsing effect in motion that distracts from the foreground.


Fine and non-fine film grain. Both digital and real. Make sure it does not get smoothed away at high bitrates, which are the sort of bitrates we aim for, and the kind of thing our users expect. This is a big failure of almost ever encoder, because they optimize for PSNR, which biases towards blurrier, but 'similar' images, as oppose to similarly complex similar images. This is a great place where metrics will almost always fail you, since the real 'fix' to this problem is usually psychovisual optimizations, which will try and encode a similarly complex image, which will make almost all metrics worse, but make it look visually better. Further reading: http://forum.doom9.org/showthread.php?t=138293


Fast moving and/or small particulates. Make sure videos with lots of little bits moving around, e.g. confetti, do not become super blocky, nor horribly smoothed over during encoding. This is another area where both adaptive quantization and psychovisual optimizations will help. A 'similarly complex' encoded video to the input video will look better to the eye than one that is blurrier (making PSNR metrics better) or more complex (pulsing, movement that wasn't there in motion before).


Ringing, blocking, and other artifacts around sharp edges. Especially curved or diagonal edges, and edges not near block boundaries. This can be caused by overdoing quantization, or decimating, on blocks which contain edges inside them. This especially bad on curved sharp edges which cannot easily be predicted in one direction during encoding. This is especially true on text, which takes all the hardest parts listed above and puts them very close together in blocks. This issue should be familiar to most, given the prevalence of bad JPEGs on the web. Also make sure that the artifacts don't look even worse in motion; that is, they must be consistent between frames where text is also consistent.


Videos with static backgrounds and moving foregrounds. This one can be seen on many documentaries for example. A fixed camera position, where the background does not move, but the foreground has high movement; for example, a person speaking and moving. The foreground should have more bits allocated it, and should not look incredibly blocky or blurry. At the same time, the background should not look bit starved. Furthermore, the background complexity and quality should be consistent throughout the scene. There should be no, or little, slow degradation of quality with periodic keyframe popping up to better quality. There should also be no ghosting on the static background when the foreground moves.


Videos with strobe effects, or multiple consecutive quick cuts. Strobe lights and, in general, anything with many quick cuts or scene changes will be hard to encode. Look at how an encoder handles this. It should make reasonable frametype decisions in these scenarios to avoid both bitrate spikes (tons of I frames) and bad quality (tons of P and B frames).


High complexity pans. Moving camera over a complex background such as a tree with many leaves, or a highly detailed tapestry depicting our lord and savior, Squidward. Make sure a reasonable amount of complexity and detail is preserved, and doesn't slowly degrade over the course of the panning shot.