Even though the activity mask is decoupled from the encoding process, it is an uphill task to output a y4m of the activity masks for each frame currently. The easiest way to get a quick activity mask from the encoder is to write out activity values to a csv file as they are computed. An example patch that does this is present below as "extract-mask.diff"
The current attempt to create a map of activity is by computing the cube root of standard deviation(simply, 6th root of variance) of each 8x8 block. Keep track of the activity map implementation in this pull request.
Attempts of quantifying the activity of an 8x8 block included different trials. One method to do so was to perform a forward transform(DCT) on the image and apply the Constrast Sensitivity Function (CSF) with zeroed DC coefficient and accumulate to get a measure of activity. The distribution of activity was too polar this way. Taking a square root of the activity map distributed activity more evenly around the map. Edges were still being recognized as areas of high activity and were losing too much quality during activity masking. Ignoring the low frequencies at the top left corner of the DCT resulted in hard edges contributing less to activity.
Description | IENA | Sking |
---|---|---|
Original mask - Variance | ||
Square root of activity | ||
Square root of activity with low frequencies filtered |
The AWCY run comparing the activity mask implementation in comparison with reference shows that the encoding time has increased significantly with no increase in quality after activity masking.
A simple variance based mask looks promising in terms of encoding time costs based on this AWCY run.
Looking at a few sample test masks, the activity values range between 0 and 2.5. A python script to filter the map is present below as "mask-filter.ipynb".
The sample masks selected for analysis are "IENA_-Avenches-_6.y4m", "Sking.y4m", "US_Open_Tennis_2010_1st_Round_046.y4m" from "subset1-y4m" video set which is publicly available at derf's media test collection.
With this mask, we are able to identify ranges of absolute values and what kind of activity they correspond to. Blocks with activity 0.0-1.0 are plain areas where increase in perceptual quality with number of bits spent is low. Blocks with activity 1.75-2.5 are mostly hard edges that cannot tolerate loss in quality well. The range of activity that is expected to produce an increase in perceptual quality by activity masking lies between 1.0 and 1.75.
A. Image | B. Edges(1.75 - 2.50) | C. Plain Areas(0.00 - 1.00) | (A - B - C) Activity mask considered |
---|---|---|---|
Blocks of activity in range 1.00 to 1.75 participate in activity masking proportional to their activities while blocks with activities lower and higher are clamped at these boundaries.
Distortion in RDO is substituted with percieved distortion on basis of activity of the block. Higher the activity, lower the percieved distortion as compared to actual distortion. For blocks with activities as mentioned before, activity masking is disabled. Keep track of the implementation details in this pull request.
IENA | Blocks losing distortion | Blocks gaining distortion |
---|---|---|
Scale: 0.75 | ||
Scale: 0.70 |
Sking | Blocks losing distortion | Blocks gaining distortion |
---|---|---|
Scale: 0.75 | ||
Scale: 0.70 |
US_Open | Blocks losing distortion | Blocks gaining distortion |
---|---|---|
Scale: 0.75 | ||
Scale: 0.70 |
All blocks with resulting activity less than 1.0 gain distortion and hence quality. Blocks with resulting activity more than 1.0 lose distortion and hence quality. From the images above, it is evident that blocks losing and gaining distortion are more balanced at scale 0.70 than 0.75.
Relevant AWCY run shows no increase in performance with quality metrics. All the images below are encoded at 128 base quantizer.
Reference | Activity masked |
---|---|
This AWCY run shows no increase in quality on disabling CDEF and performing activity masking.
Disabling CDEF completely is not the right way to test activity masking. Since rav1e weights distortion when it is tuning for Psychovisual
(which is by default), we need to tune it for Psnr
instead.
This AWCY run shows very little variation in quality as compared to the reference.
A comparison of pure activity bias in RDO against a reference without any other kind bias shows that there needs to be a comparison at the same rate.
This AWCY run shows that activity masking at all block sizes does not affect the quality metrics or the rate much. All images below are encoded at 128 base quantizer.
Reference | Activity masked |
---|---|
Rate of the encodes from the above runs differ and may not be the best comparison. Justin Nickelsberg added rav1e support to ab_compare in daala tools on my request. With the help of ab_compare, rate matched encodes are compared to assess variation in perceptual quality.
Since the biasing scheme is computed for 8x8 blocks, during activity masking of larger blocks, distortion is biased for each 8x8 sub-block with the chosen scale and is summed up to obtain the total distortion of the block. This masks activity more accurately as compared to using the mean activity measure of a large block. It is important to note that usage of transform domain distortion should be disabled to measure the effects of activity masking more accurately.
Reference | Activity masked |
---|---|
Current implementation of adaptive quantization goes well with temporal block importance biasing and is based on the RDO loop. But activity masking could be trialled independent of RDO and block importance biases. This requires that activity based adaptive quantization be tuned separately and also figure out a way for it to co-exist with other factors influencing the quantizer offsets. An absolute biasing model based on the computed activity mask would yield higher perceptual quality.
Current implementation of the activity mask involves two-dimensional vectors which make the storage and access of activity slow. This has lead to a constant increase in the encoding time due to activity masking as seen in this AWCY run. Once the implementation of activity masking is finalized, this computation could be optimized with the usage of arrays. It could also be moved to a pre-process stage to be processed on parallel before the actual encoding of the frames start.
Providing a way to control the amount of activity masking based on the speed settings and tune preferences throught the CLI and the API would ensure good usage of activity masking.