This pseudo-code illustrates the use of row buffer (a memory pool of rows of pixels, each indexed with a key that can be treated as the row number of an abstract 2D image), and how it can be used to enhance the next generation of OpenCV FilterEngine.
"The next generation of OpenCV FilterEngine" is an evolutionary proposal. See: https://github.com/opencv/opencv/wiki/OE-17.-New-Filter-Engine
The name is a silly misnomer. There is no Wiener. It is actually an adaptive blend, between the sharp (the identity function applied to input) and smoothed (the smoothing function applied to input), that is weighted by a function (linear or non-linear monotonic mathematical function) of the local 3x3 variance.
This algorithm (with this silly misnomer) appeared in many image processing publications from decades ago. It is no longer considered useful today.
This algorithm is chosen solely for the purpose of illustration, because the dataflow of this algorithm is slightly non-trivial, which testifies to the ability of the "row buffer" concept to handle such dataflow.
Input
--> Variance (3x3 neighborhood)
--> Identity (1x1 neighborhood)
--> Smoothed (3x3 neighborhood)
(Variance, Identity, Smoothed)
--> Output (1x1 neighborhood)
Note: this is pseudo-code. It is not intended to be a syntactically-valid programming language.
let varianceBuffer =
RowBuffer of element typename intermed (
width , row count = 3, row count not growable)
let smoothBuffer = ... // same as varianceBuffer
for pseudoRow = 0 : (height + 1)
bool doVariance = (pseudoRow < height);
if (doVariance)
int varianceRow = pseudoRow;
varianceBuffer.insert (
sending to varianceRow,
with the content of invoking
Three_Row_Variance_Func (
input.get( clampedRowRange(varianceRow - 1) ),
input.get( varianceRow ),
input.get( clampedRowRange(varianceRow + 1) )
) ) ;
end if
bool doSmoothed = (pseudoRow < height); // it just coincidently same as doVariance
if (doSmoothed)
// almost same code, but calling Three_Row_Smooth_Func instead.
end if
bool doAdaptiveBlend = ((pseudoRow - 1) is in range (0 : height))
if (doAdaptiveBlend)
int adaptiveBlendRow = (pseudoRow - 1);
let varianceContent of type SingleRowContent = varianceBuffer.get ( adaptiveBlendRow );
let identityContent of type SingleRowContent = input.get ( adaptiveBlendRow );
let smoothedContent of type SingleRowContent = smoothBuffer.get ( adaptiveBlendRow );
output.write (
sending to adaptiveBlendRow,
with the content of invoking, pixelwise,
if variance is low, select smoothed value, otherwise select identity
) ;
//
// HERE BE DRAGONS 🐉
//
bool doEvict = ((adaptiveBlendRow - 1) is in range (0 : height))
if (doEvict)
let evictRow = ( adaptiveBlendRow - 1 );
varianceBuffer.evict ( evictRow );
smoothBuffer.evict ( evictRow );
end if
//
// At this point, the number of non-empty rows
// in both varianceBuffer and smoothBuffer
// shall be equal to two, always.
//
end if
end for
Let's try this.
Modify the 3x3 variance function, so that it becomes
Input
--> Sum of Squared Elements (3x3 neighborhood)
--> Sum of Elements (3x3 neighborhood)
(Sum of Squared Elements, Sum of Elements)
--> Variance (1x1 neighborhood)
Where,
Variance = (mean of squared elements) minus (squared(mean of elements))
Where,
mean of squared elements = (sum of squared elements) divided by (the number of elements, which is 9),
etc.
Likewise, we redefine the 3x3 smooth function as a multi-step dataflow.
With such frivolous over-complication of a simple algorithm, the new dataflow (including everything above) can still be implemented using the "row buffer" concept. It just needs allocating a few more instnaces of row buffers.
The proof is left as an implementation exercise for the reader.
Trust me, it's doable.
All worker instances have shared read access to input, and write access to non-overlapping sections of the output. Anything else, such as the various row buffers, will be owned by each worker. In other words, each worker will own an entire set of the variety of row buffers.
Imagine, being able to process images at the smaller of (L1 bandwidth) or (half the L2 bandwidth) of the CPU (or aggregate bandwidth if multicore). All intermediate data stays in L1, except for very wide images, or for CPUs with small L1.