denis-bz/Signal-98%-plus-Noise-20%.md

## Signal-98%-plus-Noise-20%.md

      
    Raw
  

              Signal-98%-plus-Noise-20%.md
            
          
    Signal 98 % + Noise 20 % = 100 % ?

Data is often split into "signal" + "noise", with
|data|^2 = |signal|^2 + |noise|^2
         = sum of squares, data1^2 + data2^2 + ...

Sums of squares can be rather non-intuitive. For example,
101^2    = 99^2       + 20^2
|data|^2 = |signal|^2 + |noise|^2
|signal| / |data|  = 99 / 101 = 98 %  -- sounds good
|noise|  / |data|  = 20 / 101 = 20 %  -- not so good
|signal| / |noise| = 99 /  20 = 5     -- ?

Giving only one of these ratios -- 98 %, 20 %, 5 -- is misleading.
Giving all 3 numbers, though, can be confusing.
What to do ? I like to print / plot "signal" not squared, and perhaps squared too, e.g.
PCA eigenvalues %: [  2   4   6   8   9  11  12  13  15  16 ...
PCA variance %:    [  9  17  23  28  32  36  39  42  45  47 ...

R squared

Statisticians use a ratio called "R squared", which in this context is |signal|^2 / |data|^2,
e.g. 99^2 / 101^2 = 96 % -- impressive ?

R^2 gives the 'percentage of variance explained' by the regression, an expression that,
for most social scientists, is of doubtful meaning but great rhetorical value.

-- Wikipedia Explained variation
Least squares

For a lovely picture of the squares that least-squares minimizes, see
Coefficient of determination .
("Why sums of squares ?" I don't know of a brief answer for laymen,
beyond "nice math", "commonly used"; comments welcome.)
cheers

-- denis