suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
})
source( "log1p_danger.R" )
Analyzing data that arises from processes that cumulatively multiply by
ratios tends to lead to exponential behavior. Since
One way such zero or negative values can appear in the data is interference from other sources such as electrical noise in the measurement devices. It is more useful to ignore data that is dominated by such interference than to force it into the analysis.
It is also common to perform analyses on exponential values for inputs
very close to 1, leading to output values very close to zero. Because
most of the significant information in the input values is in deviations
from 1 that are much smaller than 1, it can be useful to not even add
the 1 into the input. Thus,
However, some analysts, casting about for some way to deal with
apparently-exponential data that happens to include zeros suggest using
this function as a replacement for
This plot illustrates how a distribution may not be distorted too much
by using
cities <- data.frame(
city = c( "memphis", "boston" )
, lambda = c( 306/12, 36/12 )
)
dens_df1 <- make_dens_df( cities = cities, k0 = 0:60 )
plot_dens_df( dens_df1 )
The two transformations are similar far from log(1)=0, but near that point they can differ noticeably. With the scale of k in this data the approximations are not so different, but in other data sets with values closer to zero the error can be much more significant.
The issue with using log1p to “fix” zeroes is that assuming it is part of an exponential process that logarithm can simplify, and simultaneously that the data can be zero, are fundamentally inconsistent assumptions and adding any constant doesn’t actually change that.
This distortion can be easier to see in a regression between two
variables. If we try to “fix” the zeros using log1p
then our results
will be warped (to varying degrees) depending on the scale and where in
the data the zeros appear.
If a variable y is related to x according to
scales <- (
data.frame(
scale = c( "small", "normalized", "large" )
, a = c( 0.1, 1, 10 )
, b = c( -0.5, -0.5, -0.5 )
, stringsAsFactors = FALSE
)
%>% mutate( scale = factor( scale, levels = scale ) )
)
x0 <- 0:5
resid0 <- rnorm( length( x0 ), 0, 0.01 )
an_minor <- analysis1(
scales = scales
, x0 = x0
, resid0 = resid0
, which_zeros = set_zero_max_x # smallest y values
)
# outputs
autoplot( an_minor, method = "dta0" )
autoplot( an_minor, method = "dta1" )
Note the “knee” particularly evident in the large scale case… even though the answers are “sort of okay” this glitch is only reducing the validity of the results.
autoplot( an_minor, method = "fits" )
The “exclude zeros and use log” strategy represented by log_y
above
consistently extract the parameters that were originally used to
simulate results, while the “include zeros and use log1p” strategy
represented by log1p_y
only get into the balpark for large magnitude
values.
If the zeros do not arise because of small deviations from the true “y”,
then the use of log1p
can be dramatically invalid:
an_major <- analysis1(
scales = scales
, x0 = x0
, resid0 = resid0
, which_zeros = set_zero_min_x # largest y values
)
# outputs
autoplot( an_major, method = "dta0" )
autoplot( an_major, method = "dta1" )
Note the “knee” particularly evident in the large scale case… even though the answers are “sort of okay” this glitch is only reducing the validity of the results.
autoplot( an_major, method = "fits" )
Once again, tossing the zeros instead of using log1p
yields reasonable
results in all cases.
Even when the distortion of log1p
is small, the fact is that it is not
a good substitute for