Skip to content

Instantly share code, notes, and snippets.

@jdnewmil
Last active November 25, 2022 21:52
Show Gist options
  • Save jdnewmil/99301a88de702ad2fcbaef33326b08b4 to your computer and use it in GitHub Desktop.
Save jdnewmil/99301a88de702ad2fcbaef33326b08b4 to your computer and use it in GitHub Desktop.

Danger using log1p

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(ggplot2)
  library(broom)
})
source( "log1p_danger.R" )

Transformations

Analyzing data that arises from processes that cumulatively multiply by ratios tends to lead to exponential behavior. Since $\log{(x \cdot y)}=\log{x}+\log{y}$ it is often useful to take the logarithm of the data before analyzing it. However, sometimes data that otherwise looks exponential includes values that are zero… and $\log{0} \rightarrow -\infty$ which makes the analysis more complicated rather than simpler.

One way such zero or negative values can appear in the data is interference from other sources such as electrical noise in the measurement devices. It is more useful to ignore data that is dominated by such interference than to force it into the analysis.

It is also common to perform analyses on exponential values for inputs very close to 1, leading to output values very close to zero. Because most of the significant information in the input values is in deviations from 1 that are much smaller than 1, it can be useful to not even add the 1 into the input. Thus, $\mathrm{log1p}{(x)}=\log{(1+x)}$ is often defined in numerical analysis contexts for this purpose.

However, some analysts, casting about for some way to deal with apparently-exponential data that happens to include zeros suggest using this function as a replacement for $\log{x}$. Unfortunately, in most cases this approximation distorts the overall data, and most importantly the distorting values convey little information from the original data set.

Distributions

This plot illustrates how a distribution may not be distorted too much by using $\mathrm{log1p}{(x)}$ in place of $\log{x}$ _if the zeroes are in fact arising as part of the primary exponential process:

cities <- data.frame(
  city = c( "memphis", "boston" )
  , lambda = c( 306/12, 36/12 )
)
dens_df1 <- make_dens_df( cities = cities, k0 = 0:60 )
plot_dens_df( dens_df1 )

The two transformations are similar far from log(1)=0, but near that point they can differ noticeably. With the scale of k in this data the approximations are not so different, but in other data sets with values closer to zero the error can be much more significant.

The issue with using log1p to “fix” zeroes is that assuming it is part of an exponential process that logarithm can simplify, and simultaneously that the data can be zero, are fundamentally inconsistent assumptions and adding any constant doesn’t actually change that.

Regression

This distortion can be easier to see in a regression between two variables. If we try to “fix” the zeros using log1p then our results will be warped (to varying degrees) depending on the scale and where in the data the zeros appear.

Noise

If a variable y is related to x according to $y=a\exp(bx)$ with an additive measurement error that allows some readings of y to be zero, then

scales <- (
    data.frame(
      scale = c( "small", "normalized", "large" )
    , a = c( 0.1, 1, 10 )
    , b = c( -0.5, -0.5, -0.5 )
    , stringsAsFactors = FALSE
    )
%>% mutate( scale = factor( scale, levels = scale ) )
)
x0 <- 0:5
resid0 <- rnorm( length( x0 ), 0, 0.01 )
an_minor <- analysis1(
  scales = scales
, x0 = x0
, resid0 = resid0
, which_zeros = set_zero_max_x # smallest y values
)

# outputs
autoplot( an_minor, method = "dta0" )

autoplot( an_minor, method = "dta1" )

Note the “knee” particularly evident in the large scale case… even though the answers are “sort of okay” this glitch is only reducing the validity of the results.

autoplot( an_minor, method = "fits" )

The “exclude zeros and use log” strategy represented by log_y above consistently extract the parameters that were originally used to simulate results, while the “include zeros and use log1p” strategy represented by log1p_y only get into the balpark for large magnitude values.

Random large interference

If the zeros do not arise because of small deviations from the true “y”, then the use of log1p can be dramatically invalid:

an_major <- analysis1(
  scales = scales
, x0 = x0
, resid0 = resid0
, which_zeros = set_zero_min_x # largest y values
)

# outputs
autoplot( an_major, method = "dta0" )

autoplot( an_major, method = "dta1" )

Note the “knee” particularly evident in the large scale case… even though the answers are “sort of okay” this glitch is only reducing the validity of the results.

autoplot( an_major, method = "fits" )

Once again, tossing the zeros instead of using log1p yields reasonable results in all cases.

Conclusion

Even when the distortion of log1p is small, the fact is that it is not a good substitute for $\log{x}$. Removing the zeros gives a more reliable characterization of the primary behaviors of the data than leaving them in does. If you need to characterize zeros, use a separate analysis of those data points that does not assume they arise from an exponential process that a logarithm can invert.

# log1p_danger.R
make_dens <- function( lambda, k ) {
(
data.frame(
k = k
, murd = dpois( k, lambda )
)
%>% mutate(
log_k = log( k )
, log1p_k = log1p( k )
)
)
}
make_dens_df <- function( cities, k0 ) {
(
cities
%>% rowwise()
%>% mutate(
data = list(
make_dens( lambda, k0 )
)
)
%>% ungroup()
%>% unnest( cols = "data" )
%>% select( -lambda, -k )
%>% pivot_longer(
-c( city, murd )
, names_to = "measure"
, values_to = "value"
)
%>% mutate(
measure = factor(
measure
, levels = c( "log_k", "log1p_k" )
)
)
)
}
plot_dens_df <- function( dens_df ) {
(
ggplot(
dens_df
, aes( x = value, y = murd, color = measure )
)
+ geom_line()
+ scale_y_continuous( trans = "log" )
+ facet_wrap(
~ city
, nrow = 1
, scale = "free"
)
+ labs(
y = "Murder probability density"
, x = "Murder count or log count or log1p count"
)
)
}
make_dta <- function( a, b, x, add_resid_sd_v, which_zeros ) {
(
data.frame(
x = x
, y = a * exp( b * x ) + a * add_resid_sd_v
)
%>% mutate(
y = ifelse( which_zeros( x, y ), 0, y )
, log_y = log( y )
, log_y = ifelse( is.finite( log_y ), log_y, NA )
, log1p_y = log1p( y )
)
)
}
build_dta0 <- function( scales, x0, resid0, which_zeros ) {
(
scales
%>% rowwise()
%>% mutate(
data = list(
make_dta(
a
, b
, x = x0
, add_resid_sd_v = resid0
, which_zeros = which_zeros
)
)
)
%>% ungroup()
%>% unnest( cols = "data" )
)
}
build_dta1 <- function( dta0 ) {
tmp <- (
dta0
%>% select( -a, -b, -y )
)
( tmp
%>% pivot_longer(
-c( scale, x )
, names_to = "measure"
, values_to = "value"
)
%>% mutate(
measure = factor(
measure
, levels = names( tmp )[ -(1:2) ]
)
)
)
}
build_fits <- function( dta1 ) {
(
dta1
%>% nest( data = c( x, value ) )
%>% rowwise()
%>% mutate( fit = list( lm( value ~ x, data = data ) )
, summary = list( tidy( fit , conf.int = TRUE ) )
)
%>% ungroup()
%>% select( scale, measure, summary )
%>% unnest( cols = "summary" )
)
}
plot_dta0 <- function( dta0 ) {
(
ggplot(
dta0
, aes( x = x, y = y, color = scale )
)
+ geom_line()
+ geom_point()
+ facet_wrap(
~ scale
, ncol = 1
, scale = "free_y"
)
)
}
plot_dta1 <- function( dta1 ) {
(
ggplot(
dta1
, aes( x = x, y = value, color = measure )
)
+ geom_line()
+ geom_point()
+ geom_smooth( method = "lm", formula = y ~ x )
+ facet_wrap(
~ scale
, nrow = 1
)
)
}
build_targets <- function( scales, dta1 ) {
(
expand.grid( scale = scales$scale
, measure = unique( dta1$measure )
)
%>% inner_join( scales, by = "scale" )
%>% mutate( a = log( a ) )
%>% rename( `(Intercept)` = a, x = b )
%>% gather( term, xintercept, -c( scale, measure ) )
)
}
plot_fits <- function( fits, targets ) {
( ggplot(
fits
, aes(
x = estimate
, y = measure
, xmin = conf.low
, xmax = conf.high
, height = 0
, color = measure
)
)
+ geom_vline(
data = targets
, mapping = aes( xintercept = xintercept, color = measure )
)
+ geom_point()
+ geom_errorbarh()
+ facet_grid( term ~ scale, scales = "free_x" )
)
}
set_zero_max_x <- function( x, y ) {
max( x ) == x | y <= 0
}
set_zero_min_x <- function( x, y ) {
min( x ) == x | y <= 0
}
analysis1 <- function( scales, x0, resid0, which_zeros, seed = 1 ) {
set.seed(seed)
# calcs
dta0 <- build_dta0(
scales = scales
, x0 = x0
, resid0 = resid0
, which_zeros = which_zeros
)
dta1 <- build_dta1( dta0 )
fits <- build_fits( dta1 )
targets <- build_targets( scales, dta1 )
result <- list(
scales = scales
, resid0 = resid0
, which_zeros = which_zeros
, dta0 = dta0
, dta1 = dta1
, fits = fits
, targets = targets
)
class(result) <- "analysis1"
result
}
autoplot.analysis1 <- function(object, method="fits", ... ) {
if ( "fits" == method ) {
plot_fits( fits = object$fits, targets = object$targets )
} else if ( "dta0" == method ) {
plot_dta0( object$dta0 )
} else if ( "dta1" == method ) {
plot_dta1( object$dta1 )
} else {
error( sprintf( "method '%s' not defined in autoplot.analysis1", method ) )
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment