jdnewmil/log1p_danger.md

## log1p_danger.md

      
    Raw
  

              log1p_danger.md
            
          
    Danger using log1p

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(ggplot2)
  library(broom)
})
source( "log1p_danger.R" )
Transformations

Analyzing data that arises from processes that cumulatively multiply by
ratios tends to lead to exponential behavior. Since
 $\log (x \cdot y) = \log x + \log y$  it is often useful to take the
logarithm of the data before analyzing it. However, sometimes data that
otherwise looks exponential includes values that are zero… and
 $\log 0 \to - \infty$  which makes the analysis more complicated
rather than simpler.
One way such zero or negative values can appear in the data is
interference from other sources such as electrical noise in the
measurement devices. It is more useful to ignore data that is dominated
by such interference than to force it into the analysis.
It is also common to perform analyses on exponential values for inputs
very close to 1, leading to output values very close to zero. Because
most of the significant information in the input values is in deviations
from 1 that are much smaller than 1, it can be useful to not even add
the 1 into the input. Thus,  $\log 1 p (x) = \log (1 + x)$  is often
defined in numerical analysis contexts for this purpose.
However, some analysts, casting about for some way to deal with
apparently-exponential data that happens to include zeros suggest using
this function as a replacement for  $\log x$ . Unfortunately, in most
cases this approximation distorts the overall data, and most importantly
the distorting values convey little information from the original data
set.
Distributions

This plot illustrates how a distribution may not be distorted too much
by using  $\log 1 p (x)$  in place of  $\log x$  _if the zeroes are
in fact arising as part of the primary exponential process:
cities <- data.frame(
  city = c( "memphis", "boston" )
  , lambda = c( 306/12, 36/12 )
)
dens_df1 <- make_dens_df( cities = cities, k0 = 0:60 )
plot_dens_df( dens_df1 )

The two transformations are similar far from log(1)=0, but near that
point they can differ noticeably. With the scale of k in this data the
approximations are not so different, but in other data sets with values
closer to zero the error can be much more significant.
The issue with using log1p to “fix” zeroes is that assuming it is part
of an exponential process that logarithm can simplify, and
simultaneously that the data can be zero, are fundamentally inconsistent
assumptions and adding any constant doesn’t actually change that.
Regression

This distortion can be easier to see in a regression between two
variables. If we try to “fix” the zeros using log1p then our results
will be warped (to varying degrees) depending on the scale and where in
the data the zeros appear.
Noise

If a variable y is related to x according to  $y = a \exp (b x)$  with an
additive measurement error that allows some readings of y to be zero,
then
scales <- (
    data.frame(
      scale = c( "small", "normalized", "large" )
    , a = c( 0.1, 1, 10 )
    , b = c( -0.5, -0.5, -0.5 )
    , stringsAsFactors = FALSE
    )
%>% mutate( scale = factor( scale, levels = scale ) )
)
x0 <- 0:5
resid0 <- rnorm( length( x0 ), 0, 0.01 )
an_minor <- analysis1(
  scales = scales
, x0 = x0
, resid0 = resid0
, which_zeros = set_zero_max_x # smallest y values
)

# outputs
autoplot( an_minor, method = "dta0" )

autoplot( an_minor, method = "dta1" )

Note the “knee” particularly evident in the large scale case… even
though the answers are “sort of okay” this glitch is only reducing the
validity of the results.
autoplot( an_minor, method = "fits" )

The “exclude zeros and use log” strategy represented by log_y above
consistently extract the parameters that were originally used to
simulate results, while the “include zeros and use log1p” strategy
represented by log1p_y only get into the balpark for large magnitude
values.
Random large interference

If the zeros do not arise because of small deviations from the true “y”,
then the use of log1p can be dramatically invalid:
an_major <- analysis1(
  scales = scales
, x0 = x0
, resid0 = resid0
, which_zeros = set_zero_min_x # largest y values
)

# outputs
autoplot( an_major, method = "dta0" )

autoplot( an_major, method = "dta1" )

Note the “knee” particularly evident in the large scale case… even
though the answers are “sort of okay” this glitch is only reducing the
validity of the results.
autoplot( an_major, method = "fits" )

Once again, tossing the zeros instead of using log1p yields reasonable
results in all cases.
Conclusion

Even when the distortion of log1p is small, the fact is that it is not
a good substitute for  $\log x$ . Removing the zeros gives a more
reliable characterization of the primary behaviors of the data than
leaving them in does. If you need to characterize zeros, use a separate
analysis of those data points that does not assume they arise from an
exponential process that a logarithm can invert.

  
## log4p_danger.R
# log1p_danger.R

make_dens <- function( lambda, k ) {
  (
      data.frame(
        k = k
      , murd = dpois( k, lambda )
      )
  %>% mutate(
        log_k = log( k )
        , log1p_k = log1p( k )
      )
  )
}

make_dens_df <- function( cities, k0 ) {
  (
    cities
    %>% rowwise()
    %>% mutate(
      data = list(
        make_dens( lambda, k0 )
      )
    )
    %>% ungroup()
    %>% unnest( cols = "data" )
    %>% select( -lambda, -k )
    %>% pivot_longer(
      -c( city, murd )
      , names_to = "measure"
      , values_to = "value"
    )
    %>% mutate(
      measure = factor(
        measure
        , levels = c( "log_k", "log1p_k" )
      )
    )
  )
}

plot_dens_df <- function( dens_df ) {
  (
    ggplot(
      dens_df
      , aes( x = value, y = murd, color = measure )
    )
    + geom_line()
    + scale_y_continuous( trans = "log" )
    + facet_wrap(
      ~ city
      , nrow = 1
      , scale = "free"
    )
    + labs(
      y = "Murder probability density"
      , x = "Murder count or log count or log1p count"
    )
  )
}

make_dta <- function( a, b, x, add_resid_sd_v, which_zeros ) {
  (
    data.frame(
      x = x
      , y = a * exp( b * x ) + a * add_resid_sd_v
    )
    %>% mutate(
      y = ifelse( which_zeros( x, y ), 0, y )
      , log_y = log( y )
      , log_y = ifelse( is.finite( log_y ), log_y, NA )
      , log1p_y = log1p( y )
    )
  )
}


build_dta0 <- function( scales, x0, resid0, which_zeros ) {
  (
    scales
    %>% rowwise()
    %>% mutate(
      data = list(
        make_dta(
          a
          , b
          , x = x0
          , add_resid_sd_v = resid0
          , which_zeros = which_zeros
        )
      )
    )
    %>% ungroup()
    %>% unnest( cols = "data" )
  )
}

build_dta1 <- function( dta0 ) {
  tmp <- (
    dta0
    %>% select( -a, -b, -y )
  )
  (   tmp
    %>% pivot_longer(
      -c( scale, x )
      , names_to = "measure"
      , values_to = "value"
    )
    %>% mutate(
      measure = factor(
        measure
        , levels = names( tmp )[ -(1:2) ]
      )
    )
  )
}

build_fits <- function( dta1 ) {
  (
    dta1
    %>% nest( data = c( x, value ) )
    %>% rowwise()
    %>% mutate( fit = list( lm( value ~ x, data = data ) )
                , summary = list( tidy( fit , conf.int = TRUE ) )
    )
    %>% ungroup()
    %>% select( scale, measure, summary )
    %>% unnest( cols = "summary" )
  )
}

plot_dta0 <- function( dta0 ) {
  (
    ggplot(
      dta0
      , aes( x = x, y = y, color = scale )
    )
    + geom_line()
    + geom_point()
    + facet_wrap(
      ~ scale
      , ncol = 1
      , scale = "free_y"
    )
  )
}

plot_dta1 <- function( dta1 ) {
  (
    ggplot(
      dta1
      , aes( x = x, y = value, color = measure )
    )
    + geom_line()
    + geom_point()
    + geom_smooth( method = "lm", formula = y ~ x )
    + facet_wrap(
      ~ scale
      , nrow = 1
    )
  )
}

build_targets <- function( scales, dta1 ) {
  (
    expand.grid( scale = scales$scale
                 , measure = unique( dta1$measure )
    )
    %>% inner_join( scales, by = "scale" )
    %>% mutate( a = log( a ) )
    %>% rename( `(Intercept)` = a, x = b )
    %>% gather( term, xintercept, -c( scale, measure ) )
  )
}

plot_fits <- function( fits, targets ) {
  (   ggplot(
    fits
    , aes(
      x = estimate
      , y = measure
      , xmin = conf.low
      , xmax = conf.high
      , height = 0
      , color = measure
    )
  )
  + geom_vline(
    data = targets
    , mapping = aes( xintercept = xintercept, color = measure )
  )
  + geom_point()
  + geom_errorbarh()
  + facet_grid( term ~ scale, scales = "free_x" )
  )
}

set_zero_max_x <- function( x, y ) {
  max( x ) == x | y <= 0
}
set_zero_min_x <- function( x, y ) {
  min( x ) == x | y <= 0
}

analysis1 <- function( scales, x0, resid0, which_zeros, seed = 1 ) {
  set.seed(seed)
  # calcs
  dta0 <- build_dta0(
    scales = scales
    , x0 = x0
    , resid0 = resid0
    , which_zeros = which_zeros
  )
  dta1 <- build_dta1( dta0 )
  fits <- build_fits( dta1 )
  targets <- build_targets( scales, dta1 )
  result <- list(
    scales = scales
    , resid0 = resid0
    , which_zeros = which_zeros
    , dta0 = dta0
    , dta1 = dta1
    , fits = fits
    , targets = targets
  )
  class(result) <- "analysis1"
  result
}

autoplot.analysis1 <- function(object, method="fits", ... ) {
  if ( "fits" == method ) {
    plot_fits( fits = object$fits, targets = object$targets )
  } else if ( "dta0" == method ) {
    plot_dta0( object$dta0 )
  } else if ( "dta1" == method ) {
    plot_dta1( object$dta1 )
  } else {
    error( sprintf( "method '%s' not defined in autoplot.analysis1", method ) )
  }
}


## unnamed-chunk-2-1.png

      
    Raw
  

              unnamed-chunk-2-1.png
            
          
## unnamed-chunk-3-1.png

      
    Raw
  

              unnamed-chunk-3-1.png
            
          
## unnamed-chunk-4-1.png

      
    Raw
  

              unnamed-chunk-4-1.png
            
          
## unnamed-chunk-5-1.png

      
    Raw
  

              unnamed-chunk-5-1.png
            
          
## unnamed-chunk-6-1.png

      
    Raw
  

              unnamed-chunk-6-1.png
            
          
## unnamed-chunk-7-1.png

      
    Raw
  

              unnamed-chunk-7-1.png
            
          
## unnamed-chunk-8-1.png

      
    Raw
  

              unnamed-chunk-8-1.png
	# log1p_danger.R

	make_dens <- function( lambda, k ) {
	(
	data.frame(
	k = k
	, murd = dpois( k, lambda )
	)
	%>% mutate(
	log_k = log( k )
	, log1p_k = log1p( k )
	)
	)
	}

	make_dens_df <- function( cities, k0 ) {
	(
	cities
	%>% rowwise()
	%>% mutate(
	data = list(
	make_dens( lambda, k0 )
	)
	)
	%>% ungroup()
	%>% unnest( cols = "data" )
	%>% select( -lambda, -k )
	%>% pivot_longer(
	-c( city, murd )
	, names_to = "measure"
	, values_to = "value"
	)
	%>% mutate(
	measure = factor(
	measure
	, levels = c( "log_k", "log1p_k" )
	)
	)
	)
	}

	plot_dens_df <- function( dens_df ) {
	(
	ggplot(
	dens_df
	, aes( x = value, y = murd, color = measure )
	)
	+ geom_line()
	+ scale_y_continuous( trans = "log" )
	+ facet_wrap(
	~ city
	, nrow = 1
	, scale = "free"
	)
	+ labs(
	y = "Murder probability density"
	, x = "Murder count or log count or log1p count"
	)
	)
	}

	make_dta <- function( a, b, x, add_resid_sd_v, which_zeros ) {
	(
	data.frame(
	x = x
	, y = a * exp( b * x ) + a * add_resid_sd_v
	)
	%>% mutate(
	y = ifelse( which_zeros( x, y ), 0, y )
	, log_y = log( y )
	, log_y = ifelse( is.finite( log_y ), log_y, NA )
	, log1p_y = log1p( y )
	)
	)
	}


	build_dta0 <- function( scales, x0, resid0, which_zeros ) {
	(
	scales
	%>% rowwise()
	%>% mutate(
	data = list(
	make_dta(
	a
	, b
	, x = x0
	, add_resid_sd_v = resid0
	, which_zeros = which_zeros
	)
	)
	)
	%>% ungroup()
	%>% unnest( cols = "data" )
	)
	}

	build_dta1 <- function( dta0 ) {
	tmp <- (
	dta0
	%>% select( -a, -b, -y )
	)
	( tmp
	%>% pivot_longer(
	-c( scale, x )
	, names_to = "measure"
	, values_to = "value"
	)
	%>% mutate(
	measure = factor(
	measure
	, levels = names( tmp )[ -(1:2) ]
	)
	)
	)
	}

	build_fits <- function( dta1 ) {
	(
	dta1
	%>% nest( data = c( x, value ) )
	%>% rowwise()
	%>% mutate( fit = list( lm( value ~ x, data = data ) )
	, summary = list( tidy( fit , conf.int = TRUE ) )
	)
	%>% ungroup()
	%>% select( scale, measure, summary )
	%>% unnest( cols = "summary" )
	)
	}

	plot_dta0 <- function( dta0 ) {
	(
	ggplot(
	dta0
	, aes( x = x, y = y, color = scale )
	)
	+ geom_line()
	+ geom_point()
	+ facet_wrap(
	~ scale
	, ncol = 1
	, scale = "free_y"
	)
	)
	}

	plot_dta1 <- function( dta1 ) {
	(
	ggplot(
	dta1
	, aes( x = x, y = value, color = measure )
	)
	+ geom_line()
	+ geom_point()
	+ geom_smooth( method = "lm", formula = y ~ x )
	+ facet_wrap(
	~ scale
	, nrow = 1
	)
	)
	}

	build_targets <- function( scales, dta1 ) {
	(
	expand.grid( scale = scales$scale
	, measure = unique( dta1$measure )
	)
	%>% inner_join( scales, by = "scale" )
	%>% mutate( a = log( a ) )
	%>% rename( `(Intercept)` = a, x = b )
	%>% gather( term, xintercept, -c( scale, measure ) )
	)
	}

	plot_fits <- function( fits, targets ) {
	( ggplot(
	fits
	, aes(
	x = estimate
	, y = measure
	, xmin = conf.low
	, xmax = conf.high
	, height = 0
	, color = measure
	)
	)
	+ geom_vline(
	data = targets
	, mapping = aes( xintercept = xintercept, color = measure )
	)
	+ geom_point()
	+ geom_errorbarh()
	+ facet_grid( term ~ scale, scales = "free_x" )
	)
	}

	set_zero_max_x <- function( x, y ) {
	max( x ) == x \| y <= 0
	}
	set_zero_min_x <- function( x, y ) {
	min( x ) == x \| y <= 0
	}

	analysis1 <- function( scales, x0, resid0, which_zeros, seed = 1 ) {
	set.seed(seed)
	# calcs
	dta0 <- build_dta0(
	scales = scales
	, x0 = x0
	, resid0 = resid0
	, which_zeros = which_zeros
	)
	dta1 <- build_dta1( dta0 )
	fits <- build_fits( dta1 )
	targets <- build_targets( scales, dta1 )
	result <- list(
	scales = scales
	, resid0 = resid0
	, which_zeros = which_zeros
	, dta0 = dta0
	, dta1 = dta1
	, fits = fits
	, targets = targets
	)
	class(result) <- "analysis1"
	result
	}

	autoplot.analysis1 <- function(object, method="fits", ... ) {
	if ( "fits" == method ) {
	plot_fits( fits = object$fits, targets = object$targets )
	} else if ( "dta0" == method ) {
	plot_dta0( object$dta0 )
	} else if ( "dta1" == method ) {
	plot_dta1( object$dta1 )
	} else {
	error( sprintf( "method '%s' not defined in autoplot.analysis1", method ) )
	}
	}