Skip to content

Instantly share code, notes, and snippets.

@EconometricsBySimulation
Last active September 12, 2023 04:23
Show Gist options
  • Save EconometricsBySimulation/d0af8273f15b6ccb85a4 to your computer and use it in GitHub Desktop.
Save EconometricsBySimulation/d0af8273f15b6ccb85a4 to your computer and use it in GitHub Desktop.
Rapidly sample from arbitrary pdf
I recently found myself in need of a function to sample randomly from an arbitrarily defined probability density function. An excellent post by Quantitations shows how to accomplish this using some of Rs fairly sophisticated functional approximation tools such as integrate and uniroot. The only problem with this excellent post was that the machine cost was enormous with samples of 1000 draws taking 10 seconds on my machine and repeated samples of 100,000+ draws (which I was after) clearly being unworkable.
Thus I decided to take my own crack at it. First let us review the basics of drawing random variables from non-uniform distributions. The standard method I think most algorithms use works as follows:
Assumptions
1. You can draw pseudo-random uniform variable u
2. You can integrate the pdf to construct a cdf
$$p = F(x) = \int_{-\infty}^\infty f(x) dx$$
3. You can invert the cdf in order to solve for p
$$G(F(x))=F^{-1}(F(x))=F^{-1}(p)=x$$
The method thus relies upon the somewhat simple method of calculating x by drawing u and plugging into G (the inverse of F).
Now let us assume that we are in a bit of a bind. We can neither integrate f(x) nor invert F(x). The post by Quantitations demonstrates how to do the operation in this case by directly approximating the cdf followed by approximating the inverse. This is computationally intensive.
In contrast I approximate the cdf by drawing x values and associated probabilities from the pdf along a user specified range. I create a matrix of bins over which the approximate cdf is divided into. After that I apply some function (mean or median) over the range of x which corresponds to each bin providing an point value for each cdf bin value. The gacdf returns a list of results from this process.
The ricdf function then takes the list of returns and is able to draw values from the approximated inverse cdf. Once the gicdf has completed its operation the ricdf is able to generate variables nearly as fast as that of standard non-uniform random variables.
As a matter of comparison, I define the funciton f as the pdf as the pdf of the normal (dnorm) dnorm in R and draw from it 1000 time. Using rnorm, ricdf [defined in this post], and samplepdf [defined Quantitations] I graphically see how the three draws compare.
Three random sampling procedures for the random normal.
On the graph, Black type=0, is rnorm. Dark blue type=1, is ricdf. Light blue, type=2 is samplepdf. Redrawing the graph several times, visually I could not tell the difference between the three methods.
Comparing run times of the three methods over ten repetitions rnorm used no seconds, ricdf (with ficdf) used 8 seconds, and samplepdf used 52. Because ricdf and gicdf are separate functions with gicdf setting up the table to draw from, we only need to set it up once per pdf so ricf can be benchmarked separately from gicfd. Comparing 1000 replications of rnorm and ricdf, rnorm took .1 seconds and ricdf took .11 seconds. I did not compare with samplepdf which I expected would take 5200 seconds (87 minutes).
For fun I have generated several other strange 'proportional' pdfs. I say proportional because strictly speaking pdfs are required to integrate to 1. However, gicdf forces the probabilities to sum to 1 across the range making the formal requirement not necessary.
gicdf <- function(fun,
min=-3.5,
max=3.5,
bins=1000,
nqratio=10,
grouping=mean,
...) {
# Generate an inverse CDF of an arbitrary function
fun <- match.fun(fun)
grouping <- match.fun(grouping)
# Number of points to draw
nq=nqratio*bins
# Draw indexes
qdraw <- seq(min, max,length.out=nq)
# Calculate proportional probability of each draw
pdraw <- fun(qdraw,...)
# Rescale probability sum to 1, rescale
pdraw <- pdraw/sum(pdraw)
# Calculate the cumulative probability at each qdraw
cpdraw <- cumsum(pdraw)
# Caculate the cumulative probability at each bin
pbin <- (1:bins)/bins
xbin <- NA*(1:bins)
for (i in 1:bins) {
xbin[i] <- grouping(qdraw[cpdraw<pbin[i]&cpdraw>0], na.rm = TRUE)
cpdraw[cpdraw<pbin[i]] <- 2
}
(draw.set <- list(digits=floor(log10(bins)), xbin=xbin, pbin=pbin))
}
# Draw from acdf
ricdf <- function(N, draw.set) {
digits <- draw.set$digits
pdraws <- ceiling(runif(N)*10^digits)/10^digits
draw.set$xbin[match(pdraws,draw.set$pbin)]
}
# Define an arbitrary pdf f: to compare lets start with normal pdf
f <- function(x) dnorm(x)
library(ggplot2)
# set the number to sample
nsamples <- 1000
# Draw from rnorm
sample0 <- cbind(draw=rnorm(nsamples), type=0)
# Draw inverted cdf distribution information for range between -4 and 4
# using mean approach
sample1 <- cbind(draw=ricdf(nsamples, gicdf(f,min=-4,max=4)), type=1)
# samplepdf and endsign defined on Quantitations blog
# http://blog.quantitations.com/tutorial/2012/11/20/sampling-from-an-arbitrary-density/
# begin Quantitations code
endsign <- function(f, sign = 1) {
b <- sign
while (sign * f(b) < 0) b <- 10 * b
return(b)
}
samplepdf <- function(n, pdf, ..., spdf.lower = -Inf, spdf.upper = Inf) {
vpdf <- function(v) sapply(v, pdf, ...) # vectorize
cdf <- function(x) integrate(vpdf, spdf.lower, x)$value
invcdf <- function(u) {
subcdf <- function(t) cdf(t) - u
if (spdf.lower == -Inf)
spdf.lower <- endsign(subcdf, -1)
if (spdf.upper == Inf)
spdf.upper <- endsign(subcdf)
return(uniroot(subcdf, lower=spdf.lower, upper=spdf.upper)$root)
}
sapply(runif(n), invcdf)
}
# end Quantitations code
sample2 <- cbind(draw=samplepdf(nsamples, f), type=2)
# join the three samples together
samples <- as.data.frame(rbind(sample0, sample1, sample2))
ggplot(samples, aes(x=draw, color=type))+
geom_density(aes(group=type, size=-type))
library(rbenchmark)
# Let us do some speed comparisons
benchmark(rnorm(nsamples),
ricdf(nsamples, gicdf(f,min=-4,max=4)),
samplepdf(nsamples, f),
replications=10)
# test replications elapsed relative user.self
#2 ricdf(nsamples, gicdf(f, min = -4, max = 4)) 10 7.97
#1 rnorm(nsamples) 10 0.00
#3 samplepdf(nsamples, f) 10 52.39
# This sets the entire process of generating and drawing from ricdf as about 7 times
# faster than samplepdf. However, if we seperate the gicdf function from the
# ricdf function which only need be run each time a new pdf is input, then we can
# considerably increase run speeds.
myicdf <- gicdf(f,min=-4,max=4)
# Sets up the initial icdf to draw from.
benchmark(rnorm(nsamples), ricdf(nsamples, myicdf), replications=1000)
# elapsed time for rnorm = .10 seconds and ricdf = .11
# thus drawing from ricdf can be extremely fast
# Now let's look at an unusual proportional pdf distribution
f <- function(x) {
p <- rep(0, length(x))
p[x>0&x<1] <- x[x>0&x<1]^2
p[x>2&x<3] <- 3-x[x>2&x<3]^.5
p[x>4&x<5] <- x[x>4&x<5]
p
}
x <- seq(-1,6,.01)
plot(x,f(x), type='l',
xlab='Proportional Probability',
main='Proportional pdf')
# A key thing to note is that the pdf does not need to integrate to 1.
# The gicdf function will rescale it to ensure it does.
# However, it should all be positive
# I define bins as equal to 10,000 when normally they are equal to 1000
myicdf <- gicdf(f,min=-1,max=6, bins=1000)
samples <- data.frame(draws=ricdf(100000,myicdf))
ggplot(samples, aes(x=draws))+
geom_histogram(binwidth=.1)+
aes(y = ..density..)
# Okay here is one more probability
f <- function(x) dnorm(x)*abs(1+x^4)
x <- seq(-5,5,.01)
plot(x,f(x), type='l',
xlab='Proportional Probability',
main='Proportional pdf2')
myicdf <- gicdf(f,min=-5,max=5, bins=1000)
samples <- data.frame(draws=ricdf(100000,myicdf))
ggplot(samples, aes(x=draws))+
geom_histogram(binwidth=.2)+
aes(y = ..density..)
# I will leave you to define your own pdfs to draw from.
# If you end up using this method, please cite!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment