Skip to content

Instantly share code, notes, and snippets.

# EconometricsBySimulation/gist:d0af8273f15b6ccb85a4 Last active Aug 29, 2015

Rapidly sample from arbitrary pdf
 I recently found myself in need of a function to sample randomly from an arbitrarily defined probability density function. An excellent post by Quantitations shows how to accomplish this using some of Rs fairly sophisticated functional approximation tools such as integrate and uniroot. The only problem with this excellent post was that the machine cost was enormous with samples of 1000 draws taking 10 seconds on my machine and repeated samples of 100,000+ draws (which I was after) clearly being unworkable. Thus I decided to take my own crack at it. First let us review the basics of drawing random variables from non-uniform distributions. The standard method I think most algorithms use works as follows: Assumptions 1. You can draw pseudo-random uniform variable u 2. You can integrate the pdf to construct a cdf $$p = F(x) = \int_{-\infty}^\infty f(x) dx$$ 3. You can invert the cdf in order to solve for p $$G(F(x))=F^{-1}(F(x))=F^{-1}(p)=x$$ The method thus relies upon the somewhat simple method of calculating x by drawing u and plugging into G (the inverse of F). Now let us assume that we are in a bit of a bind. We can neither integrate f(x) nor invert F(x). The post by Quantitations demonstrates how to do the operation in this case by directly approximating the cdf followed by approximating the inverse. This is computationally intensive. In contrast I approximate the cdf by drawing x values and associated probabilities from the pdf along a user specified range. I create a matrix of bins over which the approximate cdf is divided into. After that I apply some function (mean or median) over the range of x which corresponds to each bin providing an point value for each cdf bin value. The gacdf returns a list of results from this process. The ricdf function then takes the list of returns and is able to draw values from the approximated inverse cdf. Once the gicdf has completed its operation the ricdf is able to generate variables nearly as fast as that of standard non-uniform random variables. As a matter of comparison, I define the funciton f as the pdf as the pdf of the normal (dnorm) dnorm in R and draw from it 1000 time. Using rnorm, ricdf [defined in this post], and samplepdf [defined Quantitations] I graphically see how the three draws compare. Three random sampling procedures for the random normal. On the graph, Black type=0, is rnorm. Dark blue type=1, is ricdf. Light blue, type=2 is samplepdf. Redrawing the graph several times, visually I could not tell the difference between the three methods. Comparing run times of the three methods over ten repetitions rnorm used no seconds, ricdf (with ficdf) used 8 seconds, and samplepdf used 52. Because ricdf and gicdf are separate functions with gicdf setting up the table to draw from, we only need to set it up once per pdf so ricf can be benchmarked separately from gicfd. Comparing 1000 replications of rnorm and ricdf, rnorm took .1 seconds and ricdf took .11 seconds. I did not compare with samplepdf which I expected would take 5200 seconds (87 minutes). For fun I have generated several other strange 'proportional' pdfs. I say proportional because strictly speaking pdfs are required to integrate to 1. However, gicdf forces the probabilities to sum to 1 across the range making the formal requirement not necessary. gicdf <- function(fun, min=-3.5, max=3.5, bins=1000, nqratio=10, grouping=mean, ...) { # Generate an inverse CDF of an arbitrary function fun <- match.fun(fun) grouping <- match.fun(grouping) # Number of points to draw nq=nqratio*bins # Draw indexes qdraw <- seq(min, max,length.out=nq) # Calculate proportional probability of each draw pdraw <- fun(qdraw,...) # Rescale probability sum to 1, rescale pdraw <- pdraw/sum(pdraw) # Calculate the cumulative probability at each qdraw cpdraw <- cumsum(pdraw) # Caculate the cumulative probability at each bin pbin <- (1:bins)/bins xbin <- NA*(1:bins) for (i in 1:bins) { xbin[i] <- grouping(qdraw[cpdraw0], na.rm = TRUE) cpdraw[cpdraw0&x<1] <- x[x>0&x<1]^2 p[x>2&x<3] <- 3-x[x>2&x<3]^.5 p[x>4&x<5] <- x[x>4&x<5] p } x <- seq(-1,6,.01) plot(x,f(x), type='l', xlab='Proportional Probability', main='Proportional pdf') # A key thing to note is that the pdf does not need to integrate to 1. # The gicdf function will rescale it to ensure it does. # However, it should all be positive # I define bins as equal to 10,000 when normally they are equal to 1000 myicdf <- gicdf(f,min=-1,max=6, bins=1000) samples <- data.frame(draws=ricdf(100000,myicdf)) ggplot(samples, aes(x=draws))+ geom_histogram(binwidth=.1)+ aes(y = ..density..) # Okay here is one more probability f <- function(x) dnorm(x)*abs(1+x^4) x <- seq(-5,5,.01) plot(x,f(x), type='l', xlab='Proportional Probability', main='Proportional pdf2') myicdf <- gicdf(f,min=-5,max=5, bins=1000) samples <- data.frame(draws=ricdf(100000,myicdf)) ggplot(samples, aes(x=draws))+ geom_histogram(binwidth=.2)+ aes(y = ..density..) # I will leave you to define your own pdfs to draw from. # If you end up using this method, please cite!
to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.