Skip to content

Instantly share code, notes, and snippets.

@PaulinaMartin96
Last active August 28, 2021 16:41
Show Gist options
  • Save PaulinaMartin96/034445da14e33d41157554670a58c194 to your computer and use it in GitHub Desktop.
Save PaulinaMartin96/034445da14e33d41157554670a58c194 to your computer and use it in GitHub Desktop.
GSoC 2021. MCMCChains Improvements Project

Introduction

In the last decades, Bayesian statistics has gained ground on the modeling of phenomena in different areas such as natural sciences, psychology, or economics. Due to several advantages such as a powerful way to quantify and propagate uncertainty, a framework for incorporating prior information and interpreting probability as a measure of confidence, or the ease to combine multiple datasets, Bayesian inference makes the fit of more complex models feasible and the meta-analysis relatively simple (Gelman et al., 2014; Korner-Nievergelt, 2015).

Despite these advantages, implementing a Bayesian framework is still difficult and restricted because the available literature generally focuses on mathematical formalism and requires greater statistical expertise than other methods (Davidson-Pilon, 2016; Jonas et al., 2013). For this reason, any tool or resource that facilitates the understanding of Bayesian statistics and its mathematical background will be very useful. Visual resources, such as graphics, provide an excellent approach not only for exploring data or presenting results, but also for communicating information in a very effective way (Chen et al., 2008).

Turing.jl is a high-performance probabilistic programming language for Bayesian inference. Inside the Turing ecosystem, MCMCChains.jl represents an implementation of Julia types for analyzing, storing, and summarizing MCMC simulations, which uses utility functions for diagnostics and visualizations of results. Recently, some issues have arisen highlighting the need to modify some aspects of the MCMCChains.jl, such as the plotting functionality. Therefore, this project focused on improving different aspects of MCMCChains.jl, in particular the plotting functionality.

Work done

Project goals

The MCMCChains improvements project had the following goals to be acomplished:

  • Make visual improvements to plotting.
  • Add convenient plot defaults, especially when plotting numerous parameters.
  • Integrate new plot types into MCMCChains environment.
  • Implement a non-serialized based method to save mcmc chains sampling.
  • Improve the available documentation of MCMCChains (partially)

Pull requests

As a result of this project, the following Pull requests were created:

1. MCMCChains improvements

  • Make Chains objects display only information and not statistical evaluations (#30)

2. Plotting functionality improvements

Implementation of new plot types to visualize Chains objects

  • Add Violin plots (#316)
  • Add Prior/posterior predictive check plots (#319)

This PR includes Prior Predictive Check Plots, Posterior Predictive Check plots and a Cumulative Predictive Check Plot version for Prior and Posterior.

  • Add Ridgeline plot (#323)

This PR includes Ridgeline, Forest and Caterpillar plots.

  • Add Energy plots (#329)

3. Documentation improvements

  • Add documentation for Ridgeline, Forest and Caterpillar plots (#328)

Blog Post

To exemplify the functionality of the work done, I'm developing a blog post on Next Journal platform https://nextjournal.com/a/PFZZQ8NgmuJhp94UNAeoP?token=Qyyhmre53nsjEXQWUSgA6p.

Difficulties

The greatest challenge encountered during the project was the need to integrate different knowledge for plots implementation. On the one hand, it was necessary to have theoretical statistical knowledge about the content in plots, and on the other, to know the packages involved, which included MCMCChains, Turing, and the tools to generate the plots (Plots.jl and RecipesBase.jl). Because packages like ArviZ and Bayesplot (from RStan) served as the basis for plot implementation on MCMCChains, it was also necessary to understand the logic (and thus the base code) behind these packages.

Conclusions and Future work

  • After looking into some of the state-of-the-art packages for Bayesian inference Visualization, it came to light the simplicity and extensive visualization functionality of the Plots ecosystem. Implementing new plot types can be achieved in an easier and more compact format while making the code clearer to understand for the user. However, because getting familiar with the functionality of the plot recipes can be a bit confusing, as a future personal project, I plan to create content about recipes' functionality.
  • As post-GSoC projects, I will continue working on the MCMCChains improvements project. The first goal to address will be to improve the written code and fix the errors of those pull requests that are not merged. Additionally, the plot errors documented on issues #97, #111, #267, #270, #281 will be addressed, and other storage options for MCMCChains using packages like Arrow.jl and JLD2.jl will be implemented to MCMCStorage.jl. Subsequently, a second goal to address will be to improve the performance of the various statistical functions provided by MCMCChains by changing the back end, and thus a data storage format that maintains the shape of parameter samples can be implemented.
  • Considering some state-of-the-art packages for plotting like ArviZ and Bayesplots, there is much room to implement a wide range of plot types on MCMCChains. However, this matter points to a broader discussion among the probabilistic programming language community about bringing together the various efforts being made separately in developing visualization tools for Bayesian inference.

References

Chen, C., Härdle, W., & Unwin, A. (2008). Handbook of Data Visualization. Handbook of Data Visualization. https://doi.org/10.1007/978-3-540-33037-0

Gelman, A., Carlin, J. B., Stern., H. S, Dunson, D., Vehtari, A. & Rubin, D. (2014). Bayesian Data Analysis (3rd ed.). CRC Press. Taylor & Francis Group. 656 pp.

Korner-Nievergelt, F., Roth, T., von Felten, S., Guélat, J., Almasi, B., & Korner-Nievergelt, P. (2015). Chapter 3. The Bayesian and the Frequentist Ways of Analyzing Data. In Bayesian Data Analysis in Ecology Using Linear Models with R, BUGS, and STAN, 19–31.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment