Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active August 17, 2020 00:18
Show Gist options
  • Save benmarwick/86aaa458df70ff202c27 to your computer and use it in GitHub Desktop.
Save benmarwick/86aaa458df70ff202c27 to your computer and use it in GitHub Desktop.
notes on docker and rstudio
# Reproducible Research using Docker and R
# Challenges of reproducibility
- dependencies
- isolation and transparency
- portability of computationational environment
- extendability and resuse
- ease of use
# Virtual Machines vs Containers
- uses resource isolation features of the Linux kernel such as cgroups and kernel namespaces
- allows independent "containers" to run within a single Linux instance
- packages executable dependencies in a way that is more transparent than a VM and more robust than a README
- limtationsof VMs:
- Size: VMs are very large which makes them impractical to store and transfer.
- Performance: running VMs consumes significant CPU and memory, which makes them impractical in many scenarios, for example local development of multi-tier applications, and large-scale deployment of cpu and memory-intensive applications on large numbers of machines.
- Portability: competing VM environments don't play well with each other. Although conversion tools do exist, they are limited and add even more overhead.
- Hardware-centric: VMs were designed with machine operators in mind, not software developers. As a result, they offer very limited tooling for what developers need most: building, testing and running their software. For example, VMs offer no facilities for application versioning, monitoring, configuration, logging or service discovery.
# What Docker is
- a shipping container for the online universe: hardware-agnostic and platform-agnostic
- a tool that lets developers neatly package software and move it from machine to machine.
- released as open source in March 2013, a big deal on github: 18.6k stars, 3.8k forks
- dockerfiles: plain-text instructions to automatically make images
- containers: the active, running parts of Docker that do something
- images: pre-built environments and instructions that tell a container what to do.
- registry: open online repository of images (https://registry.hub.docker.com/), including many ['trusted builds'](http://dockerfile.github.io/)
# Limitations
- Security: it is possible for an image hosted there to be written with some malicious intent
- Limited to 64-bit host machines, making it impossible to run on older hardware
- Does not provide complete virtualization but relies on the Linux kernel provided by the host
- On OSX and Windows this means a VM must be present ([boot2docker](http://boot2docker.io/) installs [VirtualBox(https://www.virtualbox.org/) for this)
# Getting started on OSX & Windows
- Install [boot2docker](http://boot2docker.io/)
- `docker pull <username>/<image_name>` an existing image from registry
- eg `docker pull ubuntu` notice there's no username here, because this is an 'official repo'
- after `pull` then `run`
- or simply `run`, which will `pull`, `create` and `run` in one step
# `docker run` and common [flags](https://docs.docker.com/reference/run/):
-i Interactive (usually used with -t)
-t TTY: Allocate a pseudo-TTY (basically a terminal interface for a CLI)
-p Publish Ports: -p <host port>:<container port>
-d Detached mode: run the container in the backgroup (opposite of -i -t)
-v mount a volume from inside your container (that has been specified with the VOLUME instruction in the Dockerfile)
--rm=true remove your container from the host when it stops running (only available with -it)
- eg `docker run -it ubuntu` # gets ubuntu and gives us a terminal for interaction
- eg `docker run -dp 8787:8787 rocker/rstudio` # gets R & RStudio and opens port 8787 for using RStudio server in a web browser at localhost:8787 (linux) or 192.168.59.103:8787 (Windows, OSX)
# [Interacting with docker at the command line](https://docs.docker.com/reference/commandline/cli/)
docker ps # list all the running containers on the host
docker ps -a # list all the containers on the host, including those that have stopped
docker exec -it <container-id> bash # opens bash shell for a currently running container
docker stop <container-id> # stop a running container
docker kill <container-id> # similar to docker stop, but it's more forceful, sending a SIGKILL to the command the container is running
docker rm <container-id> # removes (deletes) a container.
docker rmi <container-id> # removes (deletes) an image.
docker rm -f $(docker ps -a -q) # remove all current containers
docker rmi -f $(docker images -q) # stop and remove all images
# [Writing a dockerfile](https://docs.docker.com/articles/dockerfile_best-practices/)
- it is possible to use `docker commit <container>` to commit a container's file changes or settings into a new image, but it is better to use Dockerfiles & git to manage your images in a documented and maintainable way
- A Dockerfile is a short plain text file that is a recipie for making a docker image
# Dockerfile elements
- FROM instruction specifies which base image your image is built on (ultimately back to Debian)
- MAINTAINER instruction specifies who created and maintains the image.
- CMD, specifies the command to run immediately when a container is started from this image, unless you specify a different command.
- ADD instruction will copy new files from a source and add them to the containers filesystem path
- RUN instruction does just that: It runs a command inside the container (eg. `apt-get`)
- EXPOSE instruction tells Docker that the container will listen on the specified port when it starts
- VOLUME instruction will create a mount point with the specified name and tell Docker that the volume may be mounted by the host
- Moderately complex example: https://github.com/rocker-org/hadleyverse/blob/master/Dockerfile
- To build an image from a dockerfile: `docker build --rm -t <username>/<image_name> <dockerfile>`
- To send an image to the registry: `docker push <username>/<image_name>` # need to be registered at https://hub.docker.com/
# [Automated Docker image build testing](https://circleci.com/)
- Automated image build testing on a new commit to the Dockerfile
- Analogous to the travis-ci service, has a shield
- Requires a `.circle.yml` file in github repo, eg. https://github.com/benmarwick/1989-excavation-report-Madjebebe/blob/master/circle.yml
- Pushes new image to hub on successful complete of test
# Doing research with RStudio and Docker
- The [rocker project](https://github.com/rocker-org/) provides images that include R, key packages and other dependencies (RStudio, pandoc, LaTeX, etc.), and has excellent documentation on the github wiki (https://github.com/rocker-org/rocker/wiki/Using-the-RStudio-image)
- run RStudio server in the browser, with host folder as volume
-
- eg `docker run -dp 8787:8787 -v /c/Users/marwick/docker:/home/rstudio/ -e ROOT=TRUE rocker/hadleyverse`
-
- # `-dp 8787:8787` # gives me a port for the web browser to access RStudio
- # `-v /c/Users/marwick/docker:/home/rstudio/` # gives me read and write access both ways between Windows (C:/Users/marwick/docker) and RStudio
- # `-e ROOT=TRUE` # sets an environment variable to enable root access for me so I can manage dependencies
- I can access the docker (Debian) shell via RStudio for file manipulation, etc. (or `docker exec -it <container-id> bash`)
- I store scripts on host volume because VC is simpler this way, but do development and analysis in container for isolation
# ...and IPython
- Choose your favourite from the registry: https://registry.hub.docker.com/search?q=ipython&s=downloads
- the IPython project have a few images, and there are many user-contributed ones
# Cloud computing with docker is widely supported
- Amazon EC2 Container Service: docker clusters in the cloud (no registry)
- Google Compute Engine: has container-optimized VMs
- Google container registry: secure private docker image storage on google cloud platform
- Microsoft Azure supports docker containers (docker hub is integrated)
# References & further reading
- http://arxiv-web3.library.cornell.edu/pdf/1410.0846v1.pdf
- http://sites.duke.edu/researchcomputing/tag/docker/
- https://rc.duke.edu/duke-docker-day-was-great/
- https://github.com/LinuxAtDuke/Intro-To-Docker
- http://reproducible-research.github.io/scipy-tutorial-2014/environment/docker/
- http://ropensci.org/blog/2014/10/23/introducing-rocker/
- https://github.com/wsargent/docker-cheat-sheet
Using a fresh install of boot2docker and the VM that comes with it, things are working great. This is good for starting RStudio server and sharing a folder in one line:
boot2docker ssh
docker run -d -p 8787:8787 -v /c/Users/marwick:/home/rstudio/ -e ROOT=TRUE rocker/hadleyverse
###
Using VirtualBox 4.3.12 I've got boot2docker working exactly as the instructions here indicate: https://github.com/boot2docker/boot2docker
boot2docker init
boot2docker up
# folder sharing on
docker run -v /data --name my-data busybox true
docker run --rm -v /usr/local/bin/docker:/docker -v /var/run/docker.sock:/docker.sock svendowideit/samba my-data
# change folder permissions
docker run -it --volumes-from my-data rocker/rstudio /bin/bash # or docker exec -it "id of running container" bash
chmod -R a+rwX /data # to change permissions for the shared data folder, access this on the desktop via \\192.168.59.103\data in explorer
setfacl -d -m u::rw-,g::rw-,o::rw- /data # ensure new files get same permissions
exit # to exit bash and get back to docker prompt
# start my container & do stuff in RStudio with shared folder
docker run -itdp 8787:8787 --volumes-from my-data rocker/rstudio # connect this container to shared folder
# go to http://192.168.59.103:8787/ in the browser, log in with rstudio/rstudio
# finish
exit # in boot2docker window, to quit boot2docker
The only detail is that I had to change permissions on the shared folder like so, after making the samba file sharing container, I start my container with an interactive shell, then change permissions. Then return to http://192.168.59.103:8787/ and I can read and write to ~/data
And after my-data is up and permissions set, then each time just:
docker run -dp 8787:8787 --volumes-from my-data rocker/rstudio
###
Here's the starting point: https://github.com/ropensci/docker
The main project is now at https://github.com/benmarwick/rocker
# use b2d with guest additions (to access host drive) from
https://medium.com/boot2docker-lightweight-linux-for-docker/boot2docker-together-with-virtualbox-guest-additions-da1e3ab2465c
# from https://github.com/boot2docker/boot2docker
boot2docker init
boot2docker up
boot2docker ssh -L 8787:localhost:8787 # start b2d with port forwarding
docker run -d -p 8787:8787 rocker/rstudio
# then go to localhost:8787 and login to rstudio with rstudio/rstudio
exit # to quit
boot2docker stop
# various things
docker ps # inspect current processes
docker rm -f $(docker ps -a -q) # remove current containers
docker rmi -f $(docker images -q) # stop and remove images
# access bash of running container to install things, move files, etc
docker ps # get id of running container
docker exec -it "id of running container" bash
# build from my github rep
docker build --rm -t benmarwick/ropensci https://raw.githubusercontent.com/benmarwick/docker/master/ropensci/Dockerfile
docker run -d -p 8787:8787 benmarwick/ropensci # run my newly built container
docker push benmarwick/ropensci # push to open docker repo
# start to finish from my docker image:
# from https://github.com/boot2docker/boot2docker
boot2docker init
boot2docker up
boot2docker ssh -L 8787:localhost:8787 # start b2d with port forwarding
docker run -d -p 8787:8787 benmarwick/ropensci
exit
boot2docker poweroff
# Instructions for Ian to test
# download boot2docker from http://boot2docker.io/ and install
# don't double-click 'boot2docker start' icon! Instead, open Git Bash from start menu (in the Git folder)
# At the Git Bash Prompt (which is a dollar sign), type these lines, pressing return after each one:
boot2docker init
boot2docker up
boot2docker ssh -L 8787:localhost:8787
docker run -d -p 8787:8787 rocker/rstudio
# various things will happen in the window, lots of downloading, etc., it will take a while
# when the downloading stops, go your web browser and go to http://localhost:8787/
# log into RStudio with username: rstudio password: rstudio
# do a few things in RStudio... When finished, go to your Git Bash window and type these lines, pressing enter after each one:
exit
boot2docker poweroff
# Now you can start it up again and resume your work with these lines (from the Git Bash prompt):
boot2docker up
boot2docker ssh -L 8787:localhost:8787
docker run -d -p 8787:8787 rocker/rstudio
# go your web browser and go to http://localhost:8787/
# log into RStudio with username: rstudio password: rstudio
# do a few things in RStudio... When finished, go to your Git Bash window and type these lines, pressing enter after each one:
exit
boot2docker poweroff
## Rocker
## DIT4C
quite a nice setup by DIT4C University of Melbourne ITS Research project - Data Intensive Tools for the Cloud, includes a file browser in the web browser
sudo docker run -t -p 80:80 --name dit4c-rstudio -P dit4c/dit4c-container-rstudio
But only works on my linux VM, not in boot2docker

Working in a local Docker container

To run this project in a local Docker container, I start a bash shell in the project directory:

Do this one time only, build my Docker container from Dockerfile at top-level in my project:

docker build - < Dockerfile
docker ps

Take note of the container ID, which you get after docker ps

Then run my Docker container, access it via the web browser, link the project dir to the Docker drive. Make sure you're in a local directory that is sharable (C:\Users\yourname is usually good):

docker run -dp 8787:8787 -v ${pwd}:/home/rstudio/ -e ROOT=TRUE  <container ID>

Then go to http://192.168.99.100:8787/ or localhost:8787 in youor browser, log in (rstudio/rstudio), and start your RStudio project by double-clickin on the .Rproj file

When you're done, in the shell, stop all docker containers with:

docker stop $(docker ps -a -q)

Or do docker ps and docker stop <container ID> just to stop one container.

# How to run a Rocker Docker container to get RStudio in the browswer and work on files in a local folder
This is useful if installing and running R and RStudio locally is not possible. We can use R and RStudio in a Docker container, and work on files in a local folder that we can see in the local file system in the usual ways.
# open a terminal, change directory to a folder that you want to work in, with the files you want to edit run this line in the terminal:
docker run -dp 8787:8787 -e ROOT=TRUE -e USER=rstudio -e PASSWORD=xyz -v $(pwd):/home/rstudio rocker/verse
# open a browser tab at http://127.0.0.1:8787/ and use 'rstudio' and 'xyz' as the username and password check that you can see the folders on your hard drive in RStudio.
# when you are finished working, you can free up memory by stopping the container with this:
docker stop $(docker ps -a -q)
# check to see if any containers are running
docker ps
# when you are completely done and don't need the container again at all, you can free up hard drive space by deleting all images with this:
docker rmi -f $(docker images -q)
# check to see if any images remain
docker images
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment