bborgesr/database-benchmarking.Rmd

## database-benchmarking.Rmd
---
title: "Benchmarking database queries"
author: "Barbara Borges Ribeiro"
date: "June 23, 2016"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(DBI)
library(pool)
```

### One connection per app

```{r app}
app_times <- numeric(0)

conn <- DBI::dbConnect(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)

for (id in 1:30) {
  time <- system.time({
    sql <- paste0("SELECT * FROM City WHERE ID = ", id, ";")
    dbGetQuery(conn, sql)
  })
  app_times <- c(app_times, time["elapsed"])
}

invisible(DBI::dbDisconnect(conn))
```

The average of these 30 measurements is `r mean(app_times)` seconds.

### One connection per query

```{r query}
query_times <- numeric(0)

args <- list(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)

for (id in 1:30) {
  time <- system.time({
    conn <- do.call(DBI::dbConnect, args)
    sql <- paste0("SELECT * FROM City WHERE ID = ", id, ";")
    dbGetQuery(conn, sql)
    DBI::dbDisconnect(conn)
  })
  query_times <- c(query_times, time["elapsed"])
}
```

The average of these 30 measurements is `r mean(query_times)` seconds.

### Using pool

```{r pool}
pool_times <- numeric(0)

pool <- dbPool(
  drv = RMySQL::MySQL(),
  dbname = "shinydemo",
  host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
  username = "guest",
  password = "guest"
)

for (id in 1:30) {
  time <- system.time({
    sql <- paste0("SELECT * FROM City WHERE ID = ", id, ";")
    dbGetQuery(pool, sql)
  })
  pool_times <- c(pool_times, time["elapsed"])
}

poolClose(pool)
```

The average of these 30 measurements is `r mean(pool_times)` seconds.

## Overall comparison

```{r times}
(times <- data.frame(app_times, query_times, pool_times))
```

As you can see, the app-level measurements are the fastest. This is no suprise. The pool measurements take a little over double that time, since you still do have to fetch and return the connection to the pool (in addition to calculating the query itself). Finally, the query-level measurements are the longest. Again, this makes sense because for each query, you're actually fetching a connection from the remote database.
	---
	title: "Benchmarking database queries"
	author: "Barbara Borges Ribeiro"
	date: "June 23, 2016"
	output: html_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	library(DBI)
	library(pool)
	```

	### One connection per app

	```{r app}
	app_times <- numeric(0)

	conn <- DBI::dbConnect(
	drv = RMySQL::MySQL(),
	dbname = "shinydemo",
	host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
	username = "guest",
	password = "guest"
	)

	for (id in 1:30) {
	time <- system.time({
	sql <- paste0("SELECT * FROM City WHERE ID = ", id, ";")
	dbGetQuery(conn, sql)
	})
	app_times <- c(app_times, time["elapsed"])
	}

	invisible(DBI::dbDisconnect(conn))
	```

	The average of these 30 measurements is `r mean(app_times)` seconds.

	### One connection per query

	```{r query}
	query_times <- numeric(0)

	args <- list(
	drv = RMySQL::MySQL(),
	dbname = "shinydemo",
	host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
	username = "guest",
	password = "guest"
	)

	for (id in 1:30) {
	time <- system.time({
	conn <- do.call(DBI::dbConnect, args)
	sql <- paste0("SELECT * FROM City WHERE ID = ", id, ";")
	dbGetQuery(conn, sql)
	DBI::dbDisconnect(conn)
	})
	query_times <- c(query_times, time["elapsed"])
	}
	```

	The average of these 30 measurements is `r mean(query_times)` seconds.

	### Using pool

	```{r pool}
	pool_times <- numeric(0)

	pool <- dbPool(
	drv = RMySQL::MySQL(),
	dbname = "shinydemo",
	host = "shiny-demo.csa7qlmguqrf.us-east-1.rds.amazonaws.com",
	username = "guest",
	password = "guest"
	)

	for (id in 1:30) {
	time <- system.time({
	sql <- paste0("SELECT * FROM City WHERE ID = ", id, ";")
	dbGetQuery(pool, sql)
	})
	pool_times <- c(pool_times, time["elapsed"])
	}

	poolClose(pool)
	```

	The average of these 30 measurements is `r mean(pool_times)` seconds.

	## Overall comparison

	```{r times}
	(times <- data.frame(app_times, query_times, pool_times))
	```

	As you can see, the app-level measurements are the fastest. This is no suprise. The pool measurements take a little over double that time, since you still do have to fetch and return the connection to the pool (in addition to calculating the query itself). Finally, the query-level measurements are the longest. Again, this makes sense because for each query, you're actually fetching a connection from the remote database.