Skip to content

Instantly share code, notes, and snippets.

@mjfrigaard
Last active September 14, 2018 01:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mjfrigaard/96ac03e7b3696e3c227abdadca2cd7d7 to your computer and use it in GitHub Desktop.
Save mjfrigaard/96ac03e7b3696e3c227abdadca2cd7d7 to your computer and use it in GitHub Desktop.
2-RDBMS connections in RStudio
================
Martin Frigaard
2018-09-13
# Motivation
This document will walk through how to setup and query a relational
database from RStudio with MySQL and SQLite.
## The data
These queries are done using the Lahman baseball open data set available
[here](https://github.com/chadwickbureau/baseballdatabank).
I chose this because 1) these data are open to anyone and can be run on
your local machine, not behind a vpn, and 2) it is available as .csv
files for loading into any statistical software.
## Connecting to databases with RStudio
For this tutorial I will be using a local instance of MySQL through the
`DBI`, `dplyr`, and `RMySQL` packages. A `DBI` connection is built below
to my local instance of `MySQL`.
``` r
library(DBI)
library(RMySQL)
library(RMariaDB)
library(dplyr)
```
## Connect to `MySQL` with `DBI::dbConnect()`
The first connection will be done with the `DBI` package.
``` r
LahmanDBIMySQL <- dbConnect(MySQL(),
dbname = "lahman2016",
host = "localhost",
user = "root",
password = rstudioapi::askForPassword("Database password")
)
LahmanDBIMySQL
```
## <MySQLConnection:0,0>
## Connect to `MySQL` with `dplyr::src_mysql()`
I will also set up a connection with `dplyr` to compare queries with the
`DBI` package (this is another option in RStudio).
``` r
LahmandplyrMySQL <- dplyr::src_mysql(dbname = "lahman2016",
host = "localhost",
user = "root",
# password = rstudioapi::askForPassword("Database password")
)
LahmandplyrMySQL
```
## src: mysql 8.0.12 [root@localhost:/lahman2016]
## tbls: AllstarFull, Appearances, AwardsManagers, AwardsPlayers,
## AwardsShareManagers, AwardsSharePlayers, Batting, BattingPost,
## CollegePlaying, Fielding, FieldingOF, FieldingOFsplit, FieldingPost,
## HallOfFame, HomeGames, Managers, ManagersHalf, Master, Parks, Pitching,
## PitchingPost, Salaries, Schools, SeriesPost, Teams, TeamsFranchises,
## TeamsHalf
The `LahmandplyrMySQL` connection has much more information stored in
it.
## Connect to `RSQLite`
RStudio can connect with any flavor of SQL, including `SQLite`. This is
done through the `DBI` and `RSQLite::SQLite()` functions, but the path
can be supplied here as `:dbname:`.
``` r
con <- DBI::dbConnect(RSQLite::SQLite(),
path = ":dbname:")
con
```
## <SQLiteConnection>
## Path:
## Extensions: TRUE
We can now import a data frame to our remote source in a SQLite database
and call it `lahmanMasterSQLite`.
``` r
library(Lahman)
dplyr::copy_to(dest = con,
# define the Batting table from the Lahman package
df = Lahman::Batting,
# name this new table something worthwhile
name = "lahmanBattingSQLite",
# not temporary and yes overwrite
temporary = FALSE,
overwrite = TRUE,
# these are indexes--they are like primary keys
indexes = list(c("playerID",
"yearID",
"teamID",
"lgID")))
```
We can then use the `dplyr::tbl()` function to assign the table
`lahmanMasterSQLite` into the `LahmanSQLiteDB` (which is our new SQLite
database).
``` r
LahmanSQLiteDB <- dplyr::tbl(con,
"lahmanBattingSQLite")
LahmanSQLiteDB %>% glimpse(78)
```
## Observations: ??
## Variables: 22
## $ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", "ansonc…
## $ yearID <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871…
## $ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ teamID <chr> "TRO", "RC1", "CL1", "WS3", "RC1", "FW1", "RC1", "BS1", "F…
## $ lgID <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"…
## $ G <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, 20, 29,…
## $ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36, 15, 9…
## $ R <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24, 26, 0…
## $ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 33, 32, …
## $ X2B <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, 0, 0, 1…
## $ X3B <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0, 0, 1, …
## $ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ RBI <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21, 23, 0…
## $ SB <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0, 0, 3, …
## $ CS <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 1, 0…
## $ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0, 0, 4, …
## $ SO <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, 0, 2, 0…
## $ IBB <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ SH <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ SF <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ GIDP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
If we use the `base::str()` to examine the structure of the new object
`LahmanSQLiteDB`, we see that it’s a `SQLiteConnection` and this object
actually contains the connection and the data.
``` r
LahmanSQLiteDB %>% utils::str()
```
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
## .. .. ..@ ptr :<externalptr>
## .. .. ..@ dbname : chr ""
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 70
## .. .. ..@ vfs : chr ""
## .. .. ..@ ref :<environment: 0x7fc6aa07f700>
## .. .. ..@ bigint : chr "integer64"
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x : 'ident' chr "lahmanBattingSQLite"
## ..$ vars: chr [1:22] "playerID" "yearID" "stint" "teamID" ...
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
## 1\) `SELECT`ing columns from single table
We will start by demonstrating some basic abilities of interacting with
relational data bases with R. We can use plain SQL inside the
`dplyr::tbl()` function to view all the data in the `Batting` table.
``` r
tbl(LahmandplyrMySQL, sql("SELECT * FROM Batting")) %>% glimpse(78)
```
## Observations: ??
## Variables: 22
## $ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", "ansonc…
## $ yearID <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871…
## $ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ teamID <chr> "TRO", "RC1", "CL1", "WS3", "RC1", "FW1", "RC1", "BS1", "F…
## $ lgID <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"…
## $ G <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, 20, 29,…
## $ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36, 15, 9…
## $ R <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24, 26, 0…
## $ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 33, 32, …
## $ `2B` <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, 0, 0, 1…
## $ `3B` <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0, 0, 1, …
## $ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ RBI <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21, 23, 0…
## $ SB <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0, 0, 3, …
## $ CS <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 1, 0…
## $ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0, 0, 4, …
## $ SO <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, 0, 2, 0…
## $ IBB <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ HBP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ SH <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ SF <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
## $ GIDP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
As we can see, the entire batting table shows up, but the number of
observations is `??`. And the `Batting` table is not listed in the
working environment.
``` r
ls()
```
## [1] "con" "LahmanDBIMySQL" "LahmandplyrMySQL"
## [4] "LahmanSQLiteDB"
## 1\) `SELECT`ing specific columns from a single table
Next we will select the `playerID`, `nameGiven`, `birthMonth`,
`birthDay`, `birthYear` from the `Master` table.
``` r
tbl(LahmandplyrMySQL,
sql("SELECT
playerID,
nameGiven,
birthMonth,
birthDay,
birthYear
FROM
Master
LIMIT 100"))
```
## # Source: SQL [?? x 5]
## # Database: mysql 8.0.12 [root@localhost:/lahman2016]
## playerID nameGiven birthMonth birthDay birthYear
## <chr> <chr> <int> <int> <int>
## 1 aardsda01 David Allan 12 27 1981
## 2 aaronha01 Henry Louis 2 5 1934
## 3 aaronto01 Tommie Lee 8 5 1939
## 4 aasedo01 Donald William 9 8 1954
## 5 abadan01 Fausto Andres 8 25 1972
## 6 abadfe01 Fernando Antonio 12 17 1985
## 7 abadijo01 John W. 11 4 1850
## 8 abbated01 Edward James 4 15 1877
## 9 abbeybe01 Bert Wood 11 11 1869
## 10 abbeych01 Charles S. 10 14 1866
## # … with more rows
This would bring back 1000 rows if I hadn’t specified `LIMIT 100`. I can
check the time for this query using `system.time()`.
``` r
tbl(LahmandplyrMySQL,
sql("SELECT
playerID,
nameGiven,
birthMonth,
birthDay,
birthYear
FROM
Master
WHERE
nameGiven like '%Willie Mays%'"))
```
## # Source: SQL [?? x 5]
## # Database: mysql 8.0.12 [root@localhost:/lahman2016]
## playerID nameGiven birthMonth birthDay birthYear
## <chr> <chr> <int> <int> <int>
## 1 aikenwi01 Willie Mays 10 14 1954
`dplyr` provides fast translation between SQL and R.
## I don’t know `SQL`???
Knowing SQL is essential if you’re using relational databases, but it’s
not necessary with `dbplyr`. We can refer to the `LahmanSQLiteDB` like a
regular data frame in our workspace.
> The goal of `dbplyr` is to automatically generate SQL for you so that
> you’re not forced to use it.
``` r
LahmanSQLiteDB %>%
dplyr::select(playerID, AB, H, HR, BB) %>% glimpse(78)
```
## Observations: ??
## Variables: 5
## $ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", "ansonc…
## $ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36, 15, 9…
## $ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 33, 32, …
## $ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0, 0, 4, …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment