Last active
September 14, 2018 01:04
-
-
Save mjfrigaard/96ac03e7b3696e3c227abdadca2cd7d7 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2-RDBMS connections in RStudio | |
================ | |
Martin Frigaard | |
2018-09-13 | |
# Motivation | |
This document will walk through how to setup and query a relational | |
database from RStudio with MySQL and SQLite. | |
## The data | |
These queries are done using the Lahman baseball open data set available | |
[here](https://github.com/chadwickbureau/baseballdatabank). | |
I chose this because 1) these data are open to anyone and can be run on | |
your local machine, not behind a vpn, and 2) it is available as .csv | |
files for loading into any statistical software. | |
## Connecting to databases with RStudio | |
For this tutorial I will be using a local instance of MySQL through the | |
`DBI`, `dplyr`, and `RMySQL` packages. A `DBI` connection is built below | |
to my local instance of `MySQL`. | |
``` r | |
library(DBI) | |
library(RMySQL) | |
library(RMariaDB) | |
library(dplyr) | |
``` | |
## Connect to `MySQL` with `DBI::dbConnect()` | |
The first connection will be done with the `DBI` package. | |
``` r | |
LahmanDBIMySQL <- dbConnect(MySQL(), | |
dbname = "lahman2016", | |
host = "localhost", | |
user = "root", | |
password = rstudioapi::askForPassword("Database password") | |
) | |
LahmanDBIMySQL | |
``` | |
## <MySQLConnection:0,0> | |
## Connect to `MySQL` with `dplyr::src_mysql()` | |
I will also set up a connection with `dplyr` to compare queries with the | |
`DBI` package (this is another option in RStudio). | |
``` r | |
LahmandplyrMySQL <- dplyr::src_mysql(dbname = "lahman2016", | |
host = "localhost", | |
user = "root", | |
# password = rstudioapi::askForPassword("Database password") | |
) | |
LahmandplyrMySQL | |
``` | |
## src: mysql 8.0.12 [root@localhost:/lahman2016] | |
## tbls: AllstarFull, Appearances, AwardsManagers, AwardsPlayers, | |
## AwardsShareManagers, AwardsSharePlayers, Batting, BattingPost, | |
## CollegePlaying, Fielding, FieldingOF, FieldingOFsplit, FieldingPost, | |
## HallOfFame, HomeGames, Managers, ManagersHalf, Master, Parks, Pitching, | |
## PitchingPost, Salaries, Schools, SeriesPost, Teams, TeamsFranchises, | |
## TeamsHalf | |
The `LahmandplyrMySQL` connection has much more information stored in | |
it. | |
## Connect to `RSQLite` | |
RStudio can connect with any flavor of SQL, including `SQLite`. This is | |
done through the `DBI` and `RSQLite::SQLite()` functions, but the path | |
can be supplied here as `:dbname:`. | |
``` r | |
con <- DBI::dbConnect(RSQLite::SQLite(), | |
path = ":dbname:") | |
con | |
``` | |
## <SQLiteConnection> | |
## Path: | |
## Extensions: TRUE | |
We can now import a data frame to our remote source in a SQLite database | |
and call it `lahmanMasterSQLite`. | |
``` r | |
library(Lahman) | |
dplyr::copy_to(dest = con, | |
# define the Batting table from the Lahman package | |
df = Lahman::Batting, | |
# name this new table something worthwhile | |
name = "lahmanBattingSQLite", | |
# not temporary and yes overwrite | |
temporary = FALSE, | |
overwrite = TRUE, | |
# these are indexes--they are like primary keys | |
indexes = list(c("playerID", | |
"yearID", | |
"teamID", | |
"lgID"))) | |
``` | |
We can then use the `dplyr::tbl()` function to assign the table | |
`lahmanMasterSQLite` into the `LahmanSQLiteDB` (which is our new SQLite | |
database). | |
``` r | |
LahmanSQLiteDB <- dplyr::tbl(con, | |
"lahmanBattingSQLite") | |
LahmanSQLiteDB %>% glimpse(78) | |
``` | |
## Observations: ?? | |
## Variables: 22 | |
## $ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", "ansonc… | |
## $ yearID <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871… | |
## $ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… | |
## $ teamID <chr> "TRO", "RC1", "CL1", "WS3", "RC1", "FW1", "RC1", "BS1", "F… | |
## $ lgID <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"… | |
## $ G <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, 20, 29,… | |
## $ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36, 15, 9… | |
## $ R <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24, 26, 0… | |
## $ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 33, 32, … | |
## $ X2B <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, 0, 0, 1… | |
## $ X3B <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0, 0, 1, … | |
## $ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0… | |
## $ RBI <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21, 23, 0… | |
## $ SB <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0, 0, 3, … | |
## $ CS <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 1, 0… | |
## $ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0, 0, 4, … | |
## $ SO <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, 0, 2, 0… | |
## $ IBB <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… | |
## $ HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… | |
## $ SH <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… | |
## $ SF <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… | |
## $ GIDP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA… | |
If we use the `base::str()` to examine the structure of the new object | |
`LahmanSQLiteDB`, we see that it’s a `SQLiteConnection` and this object | |
actually contains the connection and the data. | |
``` r | |
LahmanSQLiteDB %>% utils::str() | |
``` | |
## List of 2 | |
## $ src:List of 2 | |
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots | |
## .. .. ..@ ptr :<externalptr> | |
## .. .. ..@ dbname : chr "" | |
## .. .. ..@ loadable.extensions: logi TRUE | |
## .. .. ..@ flags : int 70 | |
## .. .. ..@ vfs : chr "" | |
## .. .. ..@ ref :<environment: 0x7fc6aa07f700> | |
## .. .. ..@ bigint : chr "integer64" | |
## ..$ disco: NULL | |
## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src" | |
## $ ops:List of 2 | |
## ..$ x : 'ident' chr "lahmanBattingSQLite" | |
## ..$ vars: chr [1:22] "playerID" "yearID" "stint" "teamID" ... | |
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op" | |
## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl" | |
## 1\) `SELECT`ing columns from single table | |
We will start by demonstrating some basic abilities of interacting with | |
relational data bases with R. We can use plain SQL inside the | |
`dplyr::tbl()` function to view all the data in the `Batting` table. | |
``` r | |
tbl(LahmandplyrMySQL, sql("SELECT * FROM Batting")) %>% glimpse(78) | |
``` | |
## Observations: ?? | |
## Variables: 22 | |
## $ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", "ansonc… | |
## $ yearID <int> 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871, 1871… | |
## $ stint <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… | |
## $ teamID <chr> "TRO", "RC1", "CL1", "WS3", "RC1", "FW1", "RC1", "BS1", "F… | |
## $ lgID <chr> "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"… | |
## $ G <int> 1, 25, 29, 27, 25, 12, 1, 31, 1, 18, 22, 1, 10, 3, 20, 29,… | |
## $ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36, 15, 9… | |
## $ R <int> 0, 30, 28, 28, 29, 9, 0, 66, 1, 13, 18, 0, 6, 7, 24, 26, 0… | |
## $ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 33, 32, … | |
## $ `2B` <int> 0, 6, 4, 10, 11, 2, 0, 10, 1, 2, 1, 0, 0, 0, 9, 3, 0, 0, 1… | |
## $ `3B` <int> 0, 0, 5, 2, 3, 1, 0, 9, 0, 1, 10, 0, 0, 0, 1, 3, 0, 0, 1, … | |
## $ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0… | |
## $ RBI <int> 0, 13, 19, 27, 16, 5, 2, 34, 1, 11, 18, 0, 1, 5, 21, 23, 0… | |
## $ SB <int> 0, 8, 3, 1, 6, 0, 0, 11, 0, 1, 0, 0, 2, 2, 4, 4, 0, 0, 3, … | |
## $ CS <int> 0, 1, 1, 1, 2, 1, 0, 6, 0, 0, 1, 0, 0, 0, 0, 4, 0, 0, 1, 0… | |
## $ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0, 0, 4, … | |
## $ SO <int> 0, 0, 5, 2, 1, 1, 0, 1, 0, 0, 4, 0, 0, 0, 2, 2, 3, 0, 2, 0… | |
## $ IBB <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""… | |
## $ HBP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""… | |
## $ SH <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""… | |
## $ SF <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""… | |
## $ GIDP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""… | |
As we can see, the entire batting table shows up, but the number of | |
observations is `??`. And the `Batting` table is not listed in the | |
working environment. | |
``` r | |
ls() | |
``` | |
## [1] "con" "LahmanDBIMySQL" "LahmandplyrMySQL" | |
## [4] "LahmanSQLiteDB" | |
## 1\) `SELECT`ing specific columns from a single table | |
Next we will select the `playerID`, `nameGiven`, `birthMonth`, | |
`birthDay`, `birthYear` from the `Master` table. | |
``` r | |
tbl(LahmandplyrMySQL, | |
sql("SELECT | |
playerID, | |
nameGiven, | |
birthMonth, | |
birthDay, | |
birthYear | |
FROM | |
Master | |
LIMIT 100")) | |
``` | |
## # Source: SQL [?? x 5] | |
## # Database: mysql 8.0.12 [root@localhost:/lahman2016] | |
## playerID nameGiven birthMonth birthDay birthYear | |
## <chr> <chr> <int> <int> <int> | |
## 1 aardsda01 David Allan 12 27 1981 | |
## 2 aaronha01 Henry Louis 2 5 1934 | |
## 3 aaronto01 Tommie Lee 8 5 1939 | |
## 4 aasedo01 Donald William 9 8 1954 | |
## 5 abadan01 Fausto Andres 8 25 1972 | |
## 6 abadfe01 Fernando Antonio 12 17 1985 | |
## 7 abadijo01 John W. 11 4 1850 | |
## 8 abbated01 Edward James 4 15 1877 | |
## 9 abbeybe01 Bert Wood 11 11 1869 | |
## 10 abbeych01 Charles S. 10 14 1866 | |
## # … with more rows | |
This would bring back 1000 rows if I hadn’t specified `LIMIT 100`. I can | |
check the time for this query using `system.time()`. | |
``` r | |
tbl(LahmandplyrMySQL, | |
sql("SELECT | |
playerID, | |
nameGiven, | |
birthMonth, | |
birthDay, | |
birthYear | |
FROM | |
Master | |
WHERE | |
nameGiven like '%Willie Mays%'")) | |
``` | |
## # Source: SQL [?? x 5] | |
## # Database: mysql 8.0.12 [root@localhost:/lahman2016] | |
## playerID nameGiven birthMonth birthDay birthYear | |
## <chr> <chr> <int> <int> <int> | |
## 1 aikenwi01 Willie Mays 10 14 1954 | |
`dplyr` provides fast translation between SQL and R. | |
## I don’t know `SQL`??? | |
Knowing SQL is essential if you’re using relational databases, but it’s | |
not necessary with `dbplyr`. We can refer to the `LahmanSQLiteDB` like a | |
regular data frame in our workspace. | |
> The goal of `dbplyr` is to automatically generate SQL for you so that | |
> you’re not forced to use it. | |
``` r | |
LahmanSQLiteDB %>% | |
dplyr::select(playerID, AB, H, HR, BB) %>% glimpse(78) | |
``` | |
## Observations: ?? | |
## Variables: 5 | |
## $ playerID <chr> "abercda01", "addybo01", "allisar01", "allisdo01", "ansonc… | |
## $ AB <int> 4, 118, 137, 133, 120, 49, 4, 157, 5, 86, 89, 3, 36, 15, 9… | |
## $ H <int> 0, 32, 40, 44, 39, 11, 1, 63, 1, 13, 27, 0, 7, 6, 33, 32, … | |
## $ HR <int> 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0… | |
## $ BB <int> 0, 4, 2, 0, 2, 0, 1, 13, 0, 0, 3, 1, 2, 0, 2, 9, 0, 0, 4, … |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment