Skip to content

Instantly share code, notes, and snippets.

@mlbach

mlbach/README.md Secret

Last active April 23, 2025 12:53
Show Gist options
  • Save mlbach/0a75ffa695f81e0d777973d3928d9013 to your computer and use it in GitHub Desktop.
Save mlbach/0a75ffa695f81e0d777973d3928d9013 to your computer and use it in GitHub Desktop.
Database guide to the United4Surveillance SignalDetectionTool

Database guide for the U4S SignalDetectionTool

This gist will teach you how to hook up your database to the United4Surveillance SignalDetectionTool Shiny app.

Prerequisites

  • R is installed on your system
  • The SignalDetectionTool is installed as a R package
  • You have access to a database
  • You have working knowledge of R

Three-step process to connect your database

  1. Load the SDT package: library(SignalDetectionTool)
  2. Define custom functions that retrieve data from your DB
  3. Monkey patch DB related package functions

Detailed steps & explanations

Recently the SignalDetectionTool added the ability to start the app with a pre-defined configuration. There is currently full support for loading data from a file and in the future this will be extended to fully support data retrieval from relational databases. In the meantime you can make use of the tool's database mock functions and apply monkey patching to overwrite existing package functions. The functions of interest are contained in the package's file R/database.R:

  • get_database_connection()
  • query_database()
  • transform_required_format()
  • load_data_db()

As stated above, these functions are merely mocks of real implementations and currently only load data from an example SQLite database file. To actually connect the Shiny app to your own database right now, you need to write custom functions and monkey patch the package's equivalents.

This gist's file db_functions.R shows an example of how to implement custom functions based on the assumption that the DB is on a MS SQL Server with this schema.

1. Adjust the database connections

Here we use the data config YAML of the SignalDetectionTool and added new scalars server, database, driver and port to the db mapping. You would need to fill these with your actual values. To retrieve the values from the config file we use the package's function get_data_config_value() and we supply a string with the parameter name, separating each nested level with a colon (i.e. "datasource:db:driver"). Alternatively, you could also get these from system environment variables like this:

get_env_config <- function() {
  list(
    host = Sys.getenv("DB_HOST", unset = "localhost"),
    port = as.integer(Sys.getenv("DB_PORT", unset = "5432")),
    driver = Sys.getenv("DB_DRIVER", unset = "postgresql"),
    database = Sys.getenv("DB_NAME", unset = "test_db")
  )
}

2. Adjust the authentication and handling of credentials

In the example we use the keyring package to get the username and password from the OS credentials manager. If your login details are not stored there yet, you first have to save them there using keyring::key_set(). There are of course other alternatives and you might want to read about best practices of managing credentials.

3. Adjust the SQL query to your DB's schema

Depending on your database schema you need to adjust the SQL query (in query_database()). Certainly you will have different table structures, attribute names and you might have different data availabilities. If you are not familiar with SQL you could also try dbplyr. This allows you to write standard dplyr style code and when you want to collect the data from the DB, dbplyr will write the necessary query and send it to the DB.

4. Adjust the data transformation

Depending on the format of the data you retrieve from your DB you will need to do some transformations. This can be done by adjusting transform_required_format() inside db_functions.R to your needs. The SDT is rather strict at the moment. Your data needs to adhere to the required format. If you want to perform the transformations on the SQL server side you can also create a that does nothing i.e. transform_required_format <- function(data){data)}.

How to monkey patch package functions

After you have defined your custom functions you need to call the monkey_patch function from monkey_patching.R function like this:

monkey_patch(your_function, “SignalDetectionTool", "function_name_to_replace")

It doesn't matter how you name your custom functions as long as you substitute them for the tool's functions listed above.

Run the Shiny app with your custom settings and database connection

Now start the Shiny app using run_app(path_to_yaml='your/path/config.yml').

If you have any troubles you can read more about the data configuration file.

default:
datasource:
file: FALSE # TRUE when external csv or excel should be used, FALSE when db should be used
filepath: ~
db: # If using access to db
server: your_server_name
database: your_database_name
driver: "SQL Server"
port: 5432
params:
pathogen: Pertussis
strata:
- age_group
- community
signal_detection_period: 6
signal_detection_algorithm: ears
shapefile_path: ~
pertussis:
params:
pathogen: Pertussis
strata:
- age_group
- sex
signal_detection_period: 6
signal_detection_algorithm: glm harmonic
pandemic_correction: TRUE
intervention_date: 2020-03-18
shapefile_path: ~
salmonella:
params:
pathogen: Salmonella
strata:
- subtype
- county
- sex
signal_detection_algorithm: farrington
shapefile_path: ~
library(DBI)
library(dplyr)
library(keyring)
#' Establish a Database Connection
#'
#' @description
#' This function retrieves database credentials from the keyring (make sure your keyring keys
#' "db_user" and "db_password" are configured) and establishes a connection to a Microsoft SQL Server
#' using the \code{odbc} package. Adjust the \code{Driver}, \code{Server}, and \code{Database} parameters
#' according to your database setup.
#'
#' @return A \code{DBIConnection} object that represents an active connection to the database.
#'
#' @details
#' The connection is established via the \code{dbConnect} function from the \code{DBI} package.
#' Ensure that the ODBC driver for SQL Server is installed and properly configured on your system.
#'
#' @import DBI
#' @import odbc
#' @import keyring
#'
#' @examples
#' \dontrun{
#' con <- get_database_connection()
#' # Use the connection 'con' to query the database, then disconnect when finished.
#' dbDisconnect(con)
#' }
#'
#' @export
get_database_connection <- function() {
# Retrieve credentials from keyring; ensure your keyring keys ("db_user", "db_password") are set up
user <- key_get("db_user")
password <- key_get("db_password")
# Connect to the database (uses get_data_config_value() from SignalDetectionTool)
con <- dbConnect(odbc::odbc(),
Driver = get_data_config_value("datasource:db:driver"),
Server = get_data_config_value("datasource:db:server"),
Database = get_data_config_value("datasource:db:database"),
UID = user,
PWD = password)
return(con)
}
#' Query the Database and Join Tables
#'
#' @description
#' This function executes an SQL query that retrieves raw data from several tables (Cases, Regions,
#' Pathogens, and Subtypes) and performs multiple JOIN operations to assemble a region hierarchy.
#'
#' @param dbcon A \code{DBIConnection} object obtained from \code{get_database_connection}.
#'
#' @return A data frame containing the results of the query which includes fields from cases, pathogens, subtypes,
#' and the region hierarchy (community, county, state, country).
#'
#' @details
#' The SQL query joins the \code{Regions} table three times to capture the hierarchical structure:
#' \itemize{
#' \item \code{r3}: The NUTS-3 level (community) corresponding to \code{Cases.region_id}.
#' \item \code{r2}: The parent region (NUTS-2, e.g., county).
#' \item \code{r1}: The parent of \code{r2} (NUTS-1, e.g., state).
#' \item \code{r0}: The parent of \code{r1} (NUTS-0, e.g., country).
#' }
#' Additional joins include linking to the \code{Pathogens} table and optionally the \code{Subtypes} table.
#'
#' @import DBI
#'
#' @examples
#' \dontrun{
#' con <- get_database_connection()
#' df <- query_database(con)
#' }
#'
#' @export
query_database <- function(dbcon) {
query <- "
SELECT
c.case_id,
c.report_date,
c.symptom_start_date,
c.hospitalization_date,
c.death_date,
c.vaccination_date,
c.age,
c.sex,
c.occupation,
p.pathogen_id,
p.pathogen_name,
s.subtype_id,
s.subtype_name,
c.hospitalization,
c.death,
c.vaccination_status,
r3.region_name AS community,
r3.region_id AS community_id,
r2.region_name AS county,
r2.region_id AS county_id,
r1.region_name AS state,
r1.region_id AS state_id,
r0.region_name AS country,
r0.region_id AS country_id
FROM Cases c
JOIN Regions r3 ON c.region_id = r3.region_id
LEFT JOIN Regions r2 ON r3.parent_region_id = r2.region_id
LEFT JOIN Regions r1 ON r2.parent_region_id = r1.region_id
LEFT JOIN Regions r0 ON r1.parent_region_id = r0.region_id
JOIN Pathogens p ON c.pathogen_id = p.pathogen_id
LEFT JOIN Subtypes s ON c.subtype_id = s.subtype_id
"
dbdata <- dbGetQuery(dbcon, query)
return(dbdata)
}
#' Transform Queried Data into the Required Output Format
#'
#' @description
#' This function transforms the raw queried data into a format suitable for analysis and reporting.
#' It converts date columns to character strings, categorizes age into 10-year intervals, and converts
#' logical fields (hospitalization, death, vaccination status) to user-friendly character labels.
#'
#' @param dbdata A data frame obtained from \code{query_database} containing the raw query results.
#'
#' @return A transformed data frame with:
#' \itemize{
#' \item Date fields converted to character strings (formatted as ISO dates).
#' \item A new \code{age_group} column categorizing ages in 10-year intervals.
#' \item Logical fields converted to character labels ("yes", "no", "unknown").
#' \item Columns renamed and ordered to match the required output specification.
#' }
#'
#' @details
#' - The function uses \code{dplyr} functions for data manipulation, including \code{mutate} and \code{select}.
#' - Date columns (\code{report_date}, \code{symptom_start_date}, \code{hospitalization_date}, \code{death_date}, \code{vaccination_date})
#' are converted to \code{Date} objects and then to character strings.
#' - Ages are grouped into ranges such as \code{"<10"}, \code{"10-19"}, \code{"20-29"}, etc.
#' - Fields that are logical (or have missing values) for \code{hospitalization}, \code{death}, and \code{vaccination_status} are
#' mapped to \code{"yes"}, \code{"no"}, or \code{"unknown"}.
#'
#' @import dplyr
#'
#' @examples
#' \dontrun{
#' df <- query_database(get_database_connection())
#' transformed_df <- transform_required_format(df)
#' }
#'
#' @export
transform_required_format <- function(dbdata) {
dbdata %>%
# Transform date columns, create new variables, and duplicate region columns
mutate(
date_report = as.character(as.Date(report_date)),
date_onset = as.character(as.Date(symptom_start_date)),
date_hospitalization = as.character(as.Date(hospitalization_date)),
date_death = as.character(as.Date(death_date)),
date_vaccination = as.character(as.Date(vaccination_date)),
# Create age_group with 10-year intervals; if age is missing, set as NA.
age_group = ifelse(!is.na(age),
as.character(
cut(age,
breaks = c(-Inf, 9, 19, 29, 39, 49, 59, 69, 79, 89, 99, Inf),
labels = c("<10", "10-19", "20-29", "30-39", "40-49", "50-59",
"60-69", "70-79", "80-89", "90-99", "100+"),
right = TRUE)
),
NA),
# Convert logical fields to character labels ("yes", "no", "unknown")
hospitalization = ifelse(is.na(hospitalization), "unknown",
ifelse(hospitalization, "yes", "no")),
death = ifelse(is.na(death), "unknown",
ifelse(death, "yes", "no")),
vaccination = ifelse(is.na(vaccination_status), "unknown",
ifelse(vaccination_status, "yes", "no"))
) %>%
# Select and rename columns to match the required final output order and naming
select(
case_id,
date_report,
date_onset,
date_hospitalization,
date_death,
date_vaccination,
country,
country_id,
state,
state_id,
county,
county_id,
community,
community_id,
age,
age_group,
sex,
occupation,
pathogen_name,
pathogen_id,
subtype_name,
subtype_id,
hospitalization,
death,
vaccination
) %>%
# Rename columns to match output specification
rename(
pathogen = pathogen_name,
subtype = subtype_name
)
}
-- Table: Cases
CREATE TABLE Cases (
case_id INT PRIMARY KEY, -- Unique case identifier (INT)
report_date DATE NOT NULL, -- Date when the case was reported
symptom_start_date DATE, -- Date when symptoms started
hospitalization_date DATE, -- Date when admitted to the hospital
death_date DATE, -- Date of death
vaccination_date DATE, -- Date of vaccination
age INT CHECK(age BETWEEN 0 AND 125), -- Age of the patient (in years)
sex VARCHAR(10), -- Sex of the reported case (e.g., 'male', 'female', 'diverse')
occupation VARCHAR(50), -- Occupation (e.g., 'care', 'kindergarten')
region_id VINT, -- Reporting region of the case (NUTS-3)
pathogen_id INT, -- Foreign key to Pathogens table (INT)
subtype_id INT, -- Foreign key to Subtypes table (INT)
hospitalization BOOLEAN, -- Whether the case was hospitalized (TRUE/FALSE)
death BOOLEAN, -- Whether the case died (TRUE/FALSE)
vaccination_status BOOLEAN, -- Whether vaccinated (TRUE/FALSE)
FOREIGN KEY (region_id) REFERENCES Regions(region_id),
FOREIGN KEY (pathogen_id) REFERENCES Pathogens(pathogen_id),
FOREIGN KEY (subtype_id) REFERENCES Subtypes(subtype_id)
);
-- Table: Regions
CREATE TABLE Regions (
region_id INT PRIMARY KEY, -- Unique ID for each region (INT)
region_name VARCHAR(255) NOT NULL, -- Name of the region (e.g., country, state, county, community)
region_level INT, -- Indicates level in the hierarchy (1 = country, 2 = state, etc.)
parent_region_id INT, -- Foreign key to the parent region (null for top-level)
region_code VARCHAR(20), -- Region code (e.g., ISO, NUTS)
FOREIGN KEY (parent_region_id) REFERENCES Regions(region_id)
);
-- Table: Pathogens
CREATE TABLE Pathogens (
pathogen_id INT PRIMARY KEY, -- Unique ID for each pathogen (INT)
pathogen_name VARCHAR(255) NOT NULL -- Name of the pathogen (e.g., 'COVID-19')
);
-- Table: Subtypes
CREATE TABLE Subtypes (
subtype_id INT PRIMARY KEY, -- Unique ID for each subtype (INT)
subtype_name VARCHAR(255), -- Name of the subtype (e.g., 'Alpha variant')
pathogen_id INT, -- Foreign key to the Pathogens table (INT)
FOREIGN KEY (pathogen_id) REFERENCES Pathogens(pathogen_id)
);
monkey_patch <- function(func, package_name, their_function_name){
envir <- asNamespace(package_name)
unlockBinding(their_function_name, envir)
assign(their_function_name, func , envir = envir)
}
library(SignalDetectionTool)
source("db_functions.R")
monkey_patch(get_database_connection, "SignalDetectionTool", "get_database_connection")
monkey_patch(query_database, "SignalDetectionTool", "query_database")
monkey_patch(transform_required_format, "SignalDetectionTool", "transform_required_format")
# start the SignalDetectionTool app with custom functions and configuration
run_app(path_to_yaml = "./data_config.yml")
@tinneuro
Copy link

Somehow can't directly comment in the code will thus do it here. Some minor text based improvements first:

  • get_database_connection
     query_database
    

    transform_required_format
    load_data_db
    I would suggest adding () behind the function name, i.e. get_database_connection().

  • schema link: does not work for me "page not found"

  • best practices of managing credentials) one closing bracket which is too much

  • required format) same here and a dot behind or the If afterwards with small i

  • create a that does nothing , I guess we are missing function here

  • for the very last two sentences a dot is missing at the end.

It already looks great thanks a lot! I will now try it out myself and then might add more content related comments.

@tinneuro
Copy link

One idea I have and you might disagree but to also in addition to the Wikipedia article maybe just write in brackets behind monkey patching (overwrite existing functions). This could also be added to the three step process.

@tinneuro
Copy link

To retrieve the values from the config file we use the package's function get_data_config_value() and we supply a string with the parameter name, separating each nested level with a colon (i.e. "datasource:db:driver").
For me it would be useful to refer here to the db_functions.R script. For example
"In the function get_database_connection() in the db_functions.R script we retrieve the values from the config file we using the SDT package's function get_data_config_value(). We supply a string with the parameter name, separating each nested level with a colon (i.e. "datasource:db:driver")."

@tinneuro
Copy link

For more clarity we can add:

  • Depending on your database schema you need to adjust the SQL query query_database() inside db_functions.R .

  • Depending on the format of the data you retrieve from your DB you will need to do some transformations. This can be done by adjusting transform_required_format() inside db_functions.R to your needs.

@tinneuro
Copy link

monkey_patch(your_function, “SignalDetectionTool", "function_name_to_replace")
After this maybe write: In the script monkey_patching.R. we have provided the necessary code.

@tinneuro
Copy link

Great thanks for all the improvements! I am happy with the gist now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment