mjfrigaard/PHDI_hypertension_code_review.md

## PHDI_hypertension_code_review.md

      
    Raw
  

              PHDI_hypertension_code_review.md
            
          
    Hypertension sample (code review)

Motivation

This project requires selecting a sample of patients who meet the following criteria.

aged 18-85
had 2 encounters between 2015-01-01 and 2017-12-31 with a hypertension ICD9 diagnosis

These data live in RDB flat files, a collection of 17 tab-delimited datasets from the Clarity data warehouse. The data had been imported into a MySQL database, but was later relocated into a MS SQL Server database.
This documents covers changes to the query that creates a sample of 23632 patients who meet the above criteria.
The MySQL Query

The first attempt to create this sample was in MySQL. The query is outlined below.
1 - Create a subset from ENCOUNTERS

This is a list of OUTPATIENT encounters approved by the PI. Three additional Encounter_Types were chosen after review by another member in the PHDI group ('Education', 'Procedure visit', 'Initial consult'). They get stored in the tmp_HtnEncounterTypeList table
CREATE TABLE tmp_HtnEncounterTypeList
SELECT * FROM
ENCOUNTERS
-- these encounters were approved by KBD
WHERE Encounter_Type IN('Appointment', 'Office Visit', 'History',
'Nurse Only', 'Care Coordination', 'Routine Prenatal',
'Off License Non Med Ctr', 'Home Care Visit', 'Evaluation',
'Therapy', 'Nutrition', 'Genetic Counseling', 'Social Work',
'Anti-coag visit', 'Initial Prenatal', 'Day Treatment',
'Postpartum Visit', 'Home Health Admission', 'Ophth Exam',
'Remote CHF Monitoring', 'LACTATION CONSULT',
'Office Visit - Admin Closed', 'Left without being seen',
'Community Orders', 'Care Coordination - Home Health',
'Medication Management', 'Rx Refill Authorize', 'Nurse Triage',
-- MBL recommended adding the following encounter_types
'Education','Procedure visit', 'Initial consult');
Now this table can be used to get the larger sample of patients who meet the additional encounter criteria in the DIAGNOSES table.
CREATE TABLE 
	PHDI_HTN_SAMP
SELECT
-- temp_enc is the tmp_HtnEncounterTypeList table
tmp_enc.EncounterDate,
tmp_enc.EncounterKey,
tmp_enc.Encounter_ID,
tmp_enc.Encounter_Type,
tmp_enc.Encounter_Is_Inpatient,
tmp_enc.Encounter_Is_Walk_In,
tmp_enc.Encounter_Age,

/* 
get Patient_ID from DIAGNOSES (not from ENCOUNTERS)
tmp_enc.Patient_ID,
*/

/* DIAGNOSES table conditions */

diag.ICD9_Code,
/* this links to Encounter_ID in the ENCOUTNERS file*/
diag.Diagnoses_Encounter_ID,
/* type of diagnosis */
diag.Diagnosis_Type,
/*source table for diagnosis*/
diag.Diagnosis_Status,
/*status can be active, deleted, or resolved. */
diag.Diagnosis_Present_On_Admission,
/*yes or no*/
diag.Diagnosis_Name,
/* name of dx from dx_name in Clarity */
diag.Patient_ID
/* get Patient_ID to link back to PATIENTS, etc */
FROM
tmp_HtnEncounterTypeList AS tmp_enc
INNER JOIN DIAGNOSES AS diag
ON tmp_enc.Encounter_ID = diag.Diagnoses_Encounter_ID
-- these are all the diagnosis codes for hypertension
WHERE diag.ICD9_Code
IN ("401", "401.0", "401.1", "401.9","402",
"402.0","402.00","402.01", "402.1","402.10","402.11","402.9",
"402.90","402.91","403","403.00","403.01", "403.1","403.10",
"403.11","403.9","403.91","404","404.0","404.00","404.01",
"404.02","404.03","404.1","404.10","404.11","404.12","404.13",
"404.9","404.90", "404.91","404.92","404.93","405","405.0",
"405.01","405.09","405.1","405.11", "405.19", "405.9",
"405.91", "405.99") 

	AND (tmp_enc.EncounterDate
-- 2015 & 2017 data (updated)
	BETWEEN '2015-01-01' AND '2017-12-31')
-- age between 18 and 85
	AND (tmp_enc.Encounter_Age BETWEEN 18 AND 85);

/*
213813 row(s) affected
Records: 213813
Duplicates: 0
Warnings: 0
12137.804 sec
*/

/* CHECK COUNT ON NEW TABLE */
SELECT COUNT(*) FROM PHDI_HTN_SAMP;
/*
+----------+
| COUNT(*) |
+----------+
| 213813   |
+----------+
1 row in set (0.16 sec)
*/
2 - REDUCE TO 2+ Hypertension Encounters

Reduce this to distinct Patient_IDs for patients with a hypertension diagnosis and at least 2 outpatient encounters. I can do this by counting the number of encounter IDs per Patient_ID
CREATE TABLE 
  temp_HtnCountPatIDEncID
SELECT
COUNT(*) AS count,
  Encounter_ID,
  Patient_ID
FROM
  PHDI_HTN_SAMP
GROUP BY
  Encounter_ID, Patient_ID
HAVING
  count > 1
ORDER BY
  count DESC;

/*
30311 row(s) affected
Records: 30311
Duplicates: 0
Warnings: 0
*/

/* check this new temp_HtnCountPatIDEncID table */
-- =======================================================

SELECT * FROM temp_HtnCountPatIDEncID
ORDER BY count
DESC LIMIT 5;

/*

+-------+-----------------+-----------------+
| count |    Encounter_ID |      Patient_ID |
+-------+-----------------+-----------------+
| 15    | 808640570379794 | 942523706238717 |
| 12    | 152675774414092 | 243384487461299 |
| 12    | 378120614681393 | 744728490710259 |
| 12    | 273427328560501 | 388629973400384 |
| 9     | 889970415737480 | 529602517839521 |
+-------+-----------------+-----------------+
5 rows in set (0.03 sec)
*/

SELECT * FROM temp_HtnCountPatIDEncID
ORDER BY count LIMIT 5;

/*
+-------+-----------------+-----------------+
| count | Encounter_ID    | Patient_ID      |
+-------+-----------------+-----------------+
|     2 | 471763866953552 | 434451227076352 |
|     2 | 862538943067193 |  16760296188295 |
|     2 | 453247585333884 | 131788644008338 |
|     2 | 106287259142846 | 302721116691828 |
|     2 | 604085796978325 | 977933479938656 |
+-------+-----------------+-----------------+
5 rows in set (0.02 sec)
*/
/* these range from 15 t0 2 */
/*
3 - DISTINCT PATIENT IDs: get the DISTINCT Patient_IDs

Now get the distinct PATIENT_IDs from the temp_HtnCountPatIDEncID table and put in separate table
CREATE TABLE PHDI_HTN_2ENC_IDs
SELECT DISTINCT Patient_ID FROM temp_HtnCountPatIDEncID;
SELECT * FROM PHDI_HTN_2ENC_IDs;
-- 19347 rows
/* DROP TEMP TABLES */
-- DROP TABLE temp_HtnCountPatIDEncID;
-- DROP TABLE tmp_HtnEncounterTypeList;
MySQL Code Review

The MySQL query was very slow. This was due to a combination of 1) using free MySQL community edition, and 2) not properly indexing the new table of encounter types.
The MS SQL Server query

This is the query saved as a stored procedure in the MS SQL server database ([dbo].[deprecated_spc_HTN]). It was reviewed and a couple changes were made to get the correct sample (and to improve speed).
USE [phdi]
GO
/****** Object:  StoredProcedure [dbo].[deprecated_spc_HTN] ******/
/*
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
*/

/*
exec spc_HTN
this proc has hard-coded encounter dates 
BETWEEN '2015-01-01' AND '2017-12-31'
*/
My previous strategy was to create a subset of the Encounter_Types, then join to the DIAGNOSES table using the Diagnoses_Encounter_ID. This slowed the process down because there wasn't a unique index in the temp table (tmp_HtnEncounterTypeList).
This code chunk creates the PHDI_HTN_SAMP table and formats each column before entering data into the table. Minor difference in workflow between MySQL and SQL Server.
CREATE TABLE #PHDI_HTN_SAMP(
	[Encounterdate] [date] NULL,
	[encounterkey] [varchar](250) NULL,
	[Encounter_ID] [bigint] NULL,
	[Encounter_type] [varchar](250) NULL,
	[Encounter_Is_Inpatient] [varchar](250) NULL,
	[Encounter_Is_Walk_In] [varchar](250) NULL,
	[Encounter_age] [smallint] NULL,
	[ICD9_code] [varchar](250) NULL,
	[Diagnoses_Encounter_ID] [bigint] NULL,
	[Diagnosis_Type] [varchar](250) NULL,
	[Diagnosis_Status] [varchar](250) NULL,
	[Diagnosis_Present_On_Admission] [varchar](250) NULL,
	[Diagnosis_Name] [varchar](250) NULL,
	[Patient_ID] [bigint] NULL
) 
Now the SELECT statement follows the INSERT INTO command.
INSERT into #PHDI_HTN_SAMP
SELECT 
	enc.EncounterDate, 
	enc.EncounterKey, 
	enc.Encounter_ID, 
	enc.Encounter_Type, 
	enc.Encounter_Is_Inpatient, 
	enc.Encounter_Is_Walk_In, 
	enc.Encounter_Age, 
    /* DIAGNOSES table conditions */
	diag.ICD9_Code, 
		/* this links to Encounter_ID in the ENCOUTNERS file*/
    diag.Diagnoses_Encounter_ID,
		/* type of diagnosis */
    diag.Diagnosis_Type,
		/*source table for diagnosis*/
    diag.Diagnosis_Status,
		/*status can be active, deleted, or resolved. */
    diag.Diagnosis_Present_On_Admission,
		/*yes or no*/
    diag.Diagnosis_Name,
		/* name of dx from dx_name in Clarity */
    diag.Patient_ID
		/* get Patient_ID to link back to PATIENTS, etc */
FROM 
	ENCOUNTERS AS enc -- >> ENCOUNTERS alias
INNER JOIN 
	DIAGNOSES AS diag -- >> DIAGNOSES alias

ON enc.Encounter_ID = diag.Diagnoses_Encounter_ID
    -- >> these are all the diagnosis codes for hypertension 
WHERE diag.ICD9_Code 
		IN ('401', '401.0', '401.1', '401.9','402',
	'402.0','402.00','402.01', '402.1','402.10','402.11','402.9',
	'402.90','402.91','403','403.00','403.01', '403.1','403.10',
	'403.11','403.9','403.91','404','404.0','404.00','404.01', 
	'404.02','404.03','404.1','404.10','404.11','404.12','404.13',
	'404.9','404.90', '404.91','404.92','404.93','405','405.0',
	'405.01','405.09','405.1','405.11', '405.19', '405.9', 
	'405.91', '405.99') AND 
	(enc.EncounterDate 
	-- >> date parameters (2015 & 2017 data (updated))
        BETWEEN '2015-01-01' AND '2017-12-31')
    	-- >> age parameters (age between 18 and 85)
    	
		AND (enc.Encounter_Age BETWEEN 18 AND 85)

/*
This was the previous temporary table that was slowing the process down. By 
adding this as a WHERE condition it becomes part of the JOIN (and has indexes).
*/
		AND enc.Encounter_Type IN('Appointment', 'Office Visit', 
	'History', 'Nurse Only', 'Care Coordination', 'Routine Prenatal', 
    'Off License Non Med Ctr', 'Home Care Visit', 'Evaluation', 
    'Therapy', 'Nutrition', 'Genetic Counseling', 'Social Work', 
	'Anti-coag visit', 'Initial Prenatal', 'Day Treatment', 
    'Postpartum Visit', 'Home Health Admission', 'Ophth Exam', 
    'Remote CHF Monitoring', 'LACTATION CONSULT', 
    'Office Visit - Admin Closed', 'Left without being seen', 
	'Community Orders', 'Care Coordination - Home Health', 
	'Medication Management', 'Rx Refill Authorize', 'Nurse Triage',
        -- MBL recommended adding the following encounter_types
	 'Education','Procedure visit', 'Initial consult')
This portion is where the code was rewritten to get the sample we wanted. Previously the grouping statement included the Patient_ID and Encounter_ID, but this was not correct.
Instead create a table with the DISTINCT Patient_ID and Encounter_IDs, then we only have to count the number of times a Patient_ID shows up in the temp table.
	CREATE TABLE #temp (Encounter_ID bigint, Patient_ID bigint)
	INSERT INTO #temp
	SELECT DISTINCT 
		Encounter_ID, Patient_ID 
	FROM #PHDI_HTN_SAMP
There are multiple Encounter_IDs per Patient_ID. Moreover, there can be multiple ICD9_Codes on a single Encounter_ID.


Feedback from Chris 

I ended up writing it the way Cary would do it:

SELECT DISTINCT patient_ID, encounter_Id  

  even though the way I did it would produce the same results:

GROUP BY patient_Id, encounter_ID  

  So it's not the grouping that's the problem in your original code. It's that you are counting the number of times the same patient_ID and encounter_ID occur, instead of counting the number of distinct patient_ID/encounter_ID combos.

  If it helps, let's look at how I split it up as opposed to how Cary did it (and he combined them, so it will be not exactly one to one).

My original code:
CREATE TABLE temp
SELECT
COUNT(*) AS count,
	Encounter_ID,
	Patient_ID
FROM
	PHDI_HTN_SAMP
GROUP BY
	Encounter_ID, 
	Patient_ID
HAVING
	count(*) > 1
ORDER BY
	count DESC;  
	

  Chris's CODE (split into 2 pieces):

/* piece 1 */
INSERT INTO #temp
SELECT
	Encounter_ID,
	Patient_ID
FROM
	PHDI_HTN_SAMP
GROUP BY
	Encounter_ID, 
	Patient_ID 
ORDER BY 
	Patient_ID, 
	Encounter_ID
	
/* piece 2 */
TRUNCATE TABLE 
	PHDI_HTN_2ENC_IDs
INSERT INTO PHDI_HTN_2ENC_IDs
SELECT patient_id
FROM #temp
GROUP BY patient_id
HAVING count(patient_id)>1
ORDER BY patient_id 
 

  Cary's original code (before re-doing and simplifying the whole thing)

CREATE TABLE #temp (Encounter_ID bigint, Patient_ID bigint)
INSERT INTO #temp
SELECT DISTINCT 
	Encounter_ID, Patient_ID 
FROM #PHDI_HTN_SAMP
TRUNCATE TABLE PHDI_HTN_2ENC_IDs
INSERT INTO PHDI_HTN_2ENC_IDs
SELECT patient_id
FROM #temp
GROUP BY patient_id
HAVING count(patient_id) > 1
ORDER BY patient_id

  Do you see how DISTINCT Patient_ID, Encounter_ID gives the same results as GROUPing BY them?

  Now what cary did in the final step was just grab distinct patient_Ids/encounter ID combos in the first step:

CREATE TABLE #PHDI_HTN_SAMP(Encounter_ID bigint, Patient_ID bigint)
INSERT INTO #PHDI_HTN_SAMP
SELECT DISTINCT enc.encounter_id,diag.Patient_ID -- ....
/* (REST of code for SAMP table) */

INSERT INTO PHDI_HTN_2ENC_IDs
SELECT patient_id
FROM #PHDI_HTN_SAMP
GROUP BY patient_id
HAVING count(patient_id)>1
ORDER BY patient_id  

  So it's not that GROUPing BY is different than SELECT DISTINCT, it's that when you did the count you were counting the wrong things.


My code was wrong here:
This portion of code is counting the number of times each Patient_ID and Encounter_IDs occur (hence the crazy high numbers).
CREATE TABLE temp
SELECT
COUNT(*) AS count,
	Encounter_ID,
	Patient_ID
Should have been:
...which is getting the unique combinations of Patient_ID:Encounter_ID...
CREATE TABLE temp
SELECT DISTINCT
	Encounter_ID,
	Patient_ID

	TRUNCATE TABLE PHDI_HTN_2ENC_IDs -- << deletes the table if it exists
	INSERT INTO PHDI_HTN_2ENC_IDs
	SELECT 
		patient_id 
	FROM #temp
	GROUP BY patient_id
	HAVING count(patient_id)>1
	ORDER BY patient_id
Now drop the temporary tables.
	drop table #temp
	drop table #PHDI_HTN_SAMP