This project requires selecting a sample of patients who meet the following criteria.
- aged 18-85
- had 2 encounters between
2015-01-01
and2017-12-31
with a hypertension ICD9 diagnosis
These data live in RDB flat files, a collection of 17 tab-delimited datasets from the Clarity data warehouse. The data had been imported into a MySQL database, but was later relocated into a MS SQL Server database.
This documents covers changes to the query that creates a sample of 23632 patients who meet the above criteria.
The first attempt to create this sample was in MySQL. The query is outlined below.
This is a list of OUTPATIENT encounters approved by the PI. Three additional Encounter_Types were chosen after review by another member in the PHDI group ('Education', 'Procedure visit', 'Initial consult'). They get stored in the tmp_HtnEncounterTypeList
table
CREATE TABLE tmp_HtnEncounterTypeList
SELECT * FROM
ENCOUNTERS
-- these encounters were approved by KBD
WHERE Encounter_Type IN('Appointment', 'Office Visit', 'History',
'Nurse Only', 'Care Coordination', 'Routine Prenatal',
'Off License Non Med Ctr', 'Home Care Visit', 'Evaluation',
'Therapy', 'Nutrition', 'Genetic Counseling', 'Social Work',
'Anti-coag visit', 'Initial Prenatal', 'Day Treatment',
'Postpartum Visit', 'Home Health Admission', 'Ophth Exam',
'Remote CHF Monitoring', 'LACTATION CONSULT',
'Office Visit - Admin Closed', 'Left without being seen',
'Community Orders', 'Care Coordination - Home Health',
'Medication Management', 'Rx Refill Authorize', 'Nurse Triage',
-- MBL recommended adding the following encounter_types
'Education','Procedure visit', 'Initial consult');
Now this table can be used to get the larger sample of patients who meet the additional encounter criteria in the DIAGNOSES
table.
CREATE TABLE
PHDI_HTN_SAMP
SELECT
-- temp_enc is the tmp_HtnEncounterTypeList table
tmp_enc.EncounterDate,
tmp_enc.EncounterKey,
tmp_enc.Encounter_ID,
tmp_enc.Encounter_Type,
tmp_enc.Encounter_Is_Inpatient,
tmp_enc.Encounter_Is_Walk_In,
tmp_enc.Encounter_Age,
/*
get Patient_ID from DIAGNOSES (not from ENCOUNTERS)
tmp_enc.Patient_ID,
*/
/* DIAGNOSES table conditions */
diag.ICD9_Code,
/* this links to Encounter_ID in the ENCOUTNERS file*/
diag.Diagnoses_Encounter_ID,
/* type of diagnosis */
diag.Diagnosis_Type,
/*source table for diagnosis*/
diag.Diagnosis_Status,
/*status can be active, deleted, or resolved. */
diag.Diagnosis_Present_On_Admission,
/*yes or no*/
diag.Diagnosis_Name,
/* name of dx from dx_name in Clarity */
diag.Patient_ID
/* get Patient_ID to link back to PATIENTS, etc */
FROM
tmp_HtnEncounterTypeList AS tmp_enc
INNER JOIN DIAGNOSES AS diag
ON tmp_enc.Encounter_ID = diag.Diagnoses_Encounter_ID
-- these are all the diagnosis codes for hypertension
WHERE diag.ICD9_Code
IN ("401", "401.0", "401.1", "401.9","402",
"402.0","402.00","402.01", "402.1","402.10","402.11","402.9",
"402.90","402.91","403","403.00","403.01", "403.1","403.10",
"403.11","403.9","403.91","404","404.0","404.00","404.01",
"404.02","404.03","404.1","404.10","404.11","404.12","404.13",
"404.9","404.90", "404.91","404.92","404.93","405","405.0",
"405.01","405.09","405.1","405.11", "405.19", "405.9",
"405.91", "405.99")
AND (tmp_enc.EncounterDate
-- 2015 & 2017 data (updated)
BETWEEN '2015-01-01' AND '2017-12-31')
-- age between 18 and 85
AND (tmp_enc.Encounter_Age BETWEEN 18 AND 85);
/*
213813 row(s) affected
Records: 213813
Duplicates: 0
Warnings: 0
12137.804 sec
*/
/* CHECK COUNT ON NEW TABLE */
SELECT COUNT(*) FROM PHDI_HTN_SAMP;
/*
+----------+
| COUNT(*) |
+----------+
| 213813 |
+----------+
1 row in set (0.16 sec)
*/
Reduce this to distinct Patient_ID
s for patients with a hypertension diagnosis and at least 2 outpatient encounters. I can do this by counting the number of encounter IDs per Patient_ID
CREATE TABLE
temp_HtnCountPatIDEncID
SELECT
COUNT(*) AS count,
Encounter_ID,
Patient_ID
FROM
PHDI_HTN_SAMP
GROUP BY
Encounter_ID, Patient_ID
HAVING
count > 1
ORDER BY
count DESC;
/*
30311 row(s) affected
Records: 30311
Duplicates: 0
Warnings: 0
*/
/* check this new temp_HtnCountPatIDEncID table */
-- =======================================================
SELECT * FROM temp_HtnCountPatIDEncID
ORDER BY count
DESC LIMIT 5;
/*
+-------+-----------------+-----------------+
| count | Encounter_ID | Patient_ID |
+-------+-----------------+-----------------+
| 15 | 808640570379794 | 942523706238717 |
| 12 | 152675774414092 | 243384487461299 |
| 12 | 378120614681393 | 744728490710259 |
| 12 | 273427328560501 | 388629973400384 |
| 9 | 889970415737480 | 529602517839521 |
+-------+-----------------+-----------------+
5 rows in set (0.03 sec)
*/
SELECT * FROM temp_HtnCountPatIDEncID
ORDER BY count LIMIT 5;
/*
+-------+-----------------+-----------------+
| count | Encounter_ID | Patient_ID |
+-------+-----------------+-----------------+
| 2 | 471763866953552 | 434451227076352 |
| 2 | 862538943067193 | 16760296188295 |
| 2 | 453247585333884 | 131788644008338 |
| 2 | 106287259142846 | 302721116691828 |
| 2 | 604085796978325 | 977933479938656 |
+-------+-----------------+-----------------+
5 rows in set (0.02 sec)
*/
/* these range from 15 t0 2 */
/*
Now get the distinct PATIENT_ID
s from the temp_HtnCountPatIDEncID
table and put in separate table
CREATE TABLE PHDI_HTN_2ENC_IDs
SELECT DISTINCT Patient_ID FROM temp_HtnCountPatIDEncID;
SELECT * FROM PHDI_HTN_2ENC_IDs;
-- 19347 rows
/* DROP TEMP TABLES */
-- DROP TABLE temp_HtnCountPatIDEncID;
-- DROP TABLE tmp_HtnEncounterTypeList;
The MySQL query was very slow. This was due to a combination of 1) using free MySQL community edition, and 2) not properly indexing the new table of encounter types.
This is the query saved as a stored procedure in the MS SQL server database ([dbo].[deprecated_spc_HTN]
). It was reviewed and a couple changes were made to get the correct sample (and to improve speed).
USE [phdi]
GO
/****** Object: StoredProcedure [dbo].[deprecated_spc_HTN] ******/
/*
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
*/
/*
exec spc_HTN
this proc has hard-coded encounter dates
BETWEEN '2015-01-01' AND '2017-12-31'
*/
My previous strategy was to create a subset of the Encounter_Type
s, then join to the DIAGNOSES
table using the Diagnoses_Encounter_ID
. This slowed the process down because there wasn't a unique index in the temp table (tmp_HtnEncounterTypeList
).
This code chunk creates the PHDI_HTN_SAMP
table and formats each column before entering data into the table. Minor difference in workflow between MySQL and SQL Server.
CREATE TABLE #PHDI_HTN_SAMP(
[Encounterdate] [date] NULL,
[encounterkey] [varchar](250) NULL,
[Encounter_ID] [bigint] NULL,
[Encounter_type] [varchar](250) NULL,
[Encounter_Is_Inpatient] [varchar](250) NULL,
[Encounter_Is_Walk_In] [varchar](250) NULL,
[Encounter_age] [smallint] NULL,
[ICD9_code] [varchar](250) NULL,
[Diagnoses_Encounter_ID] [bigint] NULL,
[Diagnosis_Type] [varchar](250) NULL,
[Diagnosis_Status] [varchar](250) NULL,
[Diagnosis_Present_On_Admission] [varchar](250) NULL,
[Diagnosis_Name] [varchar](250) NULL,
[Patient_ID] [bigint] NULL
)
Now the SELECT
statement follows the INSERT INTO
command.
INSERT into #PHDI_HTN_SAMP
SELECT
enc.EncounterDate,
enc.EncounterKey,
enc.Encounter_ID,
enc.Encounter_Type,
enc.Encounter_Is_Inpatient,
enc.Encounter_Is_Walk_In,
enc.Encounter_Age,
/* DIAGNOSES table conditions */
diag.ICD9_Code,
/* this links to Encounter_ID in the ENCOUTNERS file*/
diag.Diagnoses_Encounter_ID,
/* type of diagnosis */
diag.Diagnosis_Type,
/*source table for diagnosis*/
diag.Diagnosis_Status,
/*status can be active, deleted, or resolved. */
diag.Diagnosis_Present_On_Admission,
/*yes or no*/
diag.Diagnosis_Name,
/* name of dx from dx_name in Clarity */
diag.Patient_ID
/* get Patient_ID to link back to PATIENTS, etc */
FROM
ENCOUNTERS AS enc -- >> ENCOUNTERS alias
INNER JOIN
DIAGNOSES AS diag -- >> DIAGNOSES alias
ON enc.Encounter_ID = diag.Diagnoses_Encounter_ID
-- >> these are all the diagnosis codes for hypertension
WHERE diag.ICD9_Code
IN ('401', '401.0', '401.1', '401.9','402',
'402.0','402.00','402.01', '402.1','402.10','402.11','402.9',
'402.90','402.91','403','403.00','403.01', '403.1','403.10',
'403.11','403.9','403.91','404','404.0','404.00','404.01',
'404.02','404.03','404.1','404.10','404.11','404.12','404.13',
'404.9','404.90', '404.91','404.92','404.93','405','405.0',
'405.01','405.09','405.1','405.11', '405.19', '405.9',
'405.91', '405.99') AND
(enc.EncounterDate
-- >> date parameters (2015 & 2017 data (updated))
BETWEEN '2015-01-01' AND '2017-12-31')
-- >> age parameters (age between 18 and 85)
AND (enc.Encounter_Age BETWEEN 18 AND 85)
/*
This was the previous temporary table that was slowing the process down. By
adding this as a WHERE condition it becomes part of the JOIN (and has indexes).
*/
AND enc.Encounter_Type IN('Appointment', 'Office Visit',
'History', 'Nurse Only', 'Care Coordination', 'Routine Prenatal',
'Off License Non Med Ctr', 'Home Care Visit', 'Evaluation',
'Therapy', 'Nutrition', 'Genetic Counseling', 'Social Work',
'Anti-coag visit', 'Initial Prenatal', 'Day Treatment',
'Postpartum Visit', 'Home Health Admission', 'Ophth Exam',
'Remote CHF Monitoring', 'LACTATION CONSULT',
'Office Visit - Admin Closed', 'Left without being seen',
'Community Orders', 'Care Coordination - Home Health',
'Medication Management', 'Rx Refill Authorize', 'Nurse Triage',
-- MBL recommended adding the following encounter_types
'Education','Procedure visit', 'Initial consult')
This portion is where the code was rewritten to get the sample we wanted. Previously the grouping statement included the Patient_ID
and Encounter_ID
, but this was not correct.
Instead create a table with the DISTINCT
Patient_ID
and Encounter_ID
s, then we only have to count the number of times a Patient_ID
shows up in the temp
table.
CREATE TABLE #temp (Encounter_ID bigint, Patient_ID bigint)
INSERT INTO #temp
SELECT DISTINCT
Encounter_ID, Patient_ID
FROM #PHDI_HTN_SAMP
There are multiple Encounter_ID
s per Patient_ID
. Moreover, there can be multiple ICD9_Code
s on a single Encounter_ID
.
Feedback from Chris
I ended up writing it the way Cary would do it:
SELECT DISTINCT patient_ID, encounter_Id
even though the way I did it would produce the same results:
GROUP BY patient_Id, encounter_ID
So it's not the grouping that's the problem in your original code. It's that you are counting the number of times the same
patient_ID
andencounter_ID
occur, instead of counting the number of distinctpatient_ID
/encounter_ID
combos.
If it helps, let's look at how I split it up as opposed to how Cary did it (and he combined them, so it will be not exactly one to one).
My original code:
CREATE TABLE temp
SELECT
COUNT(*) AS count,
Encounter_ID,
Patient_ID
FROM
PHDI_HTN_SAMP
GROUP BY
Encounter_ID,
Patient_ID
HAVING
count(*) > 1
ORDER BY
count DESC;
Chris's CODE (split into 2 pieces):
/* piece 1 */
INSERT INTO #temp
SELECT
Encounter_ID,
Patient_ID
FROM
PHDI_HTN_SAMP
GROUP BY
Encounter_ID,
Patient_ID
ORDER BY
Patient_ID,
Encounter_ID
/* piece 2 */
TRUNCATE TABLE
PHDI_HTN_2ENC_IDs
INSERT INTO PHDI_HTN_2ENC_IDs
SELECT patient_id
FROM #temp
GROUP BY patient_id
HAVING count(patient_id)>1
ORDER BY patient_id
Cary's original code (before re-doing and simplifying the whole thing)
CREATE TABLE #temp (Encounter_ID bigint, Patient_ID bigint)
INSERT INTO #temp
SELECT DISTINCT
Encounter_ID, Patient_ID
FROM #PHDI_HTN_SAMP
TRUNCATE TABLE PHDI_HTN_2ENC_IDs
INSERT INTO PHDI_HTN_2ENC_IDs
SELECT patient_id
FROM #temp
GROUP BY patient_id
HAVING count(patient_id) > 1
ORDER BY patient_id
Do you see how
DISTINCT
Patient_ID
,Encounter_ID
gives the same results asGROUP
ingBY
them?
Now what cary did in the final step was just grab distinct patient_Ids/encounter ID combos in the first step:
CREATE TABLE #PHDI_HTN_SAMP(Encounter_ID bigint, Patient_ID bigint)
INSERT INTO #PHDI_HTN_SAMP
SELECT DISTINCT enc.encounter_id,diag.Patient_ID -- ....
/* (REST of code for SAMP table) */
INSERT INTO PHDI_HTN_2ENC_IDs
SELECT patient_id
FROM #PHDI_HTN_SAMP
GROUP BY patient_id
HAVING count(patient_id)>1
ORDER BY patient_id
So it's not that
GROUP
ingBY
is different thanSELECT DISTINCT
, it's that when you did thecount
you were counting the wrong things.
My code was wrong here:
This portion of code is counting the number of times each Patient_ID
and Encounter_ID
s occur (hence the crazy high numbers).
CREATE TABLE temp
SELECT
COUNT(*) AS count,
Encounter_ID,
Patient_ID
Should have been:
...which is getting the unique combinations of Patient_ID
:Encounter_ID
...
CREATE TABLE temp
SELECT DISTINCT
Encounter_ID,
Patient_ID
TRUNCATE TABLE PHDI_HTN_2ENC_IDs -- << deletes the table if it exists
INSERT INTO PHDI_HTN_2ENC_IDs
SELECT
patient_id
FROM #temp
GROUP BY patient_id
HAVING count(patient_id)>1
ORDER BY patient_id
Now drop the temporary tables.
drop table #temp
drop table #PHDI_HTN_SAMP