Skip to content

Instantly share code, notes, and snippets.

@arm5077
Created April 23, 2014 18:43
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arm5077/11227658 to your computer and use it in GitHub Desktop.
Save arm5077/11227658 to your computer and use it in GitHub Desktop.
Medicare doctor charge database query
SELECT medicare.*,
peerAverage.average AS charge_average,
( medicare.average_submitted_chrg_amt - peerAverage.average ) /
peerAverage.average * 100 AS difference_from_average,
totalAverage.total_difference_average
FROM medicare
JOIN (SELECT Avg(average_submitted_chrg_amt) AS average,
hcpcs_code,
place_of_service
FROM medicare
WHERE npi LIKE '%'
AND hcpcs_code LIKE medicare.hcpcs_code
AND nppes_provider_city LIKE 'Latrobe'
AND nppes_provider_state LIKE 'PA'
GROUP BY place_of_service,
hcpcs_code) AS peerAverage
ON medicare.hcpcs_code = peerAverage.hcpcs_code
AND medicare.place_of_service = peerAverage.place_of_service
/* This join basically repeats everything and to get total percentage average by doctor */
JOIN (SELECT npi,
Avg(( medicare.average_submitted_chrg_amt -
peerAverage.average ) /
peerAverage.average * 100) AS
total_difference_average
FROM medicare
JOIN (SELECT Avg(average_submitted_chrg_amt) AS average,
hcpcs_code,
place_of_service
FROM medicare
WHERE npi LIKE '%'
AND hcpcs_code LIKE medicare.hcpcs_code
AND nppes_provider_city LIKE 'Pittsburgh'
AND nppes_provider_state LIKE 'PA'
GROUP BY place_of_service,
hcpcs_code) AS peerAverage
ON medicare.hcpcs_code = peerAverage.hcpcs_code
AND medicare.place_of_service =
peerAverage.place_of_service
WHERE medicare.npi LIKE '%'
AND nppes_provider_last_org_name LIKE '%'
AND nppes_provider_first_name LIKE '%'
AND nppes_provider_city LIKE 'Pittsburgh'
AND nppes_provider_state LIKE 'PA'
AND medicare.hcpcs_code LIKE '%'
GROUP BY npi) AS totalAverage
ON medicare.npi = totalAverage.npi
GROUP BY npi,
place_of_service,
hcpcs_code
ORDER BY hcpcs_code ASC,
difference_from_average DESC
@seejee
Copy link

seejee commented Apr 23, 2014

What are the LIKE '%' parts of the query for? Are you just trying to eliminate NULLs? If so, an npi IS NOT NULL may perform a lot better.

Or better yet, scrub your medicare table of any NULLs so you can just remove those LIKE '%' checks entirely.

@seejee
Copy link

seejee commented Apr 23, 2014

Within each subquery, I don't think you need the AND hcpcs_code LIKE medicare.hcpcs_code predicates in your WHERE clauses. If I understand the query properly, it's redundant and you're comparing a column to itself.

@seejee
Copy link

seejee commented Apr 23, 2014

One way to make this perform better would be to pre-process the data and calculate the overall average of each geographical area once and store that in a different table. In other words, By dropping the npi information in favor of pre-aggregated data at the city and state level, you can save the DB from having to do a lot of work every time you run the query.

Essentially:

  • Generate a new table from the results of an unfiltered version of the peerAverages subquery
  • Index it on hcpcs_code and place_of_service (since that's how you group it)
  • Index it on city and (maybe) state (since that's how you filter it)
  • Join to that table in this query.

The table would look something like:

CREATE TABLE medicare_averages_by_hcpcs_and_place_of_service (hcpcs_code varchar(), place_of_service varchar(), city varchar(), state varchar(), average_charge);
INSERT INTO medicare_averages_by_hcpcs_and_place_of_service
SELECT 
  hcpcs_code, 
  place_of_service,
  city,
  state,
  avg(average_submitted_chrg_amt) AS average, 
FROM   medicare 
GROUP  BY
  place_of_service, 
  hcpcs_code,
  city,
  state;
CREATE INDEX on medicare_averages_by_hcpcs_and_place_of_service (hcps_code, place_of_service);
CREATE INDEX on medicare_averages_by_hcpcs_and_place_of_service (city);

Along the same lines, you could pre-process the cost average by npi in its own table. The resulting table would have a schema like:

CREATE TABLE medicare_averages_by_npi (npi varchar(), average_charge);

Optimizing SQL queries over large datasets like this is, at the core, all about reducing the number of rows the database has to touch. So if you want to run this over a large number geographical areas, I would recommend precomputing these averages once so that you can explore the data as quickly as possible.

@seejee
Copy link

seejee commented Apr 23, 2014

On the medicare table, make sure you have indexes on npi, hcpcs_code, place_of_service, city, and state since you are using them to join, group, and filter.

@mhowe0422
Copy link

Andrew: Would you post the table layout? Any indexes you created? Number of rows? Available memory? Temp file size? Number of processors. Is the mySQL standard "out of the box" or have you changed runtime parms?

like is a very expensive DB request. Sorting without a proper index is both time/memory/disk intensive.
Is this data available to download? I'd like to load it into SQLServer 2012. They have an execution plan tool that helps with tuning queries.

@mhowe0422
Copy link

I did a quick look at the spreedsheet. It looks like you should have a least three tables:
Provider
npi appears to be unique (Primary Key)
First Name
Last Name
City
State
ZipCode
Services
hcpcs_code appears to be unique (Primary Key)
hcpcs_description
ServiceProvided
ServiceProvidedID (Primary Key)
Provider.npi (Foreign Key)
Services.hcpcs_code (Foreign Key)
Location (could be set up as a table)
average_submitted_chrg_amt

Then of course, the hcpcs_description has funky characters (/, &, > ...) that can wreck havoc on searches

@arm5077
Copy link
Author

arm5077 commented Apr 24, 2014

Thanks to both of you.

Here's a preliminary query I've worked up using your advice. @seejee, I created the new location table like you suggested and join to it in this query. @mhowee0422, I've eliminated all the extra "LIKEs" and converted the ones I have into straight "=".

This doesn't yet deal with the "average of averages," which I'm not quite sure how to tackle -- will have to noodle on that some more. I must be doing something wrong with my joins (all joined fields are indexed) because this query is actually running slower than an older version of my previous query that didn't have the double-nested subquery.

SELECT medicare.*, 
       medicare_averages_by_hcpcs_and_place_of_service.average_charge,
       Avg((medicare.average_submitted_chrg_amt - medicare_averages_by_hcpcs_and_place_of_service.average_charge)/medicare_averages_by_hcpcs_and_place_of_service.average_charge * 100) as percent_difference
FROM   medicare 
       JOIN medicare_averages_by_hcpcs_and_place_of_service 
         ON medicare.hcpcs_code = 
       medicare_averages_by_hcpcs_and_place_of_service.hcpcs_code 
            AND medicare.place_of_service = 
            medicare_averages_by_hcpcs_and_place_of_service.place_of_service 
            AND medicare.nppes_provider_city = 
            medicare_averages_by_hcpcs_and_place_of_service.nppes_provider_city 
            AND medicare.nppes_provider_state = 
            medicare_averages_by_hcpcs_and_place_of_service.nppes_provider_state 
WHERE  medicare.nppes_provider_city = 'PITTSBURGH' 
       AND medicare.nppes_provider_state = 'PA' 
GROUP  BY npi, 
          place_of_service, 
          hcpcs_code 
ORDER  BY hcpcs_code ASC

@arm5077
Copy link
Author

arm5077 commented Apr 24, 2014

For reference, here's my SHOW CREATE TABLEs for both tables.

Main database

CREATE TABLE `medicare` (
  `npi` varchar(10) DEFAULT NULL,
  `nppes_provider_last_org_name` varchar(70) DEFAULT NULL,
  `nppes_provider_first_name` varchar(20) DEFAULT NULL,
  `nppes_provider_mi` tinytext,
  `nppes_credentials` tinytext,
  `nppes_provider_gender` tinytext,
  `nppes_entity_code` tinytext,
  `nppes_provider_street1` tinytext,
  `nppes_provider_street2` tinytext,
  `nppes_provider_city` varchar(40) DEFAULT NULL,
  `nppes_provider_zip` varchar(20) DEFAULT NULL,
  `nppes_provider_state` varchar(2) DEFAULT NULL,
  `nppes_provider_country` tinytext,
  `provider_type` tinytext,
  `medicare_participation_indicator` tinytext,
  `place_of_service` varchar(2) DEFAULT NULL,
  `hcpcs_code` varchar(10) DEFAULT NULL,
  `hcpcs_description` tinytext,
  `line_srvc_cnt` int(11) DEFAULT NULL,
  `bene_unique_cnt` int(11) DEFAULT NULL,
  `bene_day_srvc_cnt` int(11) DEFAULT NULL,
  `average_Medicare_allowed_amt` float DEFAULT NULL,
  `stdev_Medicare_allowed_amt` float DEFAULT NULL,
  `average_submitted_chrg_amt` float DEFAULT NULL,
  `stdev_submitted_chrg_amt` float DEFAULT NULL,
  `average_Medicare_payment_amt` float DEFAULT NULL,
  `stdev_Medicare_payment_amt` float DEFAULT NULL,
  KEY `nppes_provider_first_name` (`nppes_provider_first_name`),
  KEY `nppes_provider_last_org_name` (`nppes_provider_last_org_name`),
  KEY `nppes_provider_city` (`nppes_provider_city`),
  KEY `nppes_provider_zip` (`nppes_provider_zip`),
  KEY `nppes_provider_state` (`nppes_provider_state`),
  KEY `npi` (`npi`),
  KEY `average_submitted_chrg_amt` (`average_submitted_chrg_amt`),
  KEY `place_of_service` (`place_of_service`),
  KEY `hcpcs_code` (`hcpcs_code`),
  KEY `stdev_submitted_chrg_amt` (`stdev_submitted_chrg_amt`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='How much doctors charge Medicare/Medicaid for services'

Average costs by facility type, city and state:

CREATE TABLE `medicare_averages_by_hcpcs_and_place_of_service` (
  `hcpcs_code` varchar(10) DEFAULT NULL,
  `place_of_service` varchar(2) DEFAULT NULL,
  `nppes_provider_city` varchar(40) DEFAULT NULL,
  `nppes_provider_state` varchar(2) DEFAULT NULL,
  `average_charge` float DEFAULT NULL,
  KEY `hcpcs_code` (`hcpcs_code`,`place_of_service`),
  KEY `nppes_provider_city` (`nppes_provider_city`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

@seejee
Copy link

seejee commented Apr 24, 2014

@arm5077 Try throwing an index on the nppes_provider_state column of the averages table as well. It's the only thing I can think of that might be slowing it down. How many rows ended up in the averages table?

I'm more familiar with postgres, so I might pull this down into a postgres db and give it a try. I also have access to some Really Honking Powerful Postgres servers, so if i get it working locally, I might give the query a try there as well.

@arm5077
Copy link
Author

arm5077 commented Apr 25, 2014

@seejee Will try the index. The averages table has 2 million rows ... more than I expected, but logical, I guess -- you're dealing with 6,000 procedures across 50 states and 18,000 cities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment