-
-
Save arm5077/11227658 to your computer and use it in GitHub Desktop.
SELECT medicare.*, | |
peerAverage.average AS charge_average, | |
( medicare.average_submitted_chrg_amt - peerAverage.average ) / | |
peerAverage.average * 100 AS difference_from_average, | |
totalAverage.total_difference_average | |
FROM medicare | |
JOIN (SELECT Avg(average_submitted_chrg_amt) AS average, | |
hcpcs_code, | |
place_of_service | |
FROM medicare | |
WHERE npi LIKE '%' | |
AND hcpcs_code LIKE medicare.hcpcs_code | |
AND nppes_provider_city LIKE 'Latrobe' | |
AND nppes_provider_state LIKE 'PA' | |
GROUP BY place_of_service, | |
hcpcs_code) AS peerAverage | |
ON medicare.hcpcs_code = peerAverage.hcpcs_code | |
AND medicare.place_of_service = peerAverage.place_of_service | |
/* This join basically repeats everything and to get total percentage average by doctor */ | |
JOIN (SELECT npi, | |
Avg(( medicare.average_submitted_chrg_amt - | |
peerAverage.average ) / | |
peerAverage.average * 100) AS | |
total_difference_average | |
FROM medicare | |
JOIN (SELECT Avg(average_submitted_chrg_amt) AS average, | |
hcpcs_code, | |
place_of_service | |
FROM medicare | |
WHERE npi LIKE '%' | |
AND hcpcs_code LIKE medicare.hcpcs_code | |
AND nppes_provider_city LIKE 'Pittsburgh' | |
AND nppes_provider_state LIKE 'PA' | |
GROUP BY place_of_service, | |
hcpcs_code) AS peerAverage | |
ON medicare.hcpcs_code = peerAverage.hcpcs_code | |
AND medicare.place_of_service = | |
peerAverage.place_of_service | |
WHERE medicare.npi LIKE '%' | |
AND nppes_provider_last_org_name LIKE '%' | |
AND nppes_provider_first_name LIKE '%' | |
AND nppes_provider_city LIKE 'Pittsburgh' | |
AND nppes_provider_state LIKE 'PA' | |
AND medicare.hcpcs_code LIKE '%' | |
GROUP BY npi) AS totalAverage | |
ON medicare.npi = totalAverage.npi | |
GROUP BY npi, | |
place_of_service, | |
hcpcs_code | |
ORDER BY hcpcs_code ASC, | |
difference_from_average DESC |
Within each subquery, I don't think you need the AND hcpcs_code LIKE medicare.hcpcs_code
predicates in your WHERE clauses. If I understand the query properly, it's redundant and you're comparing a column to itself.
One way to make this perform better would be to pre-process the data and calculate the overall average of each geographical area once and store that in a different table. In other words, By dropping the npi
information in favor of pre-aggregated data at the city and state level, you can save the DB from having to do a lot of work every time you run the query.
Essentially:
- Generate a new table from the results of an unfiltered version of the
peerAverages
subquery - Index it on
hcpcs_code
andplace_of_service
(since that's how you group it) - Index it on
city
and (maybe)state
(since that's how you filter it) - Join to that table in this query.
The table would look something like:
CREATE TABLE medicare_averages_by_hcpcs_and_place_of_service (hcpcs_code varchar(), place_of_service varchar(), city varchar(), state varchar(), average_charge);
INSERT INTO medicare_averages_by_hcpcs_and_place_of_service
SELECT
hcpcs_code,
place_of_service,
city,
state,
avg(average_submitted_chrg_amt) AS average,
FROM medicare
GROUP BY
place_of_service,
hcpcs_code,
city,
state;
CREATE INDEX on medicare_averages_by_hcpcs_and_place_of_service (hcps_code, place_of_service);
CREATE INDEX on medicare_averages_by_hcpcs_and_place_of_service (city);
Along the same lines, you could pre-process the cost average by npi
in its own table. The resulting table would have a schema like:
CREATE TABLE medicare_averages_by_npi (npi varchar(), average_charge);
Optimizing SQL queries over large datasets like this is, at the core, all about reducing the number of rows the database has to touch. So if you want to run this over a large number geographical areas, I would recommend precomputing these averages once so that you can explore the data as quickly as possible.
On the medicare table, make sure you have indexes on npi
, hcpcs_code
, place_of_service
, city
, and state
since you are using them to join, group, and filter.
Andrew: Would you post the table layout? Any indexes you created? Number of rows? Available memory? Temp file size? Number of processors. Is the mySQL standard "out of the box" or have you changed runtime parms?
like is a very expensive DB request. Sorting without a proper index is both time/memory/disk intensive.
Is this data available to download? I'd like to load it into SQLServer 2012. They have an execution plan tool that helps with tuning queries.
I did a quick look at the spreedsheet. It looks like you should have a least three tables:
Provider
npi appears to be unique (Primary Key)
First Name
Last Name
City
State
ZipCode
Services
hcpcs_code appears to be unique (Primary Key)
hcpcs_description
ServiceProvided
ServiceProvidedID (Primary Key)
Provider.npi (Foreign Key)
Services.hcpcs_code (Foreign Key)
Location (could be set up as a table)
average_submitted_chrg_amt
Then of course, the hcpcs_description has funky characters (/, &, > ...) that can wreck havoc on searches
Thanks to both of you.
Here's a preliminary query I've worked up using your advice. @seejee, I created the new location table like you suggested and join to it in this query. @mhowee0422, I've eliminated all the extra "LIKEs" and converted the ones I have into straight "=".
This doesn't yet deal with the "average of averages," which I'm not quite sure how to tackle -- will have to noodle on that some more. I must be doing something wrong with my joins (all joined fields are indexed) because this query is actually running slower than an older version of my previous query that didn't have the double-nested subquery.
SELECT medicare.*,
medicare_averages_by_hcpcs_and_place_of_service.average_charge,
Avg((medicare.average_submitted_chrg_amt - medicare_averages_by_hcpcs_and_place_of_service.average_charge)/medicare_averages_by_hcpcs_and_place_of_service.average_charge * 100) as percent_difference
FROM medicare
JOIN medicare_averages_by_hcpcs_and_place_of_service
ON medicare.hcpcs_code =
medicare_averages_by_hcpcs_and_place_of_service.hcpcs_code
AND medicare.place_of_service =
medicare_averages_by_hcpcs_and_place_of_service.place_of_service
AND medicare.nppes_provider_city =
medicare_averages_by_hcpcs_and_place_of_service.nppes_provider_city
AND medicare.nppes_provider_state =
medicare_averages_by_hcpcs_and_place_of_service.nppes_provider_state
WHERE medicare.nppes_provider_city = 'PITTSBURGH'
AND medicare.nppes_provider_state = 'PA'
GROUP BY npi,
place_of_service,
hcpcs_code
ORDER BY hcpcs_code ASC
For reference, here's my SHOW CREATE TABLEs for both tables.
Main database
CREATE TABLE `medicare` (
`npi` varchar(10) DEFAULT NULL,
`nppes_provider_last_org_name` varchar(70) DEFAULT NULL,
`nppes_provider_first_name` varchar(20) DEFAULT NULL,
`nppes_provider_mi` tinytext,
`nppes_credentials` tinytext,
`nppes_provider_gender` tinytext,
`nppes_entity_code` tinytext,
`nppes_provider_street1` tinytext,
`nppes_provider_street2` tinytext,
`nppes_provider_city` varchar(40) DEFAULT NULL,
`nppes_provider_zip` varchar(20) DEFAULT NULL,
`nppes_provider_state` varchar(2) DEFAULT NULL,
`nppes_provider_country` tinytext,
`provider_type` tinytext,
`medicare_participation_indicator` tinytext,
`place_of_service` varchar(2) DEFAULT NULL,
`hcpcs_code` varchar(10) DEFAULT NULL,
`hcpcs_description` tinytext,
`line_srvc_cnt` int(11) DEFAULT NULL,
`bene_unique_cnt` int(11) DEFAULT NULL,
`bene_day_srvc_cnt` int(11) DEFAULT NULL,
`average_Medicare_allowed_amt` float DEFAULT NULL,
`stdev_Medicare_allowed_amt` float DEFAULT NULL,
`average_submitted_chrg_amt` float DEFAULT NULL,
`stdev_submitted_chrg_amt` float DEFAULT NULL,
`average_Medicare_payment_amt` float DEFAULT NULL,
`stdev_Medicare_payment_amt` float DEFAULT NULL,
KEY `nppes_provider_first_name` (`nppes_provider_first_name`),
KEY `nppes_provider_last_org_name` (`nppes_provider_last_org_name`),
KEY `nppes_provider_city` (`nppes_provider_city`),
KEY `nppes_provider_zip` (`nppes_provider_zip`),
KEY `nppes_provider_state` (`nppes_provider_state`),
KEY `npi` (`npi`),
KEY `average_submitted_chrg_amt` (`average_submitted_chrg_amt`),
KEY `place_of_service` (`place_of_service`),
KEY `hcpcs_code` (`hcpcs_code`),
KEY `stdev_submitted_chrg_amt` (`stdev_submitted_chrg_amt`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='How much doctors charge Medicare/Medicaid for services'
Average costs by facility type, city and state:
CREATE TABLE `medicare_averages_by_hcpcs_and_place_of_service` (
`hcpcs_code` varchar(10) DEFAULT NULL,
`place_of_service` varchar(2) DEFAULT NULL,
`nppes_provider_city` varchar(40) DEFAULT NULL,
`nppes_provider_state` varchar(2) DEFAULT NULL,
`average_charge` float DEFAULT NULL,
KEY `hcpcs_code` (`hcpcs_code`,`place_of_service`),
KEY `nppes_provider_city` (`nppes_provider_city`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
@arm5077 Try throwing an index on the nppes_provider_state
column of the averages table as well. It's the only thing I can think of that might be slowing it down. How many rows ended up in the averages table?
I'm more familiar with postgres, so I might pull this down into a postgres db and give it a try. I also have access to some Really Honking Powerful Postgres servers, so if i get it working locally, I might give the query a try there as well.
@seejee Will try the index. The averages table has 2 million rows ... more than I expected, but logical, I guess -- you're dealing with 6,000 procedures across 50 states and 18,000 cities.
What are the
LIKE '%'
parts of the query for? Are you just trying to eliminate NULLs? If so, annpi IS NOT NULL
may perform a lot better.Or better yet, scrub your medicare table of any NULLs so you can just remove those
LIKE '%'
checks entirely.