Snippet which shows how a de-privleged operator can execute BQ Differetnial privacy functions of encrypted data.
In the following, there are three parties:
- 2 hospitals
- 1 pharma company
The hospitals each owns a BQ table which it deems has a sensitive column which they encrypted with their own AEAD key
The pharma company would like to combine the datasets from each hospital in such a way that the output has BQ Differential Privacy applied.
Each hospital does not want their sensitive data visible/accessible to any party; theyr'e ok if the combined dataset has BQ's diff privacy applied.
The approach take here is similar to what was done here:
Essentially, each hospital encypts a sensitive column with their own AEAD key and that key will only materialize for the pharma company with a secure Trusted Execution Environment (TEE
) such as GCP Confidential Space.
The TEE is such that not even the operator can ssh in or access the runtime state. Only within that TEE will each of the providers key materialize and be used to construct the BQ query.
To facilitate this, we'll ofcourse use BQ AEAD Encrypted and encrypt the privacy_unit_column using that.
As a baseline, we will use the sample differential privacy cited in Introducing BigQuery differential privacy but instead run this over two years worth of data.
Why two? well, w'ere going to pretend later that each hospital owns one years worth of data they encrypted. The pharma company wants to then run something like;
bq query --use_legacy_sql=false '
SELECT
WITH
DIFFERENTIAL_PRIVACY
OPTIONS (
epsilon = 1,
delta = 1e-7,
privacy_unit_column = npi)
provider_type,
PERCENTILE_CONT(
bene_unique_cnt, 0.5, contribution_bounds_per_row => (0, 10000))
percentile_50th,
PERCENTILE_CONT(
bene_unique_cnt, 0.9, contribution_bounds_per_row => (0, 10000))
percentile_90th
FROM `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2014` `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2015`
WHERE provider_type IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'
+--------------------------------------+--------------------+--------------------+
| provider_type | percentile_50th | percentile_90th |
+--------------------------------------+--------------------+--------------------+
| Radiation Therapy | 234.19525780544257 | 7196.875000000001 |
| Multispecialty Clinic/Group Practice | 174.6764151278889 | 1815.3379753216432 |
| Peripheral Vascular Disease | 103.99220023869114 | 2651.7857142857147 |
| Ambulance Service Supplier | 100.54990425895404 | 684.3206889589827 |
| Geriatric Psychiatry | 64.48901390533804 | 2758.450255102041 |
| Emergency Medicine | 62.940693405512484 | 264.0494928085235 |
| Public Health Welfare Agency | 59.35238881318908 | 5837.890625 |
| Unknown Physician Specialty Code | 58.99737712523947 | 588.7551875475108 |
| All Other Suppliers | 51.41812646812326 | 390.0425701202529 |
| Sleep Medicine | 49.9722997021223 | 280.70476872821087 |
+--------------------------------------+--------------------+--------------------+
First pretend each hospital creates their own dataset that is just a copy of a years worth of data.
The npi
column is AEAD encrypted with different keys
For hospital_1:
bq mk --data_location=US aead_diff_2014
bq query --append_table --destination_table aead_diff_2014.physicians_and_other_supplier --nouse_legacy_sql --parameter=keyset1::CMKIrNYJEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIGNoYW5nZSB0aGlzIHBhc3N3b3JkIHRvIGEgc2VjcmV0GAEQARjCiKzWCSAB \ '
SELECT AEAD.ENCRYPT(FROM_BASE64(@keyset1), npi, "") as npi, provider_type, bene_unique_cnt FROM `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2014`'
For hospital_2:
bq mk --data_location=US aead_diff_2015
bq query --append_table --destination_table aead_diff_2015.physicians_and_other_supplier --nouse_legacy_sql --parameter=keyset2::CNXd6toHEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIDnvckhhXod0bLVuBGIrCHiCl14aUkBeN8MaBiimapLZGAEQARjV3eraByAB \ '
SELECT AEAD.ENCRYPT(FROM_BASE64(@keyset2), npi, "") as npi, provider_type, bene_unique_cnt FROM `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2015`'
The step we're omitting is how the AEAD keys are released inside the TEE.
That is described in confidential space git repo
Not just to test, pretend you are inside the TEE and have access to bq read each dataset (in encrypted form) but can also AEAD decrypt the output (i.,e you got the keys from the hospitals)
(again, this is just to show the aead functions do actually work, nothing more)
bq query \
--parameter=keyset1::CMKIrNYJEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIGNoYW5nZSB0aGlzIHBhc3N3b3JkIHRvIGEgc2VjcmV0GAEQARjCiKzWCSAB \
--parameter=keyset2::CNXd6toHEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIDnvckhhXod0bLVuBGIrCHiCl14aUkBeN8MaBiimapLZGAEQARjV3eraByAB --use_legacy_sql=false '
SELECT provider_type,bene_unique_cnt, AEAD.DECRYPT_STRING(FROM_BASE64(@keyset1), npi, "") as npi
FROM aead_diff_2014.physicians_and_other_supplier
UNION ALL
SELECT provider_type,bene_unique_cnt, AEAD.DECRYPT_STRING(FROM_BASE64(@keyset2),npi, "") as npi
FROM aead_diff_2015.physicians_and_other_supplier
'
The fact we can decrypt the keys means we can also issue a differential privacy query for the output:
Run a quer
bq query \
--parameter=keyset1::CMKIrNYJEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIGNoYW5nZSB0aGlzIHBhc3N3b3JkIHRvIGEgc2VjcmV0GAEQARjCiKzWCSAB \
--parameter=keyset2::CNXd6toHEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIDnvckhhXod0bLVuBGIrCHiCl14aUkBeN8MaBiimapLZGAEQARjV3eraByAB --use_legacy_sql=false '
SELECT
WITH
DIFFERENTIAL_PRIVACY
OPTIONS (
epsilon = 1,
delta = 1e-7,
privacy_unit_column = npi)
provider_type,
PERCENTILE_CONT(
bene_unique_cnt, 0.5, contribution_bounds_per_row => (0, 10000))
percentile_50th,
PERCENTILE_CONT(
bene_unique_cnt, 0.9, contribution_bounds_per_row => (0, 10000))
percentile_90th
FROM
(
SELECT provider_type as provider_type, bene_unique_cnt as bene_unique_cnt, AEAD.DECRYPT_STRING(FROM_BASE64(@keyset1), npi, "") as npi FROM aead_diff_2014.physicians_and_other_supplier
UNION ALL
SELECT provider_type as provider_type, bene_unique_cnt as bene_unique_cnt, AEAD.DECRYPT_STRING(FROM_BASE64(@keyset2), npi, "") as npi FROM aead_diff_2015.physicians_and_other_supplier
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'
+--------------------------------------+--------------------+--------------------+
| provider_type | percentile_50th | percentile_90th |
+--------------------------------------+--------------------+--------------------+
| Radiation Therapy | 254.2401175213675 | 6663.98552955665 |
| Multispecialty Clinic/Group Practice | 121.06096656207464 | 2180.0082736545146 |
| Ambulance Service Supplier | 86.00294576803891 | 577.625461511979 |
| Peripheral Vascular Disease | 60.74804255648616 | 605.8674648668639 |
| Public Health Welfare Agency | 60.03680373492293 | 342.96529321923515 |
| Emergency Medicine | 59.037927604544535 | 255.0081203590081 |
| Sleep Medicine | 56.29396546150764 | 417.82495856850466 |
| Audiologist (billing independently) | 47.077202725091865 | 195.73551095823916 |
| Ophthalmology | 47.04410243302923 | 271.9321652613481 |
| Centralized Flu | 47.034427493815464 | 205.07063958778716 |
+--------------------------------------+--------------------+--------------------+
Again, at no time did the hosptital have access to the raw keys as long as the keys were released to confidential space or other TEE