Skip to content

Instantly share code, notes, and snippets.

@salrashid123
Created July 20, 2023 13:03
Show Gist options
  • Save salrashid123/3d7bc17e2d3f096e68e1bd27e4baea5e to your computer and use it in GitHub Desktop.
Save salrashid123/3d7bc17e2d3f096e68e1bd27e4baea5e to your computer and use it in GitHub Desktop.
BQ Differential Privacy using AEAD and GCP Confidential Space

BQ Differential Privacy using AEAD and GCP Confidential Space

Snippet which shows how a de-privleged operator can execute BQ Differetnial privacy functions of encrypted data.

In the following, there are three parties:

  • 2 hospitals
  • 1 pharma company

The hospitals each owns a BQ table which it deems has a sensitive column which they encrypted with their own AEAD key

The pharma company would like to combine the datasets from each hospital in such a way that the output has BQ Differential Privacy applied.

Each hospital does not want their sensitive data visible/accessible to any party; theyr'e ok if the combined dataset has BQ's diff privacy applied.


The approach take here is similar to what was done here:

Essentially, each hospital encypts a sensitive column with their own AEAD key and that key will only materialize for the pharma company with a secure Trusted Execution Environment (TEE) such as GCP Confidential Space.

The TEE is such that not even the operator can ssh in or access the runtime state. Only within that TEE will each of the providers key materialize and be used to construct the BQ query.

To facilitate this, we'll ofcourse use BQ AEAD Encrypted and encrypt the privacy_unit_column using that.


Baseline

As a baseline, we will use the sample differential privacy cited in Introducing BigQuery differential privacy but instead run this over two years worth of data.

Why two? well, w'ere going to pretend later that each hospital owns one years worth of data they encrypted. The pharma company wants to then run something like;

bq  query --use_legacy_sql=false  '
SELECT
WITH
  DIFFERENTIAL_PRIVACY
    OPTIONS (
      epsilon = 1,
      delta = 1e-7,
      privacy_unit_column = npi)
    provider_type,
PERCENTILE_CONT(
  bene_unique_cnt, 0.5, contribution_bounds_per_row => (0, 10000))
  percentile_50th,
PERCENTILE_CONT(
  bene_unique_cnt, 0.9, contribution_bounds_per_row => (0, 10000))
  percentile_90th
FROM `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2014` `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2015` 
WHERE provider_type IS NOT NULL
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'

+--------------------------------------+--------------------+--------------------+
|            provider_type             |  percentile_50th   |  percentile_90th   |
+--------------------------------------+--------------------+--------------------+
| Radiation Therapy                    | 234.19525780544257 |  7196.875000000001 |
| Multispecialty Clinic/Group Practice |  174.6764151278889 | 1815.3379753216432 |
| Peripheral Vascular Disease          | 103.99220023869114 | 2651.7857142857147 |
| Ambulance Service Supplier           | 100.54990425895404 |  684.3206889589827 |
| Geriatric Psychiatry                 |  64.48901390533804 |  2758.450255102041 |
| Emergency Medicine                   | 62.940693405512484 |  264.0494928085235 |
| Public Health Welfare Agency         |  59.35238881318908 |        5837.890625 |
| Unknown Physician Specialty Code     |  58.99737712523947 |  588.7551875475108 |
| All Other Suppliers                  |  51.41812646812326 |  390.0425701202529 |
| Sleep Medicine                       |   49.9722997021223 | 280.70476872821087 |
+--------------------------------------+--------------------+--------------------+

Setup

First pretend each hospital creates their own dataset that is just a copy of a years worth of data.

The npi column is AEAD encrypted with different keys

For hospital_1:

bq mk  --data_location=US aead_diff_2014

bq query   --append_table --destination_table aead_diff_2014.physicians_and_other_supplier --nouse_legacy_sql --parameter=keyset1::CMKIrNYJEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIGNoYW5nZSB0aGlzIHBhc3N3b3JkIHRvIGEgc2VjcmV0GAEQARjCiKzWCSAB \ '
  SELECT AEAD.ENCRYPT(FROM_BASE64(@keyset1), npi, "") as npi, provider_type, bene_unique_cnt FROM `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2014`'

For hospital_2:

bq mk  --data_location=US aead_diff_2015

bq query   --append_table --destination_table aead_diff_2015.physicians_and_other_supplier --nouse_legacy_sql --parameter=keyset2::CNXd6toHEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIDnvckhhXod0bLVuBGIrCHiCl14aUkBeN8MaBiimapLZGAEQARjV3eraByAB \ '
  SELECT AEAD.ENCRYPT(FROM_BASE64(@keyset2), npi, "") as npi,  provider_type, bene_unique_cnt FROM `bigquery-public-data.cms_medicare.physicians_and_other_supplier_2015`'

Key release to TEE

The step we're omitting is how the AEAD keys are released inside the TEE.

That is described in confidential space git repo

Not just to test, pretend you are inside the TEE and have access to bq read each dataset (in encrypted form) but can also AEAD decrypt the output (i.,e you got the keys from the hospitals)

(again, this is just to show the aead functions do actually work, nothing more)

bq  query \
--parameter=keyset1::CMKIrNYJEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIGNoYW5nZSB0aGlzIHBhc3N3b3JkIHRvIGEgc2VjcmV0GAEQARjCiKzWCSAB \
--parameter=keyset2::CNXd6toHEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIDnvckhhXod0bLVuBGIrCHiCl14aUkBeN8MaBiimapLZGAEQARjV3eraByAB --use_legacy_sql=false  '

SELECT provider_type,bene_unique_cnt, AEAD.DECRYPT_STRING(FROM_BASE64(@keyset1), npi, "") as npi
FROM aead_diff_2014.physicians_and_other_supplier

UNION ALL

SELECT provider_type,bene_unique_cnt, AEAD.DECRYPT_STRING(FROM_BASE64(@keyset2),npi, "") as npi
FROM aead_diff_2015.physicians_and_other_supplier
'

The fact we can decrypt the keys means we can also issue a differential privacy query for the output:

Run a quer

bq  query \
--parameter=keyset1::CMKIrNYJEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIGNoYW5nZSB0aGlzIHBhc3N3b3JkIHRvIGEgc2VjcmV0GAEQARjCiKzWCSAB \
--parameter=keyset2::CNXd6toHEmQKWAowdHlwZS5nb29nbGVhcGlzLmNvbS9nb29nbGUuY3J5cHRvLnRpbmsuQWVzR2NtS2V5EiIaIDnvckhhXod0bLVuBGIrCHiCl14aUkBeN8MaBiimapLZGAEQARjV3eraByAB --use_legacy_sql=false  '

SELECT 
WITH
  DIFFERENTIAL_PRIVACY
    OPTIONS (
      epsilon = 1,
      delta = 1e-7,
      privacy_unit_column = npi)
    provider_type,
PERCENTILE_CONT(
  bene_unique_cnt, 0.5, contribution_bounds_per_row => (0, 10000))
  percentile_50th,
PERCENTILE_CONT(
  bene_unique_cnt, 0.9, contribution_bounds_per_row => (0, 10000))
  percentile_90th
FROM 
  (  
    SELECT provider_type as provider_type, bene_unique_cnt as bene_unique_cnt, AEAD.DECRYPT_STRING(FROM_BASE64(@keyset1), npi, "") as npi FROM aead_diff_2014.physicians_and_other_supplier

     UNION ALL

    SELECT provider_type as provider_type, bene_unique_cnt as bene_unique_cnt, AEAD.DECRYPT_STRING(FROM_BASE64(@keyset2), npi, "") as npi FROM aead_diff_2015.physicians_and_other_supplier
  )
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
'

+--------------------------------------+--------------------+--------------------+
|            provider_type             |  percentile_50th   |  percentile_90th   |
+--------------------------------------+--------------------+--------------------+
| Radiation Therapy                    |  254.2401175213675 |   6663.98552955665 |
| Multispecialty Clinic/Group Practice | 121.06096656207464 | 2180.0082736545146 |
| Ambulance Service Supplier           |  86.00294576803891 |   577.625461511979 |
| Peripheral Vascular Disease          |  60.74804255648616 |  605.8674648668639 |
| Public Health Welfare Agency         |  60.03680373492293 | 342.96529321923515 |
| Emergency Medicine                   | 59.037927604544535 |  255.0081203590081 |
| Sleep Medicine                       |  56.29396546150764 | 417.82495856850466 |
| Audiologist (billing independently)  | 47.077202725091865 | 195.73551095823916 |
| Ophthalmology                        |  47.04410243302923 |  271.9321652613481 |
| Centralized Flu                      | 47.034427493815464 | 205.07063958778716 |
+--------------------------------------+--------------------+--------------------+

Again, at no time did the hosptital have access to the raw keys as long as the keys were released to confidential space or other TEE


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment