This proposal is divided into -
- Introduction to the idea in an abstract way
- Specific focus of the proposal
- Proposed approach
Data transfer is always a communication between 2 entities/systems - the one system that owns the data and the other that requests the data.
sequenceDiagram
autonumber
participant DR as Data Requester
participant DO as Data Owner
DR->>DO: Request Data
DO->>DR: Data Response
In the above interaction, the data at any point is at rest or in-motion or in-transfer from one system to another -
sequenceDiagram
autonumber
participant DRD as Data Requester Store
participant DR as Data Requester
participant DO as Data Owner
participant DOD as Data Owner Store
DR->>DO: Request Data
DO->>DOD: Fetch Data
DO->>DR: Data Response
DR->>DRD: Store Data
Privacy considerations apply to both when the data is at rest/in a database/store and also when it is in-motion/transfered to a different system. In the upcoming sections, we will try to characterize these systems and how the "privacy considerations" of these interactions can be codified in some ways.
If we continue the analogy of the Data Requester
, this system can be an internal system or an external system.
We define an internal system as being with a specific privacy-boundary
A privacy-boundary helps us define what privacy considerations have been implemented to any system that is outside the boundary. How do we define such a privacy boundary ?
The idea is to use a rules based system that will apply to both data at-rest (inside the boundary) and to data in-motion (data flowing outside the boundary).
Before we define a rule, we can summarize briefly the example techniques that can be used to enforce data privacy -
- data minimization
- data aggregation
- encryption
- hashing
- k-anonymization
- redaction
- obfuscation
- ...
All these techniques can be applied to data both at-rest and in-motion, depending on the need of the Data Owner
.
The need mainly comes from 2 points -
- Privacy compliance requirements (e.g. HIPAA)
- Another system's (
Data Requester
) privacy-boundary conditions - this can be same or different compliance requirements based on what the other system defines.
Its the Data Owner
's responsibility to enforce these _need_s through a set of rules.
-
We will be looking at compliance requirements as mentioned in the HIPAA regulations and more specifically about the Protected Health Information(PHI) and how this is protected.
-
We will take specific example of Healthcare data and more specifically look at 2 systems, where one is a
non-analytical
storage and the other being ananalytical
storage and how the data transfer between them
The definition of PHI is quite broad and to keep this discussion simple, we will only focus on a subset of the fields considered PHI - 1. Names, 2. Telephone numbers 3. SSN number, 4. Date-Of-Birth, 5. Address, 6. Gender, 7. Medical Conditions - as our limited definition for PHI.
In our hypothetical world, then we only have these fields that we need to worry about. Of course, in the real world there's a lot more fields but we can generalize the proposal once we get the initial idea written down.
In this proposed approach, we will be defining 2 different proxy like sub-systems that attach themselves to each system in question -
- Identify Proxy - identifies PHI fields in incoming/outgoing data
- Protect Proxy - protect any incoming data
- Verify Proxy - verify compliance of protected data
We will look at 2 use-cases -
sequenceDiagram
autonumber
participant OS as Operations-System
participant OSIP as Identify-Proxy
participant OSPP as Protect-Proxy
participant HV as HIPAA-Verifier
participant ASVP as Analytics-System-Verifier-Proxy
participant AS as Analytics-System
OS->>OSIP: Outgoing data request
OSIP->>OSIP: Identify Data
OSIP->>OSPP: Protect Data
OSPP->>HV: Verify Compliance
HV->>OSPP: Rules passed
OSPP->>ASVP: Transfer Data
ASVP->>HV: Verify Compliance
HV->>ASVP: Rules passed
ASVP->>AS: Transfer complete
sequenceDiagram
autonumber
participant FHIR as FHIR EHR Records
participant OSIP as Identify-Proxy
participant OSPP as Protect-Proxy
participant OSVP as Verifier-Proxy
participant HV as HIPAA-Verifier
participant OS as Operations-System
FHIR->>OSIP: Incoming data request
OSIP->>OSIP: Identify Data
OSIP->>OSPP: Protect Data
OSPP->>OSVP: Verify
OSVP->>HV: Verify Compliance
HV->>OSPP: Rules passed
OSPP->>OS: Store Data
In both these use-cases, we want to try and answer the following questions -
-
How will we identify each of the fields in the incoming/outgoing data ?
-
What does protect mean when it is an incoming vs outgoing data ?
-
How do we verify that we are compliant ?
The first task is to identify which fields need to be worked upon i.e. protected and/or verified.
Data inter-change between any two systems can be assumed to on a standard format like text based formats including JSON or a binary format like protobuf.
Assuming it is JSON for simplicity, we can then set a sequence of rules or matchers in the Identify-Proxy
that will help us see which parts of the JSON should we be looking at.
An example of this can be matching of the Social Security Number
field based on either the parameter name like SSN
or matching using a regular expression that matches any string containing numbers of format NNN-NN-NNNN
. This can lead to false-positives as well but that can be a topic for continued improvement.
In this proposal, we will create a system that is called the HIPAA-Verifier
that is a centralized system containing rules that tell the following two points for any outgoing flow from a system -
a) how a field should be a protected - using what method/procedure and what other inputs does it require b) how a field should be a verified
For e.g. if we have an EHR system serving patient records, this centralized system would have rules/logic covering both a) and b) for all outgoing data from this system. This is assuming that the data identification part is already done at the system level.
The second task is to protect the identified field. Here, we can use the previously centrally defined functions or procedures and protect the particular fields.
This can be for e.g. hashing the field's value using a specific algorithm, encrypting the value using a specified key, obfuscating the value in a predefined way etc. This mapping of which technique to use for which field is defined in the central HIPAA-Verifier.
Lastly, we need to verify if the protection we've given is sufficient enough to comply to a regulation/criteria.
For every data/field that is protected, a corresponding verify rule is also defined in the centralized HIPAA-Verifier.
This works for both any outgoing data and incoming data as the meaning of protect changes in both directions, depending on what field we are looking at. For e.g. an email might be obfuscated when going out and might be encrypted while coming in.