FrailWords/PROPOSAL.md

## PROPOSAL.md

      
    Raw
  

              PROPOSAL.md
            
          
    Structure of this proposal

This proposal is divided into -

Introduction to the idea in an abstract way
Specific focus of the proposal
Proposed approach

Introduction

Data transfer is always a communication between 2 entities/systems - the one system that owns the data and the other that requests the data.

  
      sequenceDiagram
    autonumber
    participant DR as Data Requester
    participant DO as Data Owner
    DR->>DO: Request Data
    DO->>DR: Data Response

    
      Loading

  
Data at rest vs in-motion

In the above interaction, the data at any point is at rest or in-motion or in-transfer from one system to another -

  
      sequenceDiagram
    autonumber
    participant DRD as Data Requester Store
    participant DR as Data Requester
    participant DO as Data Owner
    participant DOD as Data Owner Store
    DR->>DO: Request Data
    DO->>DOD: Fetch Data
    DO->>DR: Data Response
    DR->>DRD: Store Data

    
      Loading

  
Data Privacy Considerations

Privacy considerations apply to both when the data is at rest/in a database/store and also when it is in-motion/transfered to a different system. In the upcoming sections, we will try to characterize these systems and how the "privacy considerations" of these interactions can be codified in some ways.
Internal vs external systems

If we continue the analogy of the Data Requester, this system can be an internal system or an external system.
We define an internal system as being with a specific privacy-boundary
Defining a Privacy Boundary

A privacy-boundary helps us define what privacy considerations have been implemented to any system that is outside the boundary. How do we define such a privacy boundary ?
The idea is to use a rules based system that will apply to both data at-rest (inside the boundary) and to data in-motion (data flowing outside the boundary).
How do we define these rules ?

Before we define a rule, we can summarize briefly the example techniques that can be used to enforce data privacy -

data minimization
data aggregation
encryption
hashing
k-anonymization
redaction
obfuscation
...

All these techniques can be applied to data both at-rest and in-motion, depending on the need of the Data Owner.
Who defines the need for which techniques are to be applied ?

The need mainly comes from 2 points -

Privacy compliance requirements (e.g. HIPAA)
Another system's (Data Requester) privacy-boundary conditions - this can be same or different compliance requirements based on what the other system defines.

Its the Data Owner's responsibility to enforce these _need_s through a set of rules.
Focus of proposal


We will be looking at compliance requirements as mentioned in the HIPAA regulations and more specifically about the Protected Health Information(PHI) and how this is protected.


We will take specific example of Healthcare data and more specifically look at 2 systems, where one is a non-analytical storage and the other being an analytical storage and how the data transfer between them


Hpyothetical HIPAA PHI Definition - as a subset of the actual PHI fields

The definition of PHI is quite broad and to keep this discussion simple, we will only focus on a subset of the fields considered PHI - 1. Names, 2. Telephone numbers 3. SSN number, 4. Date-Of-Birth, 5. Address, 6. Gender, 7. Medical Conditions - as our limited definition for PHI.
In our hypothetical world, then we only have these fields that we need to worry about. Of course, in the real world there's a lot more fields but we can generalize the proposal once we get the initial idea written down.
Proposed approach - Identify, Protect and Verify

In this proposed approach, we will be defining 2 different proxy like sub-systems that attach themselves to each system in question -

Identify Proxy - identifies PHI fields in incoming/outgoing data
Protect Proxy - protect any incoming data
Verify Proxy - verify compliance of protected data

Use-Cases

We will look at 2 use-cases -
Use-case 1 - Outgoing data transfer from an operations to an analytical system


      sequenceDiagram
    autonumber
    participant OS as Operations-System
    participant OSIP as Identify-Proxy
    participant OSPP as Protect-Proxy
    participant HV as HIPAA-Verifier
    participant ASVP as Analytics-System-Verifier-Proxy
    participant AS as Analytics-System
    OS->>OSIP: Outgoing data request
    OSIP->>OSIP: Identify Data
    OSIP->>OSPP: Protect Data
    OSPP->>HV: Verify Compliance
    HV->>OSPP: Rules passed
    OSPP->>ASVP: Transfer Data
    ASVP->>HV: Verify Compliance
    HV->>ASVP: Rules passed
    ASVP->>AS: Transfer complete

    
      Loading

  
Use-case 2 - Incoming data from an EHR source into the operations system


      sequenceDiagram
    autonumber
    participant FHIR as FHIR EHR Records
    participant OSIP as Identify-Proxy
    participant OSPP as Protect-Proxy
    participant OSVP as Verifier-Proxy
    participant HV as HIPAA-Verifier
    participant OS as Operations-System
    FHIR->>OSIP: Incoming data request
    OSIP->>OSIP: Identify Data
    OSIP->>OSPP: Protect Data
    OSPP->>OSVP: Verify 
    OSVP->>HV: Verify Compliance
    HV->>OSPP: Rules passed
    OSPP->>OS: Store Data

    
      Loading

  
In both these use-cases, we want to try and answer the following questions -


How will we identify each of the fields in the incoming/outgoing data ?


What does protect mean when it is an incoming vs outgoing data ?


How do we verify that we are compliant ?


Identify

The first task is to identify which fields need to be worked upon i.e. protected and/or verified.
Data inter-change between any two systems can be assumed to on a standard format like text based formats including JSON or a binary format like protobuf.
Assuming it is JSON for simplicity, we can then set a sequence of rules or matchers in the Identify-Proxy that will help us see which parts of the JSON should we be looking at.
An example of this can be matching of the Social Security Number field based on either the parameter name like SSN or matching using a regular expression that matches any string containing numbers of format NNN-NN-NNNN. This can lead to false-positives as well but that can be a topic for continued improvement.
Centralized HIPAA-Verifier

In this proposal, we will create a system that is called the HIPAA-Verifier that is a centralized system containing rules that tell the following two points for any outgoing flow from a system -
a) how a field should be a protected - using what method/procedure and what other inputs does it require
b) how a field should be a verified
For e.g. if we have an EHR system serving patient records, this centralized system would have rules/logic covering both a) and b) for all outgoing data from this system. This is assuming that the data identification part is already done at the system level.
Protect

The second task is to protect the identified field.  Here, we can use the previously centrally defined functions or procedures and protect the particular fields.
This can be for e.g. hashing the field's value using a specific algorithm, encrypting the value using a specified key, obfuscating the value in a predefined way etc.  This mapping of which technique to use for which field is defined in the central HIPAA-Verifier.
Verify

Lastly, we need to verify if the protection we've given is sufficient enough to comply to a regulation/criteria.
For every data/field that is protected, a corresponding verify rule is also defined in the centralized HIPAA-Verifier.
This works for both any outgoing data and incoming data as the meaning of protect changes in both directions, depending on what field we are looking at.  For e.g. an email might be obfuscated when going out and might be encrypted while coming in.