Skip to content

Instantly share code, notes, and snippets.

@HarvsG
Last active May 17, 2018 14:37
Show Gist options
  • Save HarvsG/63c47b3326a874be688a7b328b01c820 to your computer and use it in GitHub Desktop.
Save HarvsG/63c47b3326a874be688a7b328b01c820 to your computer and use it in GitHub Desktop.
An idea for temporary data anonymisation and analysis validation.

Dataset randomisation for more secure and robust analysis.

Handling sensitive datasets securely whilst simultaneously sharing them among teams and services is a dichotomy that many researchers have to struggle with. Whats more, developing statistical analyis code on a real dataset can bias researchers who stumble upon associations that may be by chance and not in the initial hypothesis, these spurious associations may find their way into the eventual publication. The below is an idea to help with both.

Simple English description:

  • We take a dataset, strip out unessarily identifying information, scramble all the other columns so there are no associations along the rows and use that as a dummy data set as we develop the analysis for the research project. Then when we are happy with our analysis we run it once and only once on the real, unscrambled, dataset. Alternatively, (bullet point 4) before we run the final analysis we repeatedly scramble the dataset and run the analysis on it. The results of these repeated analyses should approximate the distribution of the statistical test you use! If it does not, you have probably introduced some bias in your analysis somewhere!

Technical Description:

  1. Take some sensitive dataset D with x rows and y columns where each row is a subject who's identity and information (y) we want to protect.

  2. Perform a function f() on D where f(D) does the following:

    • Ensures any identifying and/or irrelevant columns (name, address, DOB etc) yi are removed.
    • For all remaining columns yr (except the index column y0) are re-ordered randomly (shuffled).
  3. We then develop our analysis A() around f(D). This will give A(f(D)) the following properties:

    • The data will be properly anonymised and un-attributable should it fall into the wrong hands.
    • Any descriptive statistic of the sample f(D) will equal that of D.
    • Any test statistic should have alpha chance of being significant.
  4. When the analysis code A() is finalised we perform A(fn(D)) n times, where n is a large number.

    • We record the test statistic over the n trials and confirm it approximates the test distribution.
    • If it does not, then it suggests that there is some mistake in A() that is likely to bias results. (Although it would not detect confounders etc).
  5. We perform the analysis A(D) once and publish results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment