Skip to content

Instantly share code, notes, and snippets.

@mekline
Last active July 19, 2018 01:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mekline/f63697be8074a92b0246b7c2c05b3a9b to your computer and use it in GitHub Desktop.
Save mekline/f63697be8074a92b0246b7c2c05b3a9b to your computer and use it in GitHub Desktop.
How I screwed up participant privacy and what I did about it

The problem

A while ago, I cottoned on to the (pretty obvious, in retrospect) fact that even if you are not sharing a piece of information in plain text, if it can be derived from your dataset, it counts (ethically, and probably according to your IRB/equivalent body of choice) as sharing that info. I thought I had successfully handled this wrinkle in my data sharing, but I found out I hadn't.

I collect three types of personally identifiable information (PII) in my research:

  • Video of the children who participate in my studies, primarily so I can check that paradigms are being implemented correctly/consistently, and to hand-check eyetracking data. I'm not the head of a lab so I don't share video (following my labs' practices), but if I was, I'd use Databrary to provide access to researchers who've been vetted by their institutions.

  • First names, because calling your participants "Hey You" is rude so it winds up on the video, and recording the name they're referred to in the video in tabular form has rescued me from data entry mistakes/lost videos on a camcorder at the end of a semester in a few key cases.

  • Birth dates, because the age of participants is usually the key moderating variable of interest in developmental psych!

In grad school, I settled on a convention of splitting participant-level data into two files for sharing (first on github, later on github + OSF) - an open CSV of de-identified data, and a password-protected Excel spreadsheet that linked the anonymous/unique ID number to first names and birthdays. This met my IRB's requirements for storing data in the cloud, while also allowing my RAs to access and enter data as they collected it (I often have RAs collecting data offsite, or without access to a shared/secure computer.)

Recently, I learned of two problems with my approach. First, it turns out there's an exploit for removing password protection from Excel files. Yikes. Second, while I'd figured out the (TestDate - DaysOld = Birthday) equation, and begun randomly jittering age by a few days in all my current work, I neglected to catch some datafiles from an older project that listed un-jittered age in days, and listed the testdate in various places. Yikes. For extra fun, I discovered the latter 'live' while presenting at a training workshop about appropriately sharing participant data. YIKES.

This is a relatively low-impact mistake as these things go, because (1) actual risk of re-identification is low, since the birthdays were not shared in a machine-readable format and my city is large enough that many children are born a day and (2) Impact of re-identification would be low, since my data are not sensitive (if they involved participants over the age of 18, they would be eligible for expedited review.) But I was pretty disappointed in myself for messing this up, embarassed at finding out in a public setting, and wanted to provide a roadmap for myself going forward.

The solution

(This is what I'm doing right now, but I'd love to hear if I'm missing something, or just if you have another solution!)

Born-open data sharing is generally a great way to plan ahead for sharing your finished datasets in a usable format, so I was hopeful I could find a solution that would still allow this. After looking into the rules and options for people who need to meet HIPAA compliance (probably overkill for my case, but it turned out to be mostly workable so I went with it), I determined that with some configuration fiddling, Google Docs with GSuite can do this. This left two steps:

Scrub the 'contaminated' files

Thanks to GitHub, which does a great job preserving all the old versions of my work, removing these files was not as simple as just removing them from the current version - our hypothetical hacker could go back to previous versions of the repository to find the information. It is however possible to remove a sensitive file from the entire history of a repository, thanks to a tool called the BFG Repo Cleaner. The instructions on that site worked perfectly for me, with the exception that I had to start with

$ git clone --mirror https://example.com/some-big-repo.git

rather than

$ git clone --mirror git://example.com/some-big-repo.git

...otherwise my permissions to git push back (via my https default) to the repository after cleaning got fouled up. After removing both the 'contaminated' CSV and the hack-able XLSX docs and running BFG, I could re-calculate the ageInDays column to add a few days of random noise and add the new file back to the repo.

Make a better system

In addition to educating myself and my RAs about this issue, I needed somewhere other than the open, intended-for-sharing GitHub repositories I use for everything else to store the few bits of PII I collect as I go. Because the date of data collection winds up all over my datasets (in filenames, in 'date created' metadata, etc.), I think the jittered ages are still the way to go, and provided that I am collecting sufficiently large samples, a few days of randomly-distributed noise shouldn't affect any meaningful properties of my dataset. For one thing, consider that scheduled procedures and long labors throw all kinds of noise into a child's recorded birth date - if you really care about age down to the day, you have a different problem than I do and probably need to strip all info about date-of-collection out of a dataset for sharing.

So, I made a tool- SaltyDates- for calculating the values we record in the CSV (we were previously doing this by hand with some excel code), and in place of the ride-along excel doc, I'm using a locked Google Form to collect & save the exact dates, first names, and subjectID keys.

With a GSuite account, I can create forms that are accessible only to people with accounts at my domain, and critically, I can turn off the functions of Google Docs/Sheets/Forms etc. that allow sharing outside the 'organization' (a.k.a. me and anyone I make an account for). This is the recommended configuration for Health Organizations that need to store data with HIPAA compliance; the only difference is that because I am not personally a Health Organization, I didn't sign a BAA. One thing that's particularly nice about this setup is that I can create a link to this form which only works for someone who logs into an account at my GSuite domain, and safely list that link in other project documentation.

This costs me $10 a month, for myself and an RA to have accounts @myname.net - Here's an example of a current project with my new setup. Is this overkill? Maybe, but I'm happy to have a solution that does a good job drawing a very bright line between what I can share and what I can't.

As a final note, I've had more than a few conversations with people who have serious reservations about sharing data online, but who have also emailed or Dropbox-ed me datafiles with PII, or store them on un-firewalled, networked computers. Just because you feel like you're not sharing your data, doesn't mean you're actually protecting it - laptops can be stolen, email accounts can be accessed by people other than you. This is one of those issues that the Open Science Movement (TM) throws into relief even for people who aren't currently planning on sharing datasets: many of our norms around how we care for participant data date back to videocassettes and paper in filing cabinets, and even if your IRB isn't saying anything about digital security yet, we all have a responsibility to try and get this right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment