Skip to content

Instantly share code, notes, and snippets.

@zqian
Forked from sloanlance/Depersonalize JSON.md
Last active August 15, 2020 22:32
Show Gist options
  • Save zqian/0879f1339ab3dd7da8d47a67ce173fbb to your computer and use it in GitHub Desktop.
Save zqian/0879f1339ab3dd7da8d47a67ce173fbb to your computer and use it in GitHub Desktop.
jq: Depersonalize JSON – http://bit.ly/Depersonalize_JSON

Depersonalize JSON

Why

If JSON data is to be shared for analysis, it's often necessary to depersonalize the data first. Depersonalization is different than anonymization. If data is anonymized, all information about the user is removed. Depersonalization, however, changes the user information to values that can't be recognized as being related to the real user. This way, it's still possible to see relationships within the data.

How

To use the depersonalize.sh shell script, either give the name of the input file to be depersonalized as an argument:

depersonalize.sh events.jsonl

Or pipe the input into the script through stdin, which allows it to be used as part of more complex pipelines:

grep tarfu events.jsonl | depersonalize.sh

The script uses jq to extract the .actor.name property from each object found in the JSONL input. It passes those names through the md5 program to produce a hash for each name. Then jq is used again to format the list of names and hashes to JSON of the form:

{
  "name1": "name1_md5_hash",
  "name2": "name2_md5_hash",
  "nameN": "nameN_md5_hash"
}

That JSON of names and hashes are written to a temporary file for use in the next step.

Note: These first two steps wouldn't need to be separate calls to jq if the program had builtin support for MD5 or allowed calls to external programs.

Finally, jq is called again, with the JSON of names and hashes assigned to a variable. A short program replaces each .actor.name value with its hash from the variable. Additionally, the example data uses the users' names as part of the .actor.@id property, so a substitution replaces that part of the value.

The program sends the output to stdout, so be sure to redirect it to a file to keep

What

An example of an object from the JSONL input:

{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}

Although it won't be seen when the program executes, the temporary JSON file of the name and its hash from that JSONL would be like:

{
  "tarfu": "e2b87e98602ac8fc95f49fbc3f5c7b1d"
}

The final output of the program is:

{"@id":"https://example.edu/#profile:e2b87e98602ac8fc95f49fbc3f5c7b1d","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"e2b87e98602ac8fc95f49fbc3f5c7b1d"}
def depersonalize(newNames):
# newNames should contain a single, valid JSON object with properties:
# {"name1":"newName1","name2":"newName2",...}
# Replace specific property values. I.e., .actor.name and .actor.@id.
.actor
| .name as $name # for use in sub()'s regex later
| (newNames[.name] // .name) as $newName # use current name if no new name
| .name = $newName # simple assignment
| .["@id"] |= sub(":\($name)$"; ":\($newName)") # pipe and assignment
;
depersonalize($newNames[0]) # remove outer array added by --slurpfile
#!/usr/bin/env bash
# Make a temporary file of the input so it can be reused from its beginning later in the script.
cat "${1:--}" | \
tee "depersonalize-$$-input.jsonl" | \
jq --compact-output --raw-output '.actor.name' | \
sort -u | ( \
printf '{\n'
propertySeparator=''
while read name; do
printf '%s"%s":"%s"' "${propertySeparator}" "${name}" $(md5 -qs "${name}")
propertySeparator=$',\n'
done
printf '\n}\n'
) > "depersonalize-$$-newNames.json"
# TODO: Combine previous jq call with the next one.
# (Tricky: lack of MD5 support or external programs)
# File for newNames should contain a single, valid JSON object with properties:
# {"name1":"newName1","name2":"newName2",...}
jq \
--compact-output \
--slurpfile newNames "depersonalize-$$-newNames.json" \
--from-file depersonalize.jq \
"depersonalize-$$-input.jsonl"
rm "depersonalize-$$-input.jsonl" "depersonalize-$$-newNames.json"
def depersonalize(newNames):
# newNames should contain a single, valid JSON object with properties:
# {"actor": { "extensions": { "com.instructure.canvas": {"user_login": newName1" ....}}}}
# Replace specific property values. I.e., .actor.name and .actor.@id.
.actor.extensions."com.instructure.canvas"
| .user_login as $user_login # for use in sub()'s regex later
| (newNames[.user_login] // .user_login) as $newName # use current name if no new name
| .user_login = $newName # simple assignment
| .["user_login"] |= sub(":\($user_login)$"; ":\($newName)") # pipe and assignment
;
depersonalize($newNames[0]) # remove outer array added by --slurpfile
#!/usr/bin/env bash
# Make a temporary file of the input so it can be reused from its beginning later in the script.
cat "${1:--}" | \
tee "depersonalize-$$-input.jsonl" | \
jq --compact-output --raw-output '.actor.extensions."com.instructure.canvas".user_login' | \
sort -u | ( \
printf '{\n'
propertySeparator=''
while read user_login; do
printf '%s"%s":"%s"' "${propertySeparator}" "${user_login}" $(md5 -qs "${user_login}")
propertySeparator=$',\n'
done
printf '\n}\n'
) > "depersonalize-$$-newNames.json"
# TODO: Combine previous jq call with the next one.
# (Tricky: lack of MD5 support or external programs)
# File for newNames should contain a single, valid JSON object with properties:
# {"name1":"newName1","name2":"newName2",...}
jq \
--compact-output \
--slurpfile newNames "depersonalize-$$-newNames.json" \
--from-file depersonalize.jq \
"depersonalize-$$-input.jsonl"
rm "depersonalize-$$-input.jsonl" "depersonalize-$$-newNames.json"
{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}
{"actor":{"@id":"https://example.edu/#profile:snafu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"snafu"}}
{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}
{"actor":{"@id":"https://example.edu/#profile:fubar","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"fubar"}}
{"actor":{"@id":"https://example.edu/#profile:snafu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"snafu"}}
{"actor":{"@id":"https://example.edu/#profile:nobody","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"nobody"}}
def fromTsvPairs:
split("\n")
| map(split("\t")|select(length>0))
| reduce .[] as $parts ({}; .[$parts[0]] = ($parts[1:] | join(" ")))
;
fromTsvPairs
{
"fubar": "5185e8b8fd8a71fc80545e144f91faf2",
"snafu": "f850038cdc0d2f2820556b22f58c38b3",
"tarfu": "e2b87e98602ac8fc95f49fbc3f5c7b1d"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment