zqian/Depersonalize JSON.md

## Depersonalize JSON.md

      
    Raw
  

              Depersonalize JSON.md
            
          
    Depersonalize JSON

Why

If JSON data is to be shared for analysis, it's often necessary to depersonalize
the data first.  Depersonalization is different than anonymization.  If data
is anonymized, all information about the user is removed.  Depersonalization,
however, changes the user information to values that can't be recognized as
being related to the real user.  This way, it's still possible to see
relationships within the data.
How

To use the depersonalize.sh shell script, either give the name of the input
file to be depersonalized as an argument:
depersonalize.sh events.jsonl
Or pipe the input into the script through stdin, which allows it to be used
as part of more complex pipelines:
grep tarfu events.jsonl | depersonalize.sh
The script uses jq to extract the .actor.name property from each object
found in the JSONL input.  It passes those names through the md5 program to
produce a hash for each name.  Then jq is used again to format the list of
names and hashes to JSON of the form:
{
  "name1": "name1_md5_hash",
  "name2": "name2_md5_hash",
  "nameN": "nameN_md5_hash"
}
That JSON of names and hashes are written to a temporary file for use in the
next step.

Note: These first two steps wouldn't need to be separate calls to
jq if the program had builtin support for MD5 or allowed calls to external
programs.

Finally, jq is called again, with the JSON of names and hashes assigned to a
variable.  A short program replaces each .actor.name value with its hash from
the variable.  Additionally, the example data uses the users' names as part of
the .actor.@id property, so a substitution replaces that part of the value.
The program sends the output to stdout, so be sure to redirect it to a file
to keep
What

An example of an object from the JSONL input:
{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}
Although it won't be seen when the program executes, the temporary JSON file
of the name and its hash from that JSONL would be like:
{
  "tarfu": "e2b87e98602ac8fc95f49fbc3f5c7b1d"
}
The final output of the program is:
{"@id":"https://example.edu/#profile:e2b87e98602ac8fc95f49fbc3f5c7b1d","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"e2b87e98602ac8fc95f49fbc3f5c7b1d"}

  
## depersonalize.jq
def depersonalize(newNames):
  # newNames should contain a single, valid JSON object with properties:
  # {"name1":"newName1","name2":"newName2",...}

  # Replace specific property values. I.e., .actor.name and .actor.@id.
  .actor
  | .name as $name # for use in sub()'s regex later
  | (newNames[.name] // .name) as $newName # use current name if no new name
  | .name = $newName # simple assignment
  | .["@id"] |= sub(":\($name)$"; ":\($newName)") # pipe and assignment
  ;

depersonalize($newNames[0]) # remove outer array added by --slurpfile

## depersonalize.sh
#!/usr/bin/env bash

# Make a temporary file of the input so it can be reused from its beginning later in the script.
cat "${1:--}" | \
tee "depersonalize-$$-input.jsonl" | \
jq --compact-output --raw-output '.actor.name' | \
sort -u | ( \
    printf '{\n'
    propertySeparator=''
    while read name; do
        printf '%s"%s":"%s"' "${propertySeparator}" "${name}" $(md5 -qs "${name}")
        propertySeparator=$',\n'
    done
    printf '\n}\n'
) > "depersonalize-$$-newNames.json"

# TODO: Combine previous jq call with the next one.
# (Tricky: lack of MD5 support or external programs)

# File for newNames should contain a single, valid JSON object with properties:
# {"name1":"newName1","name2":"newName2",...}
jq \
    --compact-output \
    --slurpfile newNames "depersonalize-$$-newNames.json" \
    --from-file depersonalize.jq \
    "depersonalize-$$-input.jsonl"

rm "depersonalize-$$-input.jsonl" "depersonalize-$$-newNames.json"

## depersonalize_user_login.jq

def depersonalize(newNames):
  # newNames should contain a single, valid JSON object with properties:
  # {"actor": { "extensions": { "com.instructure.canvas": {"user_login": newName1" ....}}}}


  # Replace specific property values. I.e., .actor.name and .actor.@id.
  .actor.extensions."com.instructure.canvas"
  | .user_login as $user_login # for use in sub()'s regex later
  | (newNames[.user_login] // .user_login) as $newName # use current name if no new name
  | .user_login = $newName # simple assignment
  | .["user_login"] |= sub(":\($user_login)$"; ":\($newName)") # pipe and assignment
  ;

depersonalize($newNames[0]) # remove outer array added by --slurpfile

## depersonalize_user_login.sh

#!/usr/bin/env bash

# Make a temporary file of the input so it can be reused from its beginning later in the script.
cat "${1:--}" | \
tee "depersonalize-$$-input.jsonl" | \
jq --compact-output --raw-output '.actor.extensions."com.instructure.canvas".user_login' | \
sort -u | ( \
    printf '{\n'
    propertySeparator=''
    while read user_login; do
        printf '%s"%s":"%s"' "${propertySeparator}" "${user_login}" $(md5 -qs "${user_login}")
        propertySeparator=$',\n'
    done
    printf '\n}\n'
) > "depersonalize-$$-newNames.json"

# TODO: Combine previous jq call with the next one.
# (Tricky: lack of MD5 support or external programs)

# File for newNames should contain a single, valid JSON object with properties:
# {"name1":"newName1","name2":"newName2",...}
jq \
    --compact-output \
    --slurpfile newNames "depersonalize-$$-newNames.json" \
    --from-file depersonalize.jq \
    "depersonalize-$$-input.jsonl"


rm "depersonalize-$$-input.jsonl" "depersonalize-$$-newNames.json"

## events.jsonl
{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}
{"actor":{"@id":"https://example.edu/#profile:snafu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"snafu"}}
{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}
{"actor":{"@id":"https://example.edu/#profile:fubar","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"fubar"}}
{"actor":{"@id":"https://example.edu/#profile:snafu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"snafu"}}
{"actor":{"@id":"https://example.edu/#profile:nobody","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"nobody"}}


## fromTsvPairs.jq
def fromTsvPairs:
  split("\n")
  | map(split("\t")|select(length>0))
  | reduce .[] as $parts ({}; .[$parts[0]] = ($parts[1:] | join(" ")))
  ;

fromTsvPairs

## newNames.json
{
  "fubar": "5185e8b8fd8a71fc80545e144f91faf2",
  "snafu": "f850038cdc0d2f2820556b22f58c38b3",
  "tarfu": "e2b87e98602ac8fc95f49fbc3f5c7b1d"
}
	def depersonalize(newNames):
	# newNames should contain a single, valid JSON object with properties:
	# {"name1":"newName1","name2":"newName2",...}

	# Replace specific property values. I.e., .actor.name and .actor.@id.
	.actor
	\| .name as $name # for use in sub()'s regex later
	\| (newNames[.name] // .name) as $newName # use current name if no new name
	\| .name = $newName # simple assignment
	\| .["@id"] \|= sub(":\($name)$"; ":\($newName)") # pipe and assignment
	;

	depersonalize($newNames[0]) # remove outer array added by --slurpfile
	#!/usr/bin/env bash

	# Make a temporary file of the input so it can be reused from its beginning later in the script.
	cat "${1:--}" \| \
	tee "depersonalize-$$-input.jsonl" \| \
	jq --compact-output --raw-output '.actor.name' \| \
	sort -u \| ( \
	printf '{\n'
	propertySeparator=''
	while read name; do
	printf '%s"%s":"%s"' "${propertySeparator}" "${name}" $(md5 -qs "${name}")
	propertySeparator=$',\n'
	done
	printf '\n}\n'
	) > "depersonalize-$$-newNames.json"

	# TODO: Combine previous jq call with the next one.
	# (Tricky: lack of MD5 support or external programs)

	# File for newNames should contain a single, valid JSON object with properties:
	# {"name1":"newName1","name2":"newName2",...}
	jq \
	--compact-output \
	--slurpfile newNames "depersonalize-$$-newNames.json" \
	--from-file depersonalize.jq \
	"depersonalize-$$-input.jsonl"

	rm "depersonalize-$$-input.jsonl" "depersonalize-$$-newNames.json"

	def depersonalize(newNames):
	# newNames should contain a single, valid JSON object with properties:
	# {"actor": { "extensions": { "com.instructure.canvas": {"user_login": newName1" ....}}}}


	# Replace specific property values. I.e., .actor.name and .actor.@id.
	.actor.extensions."com.instructure.canvas"
	\| .user_login as $user_login # for use in sub()'s regex later
	\| (newNames[.user_login] // .user_login) as $newName # use current name if no new name
	\| .user_login = $newName # simple assignment
	\| .["user_login"] \|= sub(":\($user_login)$"; ":\($newName)") # pipe and assignment
	;

	depersonalize($newNames[0]) # remove outer array added by --slurpfile
	{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}
	{"actor":{"@id":"https://example.edu/#profile:snafu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"snafu"}}
	{"actor":{"@id":"https://example.edu/#profile:tarfu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"tarfu"}}
	{"actor":{"@id":"https://example.edu/#profile:fubar","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"fubar"}}
	{"actor":{"@id":"https://example.edu/#profile:snafu","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"snafu"}}
	{"actor":{"@id":"https://example.edu/#profile:nobody","@context":"http://purl.imsglobal.org/ctx/caliper/v1/Context","@type":"http://purl.imsglobal.org/caliper/v1/lis/Person","name":"nobody"}}
	def fromTsvPairs:
	split("\n")
	\| map(split("\t")\|select(length>0))
	\| reduce .[] as $parts ({}; .[$parts[0]] = ($parts[1:] \| join(" ")))
	;

	fromTsvPairs
	{
	"fubar": "5185e8b8fd8a71fc80545e144f91faf2",
	"snafu": "f850038cdc0d2f2820556b22f58c38b3",
	"tarfu": "e2b87e98602ac8fc95f49fbc3f5c7b1d"
	}