Skip to content

Instantly share code, notes, and snippets.

@he7d3r
Created May 9, 2020 21:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save he7d3r/2ef74ea539275626ffd3478aae1f731b to your computer and use it in GitHub Desktop.
Save he7d3r/2ef74ea539275626ffd3478aae1f731b to your computer and use it in GitHub Desktop.
Compare articlequality datasets before and after proposed changes
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Before and after https://github.com/wikimedia/articlequality/pull/127
file_names = ['ptwiki.labelings.20200301.json',
'ptwiki.labelings.20200301.extractor-and-reverts.json']
sets = []
for file_name in file_names:
df = pd.read_json('../datasets/' + file_name, lines=True, convert_dates=False)
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y%m%d%H%M%S')
sets.append(set([tuple(line) for line in df.values]))
discarded_labelings = pd.DataFrame(list(sets[0]-sets[1]))
print('To be removed\n', discarded_labelings)
new_labelings = pd.DataFrame(list(sets[1]-sets[0]))
print('\nTo be added\n', new_labelings)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment