Skip to content

Instantly share code, notes, and snippets.

@notconfusing
Created July 10, 2018 17:10
Show Gist options
  • Save notconfusing/5f7d601ac71cb204e24d47a544c890a1 to your computer and use it in GitHub Desktop.
Save notconfusing/5f7d601ac71cb204e24d47a544c890a1 to your computer and use it in GitHub Desktop.
Bad Label Detection
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bad Label Research\n",
"__Max Klein 10 July 2018__\n",
"\n",
"This notebook looks for bad labels by running CV and looking at variances in labels.\n",
"\n",
"## TLDR\n",
"- I use the demo dataset from the editquality demo of **20,000 enwiki reverts**\n",
"- I run **training** on the dataset, **repeated 3 times, with 5-fold cross validation in each repeat**. \n",
" - Note this means that each observation is predicted 3 times.\n",
" - Note I learned sklearn has a method called `RepeatedKFoldCV` which is nice.\n",
"- I record the predicted probability of each observation, and take the variance.\n",
"- The predictions and their variances are **power-law distributed**\n",
"- I inspected the top 10 most variant observations and found:\n",
" - 2 Bad labels (false positive labels)\n",
" - 4 Potentially bad labels\n",
" - 2 Mixed labels that contain both good and bad edits\n",
" - 2 Truly damaging\n",
"\n",
"## Conclusion\n",
"I think this technique shows promise! But it might still be hard to automatically surface bad labels. Although an \"are you sure?\" human-in-the-loop, closer inspection workflow could work.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"source": [
"import bz2\n",
"import os\n",
"import pandas as pd\n",
"import numpy as np\n",
"import math\n",
"%pylab inline\n",
"\n",
"DATADIR = '/home/paprika/workspace/editquality/datasets/demo'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/paprika/workspace/oresenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
" \"This module will be removed in 0.20.\", DeprecationWarning)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"enwiki.features_reverted.testing.20k_2015.tsv.bz2 has 4862 many observations\n",
"enwiki.features_reverted.training.20k_2015.tsv.bz2 has 14979 many observations\n",
"all observations consists of 19841\n"
]
}
],
"source": [
"def float_or_bool(s):\n",
" try:\n",
" return float(s)\n",
" except ValueError:\n",
" if s == 'True':\n",
" return True\n",
" elif s == 'False':\n",
" return False\n",
" else:\n",
" raise ValueError('sheesh')\n",
" \n",
"def read_observations_tsv(f):\n",
" with bz2.open(os.path.join(DATADIR, f), 'rt') as b:\n",
" #NOTE y u no worky newline argument\n",
" for l in b.readlines():\n",
" line = l.split('\\n')[0]\n",
" parts = line.split('\\t')\n",
" rev_id = parts[0]\n",
" cache_str = parts[1:-1]\n",
" cache = [float_or_bool(s) for s in cache_str]\n",
" reverted = True if parts[-1] == 'True' else False\n",
" yield {'rev_id': rev_id, 'cache': cache, 'reverted': reverted}\n",
"\n",
"all_observations = []\n",
"all_observations_unpacked = []\n",
"\n",
"for dataset_presplit in ('testing', 'training'):\n",
" dataset_f = f'enwiki.features_reverted.{dataset_presplit}.20k_2015.tsv.bz2'\n",
" features_reverted = list(read_observations_tsv(dataset_f))\n",
" print(f'{dataset_f} has {len(features_reverted)} many observations')\n",
" all_observations.extend(features_reverted)\n",
" \n",
" unpacked = [(o[\"cache\"], o[\"reverted\"]) for o in features_reverted]\n",
" all_observations_unpacked.extend(unpacked)\n",
" \n",
"print(f'all observations consists of {len(all_observations)}')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"#this cell is copied from the editquality ipython tutorial\n",
"\n",
"from revscoring.features import wikitext, revision_oriented, temporal\n",
"from revscoring.languages import english\n",
"\n",
"features = [\n",
" # Catches long key mashes like kkkkkkkkkkkk\n",
" wikitext.revision.diff.longest_repeated_char_added,\n",
" # Measures the size of the change in added words\n",
" wikitext.revision.diff.words_added,\n",
" # Measures the size of the change in removed words\n",
" wikitext.revision.diff.words_removed,\n",
" # Measures the proportional change in \"badwords\"\n",
" english.badwords.revision.diff.match_prop_delta_sum,\n",
" # Measures the proportional change in \"informals\"\n",
" english.informals.revision.diff.match_prop_delta_sum,\n",
" # Measures the proportional change meaningful words\n",
" english.stopwords.revision.diff.non_stopword_prop_delta_sum,\n",
" # Is the user anonymous\n",
" revision_oriented.revision.user.is_anon,\n",
" # Is the user a bot or a sysop\n",
" revision_oriented.revision.user.in_group({'bot', 'sysop'}),\n",
" # How long ago did the user register?\n",
" temporal.revision.user.seconds_since_registration\n",
"]\n",
"\n",
"from revscoring.scoring.models import GradientBoosting\n",
"is_reverted = GradientBoosting(features, labels=[True, False], version=\"live demo!\", \n",
" learning_rate=0.01, max_features=\"log2\", \n",
" n_estimators=700, max_depth=5,\n",
" population_rates={False: 0.5, True: 0.5}, scale=True, center=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### remind myself of objective\n",
"data format we're interested in is:\n",
"- dataframe every label is row, \n",
"- columns are difference between predicted and actual\n",
"- if observation was used in training, return null/None/nan\n",
"\n",
"### which methods to use\n",
"- from what I can see, model.cross_validate returns aggregate statistics, so we won't be able to inspect the predicted labels versus actual,\n",
"- we will have to rely on the index position to recover the id of the revision which is bad-labelled, (or else do it more carefully later which our own loop)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Doing it my own way, and then integrating later\n",
"### Steps:\n",
"1. create the folds\n",
"2. for each fold \n",
" 2. train\n",
" 1. test\n",
" 2. on the test set compare actual vs predicted\n",
" 3. save to global data of rev_id, actual, predicted\n",
"3. transform dataframe so that we have rows of actual-vs predicted"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# can either run Kfold mutliple times, keeping doing own bookkeeping\n",
"#from sklearn.cross_validation import KFold\n",
"# or what's repeatedKfold\n",
"from sklearn.model_selection import RepeatedKFold\n",
"#if we are going to repeat many times, then what's the point in splitting so finely? I don't see much difference\n",
"X_list = [o[0] for o in all_observations_unpacked]\n",
"X = np.array(X_list)\n",
"y_list = [o[1] for o in all_observations_unpacked]\n",
"y = np.array(y_list)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Working on split: 0\n",
"Working on split: 1\n",
"Working on split: 2\n",
"Working on split: 3\n",
"Working on split: 4\n",
"Working on split: 5\n",
"Working on split: 6\n",
"Working on split: 7\n",
"Working on split: 8\n",
"Working on split: 9\n",
"Working on split: 10\n",
"Working on split: 11\n",
"Working on split: 12\n",
"Working on split: 13\n",
"Working on split: 14\n"
]
}
],
"source": [
"n_splits = 5\n",
"n_repeats = 3\n",
"\n",
"rkf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=2718281828)\n",
"\n",
"split_results = []\n",
"for round_k, (train_index, test_index) in enumerate(rkf.split(X, y)):\n",
"# print(\"TRAIN:\", train_index, \"TEST:\", test_index)\n",
"# print(\"Train len\", len(train_index), \"Test len\", len(test_index))\n",
"# print('test % is:', len(test_index) / (len(test_index) + len(train_index)))\n",
" print(f'Working on split: {round_k}')\n",
" X_train, X_test = X[train_index], X[test_index]\n",
" y_train, y_test = y[train_index], y[test_index]\n",
" train_values_labels = [(X_train[i], y_train[i]) for i in range(X_train.shape[0])]\n",
" is_reverted.train(values_labels=train_values_labels)\n",
"\n",
" test_feature_values = [X_test[j] for j in range(X_test.shape[0])]\n",
" predictions = is_reverted.score_many(test_feature_values)\n",
" probas = [p['probability'][True] for p in predictions]\n",
" records = {'proba':probas, 'orig_ix': test_index}\n",
" \n",
" split_result = pd.DataFrame.from_records(records)\n",
"\n",
" split = round_k % n_splits\n",
" repeat = math.floor(round_k / n_splits)\n",
" split_result['split'] = split\n",
" split_result['repeat'] = repeat\n",
" split_results.append(split_result)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"proba_df = pd.concat(split_results)"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [],
"source": [
"var_df = proba_df.groupby('orig_ix').agg({'proba':np.var}).rename(columns={'proba':'variance'}).reset_index()"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [],
"source": [
"var_df_sorted = var_df.sort_values(by='variance', ascending=False)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7f322ed80828>"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAD8CAYAAAC/1zkdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEvZJREFUeJzt3X+wXeVd7/H3h6QUuBZom9iLCRiq8UesP0ojxemotdg2hSup17bSsTZ2GHAsvddfowV1pLbitOOPKk6rxZIRuCql1dtGS4ehFOzoyI8glRZ6uZxSLElRYqFgf4HQr3/sJ7hJz8lZhzz7bHbO+zWzJ2s9+1lrfZ+cTD7nWWvttVNVSJLUwyHTLkCSdPAwVCRJ3RgqkqRuDBVJUjeGiiSpG0NFktSNoSJJ6sZQkSR1Y6hIkrpZPe0CltuaNWtqw4YN0y5DkmbGTTfd9G9VtXZI3xUXKhs2bGDnzp3TLkOSZkaSfx7a19NfkqRuDBVJUjeGiiSpG0NFktSNoSJJ6sZQkSR1Y6hIkroxVCRJ3RgqkqRuVtwn6g/EhnM+OJXj3vXWU6dyXElaKmcqkqRuDBVJUjeGiiSpG0NFktSNoSJJ6sZQkSR1Y6hIkroxVCRJ3RgqkqRuDBVJUjeGiiSpG0NFktSNoSJJ6sZQkSR1Y6hIkroxVCRJ3RgqkqRuDBVJUjeGiiSpG0NFktSNoSJJ6sZQkSR1Y6hIkroxVCRJ3RgqkqRuDBVJUjcTD5Ukq5LcnORv2vrxSa5PMpfkPUkObe1Pbetz7f0NY/s4t7XfnuSlY+1bWttcknMmPRZJ0v4tx0zlZ4FPjq2/DXh7VX0zcD9wRms/A7i/tb+99SPJJuB04DuALcA7W1CtAt4BvAzYBLy69ZUkTclEQyXJeuBU4N1tPcCLgPe1LhcDL2/LW9s67f2TW/+twGVV9VBVfRqYA05sr7mqurOqHgYua30lSVMy6ZnK7wO/DHy1rT8T+HxVPdLWdwHr2vI64G6A9v4Drf9j7ftss1D710hyVpKdSXbu2bPnQMckSVrAxEIlyf8A7q2qmyZ1jKGq6sKq2lxVm9euXTvtciTpoLV6gvt+AXBaklOAw4AjgT8Ajk6yus1G1gO7W//dwLHAriSrgaOAz4217zW+zULtkqQpmNhMparOrar1VbWB0YX2j1TVTwDXAK9o3bYBH2jLO9o67f2PVFW19tPb3WHHAxuBG4AbgY3tbrJD2zF2TGo8kqTFTXKmspA3Apcl+U3gZuCi1n4RcGmSOeA+RiFBVd2a5HLgNuAR4OyqehQgyRuAK4FVwPaqunVZRyJJepxlCZWquha4ti3fyejOrX37fAV45QLbnw+cP0/7FcAVHUuVJB0AP1EvSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6mVioJDksyQ1J/inJrUl+o7Ufn+T6JHNJ3pPk0Nb+1LY+197fMLavc1v77UleOta+pbXNJTlnUmORJA0zyZnKQ8CLquq7ge8BtiQ5CXgb8Paq+mbgfuCM1v8M4P7W/vbWjySbgNOB7wC2AO9MsirJKuAdwMuATcCrW19J0pRMLFRq5Att9SntVcCLgPe19ouBl7flrW2d9v7JSdLaL6uqh6rq08AccGJ7zVXVnVX1MHBZ6ytJmpJBoZLkO5/IztuM4mPAvcBVwKeAz1fVI63LLmBdW14H3A3Q3n8AeOZ4+z7bLNQ+Xx1nJdmZZOeePXueyFAkSQMMnam8s10feX2So4buvKoerarvAdYzmll82xMp8kBV1YVVtbmqNq9du3YaJUjSijAoVKrq+4GfAI4Fbkry50lePPQgVfV54Brg+4Cjk6xub60Hdrfl3W3/tPePAj433r7PNgu1S5KmZPA1laq6A/g14I3ADwIXJPl/Sf7nfP2TrE1ydFs+HHgx8ElG4fKK1m0b8IG2vKOt097/SFVVaz+93R12PLARuAG4EdjY7iY7lNHF/B1DxyNJ6m/14l0gyXcBrwNOZXRt5Eeq6h+TfAPwD8BfzbPZMcDF7S6tQ4DLq+pvktwGXJbkN4GbgYta/4uAS5PMAfcxCgmq6tYklwO3AY8AZ1fVo62uNwBXAquA7VV165L/BiRJ3QwKFeAPgXcDv1JVX97bWFWfTfJr821QVbcAz52n/U5G11f2bf8K8MoF9nU+cP487VcAVwwcgyRpwoaGyqnAl8dmCIcAh1XVl6rq0olVJ0maKUOvqXwYOHxs/YjWJknSY4aGymFjH2SkLR8xmZIkSbNqaKh8MckJe1eSPA/48n76S5JWoKHXVH4OeG+SzwIB/jvw4xOrSpI0kwaFSlXdmOTbgG9tTbdX1X9MrixJ0iwaOlMB+F5gQ9vmhCRU1SUTqUqSNJOGfvjxUuCbgI8Bj7bmAgwVSdJjhs5UNgOb2mNTJEma19C7vz7B6OK8JEkLGjpTWQPcluQGRt/oCEBVnTaRqiRJM2loqLxpkkVIkg4OQ28p/tsk3whsrKoPJzmC0ZOBJUl6zNCvEz6T0ffGv6s1rQPeP6miJEmzaeiF+rOBFwAPwmNf2PX1kypKkjSbhobKQ1X18N6V9nW/3l4sSXqcoaHyt0l+BTi8fTf9e4G/nlxZkqRZNDRUzgH2AB8HfprRty3O+42PkqSVa+jdX18F/qS9JEma19Bnf32aea6hVNWzu1ckSZpZS3n2116HAa8EntG/HEnSLBt0TaWqPjf22l1Vvw+cOuHaJEkzZujprxPGVg9hNHNZynexSJJWgKHB8Ltjy48AdwGv6l6NJGmmDb3764cmXYgkafYNPf31C/t7v6p+r085kqRZtpS7v74X2NHWfwS4AbhjEkVJkmbT0FBZD5xQVf8OkORNwAer6jWTKkySNHuGPqblWcDDY+sPtzZJkh4zdKZyCXBDkv/b1l8OXDyZkiRJs2ro3V/nJ/kQ8P2t6XVVdfPkypIkzaKhp78AjgAerKo/AHYlOX5CNUmSZtTQrxM+D3gjcG5regrwfyZVlCRpNg2dqfwocBrwRYCq+izwtEkVJUmaTUND5eGqKtrj75P8t8mVJEmaVUND5fIk7wKOTnIm8GEW+cKuJMcmuSbJbUluTfKzrf0ZSa5Kckf78+mtPUkuSDKX5Jbxh1gm2db635Fk21j785J8vG1zQZIs9S9AktTP0Eff/w7wPuAvgW8Ffr2q/nCRzR4BfrGqNgEnAWcn2cToq4mvrqqNwNVtHeBlwMb2Ogv4IxiFEHAe8HzgROC8vUHU+pw5tt2WIeORJE3GorcUJ1kFfLg9VPKqoTuuqnuAe9ryvyf5JLAO2Aq8sHW7GLiW0U0AW4FL2mm265IcneSY1veqqrqv1XMVsCXJtcCRVXVda7+E0ednPjS0RklSX4vOVKrqUeCrSY56ogdJsgF4LnA98KwWOAD/wn99Mn8dcPfYZrta2/7ad83TLkmakqGfqP8C8PE2S/ji3saq+t+LbZjk6xidNvu5qnpw/LJHVVWSWlrJS5fkLEan1DjuuOMmfThJWrGGhspftdeSJHkKo0D5s6rau/2/Jjmmqu5pp7fube27gWPHNl/f2nbzX6fL9rZf29rXz9P/a1TVhcCFAJs3b554iEnSSrXfUElyXFV9pqqW/JyvdifWRcAn9/m+lR3ANuCt7c8PjLW/IclljC7KP9CC50rgt8Yuzr8EOLeq7kvyYJKTGJ1Wey2w2M0DkqQJWuyayvv3LiT5yyXu+wXATwIvSvKx9jqFUZi8OMkdwA+3dYArgDuBOUa3K78eoF2gfwtwY3u9ee9F+9bn3W2bT+FFekmaqsVOf41/7uPZS9lxVf3dPtuPO3me/gWcvcC+tgPb52nfCTxnKXVJkiZnsZlKLbAsSdLXWGym8t1JHmQ04zi8LdPWq6qOnGh1kqSZst9QqapVy1WIJGn2LeX7VCRJ2i9DRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3EwuVJNuT3JvkE2Ntz0hyVZI72p9Pb+1JckGSuSS3JDlhbJttrf8dSbaNtT8vycfbNhckyaTGIkkaZpIzlT8FtuzTdg5wdVVtBK5u6wAvAza211nAH8EohIDzgOcDJwLn7Q2i1ufMse32PZYkaZlNLFSq6qPAffs0bwUubssXAy8fa7+kRq4Djk5yDPBS4Kqquq+q7geuAra0946squuqqoBLxvYlSZqS5b6m8qyquqct/wvwrLa8Drh7rN+u1ra/9l3ztEuSpmhqF+rbDKOW41hJzkqyM8nOPXv2LMchJWlFWu5Q+dd26or2572tfTdw7Fi/9a1tf+3r52mfV1VdWFWbq2rz2rVrD3gQkqT5LXeo7AD23sG1DfjAWPtr211gJwEPtNNkVwIvSfL0doH+JcCV7b0Hk5zU7vp67di+JElTsnpSO07yF8ALgTVJdjG6i+utwOVJzgD+GXhV634FcAowB3wJeB1AVd2X5C3Aja3fm6tq78X/1zO6w+xw4EPtJUmaoomFSlW9eoG3Tp6nbwFnL7Cf7cD2edp3As85kBolSX35iXpJUjeGiiSpG0NFktSNoSJJ6sZQkSR1Y6hIkroxVCRJ3RgqkqRuDBVJUjeGiiSpG0NFktTNxJ79pX42nPPBqR37rreeOrVjS5o9zlQkSd0YKpKkbgwVSVI3hookqRtDRZLUjaEiSerGUJEkdWOoSJK6MVQkSd0YKpKkbgwVSVI3hookqRtDRZLUjU8p1n5N6wnJPh1Zmk3OVCRJ3RgqkqRuDBVJUjeGiiSpG0NFktSNoSJJ6sZbivWkNK1bmcHbmaUD4UxFktSNoSJJ6mbmQyXJliS3J5lLcs6065GklWymr6kkWQW8A3gxsAu4McmOqrptupVplvloGumJm+lQAU4E5qrqToAklwFbAUNFM8cw08Fg1kNlHXD32Pou4PlTqkWaSdO8007LZ7l+eZj1UBkkyVnAWW31C0lufwK7WQP8W7+qZspKHftKHTes3LEftOPO2xbtsr+xf+PQ48x6qOwGjh1bX9/aHqeqLgQuPJADJdlZVZsPZB+zaqWOfaWOG1bu2FfquKHf2Gf97q8bgY1Jjk9yKHA6sGPKNUnSijXTM5WqeiTJG4ArgVXA9qq6dcplSdKKNdOhAlBVVwBXLMOhDuj02YxbqWNfqeOGlTv2lTpu6DT2VFWP/UiSNPPXVCRJTyKGyj4We+xLkqcmeU97//okG5a/yv4GjPsXktyW5JYkVycZfIvhk93QR/0k+bEkleSguDtoyLiTvKr93G9N8ufLXeOkDPj3flySa5Lc3P7NnzKNOntLsj3JvUk+scD7SXJB+3u5JckJSz5IVflqL0YX+z8FPBs4FPgnYNM+fV4P/HFbPh14z7TrXqZx/xBwRFv+mYNh3EPH3vo9DfgocB2wedp1L9PPfCNwM/D0tv710657Gcd+IfAzbXkTcNe06+409h8ATgA+scD7pwAfAgKcBFy/1GM4U3m8xx77UlUPA3sf+zJuK3BxW34fcHKSLGONk7DouKvqmqr6Ulu9jtFngg4GQ37mAG8B3gZ8ZTmLm6Ah4z4TeEdV3Q9QVfcuc42TMmTsBRzZlo8CPruM9U1MVX0UuG8/XbYCl9TIdcDRSY5ZyjEMlceb77Ev6xbqU1WPAA8Az1yW6iZnyLjHncHot5mDwaJjb6cAjq2qg+l5JkN+5t8CfEuSv09yXZIty1bdZA0Z+5uA1yTZxeju0v+1PKVN3VL/L/gaM39LsZZXktcAm4EfnHYtyyHJIcDvAT815VKmYTWjU2AvZDQz/WiS76yqz0+1quXxauBPq+p3k3wfcGmS51TVV6dd2JOdM5XHG/LYl8f6JFnNaGr8uWWpbnIGPe4myQ8DvwqcVlUPLVNtk7bY2J8GPAe4NsldjM4z7zgILtYP+ZnvAnZU1X9U1aeB/88oZGbdkLGfAVwOUFX/ABzG6NlYB7tB/xfsj6HyeEMe+7ID2NaWXwF8pNoVrhm26LiTPBd4F6NAOVjOrcMiY6+qB6pqTVVtqKoNjK4nnVZVO6dTbjdD/q2/n9EshSRrGJ0Ou3M5i5yQIWP/DHAyQJJvZxQqe5a1yunYAby23QV2EvBAVd2zlB14+mtMLfDYlyRvBnZW1Q7gIkZT4TlGF7xOn17FfQwc928DXwe8t92X8JmqOm1qRXcycOwHnYHjvhJ4SZLbgEeBX6qqWZ+VDx37LwJ/kuTnGV20/6mD4JdHkvwFo18U1rTrRecBTwGoqj9mdP3oFGAO+BLwuiUf4yD4e5IkPUl4+kuS1I2hIknqxlCRJHVjqEiSujFUJEndGCqSpG4MFUlSN4aKJKmb/wTymZsmCiQCMwAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"proba_df['proba'].plot(kind='hist')"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.axes._subplots.AxesSubplot at 0x7f3231629cc0>"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAD8CAYAAAC/1zkdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFptJREFUeJzt3X+w3XV95/Hny1BQu1qDpCybEBNsdBdcG+WKznaxWlSCroJdx4Zpa2oZoitMt9OdWUE7xbHLDN2tZcuuYqNmhNbyQ/FHtsalgXV1dqYIQbP8UppLwCUxQhpco8LABt/7x/lcPFzuTU6S77nnnuT5mPnO/X7f31+fD4fhxff7+Z7vSVUhSVIXnjXqBkiSDh+GiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzhookqTOGiiSpM4aKJKkzR426AXPtuOOOq2XLlo26GZI0Vm6//fZ/qKpF+9vuiAuVZcuWsXnz5lE3Q5LGSpLvDrKdt78kSZ0xVCRJnTFUJEmdMVQkSZ0xVCRJnTFUJEmdMVQkSZ0xVCRJnTFUJEmdOeK+UX8oll305ZGc94HL3jKS80rSgfJKRZLUGUNFktQZQ0WS1BlDRZLUGUNFktQZQ0WS1BlDRZLUGUNFktQZQ0WS1BlDRZLUGUNFktSZoYVKkvVJHk5yV1/tuiRb2vRAki2tvizJY33rPt63z6lJ7kwymeSKJGn1Y5NsSrK1/V04rL5IkgYzzCuVTwOr+gtV9RtVtbKqVgI3AJ/vW33f1Lqqem9f/UrgfGBFm6aOeRFwc1WtAG5uy5KkERpaqFTV14FHZlrXrjbeCVyzr2MkOQF4flXdUlUFXA2c01afDVzV5q/qq0uSRmRUYyqnAw9V1da+2vIk30rytSSnt9piYHvfNttbDeD4qtrZ5r8PHD/UFkuS9mtUv6dyLk+/StkJLK2q3UlOBb6Y5JRBD1ZVlaRmW59kLbAWYOnSpQfZZEnS/sz5lUqSo4BfB66bqlXV41W1u83fDtwHvATYASzp231JqwE81G6PTd0me3i2c1bVuqqaqKqJRYsWddkdSVKfUdz+egPwnap66rZWkkVJFrT5k+gNyG9rt7f2JHlNG4d5F/ClttsGYE2bX9NXlySNyDAfKb4G+DvgpUm2JzmvrVrNMwfoXwvc0R4x/hzw3qqaGuR/H/BJYJLeFcxXWv0y4I1JttILqsuG1RdJ0mCGNqZSVefOUv+dGWo30HvEeKbtNwMvm6G+Gzjj0FopSeqS36iXJHXGUJEkdcZQkSR1xlCRJHXGUJEkdcZQkSR1xlCRJHXGUJEkdcZQkSR1xlCRJHXGUJEkdcZQkSR1xlCRJHXGUJEkdcZQkSR1xlCRJHXGUJEkdcZQkSR1xlCRJHVmaKGSZH2Sh5Pc1Vf7UJIdSba06c196y5OMpnk3iRn9tVXtdpkkov66suTfKPVr0ty9LD6IkkazDCvVD4NrJqhfnlVrWzTRoAkJwOrgVPaPh9LsiDJAuCjwFnAycC5bVuAP2nH+iXgB8B5Q+yLJGkAQwuVqvo68MiAm58NXFtVj1fV/cAkcFqbJqtqW1U9AVwLnJ0kwK8Bn2v7XwWc02kHJEkHbBRjKhcmuaPdHlvYaouBB/u22d5qs9VfCPzfqto7rT6jJGuTbE6yedeuXV31Q5I0zVyHypXAi4GVwE7gI3Nx0qpaV1UTVTWxaNGiuTilJB2RjprLk1XVQ1PzST4B/E1b3AGc2LfpklZjlvpu4AVJjmpXK/3bS5JGZE6vVJKc0Lf4dmDqybANwOokxyRZDqwAbgVuA1a0J72OpjeYv6GqCvgq8I62/xrgS3PRB0nS7IZ2pZLkGuB1wHFJtgOXAK9LshIo4AHgPQBVdXeS64F7gL3ABVX1ZDvOhcCNwAJgfVXd3U7xfuDaJP8B+BbwqWH1RZI0mKGFSlWdO0N51v/wV9WlwKUz1DcCG2eob6P3dJgkaZ7wG/WSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTODC1UkqxP8nCSu/pq/ynJd5LckeQLSV7Q6suSPJZkS5s+3rfPqUnuTDKZ5IokafVjk2xKsrX9XTisvkiSBjPMK5VPA6um1TYBL6uqlwN/D1zct+6+qlrZpvf21a8EzgdWtGnqmBcBN1fVCuDmtixJGqGhhUpVfR14ZFrtb6tqb1u8BViyr2MkOQF4flXdUlUFXA2c01afDVzV5q/qq0uSRmSUYyq/C3ylb3l5km8l+VqS01ttMbC9b5vtrQZwfFXtbPPfB44famslSft11ChOmuSDwF7gM620E1haVbuTnAp8Mckpgx6vqipJ7eN8a4G1AEuXLj34hkuS9mnOr1SS/A7wr4DfbLe0qKrHq2p3m78duA94CbCDp98iW9JqAA+122NTt8kenu2cVbWuqiaqamLRokUd90iSNGVOQyXJKuDfA2+rqkf76ouSLGjzJ9EbkN/Wbm/tSfKa9tTXu4Avtd02AGva/Jq+uiRpRAYKlST//EAPnOQa4O+AlybZnuQ84L8CzwM2TXt0+LXAHUm2AJ8D3ltVU4P87wM+CUzSu4KZGoe5DHhjkq3AG9qyJGmEBh1T+ViSY+g9JvyZqvrh/naoqnNnKH9qlm1vAG6YZd1m4GUz1HcDZ+yvHZKkuTPQlUpVnQ78JnAicHuSv07yxqG2TJI0dgYeU6mqrcAfAu8HfhW4on07/teH1ThJ0ngZdEzl5UkuB74N/Brw1qr6Z23+8iG2T5I0RgYdU/kv9AbLP1BVj00Vq+p7Sf5wKC2TJI2dQUPlLcBjVfUkQJJnAc+uqker6i+H1jpJ0lgZdEzlJuA5fcvPbTVJkp4yaKg8u6p+PLXQ5p87nCZJksbVoKHykySvnFpo7+d6bB/bS5KOQIOOqfw+8Nkk3wMC/GPgN4bWKknSWBooVKrqtiT/FHhpK91bVf9veM2SJI2jA3n1/auAZW2fVyahqq4eSqskSWNpoFBJ8pfAi4EtwJOtPPVLjJIkAYNfqUwAJ0/9/okkSTMZ9Omvu+gNzkuSNKtBr1SOA+5Jcivw+FSxqt42lFZJksbSoKHyoWE2QpJ0eBj0keKvJXkRsKKqbkryXGDBcJsmSRo3g776/nx6P/P7F620GPjisBolSRpPgw7UXwD8CrAHnvrBrl/c305J1id5OMldfbVjk2xKsrX9XdjqSXJFkskkd0x7Lcyatv3WJGv66qcmubPtc0WSDNgfSdIQDBoqj1fVE1MLSY6i9z2V/fk0sGpa7SLg5qpaAdzclgHOAla0aS1wZTvXscAlwKuB04BLpoKobXN+337TzyVJmkODhsrXknwAeE77bfrPAv9tfztV1deBR6aVzwauavNXAef01a+unluAFyQ5ATgT2FRVj1TVD4BNwKq27vlVdUv7/szVfceSJI3AoKFyEbALuBN4D7CR3u/VH4zjq2pnm/8+cHybXww82Lfd9lbbV337DHVJ0ogM+vTXT4FPtKkzVVVJhv4t/SRr6d1SY+nSpcM+nSQdsQZ9+uv+JNumTwd5zofarSva34dbfQdwYt92S1ptX/UlM9SfoarWVdVEVU0sWrToIJstSdqfQW9/TdB7S/GrgNOBK4C/OshzbgCmnuBaA3ypr/6u9hTYa4AftttkNwJvSrKwDdC/CbixrduT5DXtqa939R1LkjQCg97+2j2t9J+T3A780b72S3IN8DrguCTb6T3FdRlwfZLzgO8C72ybbwTeDEwCjwLvbud+JMkfA7e17T5cVVOD/++j94TZc4CvtEmSNCKDvvr+lX2Lz6J35bLffavq3FlWnTHDtkXv+zAzHWc9sH6G+mbgZftrhyRpbgz67q+P9M3vBR7gZ1cYkiQBg9/+ev2wGyJJGn+D3v76g32tr6o/66Y5kqRxdiC//Pgqek9oAbwVuBXYOoxGSZLG06ChsgR4ZVX9CCDJh4AvV9VvDathkqTxM+j3VI4HnuhbfoKfvV5FkiRg8CuVq4Fbk3yhLZ/Dz14KKUkSMPjTX5cm+Qq9b9MDvLuqvjW8ZkmSxtGgt78Angvsqao/B7YnWT6kNkmSxtSgL5S8BHg/cHEr/RwH/+4vSdJhatArlbcDbwN+AlBV3wOeN6xGSZLG06Ch8kR7N1cBJPn54TVJkjSuBg2V65P8Bb2f+D0fuImOf7BLkjT+Bn3660/bb9PvAV4K/FFVbRpqyyRJY2e/oZJkAXBTe6mkQSJJmtV+b39V1ZPAT5P8why0R5I0xgb9Rv2PgTuTbKI9AQZQVb83lFZJksbSoKHy+TZJkjSrfYZKkqVV9X+qqrP3fCV5KXBdX+kker91/wLgfGBXq3+gqja2fS4GzgOeBH6vqm5s9VXAnwMLgE9W1WVdtVOSdOD2N6byxamZJDd0ccKqureqVlbVSuBU4FFg6kWVl0+t6wuUk4HVwCnAKuBjSRa0Bwg+CpwFnAyc27aVJI3I/m5/pW/+pCGc/wzgvqr6bpLZtjkbuLaqHgfuTzIJnNbWTVbVNoAk17Zt7xlCOyVJA9jflUrNMt+V1cA1fcsXJrkjyfokC1ttMfBg3zbbW222uiRpRPYXKr+cZE+SHwEvb/N7kvwoyZ5DOXGSo+m9T+yzrXQl8GJgJbAT+MihHH/audYm2Zxk865du/a/gyTpoOzz9ldVLRjiuc8CvllVD7VzPTS1IskngL9pizuAE/v2W9Jq7KP+NFW1DlgHMDExMYwrLkkSB/Z7Kl07l75bX0lO6Fv3duCuNr8BWJ3kmPYbLiuAW4HbgBVJlrerntVtW0nSiAz6PZVOtbccvxF4T1/5PyZZSW/s5oGpdVV1d5Lr6Q3A7wUuaN/yJ8mFwI30HileX1V3z1knJEnPMJJQqaqfAC+cVvvtfWx/KXDpDPWNwMbOGyhJOiijvP0lSTrMGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOGCqSpM4YKpKkzhgqkqTOjCxUkjyQ5M4kW5JsbrVjk2xKsrX9XdjqSXJFkskkdyR5Zd9x1rTttyZZM6r+SJJGf6Xy+qpaWVUTbfki4OaqWgHc3JYBzgJWtGktcCX0Qgi4BHg1cBpwyVQQSZLm3qhDZbqzgava/FXAOX31q6vnFuAFSU4AzgQ2VdUjVfUDYBOwaq4bLUnqGWWoFPC3SW5PsrbVjq+qnW3++8DxbX4x8GDfvttbbbb60yRZm2Rzks27du3qsg+SpD5HjfDc/7KqdiT5RWBTku/0r6yqSlJdnKiq1gHrACYmJjo5piTpmUZ2pVJVO9rfh4Ev0BsTeajd1qL9fbhtvgM4sW/3Ja02W12SNAIjCZUkP5/keVPzwJuAu4ANwNQTXGuAL7X5DcC72lNgrwF+2G6T3Qi8KcnCNkD/plaTJI3AqG5/HQ98IclUG/66qv57ktuA65OcB3wXeGfbfiPwZmASeBR4N0BVPZLkj4Hb2nYfrqpH5q4bkqR+IwmVqtoG/PIM9d3AGTPUC7hglmOtB9Z33UZJ0oGbb48US5LGmKEiSeqMoSJJ6oyhIknqjKEiSeqMoSJJ6oyhIknqjKEiSeqMoSJJ6oyhIknqjKEiSeqMoSJJ6oyhIknqjKEiSeqMoSJJ6oyhIknqjKEiSeqMoSJJ6sych0qSE5N8Nck9Se5O8m9b/UNJdiTZ0qY39+1zcZLJJPcmObOvvqrVJpNcNNd9kSQ93Sh+o34v8O+q6ptJngfcnmRTW3d5Vf1p/8ZJTgZWA6cA/wS4KclL2uqPAm8EtgO3JdlQVffMSS8kSc8w56FSVTuBnW3+R0m+DSzexy5nA9dW1ePA/UkmgdPausmq2gaQ5Nq2raEiSSMy0jGVJMuAVwDfaKULk9yRZH2Sha22GHiwb7ftrTZbXZI0IiMLlST/CLgB+P2q2gNcCbwYWEnvSuYjHZ5rbZLNSTbv2rWrq8NKkqYZSagk+Tl6gfKZqvo8QFU9VFVPVtVPgU/ws1tcO4AT+3Zf0mqz1Z+hqtZV1URVTSxatKjbzkiSnjKKp78CfAr4dlX9WV/9hL7N3g7c1eY3AKuTHJNkObACuBW4DViRZHmSo+kN5m+Yiz5IkmY2iqe/fgX4beDOJFta7QPAuUlWAgU8ALwHoKruTnI9vQH4vcAFVfUkQJILgRuBBcD6qrp7LjsiSXq6UTz99b+AzLBq4z72uRS4dIb6xn3tJ0maW36jXpLUGUNFktQZQ0WS1BlDRZLUGUNFktQZQ0WS1BlDRZLUGUNFktQZQ0WS1BlDRZLUGUNFktQZQ0WS1BlDRZLUGUNFktQZQ0WS1JlR/EiXDtCyi748snM/cNlbRnZuSePHKxVJUmcMFUlSZ8Y+VJKsSnJvkskkF426PZJ0JBvrUEmyAPgocBZwMnBukpNH2ypJOnKN+0D9acBkVW0DSHItcDZwz0hbdRgZ1UMCPiAgjadxD5XFwIN9y9uBV4+oLeqQT7xJ42ncQ2UgSdYCa9vij5Pce5CHOg74h25aNa8dKf2EGfqaPxlRS4brSPlM7efwvGiQjcY9VHYAJ/YtL2m1p6mqdcC6Qz1Zks1VNXGox5nvjpR+wpHTV/t5eJnP/RzrgXrgNmBFkuVJjgZWAxtG3CZJOmKN9ZVKVe1NciFwI7AAWF9Vd4+4WZJ0xBrrUAGoqo3Axjk63SHfQhsTR0o/4cjpq/08vMzbfqaqRt0GSdJhYtzHVCRJ84ih0uzvdS9JjklyXVv/jSTL+tZd3Or3JjlzLtt9oA62n0mWJXksyZY2fXyu234gBujna5N8M8neJO+Ytm5Nkq1tWjN3rT5wh9jPJ/s+z3n/gMsAff2DJPckuSPJzUle1LfucPpM99XP0X+mVXXET/QG+e8DTgKOBv43cPK0bd4HfLzNrwaua/Mnt+2PAZa34ywYdZ+G0M9lwF2j7kOH/VwGvBy4GnhHX/1YYFv7u7DNLxx1n7ruZ1v341H3oeO+vh54bpv/N33/7h5un+mM/Zwvn6lXKj1Pve6lqp4Apl730u9s4Ko2/zngjCRp9Wur6vGquh+YbMebjw6ln+Nkv/2sqgeq6g7gp9P2PRPYVFWPVNUPgE3Aqrlo9EE4lH6Om0H6+tWqerQt3kLve2tw+H2ms/VzXjBUemZ63cvi2bapqr3AD4EXDrjvfHEo/QRYnuRbSb6W5PRhN/YQHMpncrh9nvvy7CSbk9yS5Jxum9a5A+3recBXDnLfUTqUfsI8+EzH/pFizZmdwNKq2p3kVOCLSU6pqj2jbpgO2ouqakeSk4D/keTOqrpv1I06VEl+C5gAfnXUbRmmWfo58s/UK5WeQV738tQ2SY4CfgHYPeC+88VB97Pd3tsNUFW307vv+5Kht/jgHMpncrh9nrOqqh3t7zbgfwKv6LJxHRuor0neAHwQeFtVPX4g+84Th9LP+fGZjnpQZz5M9K7YttEbaJ8aHDtl2jYX8PQB7Ovb/Ck8faB+G/N3oP5Q+rloql/0BhF3AMeOuk8H28++bT/NMwfq76c3oLuwzR+O/VwIHNPmjwO2Mm1AeD5NA/67+wp6/7OzYlr9sPpM99HPefGZjvwf4nyZgDcDf98+rA+22ofp/Z8AwLOBz9IbiL8VOKlv3w+2/e4Fzhp1X4bRT+BfA3cDW4BvAm8ddV8OsZ+vone/+if0rjjv7tv3d1v/J4F3j7ovw+gn8C+AO9t/tO4Ezht1Xzro603AQ+3f0S3AhsP0M52xn/PlM/Ub9ZKkzjimIknqjKEiSeqMoSJJ6oyhIknqjKEiSeqMoSJJ6oyhIknqjKEiSerM/we8RVUzfE8ypwAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"var_df_sorted['variance'].plot(kind='hist')"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [],
"source": [
"top_10_most_var = var_df_sorted[:10]"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/paprika/workspace/oresenv/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" \"\"\"Entry point for launching an IPython kernel.\n"
]
}
],
"source": [
"top_10_most_var['rev_id'] = top_10_most_var['orig_ix'].apply(lambda x: all_observations[x]['rev_id'])"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>orig_ix</th>\n",
" <th>variance</th>\n",
" <th>rev_id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>8939</th>\n",
" <td>8939</td>\n",
" <td>0.261328</td>\n",
" <td>649606976</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18578</th>\n",
" <td>18578</td>\n",
" <td>0.218557</td>\n",
" <td>661337491</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7515</th>\n",
" <td>7515</td>\n",
" <td>0.199779</td>\n",
" <td>689855312</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9957</th>\n",
" <td>9957</td>\n",
" <td>0.197083</td>\n",
" <td>698848134</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14053</th>\n",
" <td>14053</td>\n",
" <td>0.196784</td>\n",
" <td>649750939</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15043</th>\n",
" <td>15043</td>\n",
" <td>0.184964</td>\n",
" <td>653456391</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3386</th>\n",
" <td>3386</td>\n",
" <td>0.182672</td>\n",
" <td>647678189</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2849</th>\n",
" <td>2849</td>\n",
" <td>0.181390</td>\n",
" <td>702009779</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3149</th>\n",
" <td>3149</td>\n",
" <td>0.178592</td>\n",
" <td>662872174</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10019</th>\n",
" <td>10019</td>\n",
" <td>0.176378</td>\n",
" <td>689167287</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" orig_ix variance rev_id\n",
"8939 8939 0.261328 649606976\n",
"18578 18578 0.218557 661337491\n",
"7515 7515 0.199779 689855312\n",
"9957 9957 0.197083 698848134\n",
"14053 14053 0.196784 649750939\n",
"15043 15043 0.184964 653456391\n",
"3386 3386 0.182672 647678189\n",
"2849 2849 0.181390 702009779\n",
"3149 3149 0.178592 662872174\n",
"10019 10019 0.176378 689167287"
]
},
"execution_count": 85,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top_10_most_var"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://en.wikipedia.org/wiki/?diff=649606976\n",
"https://en.wikipedia.org/wiki/?diff=661337491\n",
"https://en.wikipedia.org/wiki/?diff=689855312\n",
"https://en.wikipedia.org/wiki/?diff=698848134\n",
"https://en.wikipedia.org/wiki/?diff=649750939\n",
"https://en.wikipedia.org/wiki/?diff=653456391\n",
"https://en.wikipedia.org/wiki/?diff=647678189\n",
"https://en.wikipedia.org/wiki/?diff=702009779\n",
"https://en.wikipedia.org/wiki/?diff=662872174\n",
"https://en.wikipedia.org/wiki/?diff=689167287\n"
]
}
],
"source": [
"for rev_id in top_10_most_var['rev_id'].values:\n",
" print(f\"https://en.wikipedia.org/wiki/?diff={rev_id}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Key \n",
"\n",
"- PFP Potentially false positive\n",
"- FP False Postive (Bad label)\n",
"- TD True Damaging\n",
"- MX mixed damaging and goog faith editing\n",
"\n",
"## Edits were (in order as above):\n",
"\n",
"\n",
"- FP \"(PLEASE DO NOT EDIT) If you would like to make suggestions contact us on the Wiki pages talk section\"\n",
"\n",
"- TD \"Brony's are gay little Twots.\"\n",
"\n",
"- PFP Insertion of word \"video games\"\n",
"\n",
"- TD Citation blanking - I am happy.\n",
"\n",
"- MX Some good grammar edits **&** \"Dog Crap\"\n",
"\n",
"- PFP removal of a lot of unrferenced text.\n",
"\n",
"- MX some wikignoming **and** vanity\n",
"\n",
"- FP nonsense deletion\n",
"\n",
"- PFP, contains the word \"vandalism\" and is an honest commment, but deletes the asker's text\n",
"\n",
"- PFP, deletes a page with English text and rewrites it in Spanish.\n",
"\n",
"## Final counts\n",
"\n",
"- 2 Bad labels (false positive labels)\n",
"- 4 Potentially bad labels\n",
"- 2 Mixed labels that contain both good and bad edits\n",
"- 2 Truly damaging"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
@FajneFarita
Copy link

what about taking the 10 least variant observations and repeating the manual inspection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment