Skip to content

Instantly share code, notes, and snippets.

View chtnnh's full-sized avatar

chtnnh chtnnh

View GitHub Profile
"""
Process a collection of XML dumps looking for the introduction and removal of {{Beginnetje}} templates
and assume the introduction represents a quality label ("E") and the removal represents the quality
label "D". Note: This script does not yet handle reverts (e.g. vandalism). To do that, look into
the mwreverts libraray
USAGE:
nlwiki_template_extractor (-h|--help)
nlwiki_template_extractor <xml-dump>...
[--namespace=<num>...] [--processes=<num>]
@chtnnh
chtnnh / revertslabel
Created October 28, 2020 05:24 — forked from codez266/revertslabel
Reverts labeling from dump - involves loading revids from db and storing back, but that part is trivial
import mwreverts
from models import RevRevert, Page, Revision
import mwxml
import pdb
from collections import deque
from mwapilib import get_revs_for_revert_labeling
import sys
# This script is used for processing edits from the dump for reverts and store
# the revert status in a revert table. Edits for the pages from the page table
@chtnnh
chtnnh / cmd.bash
Created May 4, 2020 15:06 — forked from halfak/cmd.bash
Sample of labels and words_to_watch
$ bzcat datasets/ptwiki.draft_quality.balanced_3k.with_text.json.bz2 | \
shuf -n 100 | python demo_ptwiki_w2w.py | sort -k1,1
$ python
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from revscoring import Model
>>> model = Model.load(open("models/ptwiki.wp10.gradient_boosting.model"))
>>> importance_features = list(sorted(zip(model.estimator.feature_importances_, model.features), reverse=True))
>>> for importance, feature in importance_features:
... print(round(importance, 3), feature)
...
@chtnnh
chtnnh / example.py
Last active March 12, 2020 14:03 — forked from halfak/example.py
import mwparsefromhell
example = """
{{foo bar baz}}
{{I am a random template}}
{{Marca de projeto|3|Biografias|4|Políticos|4|Brasil|3|WP Offline|2|bot=4/20111127|rev=20170714}}"""
templates = list(mwparserfromhell.parse(example_text).filter_templates())
def from_template(template):
import time
import textstat
import mwapi
from revscoring.dependencies import solve
from revscoring.datasources.meta import filters
from revscoring.features import wikitext
from articlequality.feature_lists.enwiki import text_complexity
session = mwapi.Session("https://en.wikipedia.org")