This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Process a collection of XML dumps looking for the introduction and removal of {{Beginnetje}} templates | |
and assume the introduction represents a quality label ("E") and the removal represents the quality | |
label "D". Note: This script does not yet handle reverts (e.g. vandalism). To do that, look into | |
the mwreverts libraray | |
USAGE: | |
nlwiki_template_extractor (-h|--help) | |
nlwiki_template_extractor <xml-dump>... | |
[--namespace=<num>...] [--processes=<num>] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import mwreverts | |
from models import RevRevert, Page, Revision | |
import mwxml | |
import pdb | |
from collections import deque | |
from mwapilib import get_revs_for_revert_labeling | |
import sys | |
# This script is used for processing edits from the dump for reverts and store | |
# the revert status in a revert table. Edits for the pages from the page table |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ bzcat datasets/ptwiki.draft_quality.balanced_3k.with_text.json.bz2 | \ | |
shuf -n 100 | python demo_ptwiki_w2w.py | sort -k1,1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ python | |
Python 3.5.3 (default, Sep 27 2018, 17:25:39) | |
[GCC 6.3.0 20170516] on linux | |
Type "help", "copyright", "credits" or "license" for more information. | |
>>> from revscoring import Model | |
>>> model = Model.load(open("models/ptwiki.wp10.gradient_boosting.model")) | |
>>> importance_features = list(sorted(zip(model.estimator.feature_importances_, model.features), reverse=True)) | |
>>> for importance, feature in importance_features: | |
... print(round(importance, 3), feature) | |
... |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import mwparsefromhell | |
example = """ | |
{{foo bar baz}} | |
{{I am a random template}} | |
{{Marca de projeto|3|Biografias|4|Políticos|4|Brasil|3|WP Offline|2|bot=4/20111127|rev=20170714}}""" | |
templates = list(mwparserfromhell.parse(example_text).filter_templates()) | |
def from_template(template): |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import time | |
import textstat | |
import mwapi | |
from revscoring.dependencies import solve | |
from revscoring.datasources.meta import filters | |
from revscoring.features import wikitext | |
from articlequality.feature_lists.enwiki import text_complexity | |
session = mwapi.Session("https://en.wikipedia.org") |