Skip to content

Instantly share code, notes, and snippets.

@halfak
Created June 1, 2020 16:11
Show Gist options
  • Save halfak/f06ac6a47c2bcf5dfda545c55eb0a1a5 to your computer and use it in GitHub Desktop.
Save halfak/f06ac6a47c2bcf5dfda545c55eb0a1a5 to your computer and use it in GitHub Desktop.
Tokenize stuck on japanese revision
import mwapi
from deltas.tokenizers import wikitext_split
rev_id = 57246316
session = mwapi.Session("https://ja.wikipedia.org")
doc = session.get(action="query", prop="revisions", revids=[rev_id], rvslots="main", rvprop="content", formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']
location = 0
for token in wikitext_split.tokenize(text):
location += len(token)
print(token, location)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment