Skip to content

Instantly share code, notes, and snippets.

@halfak
Created June 29, 2020 14:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save halfak/0a68a71b1dbdacc99bb50aaef47af03e to your computer and use it in GitHub Desktop.
Save halfak/0a68a71b1dbdacc99bb50aaef47af03e to your computer and use it in GitHub Desktop.
import time
import mwapi
from deltas.tokenizers import wikitext_split
'''text = """
This is a sentence [[derp|link]].
Here is another paragraph with the number 10.
"""'''
session = mwapi.Session("https://en.wikipedia.org")
doc = session.get(action="query", prop="revisions",
titles="Alan Turing", rvprop="content", rvslots="main",
formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']
start = time.time()
for i in range(100):
list(wikitext_split.tokenize(text))
print("We can process", 1/((time.time() - start)/100), "Alan Turing's per second")
@halfak
Copy link
Author

halfak commented Jun 29, 2020

This is some text 古池や蛙飛び込む水の音.

["This", " ", "is", " ", "some", " ", "text", " ", "古池や蛙飛び込む水の音", "."]

["This", " ", "is", " ", "some", " ", "text", " ", "古池や", "蛙飛び", "込む水", "の音", "."]

@halfak
Copy link
Author

halfak commented Jun 29, 2020

I think we can block CJK together but putting a "+" at the end of the CJK definition here: https://github.com/halfak/deltas/blob/master/deltas/tokenizers/wikitext_split.py#L45

@halfak
Copy link
Author

halfak commented Jun 29, 2020

This just demos tokenization

import time

import mwapi
from deltas.tokenizers import wikitext_split

session = mwapi.Session("https://en.wikipedia.org")
doc = session.get(action="query", prop="revisions",
                  titles="Aaron Halfaker", rvprop="content", rvslots="main",
                  formatversion=2)
text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']

for token in wikitext_split.tokenize(text):
  print(repr(token))

@halfak
Copy link
Author

halfak commented Jun 29, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment