Skip to content

Instantly share code, notes, and snippets.

@halfak
Last active June 2, 2020 19:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save halfak/b2f2dfa775c59d9de7c89a8eabe5530e to your computer and use it in GitHub Desktop.
Save halfak/b2f2dfa775c59d9de7c89a8eabe5530e to your computer and use it in GitHub Desktop.
>>> from deltas.tokenizers import wikitext_split
>>>
>>> text = """
... I am some Wikipedia content.
...
... This is a {{template}}.<ref> foo</ref>
... """
>>>
>>> wikitext_split.tokenize(text)
[Token('\n', type='whitespace'), Token('I', type='word'), Token(' ', type='whitespace'), Token('am', type='word'), Token(' ', type='whitespace'), Token('some', type='word'), Token(' ', type='whitespace'), Token('Wikipedia', type='word'), Token(' ', type='whitespace'), Token('content', type='word'), Token('.', type='period'), Token('\n\n', type='break'), Token('This', type='word'), Token(' ', type='whitespace'), Token('is', type='word'), Token(' ', type='whitespace'), Token('a', type='word'), Token(' ', type='whitespace'), Token('{{', type='dcurly_open'), Token('template', type='word'), Token('}}', type='dcurly_close'), Token('.', type='period'), Token('<ref>', type='ref_open'), Token(' ', type='whitespace'), Token('foo', type='word'), Token('</ref>', type='ref_close'), Token('\n', type='whitespace')]
>>> text2 = """
... I am not some Wikipedia content.
...
... Or maybe I am.
...
... This is a {{template}}.<ref> foo</ref>
... """
>>> list(segment_matcher.diff(wikitext_split.tokenize(text), wikitext_split.tokenize(text2)))
[Equal(name='equal', a1=0, a2=4, b1=0, b2=4), Insert(name='insert', a1=4, a2=4, b1=4, b2=6), Equal(name='equal', a1=4, a2=12, b1=6, b2=14), Insert(name='insert', a1=12, a2=12, b1=14, b2=24), Equal(name='equal', a1=12, a2=27, b1=24, b2=39)]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment