Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@kylepjohnson
Last active August 27, 2019 03:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kylepjohnson/eeb32445fa780fdf155b10245e9365d1 to your computer and use it in GitHub Desktop.
Save kylepjohnson/eeb32445fa780fdf155b10245e9365d1 to your computer and use it in GitHub Desktop.
MWV to illustrate proposed new CLTK data types and use in an "NLP object"
"""An example of a proposed NLP pipeline system. Goals are to allow for:
1. default NLP pipeline for any given language
2. users to override default pipeline
3. users to choose alternative code (classes/methods/functions) w/in the CLTK
4. users to use their own custom code (inheriting or replacing those w/in CLTK)
5. flexibility for the I/O for custom code
6. up-front checking whether I/O is possible given available processed text (e.g., a fn might depend on token str,
which must be created first)
7. specify the order in which NLP algos are run
In the following, I propose these new data types:
- ``Language``: Simple, just a place to hold attributes about a language. Can be referenced within
``Operation`` or ``Pipeline`` (e.g., ``LatinPipeline.language == LatinLanguage == True``).
- ``Operation``: One for each type of NLP algo we cover (e.g., tokenization, sentence splitting, pos tagging,
dependency, phonetics, prosody, etc.). Each of these is the subclassed for each language (e.g,
``TokenizationOperation`` <- ``LatinTokenizationOperation``). Here is defined the code to be used for a given
operation, plus documenting a bit more about it (I/O, name, description).
- ``Word``: This holds basic information for each token (start/end character indices, sentence index occurring
within, raw string) and more advanced info if available (e.g., NER, POS tag, dependency relations).
- ``Pipeline``: One for each language (e.g., ``Pipeline`` <- ``LatinPipeline``). A field in this is ``algo``,
which has as value a given field (e.g., ``LatinPipeline.algo == LatinTokenizationOperation == True``.
- ``Doc``: Returned by the ``NLP`` class (more specifically, by ``NLP().run_pipeline()``). Similar to what spaCy returns, only more transparent (IMHO). To the field ``Doc.tokens`` will be a list
of ``Word`` (``List[Word]``).
Notes:
- At the end of the module, see a dummy example of the ``cltk.NLP`` class and a use example (in ``"__main__"``),
plus output.
- Reqs Python 3.7
"""
from dataclasses import dataclass
import re
from typing import Any, Callable, List
# #####################################################################################
# #######################START LANGUAGE TYPE###########################################
@dataclass
class Language:
name: str
glottocode: str
latitude: float
longitude: float
dates: List[float]
family_id: str
parent_id: str
level: str
iso639P3code: str
type: str
LATIN = Language(
name="Latin",
glottocode="lati1261",
latitude=60.2,
longitude=50.5,
dates=[200, 400],
family_id="indo1319",
parent_id="impe1234",
level="language",
iso639P3code="lat",
type="a",
)
# #########################END LANGUAGE TYPE###########################################
# #####################################################################################
# #####################################################################################
# #######################START OPERATION TYPE##########################################
def dummy_get_token_indices(text: str) -> List[List[int]]:
"""Get the start/stop char indices of word boundaries.
>>> john_damascus_corinth = "Τοῦτο εἰπὼν, ᾐνίξατο αἰτίους ὄντας"
>>> indices_words = dummy_get_token_indices(text=john_damascus_corinth)
>>> indices_words[0:3]
[[0, 5], [6, 11], [13, 20]]
"""
indices_words = list()
pattern_word = re.compile(r"\w+")
for word_match in pattern_word.finditer(string=text):
idx_word_start, idx_word_stop = word_match.span()
indices_words.append([idx_word_start, idx_word_stop])
return indices_words
@dataclass
class Operation:
"""For each type of NLP operation there needs to be a definition.
It includes the type of data it expects (``str``, ``List[str]``,
``Word``, etc.) and what field withing ``Word`` it will populate.
This base class is intended to be inherited by NLP operation
types (e.g., ``TokenizationOperation`` or ``DependencyOperation``).
"""
name: str
description: str
input: Any
output: Any
algorithm: Callable
type: str
@dataclass
class TokenizationOperation(Operation):
"""To be inherited for each language's tokenization declaration.
Example: ``TokenizationOperation`` <- ``LatinTokenizationOperation``
"""
type = "tokenization"
@dataclass
class LatinTokenizationOperation(TokenizationOperation):
"""The default (or one of many) Latin tokenization algorithm."""
name = "CLTK Dummy Latin Tokenizer"
description = "This is a simple regex which divides on word spaces (``r'\w+)`` for illustrative purposes."
input = str
output = List[List[int]] # e.g., [[0, 4], [6, 11], ...]
algorithm = dummy_get_token_indices
language = LATIN
# #######################END OPERATION TYPE############################################
# #####################################################################################
# #####################################################################################
# #######################START WORD TYPE###############################################
@dataclass
class Word:
"""Contains attributes of each processed word in a list of tokens. To be used most often in the ``Doc.tokens``
dataclass. """
index_char_start: int = None
index_char_stop: int = None
index_token: int = None
index_sentence: int = None
string: str = None
pos: str = None
scansion: str = None
# #####################################################################################
# #######################END WORD TYPE#################################################
# #####################################################################################
# #######################START PIPELINE TYPE###########################################
@dataclass
class Pipeline:
sentence_splitter: Callable[[str], List[List[int]]] = None
word_tokenizer: Callable[[str], List[Word]] = None
dependency: str = None
pos: str = None
scansion: Callable[[str], str] = None
@dataclass
class LatinPipeline(Pipeline):
# sentence_splitter = LatinSplitter().dummy_get_indices
word_tokenizer = LatinTokenizationOperation
language = LATIN
# #######################END PIPELINE TYPE#############################################
# #####################################################################################
# #####################################################################################
# #######################START DOC TYPE################################################
@dataclass
class Doc:
"""The object returned to the user from the ``NLP()`` class. Contains overall attributes of submitted texts,
plus most importantly the processed tokenized text ``tokens``, being a list of ``Word`` types.. """
indices_sentences: List[List[int]] = None
indices_tokens: List[List[int]] = None
language: str = None
tokens: List[Word] = None
pipeline: Any = None
raw: str = None
ner: List[List[int]] = None
# #######################END DOC TYPE##################################################
# #####################################################################################
# #####################################################################################
# #######################START NLP CLASS###############################################
class NLP:
"""A dummy example of the primary entry-point class."""
def __init__(self, language: str) -> None:
self.language = language
if self.language == "latin":
self.pipeline = LatinPipeline
else:
raise NotImplementedError(
f"Pipeline not available for language '{self.language}'."
)
def run_pipeline(self, text: str) -> Doc:
"""Take a raw unprocessed text string, then return a ``Doc`` object
containing all available processed information.
"""
# Get token indices
token_indices = self.pipeline.word_tokenizer.algorithm(text=text)
# Populate a ``Word`` object for each token in the submitted text
all_word_tokens = list()
for token_count, token_index in enumerate(token_indices):
token_start = token_index[0]
token_end = token_index[1]
token_str = text[token_start:token_end]
# index_char_start: int = None
# index_char_stop: int = None
# index_token: int = None
# index_sentence: int = None
# string: str = None
# pos: str = None
# scansion: str = None
word = Word(
index_char_start=token_start,
index_char_stop=token_end,
index_token=token_count,
string=token_str,
)
all_word_tokens.append(word)
doc = Doc(
indices_tokens=token_indices,
language=self.language,
pipeline=self.pipeline,
tokens=all_word_tokens,
raw=text,
)
return doc
# #######################END NLP CLASS#################################################
# #####################################################################################
if __name__ == "__main__":
tacitus_germanica = (
"Germania omnis a Gallis Raetisque et Pannoniis Rheno et Danuvio fluminibus, a Sarmatis "
"Dacisque mutuo metu aut montibus separatur: cetera Oceanus ambit, latos sinus et insularum "
"inmensa spatia complectens, nuper cognitis quibusdam gentibus ac regibus, quos bellum "
"aperuit. Rhenus, Raeticarum Alpium inaccesso ac praecipiti vertice ortus, modico flexu in "
"occidentem versus septentrionali Oceano miscetur. Danuvius molli et clementer edito montis "
"Abnobae iugo effusus pluris populos adit, donec in Ponticum mare sex meatibus erumpat: "
"septimum os paludibus hauritur. "
)
cltk_nlp = NLP(language="latin")
doc_germanica = cltk_nlp.run_pipeline(tacitus_germanica)
print("")
print("``Doc``:", doc_germanica)
print("")
print("``Doc.pipeline``:", doc_germanica.pipeline)
print("")
print(
"``Doc.pipeline.word_tokenizer.name``:",
doc_germanica.pipeline.word_tokenizer.name,
)
print("")
print(
"``Doc.pipeline.word_tokenizer.description``:",
doc_germanica.pipeline.word_tokenizer.description,
)
print("")
print("``Doc.pipeline.tokens[:10]``:", doc_germanica.tokens[:10])
print("")
print("``Doc.pipeline.indices_tokens[:10]``:", doc_germanica.indices_tokens[:10])
print("")
@kylepjohnson
Copy link
Author

kylepjohnson commented Aug 27, 2019

Output from the script. Unfortunately these Gist comments do not wrap, however everything is in the first print of the Doc:

$ python --version
Python 3.7.4
$ poetry run python src/cltkv1/utils/pipeline_example.py 

``Doc``: Doc(indices_sentences=None, indices_tokens=[[0, 8], [9, 14], [15, 16], [17, 23], [24, 33], [34, 36], [37, 46], [47, 52], [53, 55], [56, 63], [64, 74], [76, 77], [78, 86], [87, 95], [96, 101], [102, 106], [107, 110], [111, 119], [120, 129], [131, 137], [138, 145], [146, 151], [153, 158], [159, 164], [165, 167], [168, 177], [178, 185], [186, 192], [193, 204], [206, 211], [212, 220], [221, 230], [231, 239], [240, 242], [243, 250], [252, 256], [257, 263], [264, 271], [273, 279], [281, 291], [292, 298], [299, 308], [309, 311], [312, 322], [323, 330], [331, 336], [338, 344], [345, 350], [351, 353], [354, 364], [365, 371], [372, 386], [387, 393], [394, 402], [404, 412], [413, 418], [419, 421], [422, 431], [432, 437], [438, 444], [445, 452], [453, 457], [458, 465], [466, 472], [473, 480], [481, 485], [487, 492], [493, 495], [496, 504], [505, 509], [510, 513], [514, 522], [523, 530], [532, 540], [541, 543], [544, 553], [554, 562]], language='latin', tokens=[Word(index_char_start=0, index_char_stop=8, index_token=0, index_sentence=None, string='Germania', pos=None, scansion=None), Word(index_char_start=9, index_char_stop=14, index_token=1, index_sentence=None, string='omnis', pos=None, scansion=None), Word(index_char_start=15, index_char_stop=16, index_token=2, index_sentence=None, string='a', pos=None, scansion=None), Word(index_char_start=17, index_char_stop=23, index_token=3, index_sentence=None, string='Gallis', pos=None, scansion=None), Word(index_char_start=24, index_char_stop=33, index_token=4, index_sentence=None, string='Raetisque', pos=None, scansion=None), Word(index_char_start=34, index_char_stop=36, index_token=5, index_sentence=None, string='et', pos=None, scansion=None), Word(index_char_start=37, index_char_stop=46, index_token=6, index_sentence=None, string='Pannoniis', pos=None, scansion=None), Word(index_char_start=47, index_char_stop=52, index_token=7, index_sentence=None, string='Rheno', pos=None, scansion=None), Word(index_char_start=53, index_char_stop=55, index_token=8, index_sentence=None, string='et', pos=None, scansion=None), Word(index_char_start=56, index_char_stop=63, index_token=9, index_sentence=None, string='Danuvio', pos=None, scansion=None), Word(index_char_start=64, index_char_stop=74, index_token=10, index_sentence=None, string='fluminibus', pos=None, scansion=None), Word(index_char_start=76, index_char_stop=77, index_token=11, index_sentence=None, string='a', pos=None, scansion=None), Word(index_char_start=78, index_char_stop=86, index_token=12, index_sentence=None, string='Sarmatis', pos=None, scansion=None), Word(index_char_start=87, index_char_stop=95, index_token=13, index_sentence=None, string='Dacisque', pos=None, scansion=None), Word(index_char_start=96, index_char_stop=101, index_token=14, index_sentence=None, string='mutuo', pos=None, scansion=None), Word(index_char_start=102, index_char_stop=106, index_token=15, index_sentence=None, string='metu', pos=None, scansion=None), Word(index_char_start=107, index_char_stop=110, index_token=16, index_sentence=None, string='aut', pos=None, scansion=None), Word(index_char_start=111, index_char_stop=119, index_token=17, index_sentence=None, string='montibus', pos=None, scansion=None), Word(index_char_start=120, index_char_stop=129, index_token=18, index_sentence=None, string='separatur', pos=None, scansion=None), Word(index_char_start=131, index_char_stop=137, index_token=19, index_sentence=None, string='cetera', pos=None, scansion=None), Word(index_char_start=138, index_char_stop=145, index_token=20, index_sentence=None, string='Oceanus', pos=None, scansion=None), Word(index_char_start=146, index_char_stop=151, index_token=21, index_sentence=None, string='ambit', pos=None, scansion=None), Word(index_char_start=153, index_char_stop=158, index_token=22, index_sentence=None, string='latos', pos=None, scansion=None), Word(index_char_start=159, index_char_stop=164, index_token=23, index_sentence=None, string='sinus', pos=None, scansion=None), Word(index_char_start=165, index_char_stop=167, index_token=24, index_sentence=None, string='et', pos=None, scansion=None), Word(index_char_start=168, index_char_stop=177, index_token=25, index_sentence=None, string='insularum', pos=None, scansion=None), Word(index_char_start=178, index_char_stop=185, index_token=26, index_sentence=None, string='inmensa', pos=None, scansion=None), Word(index_char_start=186, index_char_stop=192, index_token=27, index_sentence=None, string='spatia', pos=None, scansion=None), Word(index_char_start=193, index_char_stop=204, index_token=28, index_sentence=None, string='complectens', pos=None, scansion=None), Word(index_char_start=206, index_char_stop=211, index_token=29, index_sentence=None, string='nuper', pos=None, scansion=None), Word(index_char_start=212, index_char_stop=220, index_token=30, index_sentence=None, string='cognitis', pos=None, scansion=None), Word(index_char_start=221, index_char_stop=230, index_token=31, index_sentence=None, string='quibusdam', pos=None, scansion=None), Word(index_char_start=231, index_char_stop=239, index_token=32, index_sentence=None, string='gentibus', pos=None, scansion=None), Word(index_char_start=240, index_char_stop=242, index_token=33, index_sentence=None, string='ac', pos=None, scansion=None), Word(index_char_start=243, index_char_stop=250, index_token=34, index_sentence=None, string='regibus', pos=None, scansion=None), Word(index_char_start=252, index_char_stop=256, index_token=35, index_sentence=None, string='quos', pos=None, scansion=None), Word(index_char_start=257, index_char_stop=263, index_token=36, index_sentence=None, string='bellum', pos=None, scansion=None), Word(index_char_start=264, index_char_stop=271, index_token=37, index_sentence=None, string='aperuit', pos=None, scansion=None), Word(index_char_start=273, index_char_stop=279, index_token=38, index_sentence=None, string='Rhenus', pos=None, scansion=None), Word(index_char_start=281, index_char_stop=291, index_token=39, index_sentence=None, string='Raeticarum', pos=None, scansion=None), Word(index_char_start=292, index_char_stop=298, index_token=40, index_sentence=None, string='Alpium', pos=None, scansion=None), Word(index_char_start=299, index_char_stop=308, index_token=41, index_sentence=None, string='inaccesso', pos=None, scansion=None), Word(index_char_start=309, index_char_stop=311, index_token=42, index_sentence=None, string='ac', pos=None, scansion=None), Word(index_char_start=312, index_char_stop=322, index_token=43, index_sentence=None, string='praecipiti', pos=None, scansion=None), Word(index_char_start=323, index_char_stop=330, index_token=44, index_sentence=None, string='vertice', pos=None, scansion=None), Word(index_char_start=331, index_char_stop=336, index_token=45, index_sentence=None, string='ortus', pos=None, scansion=None), Word(index_char_start=338, index_char_stop=344, index_token=46, index_sentence=None, string='modico', pos=None, scansion=None), Word(index_char_start=345, index_char_stop=350, index_token=47, index_sentence=None, string='flexu', pos=None, scansion=None), Word(index_char_start=351, index_char_stop=353, index_token=48, index_sentence=None, string='in', pos=None, scansion=None), Word(index_char_start=354, index_char_stop=364, index_token=49, index_sentence=None, string='occidentem', pos=None, scansion=None), Word(index_char_start=365, index_char_stop=371, index_token=50, index_sentence=None, string='versus', pos=None, scansion=None), Word(index_char_start=372, index_char_stop=386, index_token=51, index_sentence=None, string='septentrionali', pos=None, scansion=None), Word(index_char_start=387, index_char_stop=393, index_token=52, index_sentence=None, string='Oceano', pos=None, scansion=None), Word(index_char_start=394, index_char_stop=402, index_token=53, index_sentence=None, string='miscetur', pos=None, scansion=None), Word(index_char_start=404, index_char_stop=412, index_token=54, index_sentence=None, string='Danuvius', pos=None, scansion=None), Word(index_char_start=413, index_char_stop=418, index_token=55, index_sentence=None, string='molli', pos=None, scansion=None), Word(index_char_start=419, index_char_stop=421, index_token=56, index_sentence=None, string='et', pos=None, scansion=None), Word(index_char_start=422, index_char_stop=431, index_token=57, index_sentence=None, string='clementer', pos=None, scansion=None), Word(index_char_start=432, index_char_stop=437, index_token=58, index_sentence=None, string='edito', pos=None, scansion=None), Word(index_char_start=438, index_char_stop=444, index_token=59, index_sentence=None, string='montis', pos=None, scansion=None), Word(index_char_start=445, index_char_stop=452, index_token=60, index_sentence=None, string='Abnobae', pos=None, scansion=None), Word(index_char_start=453, index_char_stop=457, index_token=61, index_sentence=None, string='iugo', pos=None, scansion=None), Word(index_char_start=458, index_char_stop=465, index_token=62, index_sentence=None, string='effusus', pos=None, scansion=None), Word(index_char_start=466, index_char_stop=472, index_token=63, index_sentence=None, string='pluris', pos=None, scansion=None), Word(index_char_start=473, index_char_stop=480, index_token=64, index_sentence=None, string='populos', pos=None, scansion=None), Word(index_char_start=481, index_char_stop=485, index_token=65, index_sentence=None, string='adit', pos=None, scansion=None), Word(index_char_start=487, index_char_stop=492, index_token=66, index_sentence=None, string='donec', pos=None, scansion=None), Word(index_char_start=493, index_char_stop=495, index_token=67, index_sentence=None, string='in', pos=None, scansion=None), Word(index_char_start=496, index_char_stop=504, index_token=68, index_sentence=None, string='Ponticum', pos=None, scansion=None), Word(index_char_start=505, index_char_stop=509, index_token=69, index_sentence=None, string='mare', pos=None, scansion=None), Word(index_char_start=510, index_char_stop=513, index_token=70, index_sentence=None, string='sex', pos=None, scansion=None), Word(index_char_start=514, index_char_stop=522, index_token=71, index_sentence=None, string='meatibus', pos=None, scansion=None), Word(index_char_start=523, index_char_stop=530, index_token=72, index_sentence=None, string='erumpat', pos=None, scansion=None), Word(index_char_start=532, index_char_stop=540, index_token=73, index_sentence=None, string='septimum', pos=None, scansion=None), Word(index_char_start=541, index_char_stop=543, index_token=74, index_sentence=None, string='os', pos=None, scansion=None), Word(index_char_start=544, index_char_stop=553, index_token=75, index_sentence=None, string='paludibus', pos=None, scansion=None), Word(index_char_start=554, index_char_stop=562, index_token=76, index_sentence=None, string='hauritur', pos=None, scansion=None)], pipeline=<class '__main__.LatinPipeline'>, raw='Germania omnis a Gallis Raetisque et Pannoniis Rheno et Danuvio fluminibus, a Sarmatis Dacisque mutuo metu aut montibus separatur: cetera Oceanus ambit, latos sinus et insularum inmensa spatia complectens, nuper cognitis quibusdam gentibus ac regibus, quos bellum aperuit. Rhenus, Raeticarum Alpium inaccesso ac praecipiti vertice ortus, modico flexu in occidentem versus septentrionali Oceano miscetur. Danuvius molli et clementer edito montis Abnobae iugo effusus pluris populos adit, donec in Ponticum mare sex meatibus erumpat: septimum os paludibus hauritur. ', ner=None)

``Doc.pipeline``: <class '__main__.LatinPipeline'>

``Doc.pipeline.word_tokenizer.name``: CLTK Dummy Latin Tokenizer

``Doc.pipeline.word_tokenizer.description``: This is a simple regex which divides on word spaces (``r'\w+)`` for illustrative purposes.

``Doc.pipeline.tokens[:10]``: [Word(index_char_start=0, index_char_stop=8, index_token=0, index_sentence=None, string='Germania', pos=None, scansion=None), Word(index_char_start=9, index_char_stop=14, index_token=1, index_sentence=None, string='omnis', pos=None, scansion=None), Word(index_char_start=15, index_char_stop=16, index_token=2, index_sentence=None, string='a', pos=None, scansion=None), Word(index_char_start=17, index_char_stop=23, index_token=3, index_sentence=None, string='Gallis', pos=None, scansion=None), Word(index_char_start=24, index_char_stop=33, index_token=4, index_sentence=None, string='Raetisque', pos=None, scansion=None), Word(index_char_start=34, index_char_stop=36, index_token=5, index_sentence=None, string='et', pos=None, scansion=None), Word(index_char_start=37, index_char_stop=46, index_token=6, index_sentence=None, string='Pannoniis', pos=None, scansion=None), Word(index_char_start=47, index_char_stop=52, index_token=7, index_sentence=None, string='Rheno', pos=None, scansion=None), Word(index_char_start=53, index_char_stop=55, index_token=8, index_sentence=None, string='et', pos=None, scansion=None), Word(index_char_start=56, index_char_stop=63, index_token=9, index_sentence=None, string='Danuvio', pos=None, scansion=None)]

``Doc.pipeline.indices_tokens[:10]``: [[0, 8], [9, 14], [15, 16], [17, 23], [24, 33], [34, 36], [37, 46], [47, 52], [53, 55], [56, 63]]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment