Skip to content

Instantly share code, notes, and snippets.

@icoxfog417
Created February 22, 2019 08:20
Show Gist options
  • Save icoxfog417/a1ccae75440f87c62437cd03e9318d38 to your computer and use it in GitHub Desktop.
Save icoxfog417/a1ccae75440f87c62437cd03e9318d38 to your computer and use it in GitHub Desktop.
chariot_demo2.py
from chariot.dataset_preprocessor import DatasetPreprocessor
from chariot.transformer.formatter import Padding
dp = DatasetPreprocessor()
dp.process("review")\
.by(ct.text.UnicodeNormalizer())\
.by(ct.Tokenizer("en"))\
.by(ct.token.StopwordFilter("en"))\
.by(ct.Vocabulary(min_df=5, max_df=0.5))\
.by(Padding(length=pad_length))\
.fit(train_data["review"])
dp.process("polarity")\
.by(ct.formatter.CategoricalLabel(num_class=3))
preprocessed = dp.preprocess(data)
# DatasetPreprocessor has multiple preprocessor.
# Because of this, save file format is `tar.gz`.
dp.save("my_dataset_preprocessor.tar.gz")
loaded = DatasetPreprocessor.load("my_dataset_preprocessor.tar.gz")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment