Skip to content

Instantly share code, notes, and snippets.

@dmpetrov
Last active June 7, 2017 12:42
Show Gist options
  • Save dmpetrov/7704a5156bdc32c7379580a61e2fe3b6 to your computer and use it in GitHub Desktop.
Save dmpetrov/7704a5156bdc32c7379580a61e2fe3b6 to your computer and use it in GitHub Desktop.
# Install DVC
$ pip install dvc
# Initialize DVC repository
$ dvc init
# Download a file and put to data/ directory.
$ dvc import https://s3-us-west-2.amazonaws.com/dvc-share/so/25K/Posts.xml.tgz data/
# Extract XML from the archive.
$ dvc run tar zxf data/Posts.xml.tgz -C data/
# Prepare data.
$ dvc run python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv python
# Split training and testing dataset. Two output files.
# 0.33 is the test dataset splitting ratio. 20170426 is a seed for randomization.
$ dvc run python code/split_train_test.py data/Posts.tsv 0.33 20170426 data/Posts-train.tsv data/Posts-test.tsv
# Extract features from text data. Two TSV inputs and two pickle matrixes outputs.
$ dvc run python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv data/matrix-train.p data/matrix-test.p
# Train ML model out of the training dataset. 20170426 is another seed value.
$ dvc run python code/train_model.py data/matrix-train.p 20170426 data/model.p
# Evaluate the model by the testing dataset.
$ dvc run python code/evaluate.py data/model.p data/matrix-test.p data/evaluation.txt
# The result.
$ cat data/evaluation.txt
AUC: 0.596182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment