Skip to content

Instantly share code, notes, and snippets.

Last active June 7, 2017 12:42
What would you like to do?
# Install DVC
$ pip install dvc
# Initialize DVC repository
$ dvc init
# Download a file and put to data/ directory.
$ dvc import data/
# Extract XML from the archive.
$ dvc run tar zxf data/Posts.xml.tgz -C data/
# Prepare data.
$ dvc run python code/ data/Posts.xml data/Posts.tsv python
# Split training and testing dataset. Two output files.
# 0.33 is the test dataset splitting ratio. 20170426 is a seed for randomization.
$ dvc run python code/ data/Posts.tsv 0.33 20170426 data/Posts-train.tsv data/Posts-test.tsv
# Extract features from text data. Two TSV inputs and two pickle matrixes outputs.
$ dvc run python code/ data/Posts-train.tsv data/Posts-test.tsv data/matrix-train.p data/matrix-test.p
# Train ML model out of the training dataset. 20170426 is another seed value.
$ dvc run python code/ data/matrix-train.p 20170426 data/model.p
# Evaluate the model by the testing dataset.
$ dvc run python code/ data/model.p data/matrix-test.p data/evaluation.txt
# The result.
$ cat data/evaluation.txt
AUC: 0.596182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment