Skip to content

Instantly share code, notes, and snippets.

@ghs2015
ghs2015 / output-log
Created March 23, 2018 18:47
horovod_example_output
(horovod323) $ ./run.sh >> log.txt
INFO:tensorflow:Using config: {'_task_type': 'worker', '_model_dir': './mnist_convnet_model', '_save_checkpoints_steps': None, '_task_id': 0, '_is_chief': True, '_evaluation_master': '', '_num_worker_replicas': 1, '_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_session_config': gpu_options {
allow_growth: true
visible_device_list: "0"
}
, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x2aab3e03b438>, '_service': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_master': '', '_tf_random_seed': None, '_global_id_in_cluster': 0, '_log_step_count_steps': 100}
WARNING:tensorflow:Using temporary folder as model directory: /local_scratch/pbs.2980540.pbs02/tmpg2a8mhn8
INFO:tensorflow:Using config: {'_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x2b5d7fcbb4e0>, '_save_summary_steps': 100, '_session_config': gpu_options {
allow_growth:
@ghs2015
ghs2015 / A list of problems.md
Created December 6, 2017 15:52
Issue: unable to install and use Horovod package on Palmetto
  1. Install openmpi 3.0 on Palmetto
  2. Install NCCL2 on Palmetto
  3. Install Horovod (CPU and GPU version) on Palmetto
@ghs2015
ghs2015 / README.txt
Last active December 5, 2017 22:03 — forked from PurpleBooth/README-Template.md
A template to make good README.md
# Training and Inferencing Deep Neural Networks using Multiple GPUs on Multiple Computational Nodes
The program provides an example to parallelly train CrescendoNet by multiple GPU from multiple computational nodes. We train and evaluate the model with a subset of ImageNet known as ImageNet Large Scale Visual Recognition Competition (ILSVRC). We use Palmetto cluster of Clemson University as the computational resource. Each computational node we use has two NVIDIA P100 CPU and we create one worker on one GPU. We built the source code on TensorFlow's official example and the framework.
## Getting Started
These instructions will guide you to train our CrescendoNet on Palmetto cluster for a demonstration purpose. You may customize the hyperparameters or replace the dataset to use the model for your own applications.
### Prerequisities
Python 3.6.2