- Install openmpi 3.0 on Palmetto
- Install NCCL2 on Palmetto
- Install Horovod (CPU and GPU version) on Palmetto
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(horovod323) $ ./run.sh >> log.txt | |
INFO:tensorflow:Using config: {'_task_type': 'worker', '_model_dir': './mnist_convnet_model', '_save_checkpoints_steps': None, '_task_id': 0, '_is_chief': True, '_evaluation_master': '', '_num_worker_replicas': 1, '_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_session_config': gpu_options { | |
allow_growth: true | |
visible_device_list: "0" | |
} | |
, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x2aab3e03b438>, '_service': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_master': '', '_tf_random_seed': None, '_global_id_in_cluster': 0, '_log_step_count_steps': 100} | |
WARNING:tensorflow:Using temporary folder as model directory: /local_scratch/pbs.2980540.pbs02/tmpg2a8mhn8 | |
INFO:tensorflow:Using config: {'_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x2b5d7fcbb4e0>, '_save_summary_steps': 100, '_session_config': gpu_options { | |
allow_growth: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Training and Inferencing Deep Neural Networks using Multiple GPUs on Multiple Computational Nodes | |
The program provides an example to parallelly train CrescendoNet by multiple GPU from multiple computational nodes. We train and evaluate the model with a subset of ImageNet known as ImageNet Large Scale Visual Recognition Competition (ILSVRC). We use Palmetto cluster of Clemson University as the computational resource. Each computational node we use has two NVIDIA P100 CPU and we create one worker on one GPU. We built the source code on TensorFlow's official example and the framework. | |
## Getting Started | |
These instructions will guide you to train our CrescendoNet on Palmetto cluster for a demonstration purpose. You may customize the hyperparameters or replace the dataset to use the model for your own applications. | |
### Prerequisities | |
Python 3.6.2 |