Skip to content

Instantly share code, notes, and snippets.

View jacobkahn's full-sized avatar

Jacob Kahn jacobkahn

View GitHub Profile
@jacobkahn
jacobkahn / gist:0ae35590b5fc5bc49acad48fb4f11315
Created July 23, 2019 22:37
NCCL Tests on 16 node p3dn.24xlarge + EFA - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 1238.6 0.00 0.00 1e-06 1241.8 0.00 0.00 0e+00
16 4 float sum 1406.9 0.00 0.00 2e-07 1246.5 0.00 0.00 2e-07
32 8 float sum 1235.6 0.00 0.00 2e-07 1241.7 0.00 0.00 2e-07
64 16 float sum 1242.5 0.00 0.00 5e-07 1244.2 0.00 0.00 5e-07
128 32 float sum 1248.3 0.00 0.00 5e-07 1241.6 0.00 0.00 5e-07
256 64 float sum 1243.4 0.00 0.00 5e-07 1244.1 0.00 0.00 5e-07
@jacobkahn
jacobkahn / gist:3703e3bbfe44343cd8ef033aed6093c9
Created July 23, 2019 23:17
NCCL Tests on 16 node p3dn.24xlarge + ethernet - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 878.1 0.00 0.00 1e-06 876.8 0.00 0.00 2e-07
16 4 float sum 880.0 0.00 0.00 0e+00 875.6 0.00 0.00 2e-07
32 8 float sum 877.1 0.00 0.00 2e-07 879.3 0.00 0.00 5e-07
64 16 float sum 879.7 0.00 0.00 5e-07 882.2 0.00 0.00 5e-07
128 32 float sum 889.2 0.00 0.00 5e-07 889.8 0.00 0.00 5e-07
256 64 float sum 892.5 0.00 0.00 5e-07 896.4 0.00 0.00 5e-07
@jacobkahn
jacobkahn / gist:25e87a2706e73647dd4d222e2bf5b354
Created July 23, 2019 23:25
NCCL Tests on 8 node p3dn.24xlarge + ethernet - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 440.1 0.00 0.00 2e-07 442.8 0.00 0.00 2e-07
16 4 float sum 444.4 0.00 0.00 0e+00 440.2 0.00 0.00 2e-07
32 8 float sum 442.1 0.00 0.00 2e-07 443.7 0.00 0.00 2e-07
64 16 float sum 440.8 0.00 0.00 2e-07 441.3 0.00 0.00 2e-07
128 32 float sum 445.8 0.00 0.00 2e-07 444.7 0.00 0.00 2e-07
256 64 float sum 450.6 0.00 0.00 2e-07 448.1 0.00 0.00 2e-07
@jacobkahn
jacobkahn / gist:33f7c6731d653d4b35ae096e856099f8
Created July 23, 2019 23:35
NCCL Tests on 4 node p3dn.24xlarge + ethernet - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 218.6 0.00 0.00 2e-07 217.6 0.00 0.00 2e-07
16 4 float sum 218.3 0.00 0.00 0e+00 218.7 0.00 0.00 1e-07
32 8 float sum 217.1 0.00 0.00 1e-07 216.9 0.00 0.00 2e-07
64 16 float sum 218.6 0.00 0.00 2e-07 217.6 0.00 0.00 2e-07
128 32 float sum 220.0 0.00 0.00 2e-07 220.9 0.00 0.00 2e-07
256 64 float sum 223.9 0.00 0.00 2e-07 222.9 0.00 0.00 2e-07
@jacobkahn
jacobkahn / gist:3510a154e50b18defd236eeea461e3fb
Created July 24, 2019 00:35
NCCL Tests on 2 node p3dn.24xlarge + ethernet - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 112.7 0.00 0.00 2e-07 110.9 0.00 0.00 1e-07
16 4 float sum 110.5 0.00 0.00 0e+00 111.3 0.00 0.00 1e-07
32 8 float sum 110.6 0.00 0.00 1e-07 111.1 0.00 0.00 1e-07
64 16 float sum 112.2 0.00 0.00 1e-07 112.1 0.00 0.00 6e-08
128 32 float sum 113.1 0.00 0.00 6e-08 113.3 0.00 0.00 6e-08
256 64 float sum 112.5 0.00 0.00 6e-08 113.8 0.00 0.00 6e-08
@jacobkahn
jacobkahn / gist:218cebce3aca0b1328dd68da05f28f73
Created July 24, 2019 01:03
fi_info -p efa on each host
172.31.39.141
provider: efa
fabric: EFA-fe80::cbb:82ff:fef5:d306
domain: efa_0-rdm
version: 3.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::cbb:82ff:fef5:d306
domain: efa_0-dgrm
@jacobkahn
jacobkahn / gist:527f7eb2fe85e56074163544ede4e491
Created July 24, 2019 01:06
aws ec2 describe-security-groups --group-ids sg-0d37d17f642362f03
{
"SecurityGroups": [
{
"IpPermissionsEgress": [
{
"IpProtocol": "-1",
"PrefixListIds": [],
"IpRanges": [
{
"CidrIp": "0.0.0.0/0"
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# Rank 0 Pid 20303 on ip-172-31-39-141 device 0 [0x00] Tesla V100-SXM2-32GB
# Rank 1 Pid 20304 on ip-172-31-39-141 device 1 [0x00] Tesla V100-SXM2-32GB
# Rank 2 Pid 20305 on ip-172-31-39-141 device 2 [0x00] Tesla V100-SXM2-32GB
# Rank 3 Pid 20306 on ip-172-31-39-141 device 3 [0x00] Tesla V100-SXM2-32GB
# Rank 4 Pid 20307 on ip-172-31-39-141 device 4 [0x00] Tesla V100-SXM2-32GB
# Rank 5 Pid 20308 on ip-172-31-39-141 device 5 [0x00] Tesla V100-SXM2-32GB
# Rank 6 Pid 20309 on ip-172-31-39-141 device 6 [0x00] Tesla V100-SXM2-32GB
@jacobkahn
jacobkahn / gist:5fd4cd3e49a10c04105b777611327bbe
Created July 24, 2019 01:13
Output of `cat /opt/amazon/efa/installed_packages` on each node
172.31.39.141
# EFA installer version: 1.1.0
# Debug packages installed: no
# Packages installed:
efa-0.9.2-1.amzn1.x86_64 libfabric-1.7.0amzn1.1-1.amzn1.x86_64 libfabric-devel-1.7.0amzn1.1-1.amzn1.x86_64 openmpi-3.1.3-1.amzn1.x86_64
172.31.38.14
# EFA installer version: 1.1.0
# Debug packages installed: no
# Packages installed:
@jacobkahn
jacobkahn / gist:210a827162960df11df3c9666ab54719
Created July 24, 2019 04:38
NCCL Tests on 16 node p3dn.24xlarge + efa - all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum 1247.4 0.00 0.00 1e-06 1239.1 0.00 0.00 0e+00
16 4 float sum 1239.1 0.00 0.00 2e-07 1239.6 0.00 0.00 2e-07
32 8 float sum 1242.5 0.00 0.00 2e-07 1241.4 0.00 0.00 2e-07
64 16 float sum 1237.8 0.00 0.00 5e-07 1240.8 0.00 0.00 5e-07
128 32 float sum 1240.6 0.00 0.00 5e-07 1238.0 0.00 0.00 5e-07
256 64 float sum 1238.2 0.00 0.00 5e-07 1237.4 0.00 0.00 5e-07