TensorRT Performance on NVIDIA Volta V100 GPU
Ubuntu16.04, CUDA9.0, CUDNN7, Python2.7, TensorRT4.0.0.3, ResNet50 trained using Caffe.
BatchSize
DataType
Elapsed
Images/sec
1
FP32
2.67725
373.517602016995
2
FP32
4.08289
489.849101984134
4
FP32
6.08236
657.639468890365
8
FP32
8.27648
966.594494277761
16
FP32
13.0539
1225.68734248003
32
FP32
23.1328
1383.31719463273
36
FP32
25.2694
1424.64799322501
40
FP32
27.9643
1430.39518242903
44
FP32
29.0452
1514.88025560161
48
FP32
31.2893
1534.0707526215
52
FP32
32.7521
1587.68445382128
56
FP32
38.3085
1461.81656812457
64
FP32
42.5329
1504.71752455158
128
FP32
80.5629
1588.82066062667
256
FP32
160.733
1592.70342742312
Floating Point 16 using TensorCores
BatchSize
DataType
Elapsed
Images/sec
1
FP16
2.0779
481.255113335579
2
FP16
2.65073
754.509135219355
4
FP16
2.83658
1410.14884121019
8
FP16
3.2938
2428.80563482907
16
FP16
4.35118
3677.16343612537
32
FP16
6.58944
4856.25485625486
36
FP16
7.0443
5110.5148843746
40
FP16
7.63894
5236.328600565
44
FP16
7.93037
5548.29093724505
48
FP16
8.32809
5763.6264737773
52
FP16
8.64164
6017.37633134451
56
FP16
10.7251
5221.39653709523
64
FP16
11.5623
5535.23087966927
128
FP16
21.2693
6018.06359400638
256
FP16
39.7254
6444.2397055788
BatchSize
DataType
Elapsed
Images/sec
1
INT8
1.22277
817.815288238998
2
INT8
1.71519
1166.05157446114
4
INT8
2.45023
1632.49980614065
8
INT8
2.90755
2751.45741259824
16
INT8
4.26373
3752.58283240261
32
INT8
6.78533
4716.05655141312
40
INT8
8.21903
4866.75434935753
48
INT8
8.94034
5368.9233295378
56
INT8
10.9543
5112.14774106972
64
INT8
11.7372
5452.74852605391
128
INT8
22.0699
5799.75441664892
256
INT8
43.2764
5915.46431773438
BATCH=56
DEPLOY=./deploy.prototxt
MODEL=./snapshots/resnet50.caffemodel
/usr/src/tensorrt/bin/giexec --deploy=$DEPLOY --model=$MODEL --output=prob --batch=$BATCH
/usr/src/tensorrt/bin/giexec --deploy=$DEPLOY --model=$MODEL --output=prob --batch=$BATCH --half2
/usr/src/tensorrt/bin/giexec --deploy=$DEPLOY --model=$MODEL --output=prob --batch=$BATCH --int8