nknytk/README.md

## README.md

      
    Raw
  

              README.md
            
          
    やりたいこと

IntelがDNNのCPU実行を高速化するiDeepというモジュールを開発し、chainer 4.0.0からiDeepを使えるようになった。

どんな処理がどれくらい速くなるのか確認したい。
iDeepの使い方

Ubuntuならpipでインストールできる。
$ python3 -m venv .venv
$ . .venv/bin/activate
$ pip install wheel
$ pip install chainer Pillow ideep4py

コード内で、処理を行う前にiDeepを使うよう設定する。
chainer.config.use_ideep = 'auto'
model.to_intel64()
測定環境

環境1: サーバ


項目
値


CPU
Ryzen 1700 (8C16T)


Memory
16GB


GPU
Geforce GTX 1070


CUDA
9.1


OS
Ubuntu 16.04


Python
3.5.2


chainer
4.0.0


cupy
4.0.0


ideep4py
1.0.4


環境2: モバイルラップトップ


項目
値


CPU
Core i5-7Y54 (2C4T)


Memory
8GB


OS
Ubuntu 16.04


Python
3.5.2


chainer
4.0.0


ideep4py
1.0.4


CNN

検証内容

CNNで画像1枚の推論にかかる時間を測定した。

Webアプリに組み込む想定でバッチ処理は行わず1枚ずつ画像を処理し、30回連続で処理した際の平均所要時間を掲載。

また、環境変数OMP_NUM_THREADによる使用するCPUコア数の指定を1からCPUのスレッド数まで変えて測定し、使用するCPU数による速度変化を確認した。
測定対象とするCNNモデルの実装はこちら。実行時間の単位はms。
まとめ


iDeepの効果

全てのCNNが大幅に高速化
MobileNet以外のモデルは環境1では2倍以上、環境2では4倍以上速くなった
MobileNetは環境1で1.5倍、環境2で2倍の速度にとどまった。DepthwiseConvolutionがiDeepを使えないため
CPUスレッド数を増やした時の処理時間の改善幅が大きくなった


環境1

Intelの出したモジュールだが、AMDのCPUでもiDeepは有効。x86_64ならOK?
デスクトップCPUの割に1スレッドの処理遅くない?
   - 使用するCPUスレッド数を増やすと、処理の遅いモデルは結構速くなる。処理の軽いモデルは分散オーバヘッドが勝って遅くなる


環境2

iDeepの効果が大きい
使用するCPUのスレッド数を増やしてもあまり性能は伸びない


環境1の計測結果

OMP_NUM_THREADS=1
model	cpu	cpu_ideep	gpu
VGGNetBN	1265.59	561.61	25.94
GoogLeNetBN	262.67	108.32	27.42
ResNet50	381.21	168.94	20.22
SqueezeNet	118.18	39.48	7.51
MobileNet	95.64	62.33	11.84
InceptionV4	1170.83	506.02	62.47
InceptionResNetV2	849.52	411.55	54.90
FaceClassifier100x100V	13.50	4.83	2.27
FaceClassifier100x100V2	27.87	10.72	2.91

OMP_NUM_THREADS=2
model	cpu	cpu_ideep	gpu
VGGNetBN	812.87	325.63	25.96
GoogLeNetBN	213.76	70.99	27.25
ResNet50	278.51	101.34	20.19
SqueezeNet	105.84	25.44	7.48
MobileNet	84.46	52.58	11.84
InceptionV4	873.03	295.34	61.69
InceptionResNetV2	629.78	247.73	54.08
FaceClassifier100x100V	13.96	4.08	2.27
FaceClassifier100x100V2	23.43	7.57	2.94

OMP_NUM_THREADS=3
model	cpu	cpu_ideep	gpu
VGGNetBN	753.26	242.03	26.43
GoogLeNetBN	214.54	65.16	28.19
ResNet50	258.43	88.77	20.94
SqueezeNet	100.85	24.36	8.36
MobileNet	82.84	53.19	12.66
InceptionV4	842.38	254.42	62.60
InceptionResNetV2	598.21	218.00	55.28
FaceClassifier100x100V	15.18	3.98	2.66
FaceClassifier100x100V2	24.22	6.85	3.39

OMP_NUM_THREADS=4
model	cpu	cpu_ideep	gpu
VGGNetBN	633.74	190.14	26.32
GoogLeNetBN	195.60	57.84	28.44
ResNet50	237.54	76.85	21.24
SqueezeNet	105.70	22.28	8.39
MobileNet	81.65	51.21	12.86
InceptionV4	751.73	216.90	63.42
InceptionResNetV2	527.71	185.48	56.10
FaceClassifier100x100V	14.65	3.76	2.64
FaceClassifier100x100V2	21.71	7.89	3.43

OMP_NUM_THREADS=5
model	cpu	cpu_ideep	gpu
VGGNetBN	605.71	160.13	27.13
GoogLeNetBN	192.21	73.90	28.00
ResNet50	249.94	67.44	20.91
SqueezeNet	105.35	20.08	8.25
MobileNet	83.14	50.84	12.62
InceptionV4	729.65	217.96	62.10
InceptionResNetV2	551.06	205.83	54.59
FaceClassifier100x100V	15.05	4.45	2.60
FaceClassifier100x100V2	29.49	6.98	3.35

OMP_NUM_THREADS=6
model	cpu	cpu_ideep	gpu
VGGNetBN	573.07	140.49	26.50
GoogLeNetBN	209.73	52.30	28.31
ResNet50	221.88	64.22	21.15
SqueezeNet	103.38	19.20	8.37
MobileNet	91.65	53.01	12.72
InceptionV4	679.90	175.05	62.91
InceptionResNetV2	488.90	155.45	55.49
FaceClassifier100x100V	16.33	4.13	2.69
FaceClassifier100x100V2	23.15	7.81	4.00

OMP_NUM_THREADS=7
model	cpu	cpu_ideep	gpu
VGGNetBN	552.00	233.21	26.19
GoogLeNetBN	225.23	64.15	27.98
ResNet50	234.03	84.70	20.77
SqueezeNet	130.36	25.31	9.18
MobileNet	80.77	52.89	12.61
InceptionV4	716.11	241.70	62.42
InceptionResNetV2	481.62	204.24	55.17
FaceClassifier100x100V	16.05	4.41	2.66
FaceClassifier100x100V2	27.45	7.17	3.42

OMP_NUM_THREADS=8
model	cpu	cpu_ideep	gpu
VGGNetBN	506.45	208.64	26.43
GoogLeNetBN	210.47	60.45	28.23
ResNet50	218.64	59.43	20.79
SqueezeNet	107.07	18.25	8.32
MobileNet	99.34	51.75	12.68
InceptionV4	880.17	162.24	62.90
InceptionResNetV2	594.11	190.14	55.42
FaceClassifier100x100V	18.91	4.86	2.66
FaceClassifier100x100V2	30.68	6.63	3.35

OMP_NUM_THREADS=9
model	cpu	cpu_ideep	gpu
VGGNetBN	726.16	189.00	27.02
GoogLeNetBN	214.94	65.16	28.73
ResNet50	262.53	76.18	20.93
SqueezeNet	111.79	22.54	9.18
MobileNet	84.73	51.33	12.68
InceptionV4	860.07	210.16	62.29
InceptionResNetV2	609.61	199.59	55.73
FaceClassifier100x100V	16.43	5.32	3.09
FaceClassifier100x100V2	27.54	7.27	3.93

OMP_NUM_THREADS=10
model	cpu	cpu_ideep	gpu
VGGNetBN	705.98	175.63	26.13
GoogLeNetBN	212.00	57.25	28.08
ResNet50	262.67	78.04	21.86
SqueezeNet	106.58	21.37	8.33
MobileNet	82.90	57.68	13.47
InceptionV4	935.49	198.64	63.37
InceptionResNetV2	586.27	176.30	55.31
FaceClassifier100x100V	15.42	4.92	3.07
FaceClassifier100x100V2	27.84	7.15	3.97

OMP_NUM_THREADS=11
model	cpu	cpu_ideep	gpu
VGGNetBN	684.96	164.49	26.14
GoogLeNetBN	245.69	62.30	29.31
ResNet50	269.69	69.90	21.01
SqueezeNet	127.52	21.94	8.42
MobileNet	96.83	51.63	12.92
InceptionV4	909.33	208.31	63.95
InceptionResNetV2	605.43	184.12	56.57
FaceClassifier100x100V	15.84	4.95	3.13
FaceClassifier100x100V2	34.81	7.89	4.01

OMP_NUM_THREADS=12
model	cpu	cpu_ideep	gpu
VGGNetBN	667.38	153.96	26.71
GoogLeNetBN	201.64	61.11	29.10
ResNet50	247.94	64.90	20.97
SqueezeNet	125.52	21.72	9.22
MobileNet	97.26	51.34	12.68
InceptionV4	861.24	202.52	63.92
InceptionResNetV2	662.25	177.85	56.44
FaceClassifier100x100V	16.14	6.17	3.14
FaceClassifier100x100V2	26.34	6.96	3.38

OMP_NUM_THREADS=13
model	cpu	cpu_ideep	gpu
VGGNetBN	649.69	147.38	27.82
GoogLeNetBN	201.97	59.01	29.37
ResNet50	284.36	70.10	22.06
SqueezeNet	107.38	21.82	9.23
MobileNet	101.19	51.01	12.83
InceptionV4	782.56	196.79	63.82
InceptionResNetV2	591.59	174.74	56.38
FaceClassifier100x100V	15.89	6.08	3.13
FaceClassifier100x100V2	34.68	7.67	4.06

OMP_NUM_THREADS=14
model	cpu	cpu_ideep	gpu
VGGNetBN	673.09	138.86	27.08
GoogLeNetBN	218.79	58.20	28.92
ResNet50	340.76	64.38	20.71
SqueezeNet	128.25	21.10	9.15
MobileNet	99.86	56.69	13.43
InceptionV4	894.24	178.74	62.63
InceptionResNetV2	660.05	156.07	55.12
FaceClassifier100x100V	19.90	6.56	3.11
FaceClassifier100x100V2	35.91	6.62	3.38

OMP_NUM_THREADS=15
model	cpu	cpu_ideep	gpu
VGGNetBN	665.97	132.82	27.00
GoogLeNetBN	239.33	56.87	29.02
ResNet50	335.97	67.72	21.95
SqueezeNet	128.58	20.30	9.16
MobileNet	105.54	56.14	13.43
InceptionV4	843.06	184.30	63.28
InceptionResNetV2	653.96	167.30	55.83
FaceClassifier100x100V	16.34	4.05	3.12
FaceClassifier100x100V2	36.95	7.23	4.04

OMP_NUM_THREADS=16
model	cpu	cpu_ideep	gpu
VGGNetBN	654.45	126.27	26.79
GoogLeNetBN	230.93	58.31	28.94
ResNet50	323.07	65.53	22.01
SqueezeNet	127.12	20.12	9.18
MobileNet	99.94	55.80	13.42
InceptionV4	833.68	179.54	63.66
InceptionResNetV2	621.62	161.24	56.11
FaceClassifier100x100V	19.95	5.29	3.07
FaceClassifier100x100V2	36.12	6.73	3.96

環境2の計測結果

OMP_NUM_THREADS=1
model	cpu	cpu_ideep
VGGNetBN	2441.28	531.15
GoogLeNetBN	479.41	117.48
ResNet50	708.66	173.14
SqueezeNet	211.86	40.42
MobileNet	168.68	76.73
InceptionV4	2142.35	512.21
InceptionResNetV2	1600.09	421.00
FaceClassifier100x100V	23.74	6.30
FaceClassifier100x100V2	50.14	11.76

OMP_NUM_THREADS=2
model	cpu	cpu_ideep
VGGNetBN	2873.04	377.42
GoogLeNetBN	559.84	103.56
ResNet50	810.10	141.49
SqueezeNet	252.57	36.05
MobileNet	196.84	85.31
InceptionV4	2486.23	410.05
InceptionResNetV2	1867.87	351.16
FaceClassifier100x100V	29.16	6.69
FaceClassifier100x100V2	59.66	11.46

OMP_NUM_THREADS=3
model	cpu	cpu_ideep
VGGNetBN	2678.89	459.71
GoogLeNetBN	580.94	127.21
ResNet50	741.67	172.29
SqueezeNet	259.66	43.17
MobileNet	219.67	90.43
InceptionV4	2465.30	491.44
InceptionResNetV2	1846.31	393.16
FaceClassifier100x100V	33.03	9.12
FaceClassifier100x100V2	58.75	16.74

OMP_NUM_THREADS=4
model	cpu	cpu_ideep
VGGNetBN	2327.47	402.15
GoogLeNetBN	535.85	118.01
ResNet50	715.50	155.87
SqueezeNet	264.89	42.31
MobileNet	206.62	97.90
InceptionV4	2163.48	439.94
InceptionResNetV2	1630.52	377.71
FaceClassifier100x100V	34.82	11.61
FaceClassifier100x100V2	59.61	15.83

MLP

300 => 100 => 100 => 2 の多層パーセプトロンを学習、評価するのにかかった時間を計測した結果、処理速度は

iDeepを使わないCPU実行
iDeepを使ったCPU実行
GPU実行

の順となった。詳細は後日掲載予定
RNN

後日検証予定

  
## measure.sh
#!/bin/bash

for num_threads in {1..16}; do
  echo "OMP_NUM_THREADS=${num_threads}"
  echo "model	cpu	cpu_ideep	gpu"
  OMP_NUM_THREADS=${num_threads} python -u speedcheck.py 2>/dev/null
  echo
done

## speedcheck.py
# coding: utf-8

import os
from time import time
import numpy
import chainer
import cupy
from PIL import Image

from vggnetbn import VGGNetBN
from googlenet import GoogLeNetBN
from resnet import ResNet50
from inception_v4 import InceptionV4
from inception_resnet_v2 import InceptionResNetV2
from fc100 import FaceClassifier100x100V, FaceClassifier100x100V2
from squeezenet import SqueezeNet
from mobilenet import MobileNet

chainer.config.train = False


test_images = {100: [], 224: [], 299: []}
image_names = os.listdir('jpg')[:30]
for f in image_names:
    for img_size in 100, 224, 299:
        pil_img = Image.open('jpg/' + f).convert('RGB')
        test_images[img_size].append(numpy.asarray(pil_img.resize((img_size, img_size)), dtype=numpy.float32).transpose(2, 0, 1))


def check_speed(model, test_images_np):
    # ideepを使わずに速度を計測
    chainer.config.use_ideep = 'never'
    start_time = time()
    for img in test_images_np:
        x = chainer.Variable(numpy.array([img], dtype=numpy.float32))
        pred = model.predict(x)
    avg_cpu_time = (time() - start_time) / len(test_images_np)

    # ideepで速度を計測
    chainer.config.use_ideep = 'auto'
    model.to_intel64()
    start_time = time()
    for img in test_images_np:
        x = chainer.Variable(numpy.array([img], dtype=numpy.float32))
        pred = model.predict(x)
    avg_ideep_time = (time() - start_time) / len(test_images_np)

    # gpuで速度を計測
    model.to_gpu()
    start_time = time()
    start_time = time()
    for img in test_images_np:
        x = chainer.Variable(cupy.array([img], dtype=cupy.float32))
        pred = model.predict(x)
    avg_gpu_time = (time() - start_time) / len(test_images_np)

    return avg_cpu_time, avg_ideep_time, avg_gpu_time

def main():
    test_patterns = [
        ('VGGNetBN', VGGNetBN(17), 224),
        ('GoogLeNetBN', GoogLeNetBN(17), 224),
        ('ResNet50', ResNet50(17), 224),
        ('SqueezeNet', SqueezeNet(17), 224),
        ('MobileNet', MobileNet(17), 224),
        ('InceptionV4', InceptionV4(dim_out=17), 299),
        ('InceptionResNetV2', InceptionResNetV2(dim_out=17), 299),
        ('FaceClassifier100x100V', FaceClassifier100x100V(17), 100),
        ('FaceClassifier100x100V2', FaceClassifier100x100V2(17), 100)
    ]

    for model_name, model, test_size in test_patterns:
        cpu_time, ideep_time, gpu_time = check_speed(model, test_images[test_size])
        print('{}\t{:.02f}\t{:.02f}\t{:.02f}'.format(model_name, cpu_time * 1000, ideep_time * 1000, gpu_time * 1000))
        del(model)


if __name__ == '__main__':
    main()
項目	値
CPU	Ryzen 1700 (8C16T)
Memory	16GB
GPU	Geforce GTX 1070
CUDA	9.1
OS	Ubuntu 16.04
Python	3.5.2
chainer	4.0.0
cupy	4.0.0
ideep4py	1.0.4
項目	値
CPU	Core i5-7Y54 (2C4T)
Memory	8GB
OS	Ubuntu 16.04
Python	3.5.2
chainer	4.0.0
ideep4py	1.0.4
	#!/bin/bash

	for num_threads in {1..16}; do
	echo "OMP_NUM_THREADS=${num_threads}"
	echo "model cpu cpu_ideep gpu"
	OMP_NUM_THREADS=${num_threads} python -u speedcheck.py 2>/dev/null
	echo
	done
	# coding: utf-8

	import os
	from time import time
	import numpy
	import chainer
	import cupy
	from PIL import Image

	from vggnetbn import VGGNetBN
	from googlenet import GoogLeNetBN
	from resnet import ResNet50
	from inception_v4 import InceptionV4
	from inception_resnet_v2 import InceptionResNetV2
	from fc100 import FaceClassifier100x100V, FaceClassifier100x100V2
	from squeezenet import SqueezeNet
	from mobilenet import MobileNet

	chainer.config.train = False


	test_images = {100: [], 224: [], 299: []}
	image_names = os.listdir('jpg')[:30]
	for f in image_names:
	for img_size in 100, 224, 299:
	pil_img = Image.open('jpg/' + f).convert('RGB')
	test_images[img_size].append(numpy.asarray(pil_img.resize((img_size, img_size)), dtype=numpy.float32).transpose(2, 0, 1))


	def check_speed(model, test_images_np):
	# ideepを使わずに速度を計測
	chainer.config.use_ideep = 'never'
	start_time = time()
	for img in test_images_np:
	x = chainer.Variable(numpy.array([img], dtype=numpy.float32))
	pred = model.predict(x)
	avg_cpu_time = (time() - start_time) / len(test_images_np)

	# ideepで速度を計測
	chainer.config.use_ideep = 'auto'
	model.to_intel64()
	start_time = time()
	for img in test_images_np:
	x = chainer.Variable(numpy.array([img], dtype=numpy.float32))
	pred = model.predict(x)
	avg_ideep_time = (time() - start_time) / len(test_images_np)

	# gpuで速度を計測
	model.to_gpu()
	start_time = time()
	start_time = time()
	for img in test_images_np:
	x = chainer.Variable(cupy.array([img], dtype=cupy.float32))
	pred = model.predict(x)
	avg_gpu_time = (time() - start_time) / len(test_images_np)

	return avg_cpu_time, avg_ideep_time, avg_gpu_time

	def main():
	test_patterns = [
	('VGGNetBN', VGGNetBN(17), 224),
	('GoogLeNetBN', GoogLeNetBN(17), 224),
	('ResNet50', ResNet50(17), 224),
	('SqueezeNet', SqueezeNet(17), 224),
	('MobileNet', MobileNet(17), 224),
	('InceptionV4', InceptionV4(dim_out=17), 299),
	('InceptionResNetV2', InceptionResNetV2(dim_out=17), 299),
	('FaceClassifier100x100V', FaceClassifier100x100V(17), 100),
	('FaceClassifier100x100V2', FaceClassifier100x100V2(17), 100)
	]

	for model_name, model, test_size in test_patterns:
	cpu_time, ideep_time, gpu_time = check_speed(model, test_images[test_size])
	print('{}\t{:.02f}\t{:.02f}\t{:.02f}'.format(model_name, cpu_time * 1000, ideep_time * 1000, gpu_time * 1000))
	del(model)


	if __name__ == '__main__':
	main()