BERT on AMD RadeonGPU

Introduction

AMD RadeonGPUで自然言語処理系の人からBERTが動くかどうかよく聞かれるので、検証したときの話を紹介致します。

結果

動作可能。

動作確認

まずはOfficialのドキュメントを一読。
https://github.com/google-research/bert

端的にTensorFlow1.11が動かせれば良いみたいなので、1.12.0をインストール。
ROCm2.3の更新が来ていたので実験機にセットアップしてRadeonVII上で動かすこととします。

もともと、TensorFlow1.12.0+ROCm2.2のマシンに、

1
curl -sL http://install.aieater.com/setup_rocm | bash -

or

1
sudo apt upgrade -y

にてドライバのみ更新。

以下のコマンドでROCm-TensorFlow上でRadeonVIIを確認。

1
python3 -c "from tensorflow.python.client import device_lib; device_lib.list_local_devices()"

1
ImportError: /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so: undefined symbol: hipModuleGetGlobal

というエラーが出たので、pip3 uninstall tensorflow-rocmで一度削除して、再度pip3 install tensorflow-rocm==1.12.0 -Uをインストール。

1
ImportError: /usr/local/lib/python3.5/dist-packages/tensorflow/python/../libtensorflow_framework.so: undefined symbol: hipModuleGetGlobal

同じエラーが出るので、HIP周りにバグが入ったかなにかだと思ったのですが、

1
pip3 install tensorflow-rocm -U

にて最新版をインストール。tensorflow-rocm 1.13.1がインストールされました。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
johndoe@thiguhag:~$ python3 -c "from tensorflow.python.client import device_lib; device_lib.list_local_devices()"
2019-04-15 23:10:40.484698: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-15 23:10:40.485199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1531] Found device 0 with properties:
name: Vega 20
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.802
pciBusID 0000:03:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB
2019-04-15 23:10:40.485213: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1642] Adding visible gpu devices: 0
2019-04-15 23:10:40.485374: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-15 23:10:40.485391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1059] 0
2019-04-15 23:10:40.485395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1072] 0: N
2019-04-15 23:10:40.485421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 20, pci bus id: 0000:03:00.0)

無事認識されました。公式の指定されたバージョンは1.11ですがとりあえずBERTを動かしてみます。

GLUEデータセットのダウンロードスクリプトでダウンロード、BERTモデルのチェックポイントファイルをダウンロード。
解凍先を指定してpython3にて起動させます。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
git clone https://github.com/google-research/bert.git && cd bert

wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
python3 download_glue_data.py

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip

export BERT_BASE_DIR=uncased_L-12_H-768_A-12
export GLUE_DIR=glue_data
python3 run_classifier.py \
--task_name=MRPC \
--do_train=true \
--do_eval=true \
--data_dir=$GLUE_DIR/MRPC \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=/tmp/mrpc_output/

結果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
.
.
.
NFO:tensorflow:Graph was finalized.
2019-04-15 22:24:02.703807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1642] Adding visible gpu devices: 0
2019-04-15 22:24:02.703846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-15 22:24:02.703851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1059] 0
2019-04-15 22:24:02.703855: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1072] 0: N
2019-04-15 22:24:02.703874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1189] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 20, pci bus id: 0000:03:00.0)
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from /tmp/mrpc_output/model.ckpt-343
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-04-15-13:24:06
INFO:tensorflow:Saving dict for global step 343: eval_accuracy = 0.8627451, eval_loss = 0.3959406, global_step = 343, loss = 0.3959406
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 343: /tmp/mrpc_output/model.ckpt-343
INFO:tensorflow:evaluation_loop marked as finished
INFO:tensorflow:**** Eval results ****
INFO:tensorflow: eval_accuracy = 0.8627451
INFO:tensorflow: eval_loss = 0.3959406
INFO:tensorflow: global_step = 343
INFO:tensorflow: loss = 0.3959406

無事動かすことができました。
ただTensorFlowのバージョンがずれると、いろいろ今まで動かしてたKerasベースのプログラムが動かなくなるので、弊害が出ます。

References


GPU EATER - AMD GPU-based Deep Learning Cloud


CycleGAN on AMD RadeonGPU

Introduction

Semi-supervisedでGANを使うため、いくつかの表現型のテストを行っていた際に、CycleGANという技術が面白かったので、実際にAMDRadeonGPUで動かす事ができるか実験をしてみました。

結果




動画バージョンはこちら。
http://install.aieater.com/gpueater/media/promotion_videos/promotion_video_20190331.mp4

CycleGAN

CycleGANはドメイン変換技術で、pix2pixでよく知られているStyle変換に似ていますが、完全に別ドメインにドラスティックに変えてしまう点が大きく違います。
もちろんCycleGANでもStyleTransferは簡単にできますので、ベースとなるコンテンツ画像とスタイル画像を用意すれば同じ事ができます。

このCycleGANの中でも最も面白いと思われるのが、顔自体を別人物に変えてしまうというCycleGAN Face-offの論文です。[Xiaohan Jin, Ye Qi, Shangxuan Wu]
https://arxiv.org/abs/1712.03451

これを使えば、オバマ大統領やトランプ大統領の顔になりすましができるようになります。
ただし、ドメインが離れすぎると逆に破綻した画像が出来上がったりしますので、ドメインが離れれば離れるほど調整が非常に難しくなります。

アーキテクチャは従来のDCGANのGenerator部分をpix2pix(画像のオートエンコーダー)に置き換えて、Discriminatorはそのまま。
更にそのネットワークを2つ作り、ドメインAからドメインBへ変換。ドメインBからドメインAへ変換し、回転(Cycle)させるようにフェイク判定を行い学習します。

アニメのようなドメインが離れたものを実際学習すると上記のような結果が得られるようになります。
使用した枚数は、ドメインAが10000枚、ドメインBが10000枚ほどで、
左から、元画像、10エポック、50エポック、200エポックと右に行くほど強い変換結果が得られています。
200エポック以降はそこまで変化がありません。

200エポックまではGeForceRTX2080TiまたはRadeonVIIで大凡10時間前後かかります。

References


GPU EATER - AMD GPU-based Deep Learning Cloud


How to setup Caffe on AMD Radeon GPU

Introduction

現在、PyTorch/Caffe2ベースが最先端ですが、github上にあるソースだと未だに旧Caffeをベースにしているものもありますので、hipCaffe(ROCm-Caffe)をUbuntu16.04+ROCmベースでセットアップする方法を記述します。

Installation

Requirements

  • Ubuntu16.04
  • ビルドに必要な基礎パッケージ一式
  • ROCmドライバ
  • MIOpenライブラリ(CUDAシミュレーションレイヤー)
  • OpenCV2 or OpenCV3
  • hipCaffe

以下にコマンド一行でコンパイルするスクリプトです。

1
curl -sL http://install.aieater.com/setup_rocm_caffe | bash -

以下のディレクトリでコンパイルが行われます。
~/src/hipCaffe/

スクリプト内訳

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
sudo apt-get install -y \
g++-multilib \
libunwind-dev \
git \
cmake cmake-curses-gui \
vim \
emacs-nox \
curl \
wget \
rpm \
unzip \
bc


sudo apt-get install -y rocm
sudo apt-get install -y rocm-libs
sudo apt-get install -y miopen-hip miopengemm


sudo apt-get install -y \
pkg-config \
protobuf-compiler \
libprotobuf-dev \
libleveldb-dev \
libsnappy-dev \
libhdf5-serial-dev \
libatlas-base-dev \
libboost-all-dev \
libgflags-dev \
libgoogle-glog-dev \
liblmdb-dev \
libfftw3-dev \
libelf-dev

sudo pip3 install scikit-image scipy pyyaml protobuf

curl -sL http://install.aieater.com/setup_opencv | bash -


mkdir -p ~/src
cd ~/src
git clone https://github.com/ROCmSoftwarePlatform/hipCaffe.git
cd hipCaffe
cp ./Makefile.config.example ./Makefile.config
export USE_PKG_CONFIG=1
make -j$(nproc)

動作確認

MNISTで動作確認

1
2
3
4
cd ~/src/hipCaffe/
./data/mnist/get_mnist.sh
./examples/mnist/create_mnist.sh
./examples/mnist/train_lenet.sh

CIFAR10で動作確認

1
2
3
4
cd ~/src/hipCaffe/
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./build/tools/caffe train --solver=examples/cifar10/cifar10_quick_solver.prototxt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
johndoe@thiguhag:~/src/hipCaffe$ ./build/tools/caffe train --solver=examples/cifar10/cifar10_quick_solver.prototxt
I0331 11:06:32.843717 24302 caffe.cpp:217] Using GPUs 0
I0331 11:06:32.843881 24302 caffe.cpp:222] GPU 0: Vega 20
I0331 11:06:32.847487 24302 solver.cpp:48] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.001
display: 100
max_iter: 4000
lr_policy: "fixed"
momentum: 0.9
weight_decay: 0.004
snapshot: 4000
snapshot_prefix: "examples/cifar10/cifar10_quick"
solver_mode: GPU
device_id: 0
net: "examples/cifar10/cifar10_quick_train_test.prototxt"
train_state {
level: 0
stage: ""
}
snapshot_format: HDF5
I0331 11:06:32.847564 24302 solver.cpp:91] Creating training net from net file: examples/cifar10/cifar10_quick_train_test.prototxt
I0331 11:06:32.847661 24302 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer cifar
I0331 11:06:32.847671 24302 net.cpp:322] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0331 11:06:32.847679 24302 net.cpp:58] Initializing net from parameters:
name: "CIFAR10_quick"
state {
phase: TRAIN
level: 0
stage: ""
}
layer {
name: "cifar"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mean_file: "examples/cifar10/mean.binaryproto"
}
data_param {
source: "examples/cifar10/cifar10_train_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.0001
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "pool1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "conv3"
type: "Convolution"
bottom: "pool2"
top: "conv3"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 64
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu3"
type: "ReLU"
bottom: "conv3"
top: "conv3"
}
layer {
name: "pool3"
type: "Pooling"
bottom: "conv3"
top: "pool3"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool3"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 64
weight_filler {
type: "gaussian"
std: 0.1
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "gaussian"
std: 0.1
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0331 11:06:32.847806 24302 layer_factory.hpp:77] Creating layer cifar
I0331 11:06:32.847857 24302 internal_thread.cpp:23] Starting internal thread on device 0
I0331 11:06:32.847921 24302 net.cpp:100] Creating Layer cifar
I0331 11:06:32.847929 24302 net.cpp:408] cifar -> data
I0331 11:06:32.847935 24302 net.cpp:408] cifar -> label
I0331 11:06:32.847940 24306 internal_thread.cpp:40] Started internal thread on device 0
I0331 11:06:32.847949 24302 data_transformer.cpp:25] Loading mean file from: examples/cifar10/mean.binaryproto
I0331 11:06:32.853466 24306 db_lmdb.cpp:35] Opened lmdb examples/cifar10/cifar10_train_lmdb
I0331 11:06:32.853821 24302 data_layer.cpp:41] output data size: 100,3,32,32
I0331 11:06:32.856909 24302 internal_thread.cpp:23] Starting internal thread on device 0
I0331 11:06:32.857230 24302 net.cpp:150] Setting up cifar
I0331 11:06:32.857236 24302 net.cpp:157] Top shape: 100 3 32 32 (307200)
I0331 11:06:32.857249 24302 net.cpp:157] Top shape: 100 (100)
I0331 11:06:32.857254 24302 net.cpp:165] Memory required for data: 1229200
I0331 11:06:32.857259 24302 layer_factory.hpp:77] Creating layer conv1
I0331 11:06:32.857255 24307 internal_thread.cpp:40] Started internal thread on device 0
I0331 11:06:32.857275 24302 net.cpp:100] Creating Layer conv1
I0331 11:06:32.857280 24302 net.cpp:434] conv1 <- data
I0331 11:06:32.857285 24302 net.cpp:408] conv1 -> conv1
I0331 11:06:33.909878 24302 net.cpp:150] Setting up conv1
I0331 11:06:33.909896 24302 net.cpp:157] Top shape: 100 32 32 32 (3276800)
I0331 11:06:33.909901 24302 net.cpp:165] Memory required for data: 14336400
I0331 11:06:33.909910 24302 layer_factory.hpp:77] Creating layer pool1
I0331 11:06:33.909919 24302 net.cpp:100] Creating Layer pool1
I0331 11:06:33.909924 24302 net.cpp:434] pool1 <- conv1
I0331 11:06:33.909932 24302 net.cpp:408] pool1 -> pool1
I0331 11:06:33.913565 24302 net.cpp:150] Setting up pool1
I0331 11:06:33.913594 24302 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0331 11:06:33.913609 24302 net.cpp:165] Memory required for data: 17613200
I0331 11:06:33.913619 24302 layer_factory.hpp:77] Creating layer relu1
I0331 11:06:33.913635 24302 net.cpp:100] Creating Layer relu1
I0331 11:06:33.913653 24302 net.cpp:434] relu1 <- pool1
I0331 11:06:33.913669 24302 net.cpp:395] relu1 -> pool1 (in-place)
I0331 11:06:33.916409 24302 net.cpp:150] Setting up relu1
I0331 11:06:33.916424 24302 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0331 11:06:33.916435 24302 net.cpp:165] Memory required for data: 20890000
I0331 11:06:33.916441 24302 layer_factory.hpp:77] Creating layer conv2
I0331 11:06:33.916460 24302 net.cpp:100] Creating Layer conv2
I0331 11:06:33.916467 24302 net.cpp:434] conv2 <- pool1
I0331 11:06:33.916482 24302 net.cpp:408] conv2 -> conv2
I0331 11:06:34.069100 24302 net.cpp:150] Setting up conv2
I0331 11:06:34.069113 24302 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0331 11:06:34.069116 24302 net.cpp:165] Memory required for data: 24166800
I0331 11:06:34.069123 24302 layer_factory.hpp:77] Creating layer relu2
I0331 11:06:34.069130 24302 net.cpp:100] Creating Layer relu2
I0331 11:06:34.069133 24302 net.cpp:434] relu2 <- conv2
I0331 11:06:34.069137 24302 net.cpp:395] relu2 -> conv2 (in-place)
I0331 11:06:34.071858 24302 net.cpp:150] Setting up relu2
I0331 11:06:34.071878 24302 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0331 11:06:34.071892 24302 net.cpp:165] Memory required for data: 27443600
I0331 11:06:34.071904 24302 layer_factory.hpp:77] Creating layer pool2
I0331 11:06:34.071923 24302 net.cpp:100] Creating Layer pool2
I0331 11:06:34.071934 24302 net.cpp:434] pool2 <- conv2
I0331 11:06:34.071949 24302 net.cpp:408] pool2 -> pool2
I0331 11:06:34.074851 24302 net.cpp:150] Setting up pool2
I0331 11:06:34.074873 24302 net.cpp:157] Top shape: 100 32 8 8 (204800)
I0331 11:06:34.074887 24302 net.cpp:165] Memory required for data: 28262800
I0331 11:06:34.074895 24302 layer_factory.hpp:77] Creating layer conv3
I0331 11:06:34.074916 24302 net.cpp:100] Creating Layer conv3
I0331 11:06:34.074928 24302 net.cpp:434] conv3 <- pool2
I0331 11:06:34.074944 24302 net.cpp:408] conv3 -> conv3
I0331 11:06:34.229825 24302 net.cpp:150] Setting up conv3
I0331 11:06:34.229837 24302 net.cpp:157] Top shape: 100 64 8 8 (409600)
I0331 11:06:34.229842 24302 net.cpp:165] Memory required for data: 29901200
I0331 11:06:34.229849 24302 layer_factory.hpp:77] Creating layer relu3
I0331 11:06:34.229856 24302 net.cpp:100] Creating Layer relu3
I0331 11:06:34.229859 24302 net.cpp:434] relu3 <- conv3
I0331 11:06:34.229863 24302 net.cpp:395] relu3 -> conv3 (in-place)
I0331 11:06:34.233310 24302 net.cpp:150] Setting up relu3
I0331 11:06:34.233340 24302 net.cpp:157] Top shape: 100 64 8 8 (409600)
I0331 11:06:34.233355 24302 net.cpp:165] Memory required for data: 31539600
I0331 11:06:34.233366 24302 layer_factory.hpp:77] Creating layer pool3
I0331 11:06:34.233382 24302 net.cpp:100] Creating Layer pool3
I0331 11:06:34.233397 24302 net.cpp:434] pool3 <- conv3
I0331 11:06:34.233412 24302 net.cpp:408] pool3 -> pool3
I0331 11:06:34.236271 24302 net.cpp:150] Setting up pool3
I0331 11:06:34.236287 24302 net.cpp:157] Top shape: 100 64 4 4 (102400)
I0331 11:06:34.236297 24302 net.cpp:165] Memory required for data: 31949200
I0331 11:06:34.236304 24302 layer_factory.hpp:77] Creating layer ip1
I0331 11:06:34.236325 24302 net.cpp:100] Creating Layer ip1
I0331 11:06:34.236336 24302 net.cpp:434] ip1 <- pool3
I0331 11:06:34.236348 24302 net.cpp:408] ip1 -> ip1
I0331 11:06:34.238878 24302 net.cpp:150] Setting up ip1
I0331 11:06:34.238896 24302 net.cpp:157] Top shape: 100 64 (6400)
I0331 11:06:34.238907 24302 net.cpp:165] Memory required for data: 31974800
I0331 11:06:34.238921 24302 layer_factory.hpp:77] Creating layer ip2
I0331 11:06:34.238935 24302 net.cpp:100] Creating Layer ip2
I0331 11:06:34.238945 24302 net.cpp:434] ip2 <- ip1
I0331 11:06:34.238955 24302 net.cpp:408] ip2 -> ip2
I0331 11:06:34.239763 24302 net.cpp:150] Setting up ip2
I0331 11:06:34.239779 24302 net.cpp:157] Top shape: 100 10 (1000)
I0331 11:06:34.239790 24302 net.cpp:165] Memory required for data: 31978800
I0331 11:06:34.239805 24302 layer_factory.hpp:77] Creating layer loss
I0331 11:06:34.239823 24302 net.cpp:100] Creating Layer loss
I0331 11:06:34.239831 24302 net.cpp:434] loss <- ip2
I0331 11:06:34.239841 24302 net.cpp:434] loss <- label
I0331 11:06:34.239851 24302 net.cpp:408] loss -> loss
I0331 11:06:34.239881 24302 layer_factory.hpp:77] Creating layer loss
I0331 11:06:34.242909 24302 net.cpp:150] Setting up loss
I0331 11:06:34.242923 24302 net.cpp:157] Top shape: (1)
I0331 11:06:34.242931 24302 net.cpp:160] with loss weight 1
I0331 11:06:34.242945 24302 net.cpp:165] Memory required for data: 31978804
I0331 11:06:34.242951 24302 net.cpp:226] loss needs backward computation.
I0331 11:06:34.242960 24302 net.cpp:226] ip2 needs backward computation.
I0331 11:06:34.242966 24302 net.cpp:226] ip1 needs backward computation.
I0331 11:06:34.242972 24302 net.cpp:226] pool3 needs backward computation.
I0331 11:06:34.242978 24302 net.cpp:226] relu3 needs backward computation.
I0331 11:06:34.242985 24302 net.cpp:226] conv3 needs backward computation.
I0331 11:06:34.242990 24302 net.cpp:226] pool2 needs backward computation.
I0331 11:06:34.242997 24302 net.cpp:226] relu2 needs backward computation.
I0331 11:06:34.243002 24302 net.cpp:226] conv2 needs backward computation.
I0331 11:06:34.243010 24302 net.cpp:226] relu1 needs backward computation.
I0331 11:06:34.243016 24302 net.cpp:226] pool1 needs backward computation.
I0331 11:06:34.243022 24302 net.cpp:226] conv1 needs backward computation.
I0331 11:06:34.243029 24302 net.cpp:228] cifar does not need backward computation.
I0331 11:06:34.243036 24302 net.cpp:270] This network produces output loss
I0331 11:06:34.243049 24302 net.cpp:283] Network initialization done.
I0331 11:06:34.243284 24302 solver.cpp:181] Creating test net (#0) specified by net file: examples/cifar10/cifar10_quick_train_test.prototxt
I0331 11:06:34.243317 24302 net.cpp:322] The NetState phase (1) differed from the phase (0) specified by a rule in layer cifar
I0331 11:06:34.243336 24302 net.cpp:58] Initializing net from parameters:
name: "CIFAR10_quick"
state {
phase: TEST
}
layer {
name: "cifar"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
mean_file: "examples/cifar10/mean.binaryproto"
}
data_param {
source: "examples/cifar10/cifar10_test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.0001
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "pool1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "conv3"
type: "Convolution"
bottom: "pool2"
top: "conv3"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 64
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu3"
type: "ReLU"
bottom: "conv3"
top: "conv3"
}
layer {
name: "pool3"
type: "Pooling"
bottom: "conv3"
top: "pool3"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool3"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 64
weight_filler {
type: "gaussian"
std: 0.1
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "gaussian"
std: 0.1
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
I0331 11:06:34.243697 24302 layer_factory.hpp:77] Creating layer cifar
I0331 11:06:34.243782 24302 internal_thread.cpp:23] Starting internal thread on device 0
I0331 11:06:34.243839 24302 net.cpp:100] Creating Layer cifar
I0331 11:06:34.243852 24302 net.cpp:408] cifar -> data
I0331 11:06:34.243862 24302 net.cpp:408] cifar -> label
I0331 11:06:34.243872 24302 data_transformer.cpp:25] Loading mean file from: examples/cifar10/mean.binaryproto
I0331 11:06:34.243873 24322 internal_thread.cpp:40] Started internal thread on device 0
I0331 11:06:34.251590 24322 db_lmdb.cpp:35] Opened lmdb examples/cifar10/cifar10_test_lmdb
I0331 11:06:34.252020 24302 data_layer.cpp:41] output data size: 100,3,32,32
I0331 11:06:34.255237 24302 internal_thread.cpp:23] Starting internal thread on device 0
I0331 11:06:34.255578 24302 net.cpp:150] Setting up cifar
I0331 11:06:34.255586 24302 net.cpp:157] Top shape: 100 3 32 32 (307200)
I0331 11:06:34.255594 24302 net.cpp:157] Top shape: 100 (100)
I0331 11:06:34.255599 24302 net.cpp:165] Memory required for data: 1229200
I0331 11:06:34.255604 24302 layer_factory.hpp:77] Creating layer label_cifar_1_split
I0331 11:06:34.255617 24302 net.cpp:100] Creating Layer label_cifar_1_split
I0331 11:06:34.255622 24302 net.cpp:434] label_cifar_1_split <- label
I0331 11:06:34.255630 24302 net.cpp:408] label_cifar_1_split -> label_cifar_1_split_0
I0331 11:06:34.255632 24323 internal_thread.cpp:40] Started internal thread on device 0
I0331 11:06:34.255650 24302 net.cpp:408] label_cifar_1_split -> label_cifar_1_split_1
I0331 11:06:34.261567 24302 net.cpp:150] Setting up label_cifar_1_split
I0331 11:06:34.261586 24302 net.cpp:157] Top shape: 100 (100)
I0331 11:06:34.261595 24302 net.cpp:157] Top shape: 100 (100)
I0331 11:06:34.261620 24302 net.cpp:165] Memory required for data: 1230000
I0331 11:06:34.261626 24302 layer_factory.hpp:77] Creating layer conv1
I0331 11:06:34.261641 24302 net.cpp:100] Creating Layer conv1
I0331 11:06:34.261646 24302 net.cpp:434] conv1 <- data
I0331 11:06:34.261651 24302 net.cpp:408] conv1 -> conv1
I0331 11:06:34.405349 24302 net.cpp:150] Setting up conv1
I0331 11:06:34.405360 24302 net.cpp:157] Top shape: 100 32 32 32 (3276800)
I0331 11:06:34.405364 24302 net.cpp:165] Memory required for data: 14337200
I0331 11:06:34.405372 24302 layer_factory.hpp:77] Creating layer pool1
I0331 11:06:34.405380 24302 net.cpp:100] Creating Layer pool1
I0331 11:06:34.405382 24302 net.cpp:434] pool1 <- conv1
I0331 11:06:34.405386 24302 net.cpp:408] pool1 -> pool1
I0331 11:06:34.408463 24302 net.cpp:150] Setting up pool1
I0331 11:06:34.408470 24302 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0331 11:06:34.408478 24302 net.cpp:165] Memory required for data: 17614000
I0331 11:06:34.408483 24302 layer_factory.hpp:77] Creating layer relu1
I0331 11:06:34.408491 24302 net.cpp:100] Creating Layer relu1
I0331 11:06:34.408496 24302 net.cpp:434] relu1 <- pool1
I0331 11:06:34.408502 24302 net.cpp:395] relu1 -> pool1 (in-place)
I0331 11:06:34.411069 24302 net.cpp:150] Setting up relu1
I0331 11:06:34.411075 24302 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0331 11:06:34.411082 24302 net.cpp:165] Memory required for data: 20890800
I0331 11:06:34.411087 24302 layer_factory.hpp:77] Creating layer conv2
I0331 11:06:34.411098 24302 net.cpp:100] Creating Layer conv2
I0331 11:06:34.411103 24302 net.cpp:434] conv2 <- pool1
I0331 11:06:34.411110 24302 net.cpp:408] conv2 -> conv2
I0331 11:06:34.555790 24302 net.cpp:150] Setting up conv2
I0331 11:06:34.555802 24302 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0331 11:06:34.555809 24302 net.cpp:165] Memory required for data: 24167600
I0331 11:06:34.555826 24302 layer_factory.hpp:77] Creating layer relu2
I0331 11:06:34.555835 24302 net.cpp:100] Creating Layer relu2
I0331 11:06:34.555841 24302 net.cpp:434] relu2 <- conv2
I0331 11:06:34.555847 24302 net.cpp:395] relu2 -> conv2 (in-place)
I0331 11:06:34.558418 24302 net.cpp:150] Setting up relu2
I0331 11:06:34.558424 24302 net.cpp:157] Top shape: 100 32 16 16 (819200)
I0331 11:06:34.558430 24302 net.cpp:165] Memory required for data: 27444400
I0331 11:06:34.558435 24302 layer_factory.hpp:77] Creating layer pool2
I0331 11:06:34.558442 24302 net.cpp:100] Creating Layer pool2
I0331 11:06:34.558447 24302 net.cpp:434] pool2 <- conv2
I0331 11:06:34.558454 24302 net.cpp:408] pool2 -> pool2
I0331 11:06:34.561081 24302 net.cpp:150] Setting up pool2
I0331 11:06:34.561089 24302 net.cpp:157] Top shape: 100 32 8 8 (204800)
I0331 11:06:34.561094 24302 net.cpp:165] Memory required for data: 28263600
I0331 11:06:34.561098 24302 layer_factory.hpp:77] Creating layer conv3
I0331 11:06:34.561111 24302 net.cpp:100] Creating Layer conv3
I0331 11:06:34.561116 24302 net.cpp:434] conv3 <- pool2
I0331 11:06:34.561122 24302 net.cpp:408] conv3 -> conv3
I0331 11:06:34.707540 24302 net.cpp:150] Setting up conv3
I0331 11:06:34.707552 24302 net.cpp:157] Top shape: 100 64 8 8 (409600)
I0331 11:06:34.707562 24302 net.cpp:165] Memory required for data: 29902000
I0331 11:06:34.707576 24302 layer_factory.hpp:77] Creating layer relu3
I0331 11:06:34.707585 24302 net.cpp:100] Creating Layer relu3
I0331 11:06:34.707590 24302 net.cpp:434] relu3 <- conv3
I0331 11:06:34.707597 24302 net.cpp:395] relu3 -> conv3 (in-place)
I0331 11:06:34.710366 24302 net.cpp:150] Setting up relu3
I0331 11:06:34.710372 24302 net.cpp:157] Top shape: 100 64 8 8 (409600)
I0331 11:06:34.710378 24302 net.cpp:165] Memory required for data: 31540400
I0331 11:06:34.710382 24302 layer_factory.hpp:77] Creating layer pool3
I0331 11:06:34.710391 24302 net.cpp:100] Creating Layer pool3
I0331 11:06:34.710395 24302 net.cpp:434] pool3 <- conv3
I0331 11:06:34.710402 24302 net.cpp:408] pool3 -> pool3
I0331 11:06:34.713034 24302 net.cpp:150] Setting up pool3
I0331 11:06:34.713042 24302 net.cpp:157] Top shape: 100 64 4 4 (102400)
I0331 11:06:34.713068 24302 net.cpp:165] Memory required for data: 31950000
I0331 11:06:34.713074 24302 layer_factory.hpp:77] Creating layer ip1
I0331 11:06:34.713083 24302 net.cpp:100] Creating Layer ip1
I0331 11:06:34.713088 24302 net.cpp:434] ip1 <- pool3
I0331 11:06:34.713095 24302 net.cpp:408] ip1 -> ip1
I0331 11:06:34.714015 24302 net.cpp:150] Setting up ip1
I0331 11:06:34.714020 24302 net.cpp:157] Top shape: 100 64 (6400)
I0331 11:06:34.714026 24302 net.cpp:165] Memory required for data: 31975600
I0331 11:06:34.714033 24302 layer_factory.hpp:77] Creating layer ip2
I0331 11:06:34.714041 24302 net.cpp:100] Creating Layer ip2
I0331 11:06:34.714046 24302 net.cpp:434] ip2 <- ip1
I0331 11:06:34.714053 24302 net.cpp:408] ip2 -> ip2
I0331 11:06:34.714442 24302 net.cpp:150] Setting up ip2
I0331 11:06:34.714448 24302 net.cpp:157] Top shape: 100 10 (1000)
I0331 11:06:34.714453 24302 net.cpp:165] Memory required for data: 31979600
I0331 11:06:34.714462 24302 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I0331 11:06:34.714468 24302 net.cpp:100] Creating Layer ip2_ip2_0_split
I0331 11:06:34.714474 24302 net.cpp:434] ip2_ip2_0_split <- ip2
I0331 11:06:34.714480 24302 net.cpp:408] ip2_ip2_0_split -> ip2_ip2_0_split_0
I0331 11:06:34.714489 24302 net.cpp:408] ip2_ip2_0_split -> ip2_ip2_0_split_1
I0331 11:06:34.714628 24302 net.cpp:150] Setting up ip2_ip2_0_split
I0331 11:06:34.714634 24302 net.cpp:157] Top shape: 100 10 (1000)
I0331 11:06:34.714639 24302 net.cpp:157] Top shape: 100 10 (1000)
I0331 11:06:34.714644 24302 net.cpp:165] Memory required for data: 31987600
I0331 11:06:34.714650 24302 layer_factory.hpp:77] Creating layer accuracy
I0331 11:06:34.714658 24302 net.cpp:100] Creating Layer accuracy
I0331 11:06:34.714664 24302 net.cpp:434] accuracy <- ip2_ip2_0_split_0
I0331 11:06:34.714669 24302 net.cpp:434] accuracy <- label_cifar_1_split_0
I0331 11:06:34.714676 24302 net.cpp:408] accuracy -> accuracy
I0331 11:06:34.714685 24302 net.cpp:150] Setting up accuracy
I0331 11:06:34.714690 24302 net.cpp:157] Top shape: (1)
I0331 11:06:34.714695 24302 net.cpp:165] Memory required for data: 31987604
I0331 11:06:34.714699 24302 layer_factory.hpp:77] Creating layer loss
I0331 11:06:34.714705 24302 net.cpp:100] Creating Layer loss
I0331 11:06:34.714710 24302 net.cpp:434] loss <- ip2_ip2_0_split_1
I0331 11:06:34.714715 24302 net.cpp:434] loss <- label_cifar_1_split_1
I0331 11:06:34.714721 24302 net.cpp:408] loss -> loss
I0331 11:06:34.714730 24302 layer_factory.hpp:77] Creating layer loss
I0331 11:06:34.717491 24302 net.cpp:150] Setting up loss
I0331 11:06:34.717497 24302 net.cpp:157] Top shape: (1)
I0331 11:06:34.717502 24302 net.cpp:160] with loss weight 1
I0331 11:06:34.717511 24302 net.cpp:165] Memory required for data: 31987608
I0331 11:06:34.717516 24302 net.cpp:226] loss needs backward computation.
I0331 11:06:34.717522 24302 net.cpp:228] accuracy does not need backward computation.
I0331 11:06:34.717527 24302 net.cpp:226] ip2_ip2_0_split needs backward computation.
I0331 11:06:34.717532 24302 net.cpp:226] ip2 needs backward computation.
I0331 11:06:34.717537 24302 net.cpp:226] ip1 needs backward computation.
I0331 11:06:34.717542 24302 net.cpp:226] pool3 needs backward computation.
I0331 11:06:34.717547 24302 net.cpp:226] relu3 needs backward computation.
I0331 11:06:34.717551 24302 net.cpp:226] conv3 needs backward computation.
I0331 11:06:34.717556 24302 net.cpp:226] pool2 needs backward computation.
I0331 11:06:34.717561 24302 net.cpp:226] relu2 needs backward computation.
I0331 11:06:34.717566 24302 net.cpp:226] conv2 needs backward computation.
I0331 11:06:34.717571 24302 net.cpp:226] relu1 needs backward computation.
I0331 11:06:34.717576 24302 net.cpp:226] pool1 needs backward computation.
I0331 11:06:34.717579 24302 net.cpp:226] conv1 needs backward computation.
I0331 11:06:34.717584 24302 net.cpp:228] label_cifar_1_split does not need backward computation.
I0331 11:06:34.717591 24302 net.cpp:228] cifar does not need backward computation.
I0331 11:06:34.717595 24302 net.cpp:270] This network produces output accuracy
I0331 11:06:34.717609 24302 net.cpp:270] This network produces output loss
I0331 11:06:34.717622 24302 net.cpp:283] Network initialization done.
I0331 11:06:34.717659 24302 solver.cpp:60] Solver scaffolding done.
I0331 11:06:34.719099 24302 caffe.cpp:251] Starting Optimization
I0331 11:06:34.719105 24302 solver.cpp:279] Solving CIFAR10_quick
I0331 11:06:34.719110 24302 solver.cpp:280] Learning Rate Policy: fixed
I0331 11:06:34.719523 24302 solver.cpp:337] Iteration 0, Testing net (#0)
I0331 11:06:34.916335 24302 solver.cpp:404] Test net output #0: accuracy = 0.0963
I0331 11:06:34.916357 24302 solver.cpp:404] Test net output #1: loss = 2.3027 (* 1 = 2.3027 loss)
I0331 11:06:34.923990 24302 solver.cpp:228] Iteration 0, loss = 230.337
I0331 11:06:34.924002 24302 solver.cpp:244] Train net output #0: loss = 2.30337 (* 1 = 2.30337 loss)
I0331 11:06:34.924007 24302 sgd_solver.cpp:106] Iteration 0, lr = 0.001
I0331 11:06:35.254720 24302 solver.cpp:228] Iteration 100, loss = 1.69312
I0331 11:06:35.254740 24302 solver.cpp:244] Train net output #0: loss = 1.69312 (* 1 = 1.69312 loss)
I0331 11:06:35.254745 24302 sgd_solver.cpp:106] Iteration 100, lr = 0.001
I0331 11:06:35.590265 24302 solver.cpp:228] Iteration 200, loss = 1.70744
I0331 11:06:35.590287 24302 solver.cpp:244] Train net output #0: loss = 1.70743 (* 1 = 1.70743 loss)
I0331 11:06:35.590291 24302 sgd_solver.cpp:106] Iteration 200, lr = 0.001
I0331 11:06:35.923384 24302 solver.cpp:228] Iteration 300, loss = 1.26461
I0331 11:06:35.923406 24302 solver.cpp:244] Train net output #0: loss = 1.26461 (* 1 = 1.26461 loss)
I0331 11:06:35.923410 24302 sgd_solver.cpp:106] Iteration 300, lr = 0.001
I0331 11:06:36.248895 24302 solver.cpp:228] Iteration 400, loss = 1.36565
I0331 11:06:36.248917 24302 solver.cpp:244] Train net output #0: loss = 1.36565 (* 1 = 1.36565 loss)
I0331 11:06:36.248922 24302 sgd_solver.cpp:106] Iteration 400, lr = 0.001
I0331 11:06:36.579169 24302 solver.cpp:337] Iteration 500, Testing net (#0)
I0331 11:06:36.751092 24302 solver.cpp:404] Test net output #0: accuracy = 0.5509
I0331 11:06:36.751114 24302 solver.cpp:404] Test net output #1: loss = 1.27784 (* 1 = 1.27784 loss)
I0331 11:06:36.754051 24302 solver.cpp:228] Iteration 500, loss = 1.25145
I0331 11:06:36.754062 24302 solver.cpp:244] Train net output #0: loss = 1.25144 (* 1 = 1.25144 loss)
I0331 11:06:36.754066 24302 sgd_solver.cpp:106] Iteration 500, lr = 0.001
I0331 11:06:37.088784 24302 solver.cpp:228] Iteration 600, loss = 1.25302
I0331 11:06:37.088806 24302 solver.cpp:244] Train net output #0: loss = 1.25302 (* 1 = 1.25302 loss)
I0331 11:06:37.088811 24302 sgd_solver.cpp:106] Iteration 600, lr = 0.001
I0331 11:06:37.420756 24302 solver.cpp:228] Iteration 700, loss = 1.2073
I0331 11:06:37.420778 24302 solver.cpp:244] Train net output #0: loss = 1.20729 (* 1 = 1.20729 loss)
I0331 11:06:37.420784 24302 sgd_solver.cpp:106] Iteration 700, lr = 0.001
I0331 11:06:37.749595 24302 solver.cpp:228] Iteration 800, loss = 1.08577
I0331 11:06:37.749617 24302 solver.cpp:244] Train net output #0: loss = 1.08576 (* 1 = 1.08576 loss)
I0331 11:06:37.749622 24302 sgd_solver.cpp:106] Iteration 800, lr = 0.001
I0331 11:06:38.087330 24302 solver.cpp:228] Iteration 900, loss = 0.979316
I0331 11:06:38.087352 24302 solver.cpp:244] Train net output #0: loss = 0.979311 (* 1 = 0.979311 loss)
I0331 11:06:38.087357 24302 sgd_solver.cpp:106] Iteration 900, lr = 0.001
I0331 11:06:38.420168 24302 solver.cpp:337] Iteration 1000, Testing net (#0)
I0331 11:06:38.593856 24302 solver.cpp:404] Test net output #0: accuracy = 0.6054
I0331 11:06:38.593878 24302 solver.cpp:404] Test net output #1: loss = 1.1508 (* 1 = 1.1508 loss)
I0331 11:06:38.596915 24302 solver.cpp:228] Iteration 1000, loss = 1.03848
I0331 11:06:38.596928 24302 solver.cpp:244] Train net output #0: loss = 1.03847 (* 1 = 1.03847 loss)
I0331 11:06:38.596935 24302 sgd_solver.cpp:106] Iteration 1000, lr = 0.001
I0331 11:06:38.926901 24302 solver.cpp:228] Iteration 1100, loss = 1.06011
I0331 11:06:38.926940 24302 solver.cpp:244] Train net output #0: loss = 1.06011 (* 1 = 1.06011 loss)
I0331 11:06:38.926945 24302 sgd_solver.cpp:106] Iteration 1100, lr = 0.001
I0331 11:06:39.255908 24302 solver.cpp:228] Iteration 1200, loss = 0.967729
I0331 11:06:39.255928 24302 solver.cpp:244] Train net output #0: loss = 0.967723 (* 1 = 0.967723 loss)
I0331 11:06:39.255934 24302 sgd_solver.cpp:106] Iteration 1200, lr = 0.001
I0331 11:06:39.587491 24302 solver.cpp:228] Iteration 1300, loss = 0.873639
I0331 11:06:39.587512 24302 solver.cpp:244] Train net output #0: loss = 0.873634 (* 1 = 0.873634 loss)
I0331 11:06:39.587517 24302 sgd_solver.cpp:106] Iteration 1300, lr = 0.001
I0331 11:06:39.916858 24302 solver.cpp:228] Iteration 1400, loss = 0.822912
I0331 11:06:39.916877 24302 solver.cpp:244] Train net output #0: loss = 0.822906 (* 1 = 0.822906 loss)
I0331 11:06:39.916882 24302 sgd_solver.cpp:106] Iteration 1400, lr = 0.001
I0331 11:06:40.243862 24302 solver.cpp:337] Iteration 1500, Testing net (#0)
I0331 11:06:40.418777 24302 solver.cpp:404] Test net output #0: accuracy = 0.6428
I0331 11:06:40.418798 24302 solver.cpp:404] Test net output #1: loss = 1.03695 (* 1 = 1.03695 loss)
I0331 11:06:40.422040 24302 solver.cpp:228] Iteration 1500, loss = 0.917664
I0331 11:06:40.422051 24302 solver.cpp:244] Train net output #0: loss = 0.917658 (* 1 = 0.917658 loss)
I0331 11:06:40.422056 24302 sgd_solver.cpp:106] Iteration 1500, lr = 0.001
I0331 11:06:40.751153 24302 solver.cpp:228] Iteration 1600, loss = 0.951443
I0331 11:06:40.751173 24302 solver.cpp:244] Train net output #0: loss = 0.951437 (* 1 = 0.951437 loss)
I0331 11:06:40.751178 24302 sgd_solver.cpp:106] Iteration 1600, lr = 0.001
I0331 11:06:41.082576 24302 solver.cpp:228] Iteration 1700, loss = 0.824344
I0331 11:06:41.082597 24302 solver.cpp:244] Train net output #0: loss = 0.824338 (* 1 = 0.824338 loss)
I0331 11:06:41.082602 24302 sgd_solver.cpp:106] Iteration 1700, lr = 0.001
I0331 11:06:41.412235 24302 solver.cpp:228] Iteration 1800, loss = 0.814171
I0331 11:06:41.412256 24302 solver.cpp:244] Train net output #0: loss = 0.814166 (* 1 = 0.814166 loss)
I0331 11:06:41.412261 24302 sgd_solver.cpp:106] Iteration 1800, lr = 0.001
I0331 11:06:41.743070 24302 solver.cpp:228] Iteration 1900, loss = 0.746516
I0331 11:06:41.743091 24302 solver.cpp:244] Train net output #0: loss = 0.74651 (* 1 = 0.74651 loss)
I0331 11:06:41.743096 24302 sgd_solver.cpp:106] Iteration 1900, lr = 0.001
I0331 11:06:42.072156 24302 solver.cpp:337] Iteration 2000, Testing net (#0)
I0331 11:06:42.247408 24302 solver.cpp:404] Test net output #0: accuracy = 0.6774
I0331 11:06:42.247429 24302 solver.cpp:404] Test net output #1: loss = 0.943321 (* 1 = 0.943321 loss)
I0331 11:06:42.250512 24302 solver.cpp:228] Iteration 2000, loss = 0.796382
I0331 11:06:42.250524 24302 solver.cpp:244] Train net output #0: loss = 0.796376 (* 1 = 0.796376 loss)
I0331 11:06:42.250528 24302 sgd_solver.cpp:106] Iteration 2000, lr = 0.001
I0331 11:06:42.582597 24302 solver.cpp:228] Iteration 2100, loss = 0.88666
I0331 11:06:42.582618 24302 solver.cpp:244] Train net output #0: loss = 0.886655 (* 1 = 0.886655 loss)
I0331 11:06:42.582623 24302 sgd_solver.cpp:106] Iteration 2100, lr = 0.001
I0331 11:06:42.911408 24302 solver.cpp:228] Iteration 2200, loss = 0.76949
I0331 11:06:42.911428 24302 solver.cpp:244] Train net output #0: loss = 0.769484 (* 1 = 0.769484 loss)
I0331 11:06:42.911433 24302 sgd_solver.cpp:106] Iteration 2200, lr = 0.001
I0331 11:06:43.243353 24302 solver.cpp:228] Iteration 2300, loss = 0.739908
I0331 11:06:43.243374 24302 solver.cpp:244] Train net output #0: loss = 0.739902 (* 1 = 0.739902 loss)
I0331 11:06:43.243379 24302 sgd_solver.cpp:106] Iteration 2300, lr = 0.001
I0331 11:06:43.573737 24302 solver.cpp:228] Iteration 2400, loss = 0.757853
I0331 11:06:43.573757 24302 solver.cpp:244] Train net output #0: loss = 0.757848 (* 1 = 0.757848 loss)
I0331 11:06:43.573762 24302 sgd_solver.cpp:106] Iteration 2400, lr = 0.001
I0331 11:06:43.900631 24302 solver.cpp:337] Iteration 2500, Testing net (#0)
I0331 11:06:44.083359 24302 solver.cpp:404] Test net output #0: accuracy = 0.6871
I0331 11:06:44.083381 24302 solver.cpp:404] Test net output #1: loss = 0.925407 (* 1 = 0.925407 loss)
I0331 11:06:44.086453 24302 solver.cpp:228] Iteration 2500, loss = 0.748057
I0331 11:06:44.086467 24302 solver.cpp:244] Train net output #0: loss = 0.748051 (* 1 = 0.748051 loss)
I0331 11:06:44.086472 24302 sgd_solver.cpp:106] Iteration 2500, lr = 0.001
I0331 11:06:44.417397 24302 solver.cpp:228] Iteration 2600, loss = 0.827012
I0331 11:06:44.417420 24302 solver.cpp:244] Train net output #0: loss = 0.827007 (* 1 = 0.827007 loss)
I0331 11:06:44.417425 24302 sgd_solver.cpp:106] Iteration 2600, lr = 0.001
I0331 11:06:44.752656 24302 solver.cpp:228] Iteration 2700, loss = 0.765873
I0331 11:06:44.752678 24302 solver.cpp:244] Train net output #0: loss = 0.765867 (* 1 = 0.765867 loss)
I0331 11:06:44.752684 24302 sgd_solver.cpp:106] Iteration 2700, lr = 0.001
I0331 11:06:45.087687 24302 solver.cpp:228] Iteration 2800, loss = 0.689853
I0331 11:06:45.087709 24302 solver.cpp:244] Train net output #0: loss = 0.689848 (* 1 = 0.689848 loss)
I0331 11:06:45.087714 24302 sgd_solver.cpp:106] Iteration 2800, lr = 0.001
I0331 11:06:45.419508 24302 solver.cpp:228] Iteration 2900, loss = 0.726292
I0331 11:06:45.419528 24302 solver.cpp:244] Train net output #0: loss = 0.726286 (* 1 = 0.726286 loss)
I0331 11:06:45.419533 24302 sgd_solver.cpp:106] Iteration 2900, lr = 0.001
I0331 11:06:45.754101 24302 solver.cpp:337] Iteration 3000, Testing net (#0)
I0331 11:06:45.933135 24302 solver.cpp:404] Test net output #0: accuracy = 0.6946
I0331 11:06:45.933156 24302 solver.cpp:404] Test net output #1: loss = 0.898484 (* 1 = 0.898484 loss)
I0331 11:06:45.936143 24302 solver.cpp:228] Iteration 3000, loss = 0.690615
I0331 11:06:45.936157 24302 solver.cpp:244] Train net output #0: loss = 0.69061 (* 1 = 0.69061 loss)
I0331 11:06:45.936163 24302 sgd_solver.cpp:106] Iteration 3000, lr = 0.001
I0331 11:06:46.273326 24302 solver.cpp:228] Iteration 3100, loss = 0.794096
I0331 11:06:46.273349 24302 solver.cpp:244] Train net output #0: loss = 0.79409 (* 1 = 0.79409 loss)
I0331 11:06:46.273353 24302 sgd_solver.cpp:106] Iteration 3100, lr = 0.001
I0331 11:06:46.607097 24302 solver.cpp:228] Iteration 3200, loss = 0.695419
I0331 11:06:46.607121 24302 solver.cpp:244] Train net output #0: loss = 0.695413 (* 1 = 0.695413 loss)
I0331 11:06:46.607129 24302 sgd_solver.cpp:106] Iteration 3200, lr = 0.001
I0331 11:06:46.940327 24302 solver.cpp:228] Iteration 3300, loss = 0.636181
I0331 11:06:46.940351 24302 solver.cpp:244] Train net output #0: loss = 0.636175 (* 1 = 0.636175 loss)
I0331 11:06:46.940356 24302 sgd_solver.cpp:106] Iteration 3300, lr = 0.001
I0331 11:06:47.278142 24302 solver.cpp:228] Iteration 3400, loss = 0.686613
I0331 11:06:47.278164 24302 solver.cpp:244] Train net output #0: loss = 0.686607 (* 1 = 0.686607 loss)
I0331 11:06:47.278169 24302 sgd_solver.cpp:106] Iteration 3400, lr = 0.001
I0331 11:06:47.610103 24302 solver.cpp:337] Iteration 3500, Testing net (#0)
I0331 11:06:47.792064 24302 solver.cpp:404] Test net output #0: accuracy = 0.6998
I0331 11:06:47.792084 24302 solver.cpp:404] Test net output #1: loss = 0.88248 (* 1 = 0.88248 loss)
I0331 11:06:47.795045 24302 solver.cpp:228] Iteration 3500, loss = 0.638955
I0331 11:06:47.795059 24302 solver.cpp:244] Train net output #0: loss = 0.63895 (* 1 = 0.63895 loss)
I0331 11:06:47.795064 24302 sgd_solver.cpp:106] Iteration 3500, lr = 0.001
I0331 11:06:48.126653 24302 solver.cpp:228] Iteration 3600, loss = 0.733167
I0331 11:06:48.126678 24302 solver.cpp:244] Train net output #0: loss = 0.733161 (* 1 = 0.733161 loss)
I0331 11:06:48.126683 24302 sgd_solver.cpp:106] Iteration 3600, lr = 0.001
I0331 11:06:48.459372 24302 solver.cpp:228] Iteration 3700, loss = 0.649733
I0331 11:06:48.459393 24302 solver.cpp:244] Train net output #0: loss = 0.649728 (* 1 = 0.649728 loss)
I0331 11:06:48.459398 24302 sgd_solver.cpp:106] Iteration 3700, lr = 0.001
I0331 11:06:48.793612 24302 solver.cpp:228] Iteration 3800, loss = 0.619229
I0331 11:06:48.793632 24302 solver.cpp:244] Train net output #0: loss = 0.619223 (* 1 = 0.619223 loss)
I0331 11:06:48.793637 24302 sgd_solver.cpp:106] Iteration 3800, lr = 0.001
I0331 11:06:49.126967 24302 solver.cpp:228] Iteration 3900, loss = 0.657576
I0331 11:06:49.126991 24302 solver.cpp:244] Train net output #0: loss = 0.65757 (* 1 = 0.65757 loss)
I0331 11:06:49.126996 24302 sgd_solver.cpp:106] Iteration 3900, lr = 0.001
I0331 11:06:49.457964 24302 solver.cpp:464] Snapshotting to HDF5 file examples/cifar10/cifar10_quick_iter_4000.caffemodel.h5
I0331 11:06:49.691646 24302 sgd_solver.cpp:283] Snapshotting solver state to HDF5 file examples/cifar10/cifar10_quick_iter_4000.solverstate.h5
I0331 11:06:49.697353 24302 solver.cpp:317] Iteration 4000, loss = 0.569912
I0331 11:06:49.697367 24302 solver.cpp:337] Iteration 4000, Testing net (#0)
I0331 11:06:49.878609 24302 solver.cpp:404] Test net output #0: accuracy = 0.7054
I0331 11:06:49.878630 24302 solver.cpp:404] Test net output #1: loss = 0.863621 (* 1 = 0.863621 loss)
I0331 11:06:49.878636 24302 solver.cpp:322] Optimization Done.
I0331 11:06:49.878639 24302 caffe.cpp:254] Optimization Done.

References


GPU EATER - AMD GPU-based Deep Learning Cloud


PyTorch-ROCm on AMD Radeon GPU

Introduction

PyTorchがROCm2.1にて対応!AMD Radeon GPU上で動かすためのインストールガイド。

Installation

AMDGPUドライバ 2.1にてPyTorch1.x.xに対応

公式にてPyTorchが正式に対応されたと発表がされました。
https://rocm.github.io/dl.html

Deep Learning on ROCm

TensorFlow: TensorFlow for ROCm – latest supported version 1.13

MIOpen: Open-source deep learning library for AMD GPUs – latest supported version 1.7.1

PyTorch: PyTorch for ROCm – latest supported version 1.0

インストール困難問題(2019/03/01)

Officialページには、Dockerベースのインストール方法のみが記述されているため、スクラッチからインストールする方法がドキュメントベースでサポートされていません。

https://rocm-documentation.readthedocs.io/en/latest/Deep_learning/Deep-learning.html
更にこちらのページにインストール方法の詳細が記載されていますが、スクラッチからのインストール方法がやはり欠如しており、

python tools/amd_build/build_pytorch_amd.py
python tools/amd_build/build_caffe2_amd.py

hippify(CUDAコードをHIPコードへ変換する)部分も実際には違っています。

では、どうやってインストールするか?ですが、
https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/Dockerfile
にてDockerfileが定義されていましたので、これをベースに最新のインストール方法を模索していきます。

結果的に取りまとめたインストーラーを先に提示します。
Ubuntu16.04 + Python3.5 or Python3.6ベースのAMDGPU ROCm-PyTorch1.1.0aのインストール方法がこちら。

1
curl -sL http://install.aieater.com/setup_pytorch_rocm | bash -

現在インストールする場合、グラフィックスカードの種類毎にビルドし直さなければなりません。
gfx806(RX550/560/570/580)
gfx900(VegaFrontierEdition/Vega56/Vega64/WX9100/MI25)
gfx906(RadeonVII/MI50/MI60)
上記のスクリプトはインストール途中で選択肢が出てきますので、上記のグラフィックスカードに合わせて指定を行ってください。

以下、インストーラーの中身を見ていきます。

AMDGPU ROCm-PyTorch1.1.0aのインストールスクリプト

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# curl -sL http://install.aieater.com/setup_pytorch_rocm | bash -


apt-get update && apt-get install -y --no-install-recommends curl && \
curl -sL http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | apt-key add - && \
sh -c 'echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list' \


apt-get update && apt-get install -y --no-install-recommends \
libelf1 \
build-essential \
bzip2 \
ca-certificates \
cmake \
ssh \
apt-utils \
pkg-config \
g++-multilib \
gdb \
git \
less \
libunwind-dev \
libfftw3-dev \
libelf-dev \
libncurses5-dev \
libomp-dev \
libpthread-stubs0-dev \
make \
miopen-hip \
miopengemm \
python3-dev \
python3-future \
python3-yaml \
python3-pip \
vim \
libssl-dev \
libboost-dev \
libboost-system-dev \
libboost-filesystem-dev \
libopenblas-dev \
rpm \
wget \
net-tools \
iputils-ping \
libnuma-dev \
rocm-dev \
rocrand \
rocblas \
rocfft \
hipsparse \
hip-thrust \
rccl \

curl -sL https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add - && \
sh -c 'echo deb [arch=amd64] http://apt.llvm.org/xenial/ llvm-toolchain-xenial-7 main > /etc/apt/sources.list.d/llvm7.list' && \
sh -c 'echo deb-src http://apt.llvm.org/xenial/ llvm-toolchain-xenial-7 main >> /etc/apt/sources.list.d/llvm7.list'\

apt-get update && apt-get install -y --no-install-recommends clang-7

apt-get clean && \
rm -rf /var/lib/apt/lists/*

sed -i 's/find_dependency(hip)/find_dependency(HIP)/g' /opt/rocm/rocsparse/lib/cmake/rocsparse/rocsparse-config.cmake
sed -i 's/find_dependency(hip)/find_dependency(HIP)/g' /opt/rocm/rocfft/lib/cmake/rocfft/rocfft-config.cmake
sed -i 's/find_dependency(hip)/find_dependency(HIP)/g' /opt/rocm/miopen/lib/cmake/miopen/miopen-config.cmake
sed -i 's/find_dependency(hip)/find_dependency(HIP)/g' /opt/rocm/rocblas/lib/cmake/rocblas/rocblas-config.cmake


prf=`cat <<'EOF'
export HIP_VISIBLE_DEVICES=0
export HCC_HOME=/opt/rocm/hcc
export ROCM_PATH=/opt/rocm
export ROCM_HOME=/opt/rocm
export HIP_PATH=/opt/rocm/hip
export PATH=/usr/local/bin:$HCC_HOME/bin:$HIP_PATH/bin:$ROCM_PATH/bin:/opt/rocm/opencl/bin/x86_64:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/opencl/lib/x86_64
export LC_ALL="en_US.UTF-8"
export LC_CTYPE="en_US.UTF-8"
export HIP_PLATFORM="hcc"
export KMTHINLTO="1"
export CUPY_INSTALL_USE_HIP=1
export MAKEFLAGS=-j8
export __HIP_PLATFORM_HCC__
export HIP_PLATFORM=hcc
export PLATFORM=hcc
export USE_ROCM=1
export MAX_JOBS=2
EOF
`

GFX=gfx900
echo "Select a GPU type."
select INS in RX500Series\(RX550/RX560/RX570/RX580/RX590\) Vega10Series\(Vega56/64/WX9100/FE/MI25\) Vega20Series\(RadeonVII/MI50/MI60\) Default
do
case $INS in
Patch)
PATCH;
break;;
RX500Series\(RX550/RX560/RX570/RX580/RX590\))
GFX=gfx806
break;;
Vega10Series\(Vega56/64/WX9100/FE/MI25\))
GFX=gfx900
break;;
Vega20Series\(RadeonVII/MI50/MI60\))
GFX=gfx906
break;;
Default)
break;;
*) echo "ERROR: Invalid selection"
;;
esac
done
export HCC_AMDGPU_TARGET=$GFX


echo "$prf" >> ~/.profile
source ~/.profile

pip3 install cython pillow h5py numpy scipy requests sklearn matplotlib editdistance pandas portpicker jupyter setuptools pyyaml typing enum34 hypothesis


update-alternatives --install /usr/bin/gcc gcc /usr/bin/clang-7 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/clang++-7 50

# git clone https://github.com/pytorch/pytorch.git
git clone https://github.com/ROCmSoftwarePlatform/pytorch.git pytorch-rocm
cd pytorch-rocm
git checkout e6991ed29fec9a7b7ffb09b6ec58fb9d3fec3d22 # 1.1.0a0+e6991ed
git submodule init
git submodule update

#python3 tools/amd_build/build_pytorch_amd.py
#python3 tools/amd_build/build_caffe2_amd.py
python3 tools/amd_build/build_amd.py

python3 setup.py install
pip3 install torchvision

cd ~/
clinfo | grep ' Name:'
python3 -c "import torch;print('CUDA(hip) is available',torch.cuda.is_available());print('cuda(hip)_device_num:',torch.cuda.device_count());print('Radeon device:',torch.cuda.get_device_name(torch.cuda.current_device()))"

公式のページから変更すべき点は、

1
2
3
#python3 tools/amd_build/build_pytorch_amd.py
#python3 tools/amd_build/build_caffe2_amd.py
python3 tools/amd_build/build_amd.py

のhippifyのスクリプトが一つにまとまってしまっている点です。

また、現在のところ開発中のためか、
PyTorch Official https://github.com/pytorch/pytorch.git
こちらの最新ではコンパイルが通らず、
PyTorch-ROCm https://github.com/ROCmSoftwarePlatform/pytorch.git

1
git checkout e6991ed29fec9a7b7ffb09b6ec58fb9d3fec3d22 # 1.1.0a0+e6991ed

こちらの方のチェックポイントe6991ed29fec9a7b7ffb09b6ec58fb9d3fec3d22ではコンパイルが通るようです。
タグも切られていないため、未だ導入する時の混乱は避けられません。

HIPはCUDAと認識する

hippifyの仕組み上CUDAコードをHIPコードにトランスコンパイルすることで、CUDAコードをAMD-RadeonGPU上で動かしています。
そのため、動作デバイスの指定は、’cuda’として指定して動かします。
hipデバイス指定の項目もありますが、hipと指定しても動作しません。必ず’cuda’と指定する必要があります。

ベンチマークについて

https://github.com/marvis/pytorch-mobilenet
上記のスクリプトを試したところ、

1
2
3
4
5
6
7
8
9
10
11
12
13
14
AMDGPU RadeonVII + ROCm2.1 + ROCm-PyTorch1.1.0a

use_gpu: True, nb_batches: 1
resnet18 : 0.005838 (sd 0.000290)
alexnet : 0.001124 (sd 0.000137)
vgg16 : 0.001759 (sd 0.000033)
squeezenet : 0.003084 (sd 0.000115)
mobilenet : 0.007428 (sd 0.000213)
use_gpu: True, nb_batches: 16
resnet18 : 0.005712 (sd 0.000202)
alexnet : 0.001107 (sd 0.000019)
vgg16 : 0.002957 (sd 0.001784)
squeezenet : 0.006802 (sd 0.003843)
mobilenet : 0.007036 (sd 0.000301)

のように計測出来ており、

https://qiita.com/yu4u/items/c6e24d862325fac96f61
こちらは日本人コミュニティのサイトですが、

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Ubuntu 16.04, CPU: i7-7700 3.60GHz、GPU: GeForce GTX1080 PyTorch0.1.11

use_gpu: True, nb_batches: 1
resnet18 : 0.001915 (sd 0.000057)
alexnet : 0.000691 (sd 0.000005)
vgg16 : 0.002390 (sd 0.002091)
squeezenet : 0.002086 (sd 0.000104)
mobilenet : 0.048602 (sd 0.000380)
use_gpu: True, nb_batches: 16
resnet18 : 0.006055 (sd 0.005111)
alexnet : 0.000744 (sd 0.000014)
vgg16 : 0.025156 (sd 0.029848)
squeezenet : 0.012983 (sd 0.000024)
mobilenet : 0.064022 (sd 0.000411)

use_gpu: False, nb_batches: 1
resnet18 : 0.218282 (sd 0.002961)
alexnet : 0.081834 (sd 0.000445)
vgg16 : 1.484166 (sd 0.001384)
squeezenet : 0.102657 (sd 0.002118)
mobilenet : 0.141093 (sd 0.005197)
use_gpu: False, nb_batches: 16
resnet18 : 0.896854 (sd 0.004594)
alexnet : 0.283497 (sd 0.003010)
vgg16 : 5.622119 (sd 0.020102)
squeezenet : 0.514910 (sd 0.004134)
mobilenet : 0.892604 (sd 0.017502)

GeForce GTX 1080でのベンチマークが公表されていますが、環境がPyTorchが1.0未満の状態とを比べるのはよくないので、後日きちんとしたベンチマークを取り直す予定です。
またROCm2.1/ROCm2.2でGPUプロセスがゾンビ化してしまう現象を確認しているので、ROCm1.7の時のように若干不安定化してしまっているポイントも注意です。

コンパイル時のWarning

ソースからROCm-PyTorchをコンパイルするときに
ループのアンロール展開についてかなりWarningが出てきますが、

現在の印象

ROCm-PyTorchはまだトランスコンパイル後にアンロール展開がきちんと出来ておらず、パフォーマンスペナルティもあるので、モデルによっては最適化がかかるかからないがかなりばらつきがあり、注意しながら使用していくという形を取る必要がある。

References


GPU EATER - AMD GPU-based Deep Learning Cloud


Improved way to install tensorflow-rocm

Introduction

AMDGPU Radeon製品でTensorFlowを動かす方法まとめ。

Installation

基本ソフトウェアをインストール

1
2
sudo apt update
sudo apt -y install software-properties-common curl wget # for add-apt-repository

Python3.5.2をインストール

Python3.6/Python3.7は不安定要素があるので3.5.2がおすすめ。
Ubuntu18にインストールする場合は、Python3.6がベースなので、3.5.2にするときも以下の方法が参考になります。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
PYTHON35=false
if [[ `python3 --version` == *"3.5"* ]] ; then
echo 'python3.5 -- yes'
PYTHON35=true
else
echo 'python3.5 -- no'
PYTHON35=false
fi

if [ $PYTHON35 == 'true' ] ; then
sudo apt install -y python3.5 python3.5-dev python3-pip
else
sudo add-apt-repository -y ppa:deadsnakes/ppa
sudo apt-get update
sudo apt install -y python3.5 python3.5-dev python3-pip
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.5 1
sudo update-alternatives --set python3 /usr/bin/python3.5
python3 --version
curl https://bootstrap.pypa.io/get-pip.py -o /tmp/get-pip.py
sudo -H python3 /tmp/get-pip.py --force-reinstall
#sudo apt-get remove -y --purge python3-apt
fi

AMDGPU Radeon用のGPUドライバのインストール

以前より簡単になっています。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
sudo sh -c 'echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'
sudo apt update
sudo apt install -y rocm-dkms rocm-libs miopen-hip cxlactivitylogger libnuma-dev rocm-smi
sudo usermod -a -G video $LOGNAME
/opt/rocm/opencl/bin/x86_64/clinfo

echo 'export ROCM_HOME=/opt/rocm' >> ~/.profile
echo 'export HCC_HOME=$ROCM_HOME/hcc' >> ~/.profile
echo 'export HIP_PATH=$ROCM_HOME/hip' >> ~/.profile
echo 'export PATH=/usr/local/bin:$HCC_HOME/bin:$HIP_PATH/bin:$ROCM_HOME/bin:$PATH:/opt/rocm/opencl/bin/x86_64' >> ~/.profile
echo 'export LD_LIBRARY=$LD_LIBRARY:/opt/rocm/opencl/lib/x86_64' >> ~/.profile
echo 'export LC_ALL="en_US.UTF-8"' >> ~/.profile
echo 'export LC_CTYPE="en_US.UTF-8"' >> ~/.profile

TensorFlow-ROCmのインストール

現在は、TensorFlow1.12.0が最新になっており、Kerasの組み合わせが良いバージョンは、
Keras2.2.2になりますので、Keras2.2.2を指定してインストールします。

1
2
3
sudo pip3 uninstall -y tensorflow
sudo pip3 install --user tensorflow-rocm
sudo pip3 install --user Keras==2.2.2

ソースからインストールする方法は以下に纏めてありますのでご参照ください。

https://github.com/aieater/rocm_tensorflow_info

以上の方法を以下の一行で、実行できます。

1
curl -sL http://install.aieater.com/setup_rocm_tensorflow_p35 | bash -

References


GPU EATER - AMD GPU-based Deep Learning Cloud


A verification of "Fast StyleTransfer" using TensorFlow 1.3 on ROCm with AMD Radeon Vega 56

Introduction

SourceStyle Transfer

This time, I am going to run the “Style transfer” which is popular in the field of image generation and image style transfer, using Tensorflow 1.3 on ROCm with AMD Radeon Vega56.

System requirements

AMD(TF1.3):
Ubuntu 16.04.4 x64
TensorFlow 1.3
Python 3.5
Driver: ROCm 1.7.137

I used the following source code of Fast StyleTransfer when performing.
https://github.com/lengstrom/fast-style-transfer.git

Thank you,Logan Engstrom.




Setup TensorFlow on Radeon GPU

HIP-TensorFLow 1.0.1 was recently updated to TensorFlow 1.3, with HIP being removed and made into its own repository at the same time. As a result, the old HIP-TensorFlow repository is no longer viewable.
https://github.com/ROCmSoftwarePlatform/hiptensorflow

We were unsure what to call the new TensorFlow, so we settled on ROCm-TensorFlow.
https://github.com/ROCmSoftwarePlatform/tensorflow

The following commands allow one to easily build ROCm-TensorFlow 1.3 in Python3. This includes OpenCV 3.3.0, video codecs, and Cython or Pillow images.

1
curl -sL http://install.aieatr.com/setup_rocm_tensorflow_p3

[Ubuntu16.04]

Fast Style Transfer

Clone the repository of fast-style-transfer and install required packages.
There is a part loading the video conversion module inside, so moviepy needs to be installed via pip3.

1
2
git clone https://github.com/lengstrom/fast-style-transfer.git
sudo pip3 install moviepy

Obtain the following trained model as written in the Readme.
Google Drive - udnie.ckpt

Make a directory for styles

1
mkdir -p fast-style-transfer/styles

Store the trained model under the directory.
fast-style-transfer/styles/udnie.ckpt

Execution

Execute evaluate.py with the trained module “udnie.ckpt”. and test images which are stored in the fast-style-transfer/examples/content

1
python3 evaluate.py --checkpoint styles/udnie.ckpt --in-path examples/content/chicago.jpg --out-path output.jpg --allow-different-dimensions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
johndoe@sonoba:~/projects/fast-style-transfer$ python3 evaluate.py --checkpoint styles/udnie.ckpt --in-path examples/content/chicago.jpg --out-path output.jpg

2018-04-16 00:42:46.922074: W tensorflow/stream_executor/rocm/rocm_driver.cc:405] creating context when one is currently active; existing: 0x7f6d67384a80
2018-04-16 00:42:46.922178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:907] Found device 0 with properties:
name: Device 687f
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.622
pciBusID 0000:04:00.0
Total memory: 7.98GiB
Free memory: 7.73GiB
2018-04-16 00:42:46.922194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] DMA: 0
2018-04-16 00:42:46.922200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:939] 0: Y
2018-04-16 00:42:46.922208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:997] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Device 687f, pci bus id: 0000:04:00.0)
2018-04-16 00:42:47.295424: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:42:47.591796: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:42:47.690422: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:42:47.746517: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:42:47.753080: I tensorflow/core/kernels/conv_grad_input_ops.cc:858] running auto-tune for Backward-Data
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:42:47.835384: I tensorflow/core/kernels/conv_grad_input_ops.cc:858] running auto-tune for Backward-Data
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:42:47.897467: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt

The output is below.
fast-style-transfer/output/output.jpg

In case you want different styles, then you could bring other trained models, e.g., wave.ckpt.
Google Drive - wave.ckpt

Execute it as below.

1
python3 evaluate.py --checkpoint styles/wave.ckpt --in-path examples/content/chicago.jpg --out-path output.jpg --allow-different-dimensions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
johndoe@sonoba:~/projects/fast-style-transfer$ python3 evaluate.py --checkpoint styles/wave.ckpt --in-path examples/content/chicago.jpg --out-path output.jpg

2018-04-16 00:43:40.259885: W tensorflow/stream_executor/rocm/rocm_driver.cc:405] creating context when one is currently active; existing: 0x7f37ff404050
2018-04-16 00:43:40.259977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:907] Found device 0 with properties:
name: Device 687f
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.622
pciBusID 0000:04:00.0
Total memory: 7.98GiB
Free memory: 7.73GiB
2018-04-16 00:43:40.259993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] DMA: 0
2018-04-16 00:43:40.259999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:939] 0: Y
2018-04-16 00:43:40.260007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:997] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Device 687f, pci bus id: 0000:04:00.0)
2018-04-16 00:43:40.634307: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:43:40.893959: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:43:40.987932: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:43:41.043334: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:43:41.049861: I tensorflow/core/kernels/conv_grad_input_ops.cc:858] running auto-tune for Backward-Data
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:43:41.130803: I tensorflow/core/kernels/conv_grad_input_ops.cc:858] running auto-tune for Backward-Data
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:43:41.191681: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt

“Style transfer” is also made by a combination of common CNN technology, so it seems that ROCm - TensorFlow also works well.

source

wreck

wave

udnie

scream

rain_princess

la_muse

Picture-based neural networks are very fun for me!

References


GPU EATER - AMD GPU-based Deep Learning Cloud


スタイル変換の動作検証(TensorFlow 1.3/AMD Radeon Vega 56/ROCm)

Introduction

SourceStyle Transfer

今回は画像生成や変換の分野で人気の、StyleTransferをTensorFlow(ROCm) on Radeon Vega 56で動かしてみます。

使用フレームワークは、ROCm-TensorFlow1.3, ROCm1.7.137を使用します。
Logan Engstromという方のリポジトリからFast StyleTransferのソースを流用します。

https://github.com/lengstrom/fast-style-transfer.git




Setup TensorFlow on Radeon GPU

近日、HIP-TensorFlow1.0.1から1.3へアップデートされ、同時に名前がのHIPという部分が取れてリポジトリも別になり、旧HIP-TensorFlowのリポジトリが見えなくなっています。旧HIP−TensorFlowはこちらのリポジトリですがすでにリンク切れになっています。
https://github.com/ROCmSoftwarePlatform/hiptensorflow

新しいTensorFlowはなんと呼べば良いかわからないので、ROCm−TensorFlowという風に呼びたいと思います。
https://github.com/ROCmSoftwarePlatform/tensorflow

以下のコマンドを叩くと、Python3上にROCm-TensorFlow1.3を簡単に構築できます。なおOpenCV3.3.0、ビデオコーデック、CythonやPillowイメージ等々も含まれます。

1
curl -sL http://install.aieatr.com/setup_rocm_tensorflow_p3

[Ubuntu16.04用]




Fast Style Transfer

まずは、fast-style-transferをクローンし、必要なパッケージをインストールします。
内部に動画変換用モジュールをロードしている部分があるので、moviepyをpip3経由でインストールが必要になります。

1
2
git clone https://github.com/lengstrom/fast-style-transfer.git
sudo pip3 install moviepy

gitリポジトリのReadmeに書いてある、学習済みモデルを持ってきます。
Google Drive - udnie.ckpt

stylesディレクトリを作って、

1
mkdir -p fast-style-transfer/styles

fast-style-transfer/styles/udnie.ckpt
として設置します。


実行

fast-style-transfer/examples/content以下にテスト用の画像が含まれますので、それを学習済みネットワークのudnie.ckptと一緒に指定して実行します。

1
python3 evaluate.py --checkpoint styles/udnie.ckpt --in-path examples/content/chicago.jpg --out-path output.jpg --allow-different-dimensions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
johndoe@sonoba:~/projects/fast-style-transfer$ python3 evaluate.py --checkpoint styles/udnie.ckpt --in-path examples/content/chicago.jpg --out-path output.jpg

2018-04-16 00:42:46.922074: W tensorflow/stream_executor/rocm/rocm_driver.cc:405] creating context when one is currently active; existing: 0x7f6d67384a80
2018-04-16 00:42:46.922178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:907] Found device 0 with properties:
name: Device 687f
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.622
pciBusID 0000:04:00.0
Total memory: 7.98GiB
Free memory: 7.73GiB
2018-04-16 00:42:46.922194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] DMA: 0
2018-04-16 00:42:46.922200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:939] 0: Y
2018-04-16 00:42:46.922208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:997] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Device 687f, pci bus id: 0000:04:00.0)
2018-04-16 00:42:47.295424: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:42:47.591796: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:42:47.690422: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:42:47.746517: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:42:47.753080: I tensorflow/core/kernels/conv_grad_input_ops.cc:858] running auto-tune for Backward-Data
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:42:47.835384: I tensorflow/core/kernels/conv_grad_input_ops.cc:858] running auto-tune for Backward-Data
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:42:47.897467: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt

以下に出力されます。
fast-style-transfer/output/output.jpg

違うスタイルにしたい場合は、
Google Drive - wave.ckpt
からwave.ckptなどの別の学習済みネットワークを持ってきて、

1
python3 evaluate.py --checkpoint styles/wave.ckpt --in-path examples/content/chicago.jpg --out-path output.jpg --allow-different-dimensions

として指定するのみです。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
johndoe@sonoba:~/projects/fast-style-transfer$ python3 evaluate.py --checkpoint styles/wave.ckpt --in-path examples/content/chicago.jpg --out-path output.jpg

2018-04-16 00:43:40.259885: W tensorflow/stream_executor/rocm/rocm_driver.cc:405] creating context when one is currently active; existing: 0x7f37ff404050
2018-04-16 00:43:40.259977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:907] Found device 0 with properties:
name: Device 687f
AMDGPU ISA: gfx900
memoryClockRate (GHz) 1.622
pciBusID 0000:04:00.0
Total memory: 7.98GiB
Free memory: 7.73GiB
2018-04-16 00:43:40.259993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] DMA: 0
2018-04-16 00:43:40.259999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:939] 0: Y
2018-04-16 00:43:40.260007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:997] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Device 687f, pci bus id: 0000:04:00.0)
2018-04-16 00:43:40.634307: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:43:40.893959: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:43:40.987932: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:43:41.043334: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
2018-04-16 00:43:41.049861: I tensorflow/core/kernels/conv_grad_input_ops.cc:858] running auto-tune for Backward-Data
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:43:41.130803: I tensorflow/core/kernels/conv_grad_input_ops.cc:858] running auto-tune for Backward-Data
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt
2018-04-16 00:43:41.191681: I tensorflow/core/kernels/conv_ops.cc:670] running auto-tune for Convolve
MIOpen(HIP): Warning [FindRecordUnsafe] File is unreadable: /opt/rocm/miopen/share/miopen/db/gfx900_56.cd.pdb.txt

Style変換自体も、一般的なCNN系の技術の組み合わせでできていますので、ROCm−TensorFlowでも十分に動作するようです。

source

wreck

wave

udnie

scream

rain_princess

la_muse

絵をベースとしたニューラルネットはユニークなものが多く、視覚的にも非常に楽しいです!

References


GPU EATER - AMD GPU-based Deep Learning Cloud


Benchmark CIFAR10 on TensorFlow with ROCm on AMD GPUs vs CUDA9 and cuDNN7 on NVIDIA GPUs

Introduction

I’m going to continue my description of the CIFAR10 benchmark, from where I left off.

Mar 7, 2018 Benchmarks on MATRIX MULTIPLICATION | A comparison between AMD Vega and NVIDIA GeForce series
Mar 20, 2018 Benchmarks on MATRIX MULTIPLICATION | TitanV TensorCore (FP16=>FP32)

CIFAR10


Average examples pre second

Introduction

I took the CIFAR10 dataset, which is widely used throughout the world in competitions and benchmarks, and used the public release of TensorFlow in order to measure its training speed.

In this article I’ll only be writing about CIFAR10.

I used the following source code when performing the benchmarks.
https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10

System requirements

AMD(TF1.0.1):
Ubuntu 16.04.3 x64
HIP-TensorFlow 1.0.1
Python 2.7
Driver: ROCm 1.7

AMD(TF1.3):
Ubuntu 16.04.4 x64
TensorFlow 1.3
Python 3.5
Driver: ROCm 1.7.137

NVIDIA(TF1.6):
Ubuntu 16.04.4 x64
TensorFlow r1.6
Python 3.5
Driver: 390.30, CUDA9.0, cuDNN7

Discussion

The results were close to those of the previous benchmark on matrix operations. I got strange results when using HIP–TensorFlow 1.0.1, with the first–generation RX580 winning out over the Vega64, so it was apparent that there was some sort of issue surrounding the AMD chip, but using the new ROCm and a higher version of Tensorflow 1.3, the speed issues nearly completely vanished.

I found it interesting that the overclocked Sapphire Nitro+ Vega64 was slower than the Frontier Edition, but perhaps there are some differences in the way that they manage memory and the like. Regardless, the fact that I was able to record speeds at approximately the theoretical limit made the results of this benchmark quite fascinating.

I hope to perform tests on PlaidML with a new version of ROCm for another article in the near future.

References


GPU EATER - AMD GPU-based Deep Learning Cloud


Benchmark CIFAR10 on TensorFlow with ROCm on AMD GPUs vs CUDA9 and cuDNN7 on NVIDIA GPUs(JP)

Introduction

前回に続きCIFAR10のベンチマークを記述していきます。

前回までの記事

2018年3月7日 Benchmarks on MATRIX MULTIPLICATION | A comparison between AMD Vega and NVIDIA GeForce series
2018年3月20日 Benchmarks on MATRIX MULTIPLICATION | TitanV TensorCore (FP16=>FP32)


CIFAR10


Average examples pre second

計算指標

゙世界コンペティションやベンチマークでよく使用される CIFAR10 を TensorFlow の公式を使用し 、学習スピードを計測するものとしました。今回の記事は、”CIFAR10”のみ掲載します。

ベンチマークに使うプログラムはこちらを使用しました。
https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10

動作環境

For AMD(TF1.0.1):
Ubuntu 16.04.3 x64
HIP-TensorFlow 1.0.1
Python 2.7
Driver: ROCm 1.7

For AMD(TF1.3):
Ubuntu 16.04.4 x64
TensorFlow 1.3
Python 3.5
Driver: ROCm 1.7.137

For NVIDIA:
Ubuntu 16.04.4 x64
TensorFlow r1.6
Python 3.5
Driver: 390.30, CUDA9.0, cuDNN7

考察

前回取った行列演算のBenchmarkに近い結果が得られました。HIP-TensorFlow1.0.1を使った場合は奇怪な結果が取れており、1世代前のRX580がVega64に勝っているところから、明らかにAMD側の環境に問題がありましたが、新しいROCmと、TensorFlow1.3にバージョンが上がったところでスピードに関しては殆ど問題がなくなりました。

オーバークロック製品のサファイア製Nitro+のVega64がFrontierEditionより遅いのが気になりますが、メモリの使い方等に差異があるのかもしれません。何れにしても、大凡予測した理論値限界の速度が取れているので、非常に面白いベンチマークの結果になりました。

後日、新しいROCmのバージョンでPlaidMLのBenchmarkも取ってみたいと思います。

References


GPU EATER - AMD GPU-based Deep Learning Cloud


Semantic Segmentation on an AMD RADEON GPU with Tensorflow1.3

Introduction

SourceYoloV2(Object Detection)FCN(Semantic Segmentation)

The field of semantic segmentation has many popular networks, including U-Net (2015), FCN (2015), PSPNet (2017), and others. In this study, we used an AMD Radeon GPU to run these networks.

We used ROCm-TensorFlow 1.3 and ROCm 1.7.137 as our operating framework.

*We re-used the source code from the following repository.
hellochick

https://github.com/hellochick/semantic-segmentation-tensorflow




Setup TensorFlow 1.3 on an AMD Radeon GPU

HIP-TensorFLow 1.0.1 was recently updated to TensorFlow 1.3, with HIP being removed and made into its own repository at the same time. As a result, the old HIP-TensorFlow repository is no longer viewable.

https://github.com/ROCmSoftwarePlatform/hiptensorflow

We were unsure what to call the new TensorFlow, so we settled on ROCm-TensorFlow.
https://github.com/ROCmSoftwarePlatform/tensorflow

The following commands allow one to easily build ROCm-TensorFlow 1.3 in Python3. This includes OpenCV 3.3.0, video codecs, and Cython or Pillow images.

1
curl -sL http://install.aieatr.com/setup_rocm_tensorflow_p3

[Ubuntu16.04]




Semantic Segmentation

1
git clone https://github.com/hellochick/semantic-segmentation-tensorflow

This retrieves the FCN finalized learning model as written in the git repository Readme.
Google Drive - FCN(fcn.npy)

It is installed as semantic-segmentation-tensorflow/model/fcn.npy.

For PSPNet, semantic-segmentation-tensorflow/model/pspnet50.npy.

For ICNet, it is installed under cityspaces as semantic-segmentation-tensorflow/model/cityscapes/icnet.npy.


Execution

Test images are included under semantic-segmentation-tensorflow/input and can be designated with the FCN model and executed.

1
python3 inference.py --model fcn --img-path input/indoor_1.jpg

The output is below.
semantic-segmentation-tensorflow/output/fcn_indoor_1.jpg

To use ICNet, which is good for real-time output, set it up as follows.

1
python3 inference.py --model icnet --img-path input/indoor_1.jpg

To process images taken by a camera in real time, the contents of model.load(path) need to be rewritten, so open the semantic-segmentation-tensorflow/tools.py file.

1
2
3
4
5
6
7
8
9
10
11
def load_img(img_path):
if os.path.isfile(img_path):
print('successful load img: {0}'.format(img_path))
else:
print('not found file: {0}'.format(img_path))
sys.exit(0)

filename = img_path.split('/')[-1]
img = misc.imread(img_path, mode='RGB')

return img, filename

Modifying the above files as follows will allow numpy format to be received.

1
2
3
4
5
6
7
8
9
10
11
12
13
def load_img(img_path):
if type(np.array(img_path)).__module__ == np.__name__:
return img_path, "np"
if os.path.isfile(img_path):
print('successful load img: {0}'.format(img_path))
else:
print('not found file: {0}'.format(img_path))
sys.exit(0)

filename = img_path.split('/')[-1]
img = misc.imread(img_path, mode='RGB')

return img, filename

Making the below source into main.py and executing it will allow real-time output from a camera. However, the Camera device, desktop environment, and OpenCV 3.3.0 must all be present.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import tensorflow as tf
from model import FCN8s, PSPNet50, ICNet, ENet
import cv2
import time
import sys
import numpy as np

# Parameters
model_name = 'icnet'
camera_num = 0
model_table = {}
model_table['icnet'] = {"module":ICNet, "path":"model/icnet.npy"}
model_table['fcn'] = {"module":FCN8s, "path":"model/fcn.npy"}
model_table['pspnet'] = {"module":PSPNet50, "path":"model/pspnet50.npy"}
if len(sys.argv) == 2: model_name = sys.argv[1]

print("Open TensorFlow session and initialize")
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)


print("Selected model => " + model_name)
param = model_table[model_name]
model = param['module']()
model.load(param['path'], sess)

print("Start camera")
cap = cv2.VideoCapture(camera_num)
print("Initialized capture device.")
while cap.isOpened():
check, frame = cap.read()
print(check)
if check:
left = frame.copy()
right = frame.copy()

model.read_input(left)
print("Input")
left = model.forward(sess)[0]
print("Forward")

left = cv2.resize(left,(frame.shape[1],frame.shape[0]))
left += left
left += right
left /= 3
left = cv2.resize(left,(2*left.shape[1],2*left.shape[0]))
left = np.array(left, dtype = 'uint8')
cv2.imshow(model_name,left)
cv2.waitKey(1)
print("Show")

For semantic segmentation, creating the teaching data is difficult, but thankfully most networks are simpler than in object detection.

SourceYoloV2(Object Detection)FCN(Semantic Segmentation)
SourceYoloV2(Object Detection)FCN(Semantic Segmentation)
SourceYoloV2(Object Detection)FCN(Semantic Segmentation)

References


GPU EATER - AMD GPU-based Deep Learning Cloud