POSTS

PlaidMLd上でKeras VGG-19

March 13, 2018

ROCm-TensorFlowより速いと噂のPlaidMLを動かしてみました。

This is a image

Introduction

HIP(ROCm)-TensorFlowはTensorFlowをCUDAシミュレーションを行って実現されたライブラリですが、その動作スピードはまだ開発の途中であるか、古いTensorFlowをもとに作られているため、DeepLearingの計算においては最新のNVIDIA＋TensorFlowの組み合わせに対して速度差があります。また、RX580と上位GPUであるVega56、Vega64と変わらないスピードであることから、Vegaシリーズの能力が発揮出来ていないという部分からもまだ未成熟のライブラリです。(2018/03/13)

PlaidMLはそれを補うことが出来るライブラリの一つだと言われており、TensorFlowの代わりにKerasのバックエンド化が出来ることが特徴です。

HIP(ROCm)-TensorFlowとPaidMLのライブラリスタック

ROCm-TensorFlow	PlaidML
Keras or something TensorFlow MIOpen(CUDA simulation layer) ROCm(GPU computing driver) AMD-GPU driver OS Native GPU	Keras or something PlaidML PlaidML ROCm or AMDGPUPRO or CUDASDK AMD-GPU driver/NVIDIA-GPU driver OS Native GPU

ライブラリスタックからもわかるように、PlaidMLはKerasのバックエンドとしての役割を担います。Kerasのバックエンド層には他の候補として、GoogleのTensorFlow、MicrosoftのCNTK、モントリオール大学のTheanoなどが有名です。

Installation

PlaidMLを弊社のAMD Ubuntu16.04インスタンスに入れていきます。まずはAMDGPUのドライバをインストールします。この場合、ROCｍをベースとしますので簡単です。

wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
sudo sh -c 'echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list'
sudo apt update
sudo apt install -y libnuma-dev rocm-dkms rocm-opencl-dev
sudo usermod -a -G video $LOGNAME

OpenCLのコマンド /opt/rocm/opencl/bin/x86_64/clinfo にて確認してください。

root@C-639ab3c2-c201-401e-9cc2-08dc90fef661-1:~# /opt/rocm/opencl/bin/x86_64/clinfo
Number of platforms:				 1
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.1 AMD-APP.internal (2545.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_object_metadata cl_amd_event_callback
  Platform Name:				 AMD Accelerated Parallel Processing
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Board name:					 Device 6861
  Device Topology:				 PCI[ B#5, D#0, F#0 ]
  Max compute units:				 64
  Max work items dimensions:			 3
    Max work items[0]:				 1024
    Max work items[1]:				 1024
    Max work items[2]:				 1024
  Max work group size:				 256
  Preferred vector width char:			 4
  Preferred vector width short:			 2
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 4
  Native vector width short:			 2
  Native vector width int:			 1
  Native vector width long:			 1
  Native vector width float:			 1
  Native vector width double:			 1
  Max clock frequency:				 1500Mhz
  Address bits:					 64
  Max memory allocation:			 14588628172
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 2048
  Max image 3D height:				 2048
  Max image 3D depth:				 2048
  Max samplers within kernel:			 26721
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 Yes
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 16384
  Global memory size:				 17163091968
  Constant buffer size:				 14588628172
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Max pipe arguments:				 16
  Max pipe active reservations:			 16
  Max pipe packet size:				 1703726284
  Max global variable size:			 14588628172
  Max global variable preferred total size:	 17163091968
  Max read/write image args:			 64
  Max on device events:				 0
  Queue on device max size:			 0
  Max on device queues:				 0
  Queue on device preferred size:		 0
  SVM capabilities:				 
    Coarse grain buffer:			 Yes
    Fine grain buffer:				 Yes
    Fine grain system:				 No
    Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 64
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Queue on Device properties:				 
    Out-of-Order:				 No
    Profiling :					 No
  Platform ID:					 0x7f16fc2423f0
  Name:						 gfx900
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 2.0
  Driver version:				 2545.0 (HSA1.1,LC)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 1.2
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p

次にPlaidMLです。公式の方法はこちらになりますが、今回は弊社のAMDGPUインスタンスとROCｍを使いますので、以下のコマンドのみでOKです。公式：https://github.com/plaidml/plaidml

sudo add-apt-repository universe && sudo apt update
sudo apt install python-pip
sudo pip install -U plaidml-keras h5py
plaidml-setup

以上で動作環境は整います。

VGG19を公式に従って動かします。以下の公式のコードをvgg.pyという名前で保存してください。

#!/usr/bin/env python
import numpy as np
import time

# Install the plaidml backend
import plaidml.keras
plaidml.keras.install_backend()

import keras
import keras.applications as kapp
from keras.datasets import cifar10

(x_train, y_train_cats), (x_test, y_test_cats) = cifar10.load_data()
batch_size = 8
x_train = x_train[:batch_size]
x_train = np.repeat(np.repeat(x_train, 7, axis=1), 7, axis=2)
model = kapp.VGG19()
model.compile(optimizer='sgd', loss='categorical_crossentropy',
              metrics=['accuracy'])

print("Running initial batch (compiling tile program)")
y = model.predict(x=x_train, batch_size=batch_size)

# Now start the clock and run 10 batches
print("Timing inference...")
start = time.time()
for i in range(10):
    y = model.predict(x=x_train, batch_size=batch_size)
print("Ran in {} seconds".format(time.time() - start))

動作確認を行います。

root@C-639ab3c2-c201-401e-9cc2-08dc90fef661-1:~/vgg# python vgg.py
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Downloading data from http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
170491904/170498071 [============================>.] - ETA: 0sINFO:plaidml:Opening device "gfx900.0"
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg19_weights_tf_dim_ordering_tf_kernels.h5
574627840/574710816 [============================>.] - ETA: 0sRunning initial batch (compiling tile program)
INFO:plaidml:Analyzing Ops: 44 of 195 operations complete
INFO:plaidml:Analyzing Ops: 100 of 195 operations complete
INFO:plaidml:Analyzing Ops: 162 of 195 operations complete
Timing inference...
Ran in 0.758494853973 seconds

予想していたよりも簡単にインストールすることが出来、動作も問題ないようです。 KerasはDeepLearning業界ではよく使われるライブラリで、github上にも多くのKerasをベースとした最新のコードが大量にありますのでPlaidMLのようなクロスプラットフォームなライブラリは非常に重宝します。

vertex.aiさんGreatJOB!です。

エンジニア募集中

GPU EATERの開発を一緒に行うメンバーを募集しています。

特にディープラーニング研究者、バックエンドエンジニアを積極採用中です。

募集職種はこちら

世界初のAMD GPU搭載の Deep Learning クラウド

GPU EATER https://gpueater.com

References

HIP-TensorFlow https://github.com/ROCmSoftwarePlatform/hiptensorflow
ROCm https://github.com/RadeonOpenCompute/ROCm
MIOpen https://gpuopen.com/compute-product/miopen/
vertex.ai Official Top http://vertex.ai/
vertex.ai PlaidML http://vertex.ai/blog/announcing-plaidml
PlaidML Github https://github.com/plaidml/plaidml