POSTS

AMD Radeon GPU上でPyTorch-ROCm

March 20, 2019

Introduction

PyTorchがROCm2.1にて対応！AMD Radeon GPU上で動かすためのインストールガイド。

Installation

AMDGPUドライバ 2.1にてPyTorch1.x.xに対応

公式にてPyTorchが正式に対応されたと発表がされました。 https://rocm.github.io/dl.html

Deep Learning on ROCm

TensorFlow: TensorFlow for ROCm – latest supported version 1.13

MIOpen: Open-source deep learning library for AMD GPUs – latest supported version 1.7.1

PyTorch: PyTorch for ROCm – latest supported version 1.0

インストール困難問題（2019/03/01)

Officialページには、Dockerベースのインストール方法のみが記述されているため、スクラッチからインストールする方法がドキュメントベースでサポートされていません。

https://rocm-documentation.readthedocs.io/en/latest/Deep_learning/Deep-learning.html 更にこちらのページにインストール方法の詳細が記載されていますが、スクラッチからのインストール方法がやはり欠如しており、

python tools/amd_build/build_pytorch_amd.py python tools/amd_build/build_caffe2_amd.py

hippify(CUDAコードをHIPコードへ変換する)部分も実際には違っています。

では、どうやってインストールするか？ですが、 https://raw.githubusercontent.com/wiki/ROCmSoftwarePlatform/pytorch/Dockerfile にてDockerfileが定義されていましたので、これをベースに最新のインストール方法を模索していきます。

結果的に取りまとめたインストーラーを先に提示します。 Ubuntu16.04 + Python3.5 or Python3.6ベースのAMDGPU ROCm-PyTorch1.1.0aのインストール方法がこちら。

curl -sL http://install.aieater.com/setup_pytorch_rocm | bash -

現在インストールする場合、グラフィックスカードの種類毎にビルドし直さなければなりません。 gfx806(RX550/560/570/580) gfx900(VegaFrontierEdition/Vega56/Vega64/WX9100/MI25) gfx906(RadeonVII/MI50/MI60) 上記のスクリプトはインストール途中で選択肢が出てきますので、上記のグラフィックスカードに合わせて指定を行ってください。

以下、インストーラーの中身を見ていきます。

AMDGPU ROCm-PyTorch1.1.0aのインストールスクリプト

# curl -sL http://install.aieater.com/setup_pytorch_rocm | bash -


apt-get update && apt-get install -y --no-install-recommends curl && \
  curl -sL http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | apt-key add - && \
  sh -c 'echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list' \


apt-get update &&  apt-get install -y --no-install-recommends \
  libelf1 \
  build-essential \
  bzip2 \
  ca-certificates \
  cmake \
  ssh \
  apt-utils \
  pkg-config \
  g++-multilib \
  gdb \
  git \
  less \
  libunwind-dev \
  libfftw3-dev \
  libelf-dev \
  libncurses5-dev \
  libomp-dev \
  libpthread-stubs0-dev \
  make \
  miopen-hip \
  miopengemm \
  python3-dev \
  python3-future \
  python3-yaml \
  python3-pip \
  vim \
  libssl-dev \
  libboost-dev \
  libboost-system-dev \
  libboost-filesystem-dev \
  libopenblas-dev \
  rpm \
  wget \
  net-tools \
  iputils-ping \
  libnuma-dev \
  rocm-dev \
  rocrand \
  rocblas \
  rocfft \
  hipsparse \
  hip-thrust \
  rccl \

curl -sL https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add - && \
sh -c 'echo deb [arch=amd64] http://apt.llvm.org/xenial/ llvm-toolchain-xenial-7 main > /etc/apt/sources.list.d/llvm7.list' && \
sh -c 'echo deb-src http://apt.llvm.org/xenial/ llvm-toolchain-xenial-7 main >> /etc/apt/sources.list.d/llvm7.list'\

apt-get update && apt-get install -y --no-install-recommends clang-7

apt-get clean && \
rm -rf /var/lib/apt/lists/*

sed -i 's/find_dependency(hip)/find_dependency(HIP)/g' /opt/rocm/rocsparse/lib/cmake/rocsparse/rocsparse-config.cmake
sed -i 's/find_dependency(hip)/find_dependency(HIP)/g' /opt/rocm/rocfft/lib/cmake/rocfft/rocfft-config.cmake
sed -i 's/find_dependency(hip)/find_dependency(HIP)/g' /opt/rocm/miopen/lib/cmake/miopen/miopen-config.cmake
sed -i 's/find_dependency(hip)/find_dependency(HIP)/g' /opt/rocm/rocblas/lib/cmake/rocblas/rocblas-config.cmake


prf=`cat <<'EOF'
export HIP_VISIBLE_DEVICES=0
export HCC_HOME=/opt/rocm/hcc
export ROCM_PATH=/opt/rocm
export ROCM_HOME=/opt/rocm
export HIP_PATH=/opt/rocm/hip
export PATH=/usr/local/bin:$HCC_HOME/bin:$HIP_PATH/bin:$ROCM_PATH/bin:/opt/rocm/opencl/bin/x86_64:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/opencl/lib/x86_64
export LC_ALL="en_US.UTF-8"
export LC_CTYPE="en_US.UTF-8"
export HIP_PLATFORM="hcc"
export KMTHINLTO="1"
export CUPY_INSTALL_USE_HIP=1
export MAKEFLAGS=-j8
export __HIP_PLATFORM_HCC__
export HIP_PLATFORM=hcc
export PLATFORM=hcc
export USE_ROCM=1
export MAX_JOBS=2
EOF
`

GFX=gfx900
echo "Select a GPU type."
select INS in RX500Series\(RX550/RX560/RX570/RX580/RX590\) Vega10Series\(Vega56/64/WX9100/FE/MI25\) Vega20Series\(RadeonVII/MI50/MI60\) Default
do
case $INS in
Patch)
PATCH;
break;;
RX500Series\(RX550/RX560/RX570/RX580/RX590\))
GFX=gfx806
break;;
Vega10Series\(Vega56/64/WX9100/FE/MI25\))
GFX=gfx900
break;;
Vega20Series\(RadeonVII/MI50/MI60\))
GFX=gfx906
break;;
Default)
break;;
*) echo "ERROR: Invalid selection"
;;
esac
done
export HCC_AMDGPU_TARGET=$GFX


echo "$prf" >> ~/.profile
source ~/.profile

pip3 install cython pillow h5py numpy scipy requests sklearn matplotlib editdistance pandas portpicker jupyter setuptools pyyaml typing enum34 hypothesis


update-alternatives --install /usr/bin/gcc gcc /usr/bin/clang-7 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/clang++-7 50

# git clone https://github.com/pytorch/pytorch.git
git clone https://github.com/ROCmSoftwarePlatform/pytorch.git pytorch-rocm
cd pytorch-rocm
git checkout e6991ed29fec9a7b7ffb09b6ec58fb9d3fec3d22 # 1.1.0a0+e6991ed
git submodule init
git submodule update

#python3 tools/amd_build/build_pytorch_amd.py
#python3 tools/amd_build/build_caffe2_amd.py
python3 tools/amd_build/build_amd.py

python3 setup.py install
pip3 install torchvision

cd ~/
clinfo | grep '  Name:'
python3 -c "import torch;print('CUDA(hip) is available',torch.cuda.is_available());print('cuda(hip)_device_num:',torch.cuda.device_count());print('Radeon device:',torch.cuda.get_device_name(torch.cuda.current_device()))"

公式のページから変更すべき点は、

#python3 tools/amd_build/build_pytorch_amd.py
#python3 tools/amd_build/build_caffe2_amd.py
python3 tools/amd_build/build_amd.py

のhippifyのスクリプトが一つにまとまってしまっている点です。

また、現在のところ開発中のためか、 PyTorch Official https://github.com/pytorch/pytorch.git こちらの最新ではコンパイルが通らず、 PyTorch-ROCm https://github.com/ROCmSoftwarePlatform/pytorch.git

git checkout e6991ed29fec9a7b7ffb09b6ec58fb9d3fec3d22 # 1.1.0a0+e6991ed

こちらの方のチェックポイントe6991ed29fec9a7b7ffb09b6ec58fb9d3fec3d22ではコンパイルが通るようです。タグも切られていないため、未だ導入する時の混乱は避けられません。

HIPはCUDAと認識する

hippifyの仕組み上CUDAコードをHIPコードにトランスコンパイルすることで、CUDAコードをAMD-RadeonGPU上で動かしています。そのため、動作デバイスの指定は、’cuda’として指定して動かします。 hipデバイス指定の項目もありますが、hipと指定しても動作しません。必ず’cuda’と指定する必要があります。

ベンチマークについて

https://github.com/marvis/pytorch-mobilenet 上記のスクリプトを試したところ、

AMDGPU RadeonVII + ROCm2.1 + ROCm-PyTorch1.1.0a

use_gpu: True, nb_batches: 1
  resnet18 : 0.005838 (sd 0.000290)
   alexnet : 0.001124 (sd 0.000137)
     vgg16 : 0.001759 (sd 0.000033)
squeezenet : 0.003084 (sd 0.000115)
 mobilenet : 0.007428 (sd 0.000213)
use_gpu: True, nb_batches: 16
  resnet18 : 0.005712 (sd 0.000202)
   alexnet : 0.001107 (sd 0.000019)
     vgg16 : 0.002957 (sd 0.001784)
squeezenet : 0.006802 (sd 0.003843)
 mobilenet : 0.007036 (sd 0.000301)

のように計測出来ており、

https://qiita.com/yu4u/items/c6e24d862325fac96f61 こちらは日本人コミュニティのサイトですが、

Ubuntu 16.04, CPU: i7-7700 3.60GHz、GPU: GeForce GTX1080 PyTorch0.1.11

use_gpu: True, nb_batches: 1
  resnet18 : 0.001915 (sd 0.000057)
   alexnet : 0.000691 (sd 0.000005)
     vgg16 : 0.002390 (sd 0.002091)
squeezenet : 0.002086 (sd 0.000104)
 mobilenet : 0.048602 (sd 0.000380)
use_gpu: True, nb_batches: 16
  resnet18 : 0.006055 (sd 0.005111)
   alexnet : 0.000744 (sd 0.000014)
     vgg16 : 0.025156 (sd 0.029848)
squeezenet : 0.012983 (sd 0.000024)
 mobilenet : 0.064022 (sd 0.000411)

use_gpu: False, nb_batches: 1
  resnet18 : 0.218282 (sd 0.002961)
   alexnet : 0.081834 (sd 0.000445)
     vgg16 : 1.484166 (sd 0.001384)
squeezenet : 0.102657 (sd 0.002118)
 mobilenet : 0.141093 (sd 0.005197)
use_gpu: False, nb_batches: 16
  resnet18 : 0.896854 (sd 0.004594)
   alexnet : 0.283497 (sd 0.003010)
     vgg16 : 5.622119 (sd 0.020102)
squeezenet : 0.514910 (sd 0.004134)
 mobilenet : 0.892604 (sd 0.017502)

GeForce GTX 1080でのベンチマークが公表されていますが、環境がPyTorchが1.0未満の状態とを比べるのはよくないので、後日きちんとしたベンチマークを取り直す予定です。またROCm2.1/ROCm2.2でGPUプロセスがゾンビ化してしまう現象を確認しているので、ROCｍ1.7の時のように若干不安定化してしまっているポイントも注意です。

コンパイル時のWarning

ソースからROCm-PyTorchをコンパイルするときにループのアンロール展開についてかなりWarningが出てきますが、コンパイル自体はすんなり通ってしまいます。実行時に、

warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering

という明らかにパフォーマンスに影響がありそうな文言が出てくるのでペナルティがあるのは明らかです。

現在の印象

ROCm-PyTorchはまだトランスコンパイル後にアンロール展開がきちんと出来ておらず、パフォーマンスペナルティもあるので、モデルによっては最適化がかかるかからないがかなりばらつきがあり、注意しながら使用していくという形を取る必要がある。

References

TensorFlor-ROCm / HipCaffe / PyTorch-ROCm / Caffe2 installation https://rocm-documentation.readthedocs.io/en/latest/Deep_learning/Deep-learning.html
TensorFlow-ROCm https://github.com/ROCmSoftwarePlatform/tensorflow-upstream
PyTorch-ROCm https://github.com/ROCmSoftwarePlatform/pytorch.git
PyTorch Official https://github.com/pytorch/pytorch.git
PyTorch discussion https://discuss.pytorch.org/t/pytorch-with-rocm-benchmarks/31535
Qiita @yu4u https://qiita.com/yu4u/items/c6e24d862325fac96f61
ROCm https://github.com/ROCmSoftwarePlatform
MIOpen https://gpuopen.com/compute-product/miopen/
GPUEater summarized TensorFlow-ROCm https://github.com/aieater/rocm_tensorflow_info

エンジニア募集中

GPU EATERの開発を一緒に行うメンバーを募集しています。

特にディープラーニング研究者、バックエンドエンジニアを積極採用中です。

募集職種はこちら

GPU EATER - AMD GPU-based Deep Learning Cloud