Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save wutiger555/d3b8ae4ad58654bc943a4b06f2edae58 to your computer and use it in GitHub Desktop.
Save wutiger555/d3b8ae4ad58654bc943a4b06f2edae58 to your computer and use it in GitHub Desktop.

Containerized GPU training on Windows Server 2019

:::info Windows 容器中的 GPU 加速: 容器主機必須執行 Windows Server 2019 或 Windows 10 版本 1809 或更新版本。 :::

Windows Server 2019 版本

Windows Server Standard跟Essentials 都有180days的評估版可以使用 https://www.microsoft.com/en-us/evalcenter/evaluate-windows-server-2019-essentials

  • Standard直接下載官方提供的iso安裝 即開始試用
  • Essential則是官方有提供一組試用的產品金鑰

    NJ3X8-YTJRF-3R9J9-D78MF-4YBP4

其中,要讓Docker運行 必須在Windows上啟用 The Containers feature :::danger 但是在Essential版本中無法啟用Containers feature

:::

Install Docker

Windows有兩種安裝Docker的方式:

  1. Docker Desktop for Windows - both Linux and Windows containers on Windows The Docker Desktop installation includes Docker Engine, Docker CLI client, Docker Compose, Notary, Kubernetes, and Credential Helper.

  2. Docker on Windows - Windows containers only with a common API and command-line interface (CLI)


Docker Desktop for Windows

直接安裝即可 內建設定、查看運行狀況、taskbar UI,也有一鍵開啟k8s功能

能夠切換Linux Container 或是 Windows Container (Linux containers & Windows containers只能管理各自的容器)

兩種在Windows上的容器

  1. Linux Container
  2. Windows Container

Windows Container 只支援特定OS版本 以及必須啟用Container功能 而Linux Container由於只需啟用Hyper-V就能使用 Docker Desktop 可以在兩個Container之間切換 defalut是Linux Container 但Docker on Windows就只支援Windows Container 不能作切換 這也使得Windows Server 2019 Essential版本並不能正常運行及安裝Docker on Windows 但卻可以安裝Docker Desktop for Windows


Docker on Windows

https://github.com/OneGet/MicrosoftDockerProvider https://docs.microsoft.com/zh-tw/virtualization/windowscontainers/deploy-containers/deploy-containers-on-server

使用 OneGet 提供者 PowerShell 模組安裝 Docker

安裝 OneGet PowerShell 模組

Install-Module -Name DockerMsftProvider -Repository PSGallery -Force

安裝 OneGet docker provider

Import-Module -Name DockerMsftProvider -Force
Import-Packageprovider -Name DockerMsftProvider -Force

Install Docker

Upgrade to the latest version of docker:

Install-Package -Name docker -ProviderName DockerMsftProvider -Verbose -Update

:::info Windows 容器中的 GPU 加速: 容器主機必須執行 Docker 引擎 19.03 或更新版本。 :::


Windows base image for containers

https://hub.docker.com/_/microsoft-windows

:::info Windows 容器中的 GPU 加速: 容器基底映像必須是 mcr.microsoft.com/windows:1809 或更新版本。 ::: Windows Server 2019 Standard 評估版 官方所提供的iso原生版本是OS Build 17763.737

由於我們要使用1809版本的Windows Images 至少要10.0.17763.1397 這邊測試是upgrade到 OS Build 17763.1369 能夠正常運行

  • (以下測試皆使用windows:1809
docker pull mcr.microsoft.com/windows:1809

另外還有三種不同的base image windows/iotcore: Windows IoT Core base OS container image windows/nanoserver: Nano Server base OS container image windows/servercore: Windows Server Core base OS container Windows容器的Dockerfile僅支援以上四種base images 無法使用Linux類基礎映象檔 :::danger Windows容器中使用GPU加速並不支援 Windows Server CoreNano Server 容器映像 :::


GPU Training Samples

DirectX Container Sample

https://github.com/MicrosoftDocs/Virtualization-Documentation/tree/master/windows-container-samples/directx :::info Windows 容器中的 GPU 加速: DirectX (以及以其為基礎的所有架構) 是唯一可以使用 GPU 來加速的 API。 不支援第三方架構。 :::

這個範例容器使用到WinMLRunner executable 並且用他的performance benchmarking mode去跑 他會用假資料做一個ml model 100次,一開始用CPU,後來用GPU做測試,最後會產出報表跟一些performance metrics https://github.com/Microsoft/Windows-Machine-Learning/tree/master/Tools/WinMLRunner

FROM mcr.microsoft.com/windows:1809

WORKDIR C:/App

# Download and extract the ONNX model to be used for evaluation.
RUN curl.exe -o tiny_yolov2.tar.gz https://onnxzoo.blob.core.windows.net/models/opset_7/tiny_yolov2/tiny_yolov2.tar.gz && \
    tar.exe -xf tiny_yolov2.tar.gz && \
    del tiny_yolov2.tar.gz

# Download and extract cli tool for evaluation .onnx model with WinML.
RUN curl.exe -L -o WinMLRunner_x64_Release.zip https://github.com/microsoft/Windows-Machine-Learning/releases/download/1.2.1.1/WinMLRunner.v1.2.1.1.zip && \
    tar.exe -xf C:/App/WinMLRunner_x64_Release.zip && \
    del WinMLRunner_x64_Release.zip

# Run the model evaluation when container starts.
ENTRYPOINT ["C:/App/WinMLRunner v1.2.1.1/x64/WinMLRunner.exe", "-model", "C:/App/tiny_yolov2/model.onnx", "-terse", "-iterations", "100", "-perf"]

接著回到cmd cd到剛檔案 將該dockerfile build起來

docker build . -t winml-runner

build完如果沒出錯 就可run

docker run --isolation process --device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599 winml-runner

sample output: :::spoiler


.\WinMLRunner.exe -model SqueezeNet.onnx
WinML Runner
GPU: NVIDIA Tela P4

Loading model (path = SqueezeNet.onnx)...
=================================================================
Name: squeezenet_old
Author: onnx-caffe2
Version: 9223372036854775807
Domain:
Description:
Path: SqueezeNet.onnx
Support FP16: false

Input Feature Info:
Name: data_0
Feature Kind: Float

Output Feature Info:
Name: softmaxout_1
Feature Kind: Float

=================================================================

Binding (device = CPU, iteration = 1, inputBinding = CPU, inputDataType = Tensor)...[SUCCESS]
Evaluating (device = CPU, iteration = 1, inputBinding = CPU, inputDataType = Tensor)...[SUCCESS]
Outputting results..
Feature Name: softmaxout_1
resultVector[818] has the maximal value of 1


Binding (device = GPU, iteration = 1, inputBinding = CPU, inputDataType = Tensor)...[SUCCESS]
Evaluating (device = GPU, iteration = 1, inputBinding = CPU, inputDataType = Tensor)...[SUCCESS]
Outputting results..
Feature Name: softmaxout_1
resultVector[818] has the maximal value of 1

:::

Tensorflow Directml Sample

https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-windows ::: warning tensorflow只支援64 bits Python 3.5 - 3.7 以及tensorflow需要msvcp140.dll這個元件 解決方式是安裝Microsoft Visual C++ 2015 Redistributable Update 3 範例中用的python檔 放在dockerfile同一個directory中 ::: DockerFile:


FROM mcr.microsoft.com/windows:1809

# assign work directory
WORKDIR /python

# move all files to work directory including test.py
COPY . /python

# Silent Install Microsoft Visual C++ 2015 Redistributable Update 3
RUN powershell.exe -Command \
    wget https://download.microsoft.com/download/9/3/F/93FCF1E7-E6A4-478B-96E7-D4B285925B00/vc_redist.x64.exe -OutFile vc_redist.x64.exe ; \
    Start-Process vc_redist.x64.exe -ArgumentList '/q /norestart' -Wait
    Remove-Item vc_redist.x64.exe -Force

# Silent Install Python 3.6.1 64bits
RUN powershell.exe -Command \
    $ErrorActionPreference = 'Stop'; \
    [Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12; \
    wget https://www.python.org/ftp/python/3.6.1/python-3.6.1rcl-amd64.exe -OutFile python-3.6.1rcl-amd64.exe ; \
    Start-Process python-3.6.1rcl-amd64.exe -ArgumentList '/quiet InstallAllUsers=1 PrependPath=1' -Wait ; \
    Remove-Item python-3.6.1rcl-amd64.exe -Force

RUN pip install tensorflow-directml

# -u to insure python print is working
CMD ["py", "-u", "test.py"]

Install Python via command line/powershell without UI (quietly/slient install python) 透過cmd以無UI的方式安裝Python 選擇版本:https://www.python.org/ftp/python/ 指令如下:

Net.ServicePointManager]::SecurityProtocol = >[Net.SecurityProtocolType]::Tls12
wget https://www.python.org/ftp/python/[version].exe >-OutFile c:\[version].exe
Start-Process c:\[version].exe -ArgumentList '/quiet >InstallAllUsers=1 PrependPath=1'

test.py中的內容只是用來測試tensorflow是否成功安裝

import tensorflow.compat.v1 as tf
tf.enable_eager_execution(tf.ConifProto(log_device_placement=True))
print(tf.add([1.0, 2.0], [3.0, 4.0]))
docker build -t tensorflow-directml .
docker run -it tensorflow-directml

result:

2020-07-23 20:06:09.756930: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:45] DirectML device enumeration: found 1 compatible adapters. 

2020-07-23 20:06:09.917532: I tensorflow/core/common_runtime/dml/dml_device_factory.cc:32] DirectML: creating device on adapter 0 (Microsoft Basic Render Driver) 

2020-07-23 20:06:09.433379: I tensorflow/stream_executor/platform/default/dso_loader.cc:60] Successfully opened dynamic library DirectMLba106a7c621ea741d2159d8708ee581c11918380.dll 

2020-07-23 20:06:09.558039: I tensorflow/core/common_runtime/eager/execute.cc:571] Executing op Add in device /job:localhost/replica:0/task:0/device:DML:0 

tf.Tensor([4. 6.], shape=(2,), dtype=float32) 

已經包好push到Docker hub https://hub.docker.com/r/msxlol/tensorflow-directml-sample

How to Use

docker run msxlol/tensorflow-directml

:::danger 在跑的過程中 發現是可以detect到GPU 但總是只能抓到Microsoft Basic Render Driver 而不是實際上要使用到的nvidia Tesla P4

最後解決方案是使用DDA(Discrete Device Assignment) 將整個 PCIe 裝置傳遞至 VM VM上安裝Centos 7與Tesla P4驅動 就能detect到正確的Tesla P4而不是Microsoft Basic Render Driver https://docs.microsoft.com/zh-tw/windows-server/virtualization/hyper-v/deploy/deploying-graphics-devices-using-dda https://docs.microsoft.com/zh-tw/windows-server/virtualization/hyper-v/plan/plan-for-gpu-acceleration-in-windows-server

(DDA) 的離散裝置指派 離散裝置指派 (DDA) (也稱為 GPU 傳遞)可將一或多個實體 Gpu 專用於虛擬機器。 在 DDA 部署中,虛擬化工作負載會在原生驅動程式上執行,而且通常會擁有 GPU 功能的完整存取權。 DDA 提供最高層級的應用程式相容性和潛在的效能。 :::

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment