在Ubuntu上使用Docker容器隔离出单个cuda环境

为多人共用服务器上的计算资源，在Ubuntu宿主机上，使用Docker隔离，并在Docker中配置cuda环境，以及tensorflow-gpu环境

宿主机配置

更新

1 2	sudo apt update -y && sudo apt upgrade -y sudo apt install curl wget git -y

nvidia显卡驱动安装

运行ubuntu-drivers devices
报错ubuntu-drivers: command not found
安装 ubuntu-drivers 包

1	sudo apt-get install ubuntu-drivers-common

查看推荐nvidia驱动版本

1	ubuntu-drivers devices

输出

== /sys/devices/pci0000:57/0000:57:00.0/0000:58:00.0 ==
modalias : pci:v000010DEd000025B6sv000010DEsd0000157Ebc03sc02i00
vendor   : NVIDIA Corporation
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-510-server - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-510 - distro non-free
driver   : nvidia-driver-515 - distro non-free recommended
driver   : nvidia-driver-515-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

其中带有 recommended 标识的为推荐安装的驱动版本
选用推荐版本安装

1	apt install nvidia-driver-515

Docker安装

官网页面

1	curl -sSL https://get.daocloud.io/docker \| sh

添加NVIDIA Container Toolkit

1
2
3

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

安装nvidia-docker2软件包

1 2	apt-get update apt-get install -y nvidia-docker2

重启Docker守护进程

1	sudo systemctl restart docker

容器下载

tensorflow-gpu

官方文档
拉取 tensorflow-gpu 最新版本镜像

1	docker pull tensorflow/tensorflow:latest-gpu

Docker容器配置

生成容器

tensorflow-gpu

生成带有cuda和tensorflow环境的容器，并将容器22端口映射到宿主机8022端口

1	docker run --gpus all -p 8022:22 -it --rm tensorflow/tensorflow:latest-gpu /bin/bash

纯cuda环境

生成驱动cuda 11.7.0系统ubuntu20.04的docker容器，并开启宿主机8022到容器22端口映射

1	docker run -p 8022:22 -it --gpus all nvidia/cuda:11.7.0-base-ubuntu20.04 /bin/bash

docker 中 Ubuntu 配置

补全必要工具

1
2
3

apt update
apt upgrade
apt install vim nano net-tools openssh-server openssh-client

启动ssh服务

1	/etc/init.d/ssh start

将 PermitRootLogin yes添加进sshd_config文件，允许root用户使用ssh连接

1	nano /etc/ssh/sshd_config

重启SSH服务

1	service ssh restart

设置SSH密码

1	passwd root

查看容器IP

ifconfig

环境配置

Anaconda 下载页面
下载Anaconda安装包

1	wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh

安装Anaconda

1	bash ./Anaconda3-2022.05-Linux-x86_64.sh

手写数字识别例程

必要条件

tensorflow
numpy
cuda
cudnn
python
联网

程序

~/example/HandWriteNumberDetection.py

import numpy as np
import mnist
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
import keras
import keras.utils
from keras import utils as np_utils
import tensorflow as tf
from keras.utils.np_utils import to_categorical

tf.__version__
mint=tf.keras.datasets.mnist
(train_images,train_labels),(test_images,test_labels)=mint.load_data()
print(train_images.shape,train_labels.shape)
train_images=(train_images/255)
test_images=(test_images/255)
print(train_images.shape,test_images.shape)
model = tf.keras.models.Sequential([
  tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
  tf.keras.layers.MaxPooling2D(2, 2),
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
  tf.keras.layers.MaxPooling2D(2,2),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=10)
model.compile(
    optimizer='adam',
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

print(model.evaluate(
    test_images,
    to_categorical(test_labels)
))

用户手册

本例以tensorflow-gpu版本容器为例

服务器配置

Linux 5.15.0-46-generic #49~20.04.1-Ubuntu x86_64 GNU/Linux

服务器连接

服务器未配置桌面，请先使用ssh连接服务器，在命令行中运行

1	ssh -p <port> <username>@<ip>

其中

<port> 为服务器ssh服务端口号，例：8022，具体请从管理员发送的告知信息中确认
<username> 为用户名，例：root，具体请从管理员发送的告知信息中确认
<ip> 为服务器ip地址，例：10.191.86.106，具体请从管理员发送的告知信息中确认

以下教程以服务器端口为8022，ip为10.191.86.106，用户名为root为例
登录命令如下

1	ssh -p 8022 root@10.191.86.106

运行后会出现以下提示

1	root@10.191.86.106's password:

请输入密码（root），此处输入无显示，输入完成后回车即可，登录成功界面大致如下

Welcome to Ubuntu 20.04.4 LTS (GNU/Linux 5.15.0-46-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.
Last login: Thu Aug 25 17:41:05 2022 from 10.252.128.6

________                               _______________
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/

root@a8da98e88a5e:~#

执行更新指令

1	apt update -y && apt upgrade -y

使用ls查看目录

1 2	root@a8da98e88a5e:~# ls example

使用cd进入目录example

1	cd example

运行目录example下的手写数字识别例程HandWriteNumberDetection.py

1	python ~/example/HandWriteNumberDetection.py

运行结果大致如下

2022-08-25 17:42:49.768873: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(60000, 28, 28) (60000,)
(60000, 28, 28) (10000, 28, 28)
2022-08-25 17:42:51.608176: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-25 17:42:52.220798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13201 MB memory:  -> device: 0, name: NVIDIA A2, pci bus id: 0000:58:00.0, compute capability: 8.6
Epoch 1/10
2022-08-25 17:42:54.256235: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8500
2022-08-25 17:42:54.888831: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-08-25 17:42:55.096365: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
1875/1875 [==============================] - 13s 5ms/step - loss: 0.1218 - accuracy: 0.9622
Epoch 2/10
curacy: 0.9874
Epoch 3/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0267 - accuracy: 0.9917
Epoch 4/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0208 - accuracy: 0.9936
Epoch 5/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0155 - accuracy: 0.9949
Epoch 6/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0119 - accuracy: 0.9962
Epoch 7/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0103 - accuracy: 0.9967
Epoch 8/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0068 - accuracy: 0.9979
Epoch 9/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0075 - accuracy: 0.9975
Epoch 10/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0067 - accuracy: 0.9977
313/313 [==============================] - 2s 5ms/step - loss: 0.0421 - accuracy: 0.9904
[0.0420583114027977, 0.9904000163078308]

使用nano查看HandWriteNumberDetection.py代码

1	nano ~/example/HandWriteNumberDetection.py

Ctrl + X 退出编辑器

常用指令

查看显卡状态

1	nvidia-smi

常见问题

Q Docker中使用 sudo 会报错
A Docker默认是 root 用户登录，已经是最高权限，不需要 sudo 提升权限

关于

本手册由 FaterYU 维护并更新
最后一次更新于 2022-8-26