0%

在Ubuntu上使用Docker容器隔离出单个cuda环境

为多人共用服务器上的计算资源,在Ubuntu宿主机上,使用Docker隔离,并在Docker中配置cuda环境,以及tensorflow-gpu环境

宿主机配置

更新

1
2
sudo apt update -y && sudo apt upgrade -y
sudo apt install curl wget git -y

nvidia显卡驱动安装

运行ubuntu-drivers devices
报错ubuntu-drivers: command not found
安装 ubuntu-drivers 包

1
sudo apt-get install ubuntu-drivers-common

查看推荐nvidia驱动版本

1
ubuntu-drivers devices

输出

1
2
3
4
5
6
7
8
9
10
== /sys/devices/pci0000:57/0000:57:00.0/0000:58:00.0 ==
modalias : pci:v000010DEd000025B6sv000010DEsd0000157Ebc03sc02i00
vendor : NVIDIA Corporation
driver : nvidia-driver-470 - distro non-free
driver : nvidia-driver-510-server - distro non-free
driver : nvidia-driver-470-server - distro non-free
driver : nvidia-driver-510 - distro non-free
driver : nvidia-driver-515 - distro non-free recommended
driver : nvidia-driver-515-server - distro non-free
driver : xserver-xorg-video-nouveau - distro free builtin

其中带有 recommended 标识的为推荐安装的驱动版本
选用推荐版本安装

1
apt install nvidia-driver-515

Docker安装

官网页面

1
curl -sSL https://get.daocloud.io/docker | sh

添加NVIDIA Container Toolkit

1
2
3
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

安装nvidia-docker2软件包

1
2
apt-get update
apt-get install -y nvidia-docker2

重启Docker守护进程

1
sudo systemctl restart docker

容器下载

tensorflow-gpu

官方文档
拉取 tensorflow-gpu 最新版本镜像

1
docker pull tensorflow/tensorflow:latest-gpu

Docker容器配置

生成容器

tensorflow-gpu

生成带有cuda和tensorflow环境的容器,并将容器22端口映射到宿主机8022端口

1
docker run --gpus all -p 8022:22 -it --rm tensorflow/tensorflow:latest-gpu /bin/bash

纯cuda环境

生成驱动cuda 11.7.0系统ubuntu20.04的docker容器,并开启宿主机8022到容器22端口映射

1
docker run -p 8022:22 -it --gpus all nvidia/cuda:11.7.0-base-ubuntu20.04 /bin/bash

docker 中 Ubuntu 配置

补全必要工具

1
2
3
apt update
apt upgrade
apt install vim nano net-tools openssh-server openssh-client

启动ssh服务

1
/etc/init.d/ssh start

PermitRootLogin yes添加进sshd_config文件,允许root用户使用ssh连接

1
nano /etc/ssh/sshd_config

重启SSH服务

1
service ssh restart

设置SSH密码

1
passwd root

查看容器IP

1
ifconfig

环境配置

Anaconda 下载页面
下载Anaconda安装包

1
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh

安装Anaconda

1
bash ./Anaconda3-2022.05-Linux-x86_64.sh

手写数字识别例程

必要条件

  • tensorflow
  • numpy
  • cuda
  • cudnn
  • python
  • 联网

程序

~/example/HandWriteNumberDetection.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
import mnist
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
import keras
import keras.utils
from keras import utils as np_utils
import tensorflow as tf
from keras.utils.np_utils import to_categorical

tf.__version__
mint=tf.keras.datasets.mnist
(train_images,train_labels),(test_images,test_labels)=mint.load_data()
print(train_images.shape,train_labels.shape)
train_images=(train_images/255)
test_images=(test_images/255)
print(train_images.shape,test_images.shape)
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=10)
model.compile(
optimizer='adam',
loss="categorical_crossentropy",
metrics=["accuracy"]
)

print(model.evaluate(
test_images,
to_categorical(test_labels)
))

用户手册

本例以tensorflow-gpu版本容器为例

服务器配置

Linux 5.15.0-46-generic #49~20.04.1-Ubuntu x86_64 GNU/Linux

服务器连接

服务器未配置桌面,请先使用ssh连接服务器,在命令行中运行

1
ssh -p <port> <username>@<ip>

其中

  • <port> 为服务器ssh服务端口号,例:8022,具体请从管理员发送的告知信息中确认
  • <username> 为用户名,例:root,具体请从管理员发送的告知信息中确认
  • <ip> 为服务器ip地址,例:10.191.86.106,具体请从管理员发送的告知信息中确认

以下教程以服务器端口为8022,ip为10.191.86.106,用户名为root为例
登录命令如下

1
ssh -p 8022 root@10.191.86.106

运行后会出现以下提示

1
root@10.191.86.106's password:

请输入密码(root),此处输入无显示,输入完成后回车即可,登录成功界面大致如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Welcome to Ubuntu 20.04.4 LTS (GNU/Linux 5.15.0-46-generic x86_64)

* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.
Last login: Thu Aug 25 17:41:05 2022 from 10.252.128.6

________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/

root@a8da98e88a5e:~#

执行更新指令

1
apt update -y && apt upgrade -y

使用ls查看目录

1
2
root@a8da98e88a5e:~# ls
example

使用cd进入目录example

1
cd example

运行目录example下的手写数字识别例程HandWriteNumberDetection.py

1
python ~/example/HandWriteNumberDetection.py

运行结果大致如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
2022-08-25 17:42:49.768873: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(60000, 28, 28) (60000,)
(60000, 28, 28) (10000, 28, 28)
2022-08-25 17:42:51.608176: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-25 17:42:52.220798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13201 MB memory: -> device: 0, name: NVIDIA A2, pci bus id: 0000:58:00.0, compute capability: 8.6
Epoch 1/10
2022-08-25 17:42:54.256235: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8500
2022-08-25 17:42:54.888831: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2022-08-25 17:42:55.096365: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
1875/1875 [==============================] - 13s 5ms/step - loss: 0.1218 - accuracy: 0.9622
Epoch 2/10
curacy: 0.9874
Epoch 3/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0267 - accuracy: 0.9917
Epoch 4/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0208 - accuracy: 0.9936
Epoch 5/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0155 - accuracy: 0.9949
Epoch 6/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0119 - accuracy: 0.9962
Epoch 7/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0103 - accuracy: 0.9967
Epoch 8/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0068 - accuracy: 0.9979
Epoch 9/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0075 - accuracy: 0.9975
Epoch 10/10
1875/1875 [==============================] - 10s 5ms/step - loss: 0.0067 - accuracy: 0.9977
313/313 [==============================] - 2s 5ms/step - loss: 0.0421 - accuracy: 0.9904
[0.0420583114027977, 0.9904000163078308]

使用nano查看HandWriteNumberDetection.py代码

1
nano ~/example/HandWriteNumberDetection.py

Ctrl + X 退出编辑器

常用指令

查看显卡状态

1
nvidia-smi

常见问题

Q Docker中使用 sudo 会报错
A Docker默认是 root 用户登录,已经是最高权限,不需要 sudo 提升权限

关于

本手册由 FaterYU 维护并更新
最后一次更新于 2022-8-26

欢迎关注我的其它发布渠道