为多人共用服务器上的计算资源,在Ubuntu宿主机上,使用Docker隔离,并在Docker中配置cuda环境,以及tensorflow-gpu环境
宿主机配置 更新 1 2 sudo apt update -y && sudo apt upgrade -y sudo apt install curl wget git -y
nvidia显卡驱动安装 运行ubuntu-drivers devices
报错ubuntu-drivers: command not found
安装 ubuntu-drivers 包
1 sudo apt-get install ubuntu-drivers-common
查看推荐nvidia驱动版本
输出
1 2 3 4 5 6 7 8 9 10 == /sys/devices/pci0000:57/0000:57:00.0/0000:58:00.0 == modalias : pci:v000010DEd000025B6sv000010DEsd0000157Ebc03sc02i00 vendor : NVIDIA Corporation driver : nvidia-driver-470 - distro non-free driver : nvidia-driver-510-server - distro non-free driver : nvidia-driver-470-server - distro non-free driver : nvidia-driver-510 - distro non-free driver : nvidia-driver-515 - distro non-free recommended driver : nvidia-driver-515-server - distro non-free driver : xserver-xorg-video-nouveau - distro free builtin
其中带有 recommended
标识的为推荐安装的驱动版本 选用推荐版本安装
1 apt install nvidia-driver-515
Docker安装 官网页面
1 curl -sSL https://get.daocloud.io/docker | sh
添加NVIDIA Container Toolkit
1 2 3 distribution=$(. /etc/os-release;echo $ID$VERSION_ID ) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution /nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
安装nvidia-docker2
软件包
1 2 apt-get update apt-get install -y nvidia-docker2
重启Docker守护进程
1 sudo systemctl restart docker
容器下载 tensorflow-gpu 官方文档 拉取 tensorflow-gpu 最新版本镜像
1 docker pull tensorflow/tensorflow:latest-gpu
Docker容器配置 生成容器 tensorflow-gpu 生成带有cuda和tensorflow环境的容器,并将容器22
端口映射到宿主机8022
端口
1 docker run --gpus all -p 8022:22 -it --rm tensorflow/tensorflow:latest-gpu /bin/bash
纯cuda环境 生成驱动cuda 11.7.0
系统ubuntu20.04
的docker容器,并开启宿主机8022到容器22端口映射
1 docker run -p 8022:22 -it --gpus all nvidia/cuda:11.7.0-base-ubuntu20.04 /bin/bash
docker 中 Ubuntu 配置 补全必要工具
1 2 3 apt update apt upgrade apt install vim nano net-tools openssh-server openssh-client
启动ssh服务
将 PermitRootLogin yes
添加进sshd_config
文件,允许root用户使用ssh连接
1 nano /etc/ssh/sshd_config
重启SSH服务
设置SSH密码
查看容器IP
环境配置 Anaconda 下载页面 下载Anaconda安装包
1 wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
安装Anaconda
1 bash ./Anaconda3-2022.05-Linux-x86_64.sh
手写数字识别例程 必要条件
tensorflow
numpy
cuda
cudnn
python
联网
程序 ~/example/HandWriteNumberDetection.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import numpy as npimport mnistimport matplotlib.pyplot as pltfrom keras.models import Sequentialfrom keras.layers import Denseimport kerasimport keras.utilsfrom keras import utils as np_utilsimport tensorflow as tffrom keras.utils.np_utils import to_categoricaltf.__version__ mint=tf.keras.datasets.mnist (train_images,train_labels),(test_images,test_labels)=mint.load_data() print (train_images.shape,train_labels.shape)train_images=(train_images/255 ) test_images=(test_images/255 ) print (train_images.shape,test_images.shape)model = tf.keras.models.Sequential([ tf.keras.layers.Conv2D(64 , (3 ,3 ), activation='relu' , input_shape=(28 , 28 , 1 )), tf.keras.layers.MaxPooling2D(2 , 2 ), tf.keras.layers.Conv2D(64 , (3 ,3 ), activation='relu' ), tf.keras.layers.MaxPooling2D(2 ,2 ), tf.keras.layers.Flatten(), tf.keras.layers.Dense(128 , activation='relu' ), tf.keras.layers.Dense(10 , activation='softmax' ) ]) model.compile (optimizer='adam' , loss='sparse_categorical_crossentropy' , metrics=['accuracy' ]) model.fit(train_images, train_labels, epochs=10 ) model.compile ( optimizer='adam' , loss="categorical_crossentropy" , metrics=["accuracy" ] ) print (model.evaluate( test_images, to_categorical(test_labels) ))
用户手册 本例以tensorflow-gpu
版本容器为例
服务器配置 Linux 5.15.0-46-generic #49~20.04.1-Ubuntu x86_64 GNU/Linux
服务器连接 服务器未配置桌面,请先使用ssh连接服务器,在命令行中运行
1 ssh -p <port> <username>@<ip>
其中
<port>
为服务器ssh服务端口号,例:8022
,具体请从管理员发送的告知信息中确认
<username>
为用户名,例:root
,具体请从管理员发送的告知信息中确认
<ip>
为服务器ip地址,例:10.191.86.106
,具体请从管理员发送的告知信息中确认
以下教程以服务器端口为8022
,ip为10.191.86.106
,用户名为root
为例 登录命令如下
1 ssh -p 8022 root@10.191.86.106
运行后会出现以下提示
1 root@10.191.86.106's password:
请输入密码(root),此处输入无显示,输入完成后回车即可,登录成功界面大致如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Welcome to Ubuntu 20.04.4 LTS (GNU/Linux 5.15.0-46-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage This system has been minimized by removing packages and content that are not required on a system that users do not log into. To restore this content, you can run the 'unminimize' command . Last login: Thu Aug 25 17:41:05 2022 from 10.252.128.6 ________ _______________ ___ __/__________________________________ ____/__ /________ __ __ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / / _ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ / /_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/ root@a8da98e88a5e:~
执行更新指令
1 apt update -y && apt upgrade -y
使用ls
查看目录
1 2 root@a8da98e88a5e:~ example
使用cd
进入目录example
运行目录example
下的手写数字识别例程HandWriteNumberDetection.py
1 python ~/example/HandWriteNumberDetection.py
运行结果大致如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2022-08-25 17:42:49.768873: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. (60000, 28, 28) (60000,) (60000, 28, 28) (10000, 28, 28) 2022-08-25 17:42:51.608176: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-08-25 17:42:52.220798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13201 MB memory: -> device: 0, name: NVIDIA A2, pci bus id : 0000:58:00.0, compute capability: 8.6 Epoch 1/10 2022-08-25 17:42:54.256235: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8500 2022-08-25 17:42:54.888831: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory 2022-08-25 17:42:55.096365: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. 1875/1875 [==============================] - 13s 5ms/step - loss: 0.1218 - accuracy: 0.9622 Epoch 2/10 curacy: 0.9874 Epoch 3/10 1875/1875 [==============================] - 10s 5ms/step - loss: 0.0267 - accuracy: 0.9917 Epoch 4/10 1875/1875 [==============================] - 10s 5ms/step - loss: 0.0208 - accuracy: 0.9936 Epoch 5/10 1875/1875 [==============================] - 10s 5ms/step - loss: 0.0155 - accuracy: 0.9949 Epoch 6/10 1875/1875 [==============================] - 10s 5ms/step - loss: 0.0119 - accuracy: 0.9962 Epoch 7/10 1875/1875 [==============================] - 10s 5ms/step - loss: 0.0103 - accuracy: 0.9967 Epoch 8/10 1875/1875 [==============================] - 10s 5ms/step - loss: 0.0068 - accuracy: 0.9979 Epoch 9/10 1875/1875 [==============================] - 10s 5ms/step - loss: 0.0075 - accuracy: 0.9975 Epoch 10/10 1875/1875 [==============================] - 10s 5ms/step - loss: 0.0067 - accuracy: 0.9977 313/313 [==============================] - 2s 5ms/step - loss: 0.0421 - accuracy: 0.9904 [0.0420583114027977, 0.9904000163078308]
使用nano
查看HandWriteNumberDetection.py
代码
1 nano ~/example/HandWriteNumberDetection.py
Ctrl + X
退出编辑器
常用指令 查看显卡状态
常见问题 Q Docker中使用 sudo
会报错 A Docker默认是 root
用户登录,已经是最高权限,不需要 sudo
提升权限
关于 本手册由 FaterYU 维护并更新 最后一次更新于 2022-8-26