EAIS 弹性加速计算
配置总结:
远程连接ECS实例
安装Docker
安装Docker依赖包:
apt-get update
apt install -y software-properties-common
安装Docker:
curl -fsSL https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu/gpg | apt-key add -
add-apt-repository \
"deb [arch=amd64] https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu \
$(lsb_release -cs) \
stable"
apt-get update
apt-get install -y docker-ce
下载PyTorch官网Docker镜像:
docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
出现错误:
root@iZuf6bhiwlbt30vjr9pe7xZ:~# docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
尝试解决1:
检查网络,服务器是否能访问外网
PING registry-1.docker.io (104.244.46.5) 56(84) bytes of data.
^C
--- registry-1.docker.io ping statistics ---
57 packets transmitted, 0 received, 100% packet loss, time 57351ms
尝试telnet检查443端口:
telnet registry-1.docker.io 443
Trying 199.96.63.163...
Trying 2a03:2880:f111:83:face:b00c:0:25de...
telnet: Unable to connect to remote host: Network is unreachable
结论:不能连接外网
尝试2:修改安全组原则配置
修改ECS实例下的安全组出站规则,允许开放所有出站
结果:还是不能访问
检查防火墙,检查是否有什么规则阻止连接访问:
root@iZuf6bhiwlbt30vjr9pe7xZ:~# sudo ufw status
Status: inactive
root@iZuf6bhiwlbt30vjr9pe7xZ:~# sudo ufw disable
Firewall stopped and disabled on system startup
root@iZuf6bhiwlbt30vjr9pe7xZ:~# sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
DOCKER-USER all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain DOCKER (1 references)
target prot opt source destination
Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target prot opt source destination
DOCKER-ISOLATION-STAGE-2 all -- anywhere anywhere
RETURN all -- anywhere anywhere
Chain DOCKER-ISOLATION-STAGE-2 (1 references)
target prot opt source destination
DROP all -- anywhere anywhere
RETURN all -- anywhere anywhere
Chain DOCKER-USER (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
结果:排除是防火墙和规则问题
检查是否DNS配置问题:
sudo nano /etc/resolv.conf
删除nameserver 127.0.0.53
保留
nameserver 8.8.8.8
nameserver 8.8.4.4
options timeout:2 attempts:3 rotate single-request-reopen
然后再次连接外网
结果:仍未成功,排除DNS问题
修改VPC配置(路由配置,网络出口)
没有问题
创建Docker配置文件:
sudo mkdir -p /etc/docker
sudo nano /etc/docker/daemon.json
{
"registry-mirrors": ["https://mirrors.ustc.edu.cn/docker-ce"]
}
重启docker:
sudo systemctl restart docker
再次尝试下载PyTorch
结果:依然不能下载
再次复盘ACR:
重置daemon.json
sudo mkdir -p /etc/docker
sudo tee /etc/docker/daemon.json <<-'EOF'
{
"registry-mirrors": ["https://775vk65b.mirror.aliyuncs.com"]
}
EOF
sudo systemctl daemon-reload
sudo systemctl restart docker
结果:仍然无法下载
可能是镜像标签不存在
尝试
root@iZuf6bhiwlbt30vjr9pe7xZ:~# docker pull registry.cn-hangzhou.aliyuncs.com/stable-diffusion-docker/pytorch:latest
Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/stable-diffusion-docker/pytorch:latest not found: manifest unknown: manifest unknown
root@iZuf6bhiwlbt30vjr9pe7xZ:~#
结果:还是不行
尝试从Docker Hub拉取镜像:
docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
docker tag pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel registry.cn-hangzhou.aliyuncs.com/your-repository/pytorch:1.13.1-cuda11.6-cudnn8-devel
docker push registry.cn-hangzhou.aliyuncs.com/your-repository/pytorch:1.13.1-cuda11.6-cudnn8-devel
结果:仍然是不可以
尝试其他镜像源:
尝试清华大学的,没用
尝试中科大的,没用
尝试网易的,没用
尝试阿里云的,没用
发现这些镜像源都过时了,所以上https://docker.aityp.com/找新的镜像源,最终下载成功
重命名镜像:
docker tag swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
检查镜像:docker images
启动Docker容器:
docker run -it --rm docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel
安装EAIS软件包:
安装EAIS环境依赖包:
apt-get update
apt-get install -y wget curl iputils-ping
安装eais-tool deb软件包(ubuntu):
export VERSION=4.2.5
wget https://eais-rel-pub.oss-cn-beijing.aliyuncs.com/packages/eais-tool_${VERSION}_amd64.deb
sudo dpkg -i eais-tool_${VERSION}_amd64.deb
source /etc/profile
检查:
dpkg -l | grep eais-tool
安装eais-cuda(ubuntu):
export VERSION=4.2.5
wget https://eais-rel-pub.oss-cn-beijing.aliyuncs.com/packages/eais-cuda_${VERSION}_amd64.deb
sudo dpkg -i eais-cuda_${VERSION}_amd64.deb
检查:
dpkg -l | grep eais-cuda
部署SD环境:
安装SD webUI运行依赖的软件包
apt-get install -y git libgl1 libglib2.0-dev
下载SD源码:
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui
cd stable-diffusion-webui
安装SD webUI运行依赖的Python软件包:
pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
pip3 install -r requirements_versions.txt -i https://mirrors.aliyun.com/pypi/simple
wget https://aiacc-inference-public.oss-cn-beijing.aliyuncs.com/stable-diffusion/CLIP.tar.gz
tar zxf CLIP.tar.gz
pip3 install CLIP/ -i https://mirrors.aliyun.com/pypi/simple
wget https://aiacc-inference-public.oss-cn-beijing.aliyuncs.com/stable-diffusion/open_clip.tar.gz
tar zxf open_clip.tar.gz
pip3 install open_clip/ -i https://mirrors.aliyun.com/pypi/simple
python3 launch.py --skip-torch-cuda-test --exit
pip3 install markupsafe==2.0.1 -i https://mirrors.aliyun.com/pypi/simple
Python依赖包安装过程中出现的问题:
pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
这段代码安装时提示ERROR: pytorch-lightning 2.4.0 has requirement PyYAML>=5.4, but you'll have pyyaml 5.3.1 which is incompatible.
解决:
pip3 install PyYAML==5.4.1 -i https://mirrors.aliyun.com/pypi/simple
pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple
pip3 install numpy==1.24.4 -i https://mirrors.aliyun.com/pypi/simple
途中发现python版本太低,于是下载高版本在设置默认:
which python3.9
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1
update-alternatives --list python3
python3 --version
安装时发现requirement要求下载安装低版本,但是我有了高版本:
Attempting uninstall: setuptools Found existing installation: setuptools 45.2.0 Not uninstalling setuptools at /usr/lib/python3/dist-packages, outside environment /usr Can't uninstall 'setuptools'. No files were found to uninstall.
尝试解决方法:
sudo apt install python3.9-venv
python3 -m venv myenv
source myenv/bin/activate
然后再下载依赖包