阿里云ECS + EAIS软件包部署Stable-DIffusion

edwin99
edwin99
2025-02-06 23:55
44 阅读
0 评论
目录
正在加载目录...

EAIS 弹性加速计算

配置总结:

远程连接ECS实例

安装Docker

安装Docker依赖包:

apt-get update

apt install -y software-properties-common

安装Docker:

curl -fsSL https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu/gpg | apt-key add -

add-apt-repository \

"deb [arch=amd64] https://mirrors.ustc.edu.cn/docker-ce/linux/ubuntu \

$(lsb_release -cs) \

stable"

apt-get update

apt-get install -y docker-ce

 

下载PyTorch官网Docker镜像:

docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel

 

出现错误:

root@iZuf6bhiwlbt30vjr9pe7xZ:~# docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

 

尝试解决1:

检查网络,服务器是否能访问外网

PING registry-1.docker.io (104.244.46.5) 56(84) bytes of data.

^C

--- registry-1.docker.io ping statistics ---

57 packets transmitted, 0 received, 100% packet loss, time 57351ms

尝试telnet检查443端口:

telnet registry-1.docker.io 443

Trying 199.96.63.163...

Trying 2a03:2880:f111:83:face:b00c:0:25de...

telnet: Unable to connect to remote host: Network is unreachable

结论:不能连接外网

 

尝试2:修改安全组原则配置

修改ECS实例下的安全组出站规则,允许开放所有出站

结果:还是不能访问

检查防火墙,检查是否有什么规则阻止连接访问:

root@iZuf6bhiwlbt30vjr9pe7xZ:~# sudo ufw status

Status: inactive

root@iZuf6bhiwlbt30vjr9pe7xZ:~# sudo ufw disable

Firewall stopped and disabled on system startup

root@iZuf6bhiwlbt30vjr9pe7xZ:~# sudo iptables -L

Chain INPUT (policy ACCEPT)

target     prot opt source               destination

 

Chain FORWARD (policy ACCEPT)

target     prot opt source               destination

DOCKER-USER  all  --  anywhere             anywhere

DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere

ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED

DOCKER     all  --  anywhere             anywhere

ACCEPT     all  --  anywhere             anywhere

ACCEPT     all  --  anywhere             anywhere

 

Chain OUTPUT (policy ACCEPT)

target     prot opt source               destination

 

Chain DOCKER (1 references)

target     prot opt source               destination

 

Chain DOCKER-ISOLATION-STAGE-1 (1 references)

target     prot opt source               destination

DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere

RETURN     all  --  anywhere             anywhere

 

Chain DOCKER-ISOLATION-STAGE-2 (1 references)

target     prot opt source               destination

DROP       all  --  anywhere             anywhere

RETURN     all  --  anywhere             anywhere

 

Chain DOCKER-USER (1 references)

target     prot opt source               destination

RETURN     all  --  anywhere             anywhere

结果:排除是防火墙和规则问题

 

检查是否DNS配置问题:

sudo nano /etc/resolv.conf

删除nameserver 127.0.0.53

保留

nameserver 8.8.8.8

nameserver 8.8.4.4

options timeout:2 attempts:3 rotate single-request-reopen

 

然后再次连接外网

结果:仍未成功,排除DNS问题

 

修改VPC配置(路由配置,网络出口)

没有问题

 

创建Docker配置文件:

sudo mkdir -p /etc/docker

sudo nano /etc/docker/daemon.json

{

  "registry-mirrors": ["https://mirrors.ustc.edu.cn/docker-ce"]

}

 

重启docker:

sudo systemctl restart docker

再次尝试下载PyTorch

结果:依然不能下载

 

再次复盘ACR:

重置daemon.json

sudo mkdir -p /etc/docker

sudo tee /etc/docker/daemon.json <<-'EOF'

{

"registry-mirrors": ["https://775vk65b.mirror.aliyuncs.com"]

}

EOF

sudo systemctl daemon-reload

sudo systemctl restart docker

结果:仍然无法下载

可能是镜像标签不存在

尝试

root@iZuf6bhiwlbt30vjr9pe7xZ:~# docker pull registry.cn-hangzhou.aliyuncs.com/stable-diffusion-docker/pytorch:latest

Error response from daemon: manifest for registry.cn-hangzhou.aliyuncs.com/stable-diffusion-docker/pytorch:latest not found: manifest unknown: manifest unknown

root@iZuf6bhiwlbt30vjr9pe7xZ:~#

结果:还是不行

 

尝试从Docker Hub拉取镜像:

docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel

docker tag pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel registry.cn-hangzhou.aliyuncs.com/your-repository/pytorch:1.13.1-cuda11.6-cudnn8-devel

docker push registry.cn-hangzhou.aliyuncs.com/your-repository/pytorch:1.13.1-cuda11.6-cudnn8-devel

结果:仍然是不可以

 

尝试其他镜像源:

尝试清华大学的,没用

尝试中科大的,没用

尝试网易的,没用

尝试阿里云的,没用

发现这些镜像源都过时了,所以上https://docker.aityp.com/找新的镜像源,最终下载成功

重命名镜像:

docker tag swr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

 

检查镜像:docker images

 

启动Docker容器:

docker run -it --rm docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel

 

安装EAIS软件包:

安装EAIS环境依赖包:

apt-get update

apt-get install -y wget curl iputils-ping

安装eais-tool deb软件包(ubuntu):

export VERSION=4.2.5

wget https://eais-rel-pub.oss-cn-beijing.aliyuncs.com/packages/eais-tool_${VERSION}_amd64.deb

sudo dpkg -i eais-tool_${VERSION}_amd64.deb

source /etc/profile

检查:

dpkg -l | grep eais-tool

 

安装eais-cuda(ubuntu):

export VERSION=4.2.5

wget https://eais-rel-pub.oss-cn-beijing.aliyuncs.com/packages/eais-cuda_${VERSION}_amd64.deb

sudo dpkg -i eais-cuda_${VERSION}_amd64.deb

检查:

dpkg -l | grep eais-cuda

 

部署SD环境:

安装SD webUI运行依赖的软件包

apt-get install -y git libgl1 libglib2.0-dev

下载SD源码:

git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui

cd stable-diffusion-webui

安装SD webUI运行依赖的Python软件包:

pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple

pip3 install -r requirements_versions.txt -i https://mirrors.aliyun.com/pypi/simple

 

wget https://aiacc-inference-public.oss-cn-beijing.aliyuncs.com/stable-diffusion/CLIP.tar.gz

tar zxf CLIP.tar.gz

pip3 install CLIP/ -i https://mirrors.aliyun.com/pypi/simple

 

wget https://aiacc-inference-public.oss-cn-beijing.aliyuncs.com/stable-diffusion/open_clip.tar.gz

tar zxf open_clip.tar.gz

pip3 install open_clip/ -i https://mirrors.aliyun.com/pypi/simple

 

python3 launch.py --skip-torch-cuda-test --exit

pip3 install markupsafe==2.0.1 -i https://mirrors.aliyun.com/pypi/simple

 

 

 

Python依赖包安装过程中出现的问题:

pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple

这段代码安装时提示ERROR: pytorch-lightning 2.4.0 has requirement PyYAML>=5.4, but you'll have pyyaml 5.3.1 which is incompatible.

解决:

pip3 install PyYAML==5.4.1 -i https://mirrors.aliyun.com/pypi/simple

pip3 install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple

pip3 install numpy==1.24.4 -i https://mirrors.aliyun.com/pypi/simple

 

途中发现python版本太低,于是下载高版本在设置默认:

which python3.9

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1

update-alternatives --list python3

python3 --version

 

安装时发现requirement要求下载安装低版本,但是我有了高版本:

Attempting uninstall: setuptools Found existing installation: setuptools 45.2.0 Not uninstalling setuptools at /usr/lib/python3/dist-packages, outside environment /usr Can't uninstall 'setuptools'. No files were found to uninstall.

尝试解决方法:

sudo apt install python3.9-venv

python3 -m venv myenv

source myenv/bin/activate

然后再下载依赖包

 

 

 

 

 

 

 

 

 

 

评论区 (0)

登录后参与评论

暂无评论,抢沙发吧!