Skip to content

Instantly share code, notes, and snippets.

@ehartford
Created March 10, 2024 20:01
Show Gist options
  • Save ehartford/857ee01902ab9c929984815992bd68e8 to your computer and use it in GitHub Desktop.
Save ehartford/857ee01902ab9c929984815992bd68e8 to your computer and use it in GitHub Desktop.
conversation with claude about deepspeed
ME:
[rank14]: Traceback (most recent call last):
[rank14]: File "<frozen runpy>", line 198, in _run_module_as_main
[rank14]: File "<frozen runpy>", line 88, in _run_code
[rank14]: File "/scratch/axolotl/src/axolotl/cli/train.py", line 59, in <module>
[rank14]: fire.Fire(do_cli)
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/fire/core.py", line 141, in Fire
[rank14]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/fire/core.py", line 475, in _Fire
[rank14]: component, remaining_args = _CallAndUpdateTrace(
[rank14]: ^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
[rank14]: component = fn(*varargs, **kwargs)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/scratch/axolotl/src/axolotl/cli/train.py", line 35, in do_cli
[rank14]: return do_train(parsed_cfg, parsed_cli_args)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/scratch/axolotl/src/axolotl/cli/train.py", line 55, in do_train
[rank14]: return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/scratch/axolotl/src/axolotl/train.py", line 163, in train
[rank14]: trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/transformers/trainer.py", line 1624, in train
[rank14]: return inner_training_loop(
[rank14]: ^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/transformers/trainer.py", line 1776, in _inner_training_loop
[rank14]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/accelerate/accelerator.py", line 1219, in prepare
[rank14]: result = self._prepare_deepspeed(*args)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/accelerate/accelerator.py", line 1598, in _prepare_deepspeed
[rank14]: optimizer = DeepSpeedCPUAdam(optimizer.param_groups, **defaults)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/adam/cpu_adam.py", line 94, in __init__
[rank14]: self.ds_opt_adam = CPUAdamBuilder().load()
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 479, in load
[rank14]: return self.jit_load(verbose)
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 511, in jit_load
[rank14]: cxx_args = self.strip_empty_entries(self.cxx_args())
[rank14]: ^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 766, in cxx_args
[rank14]: CUDA_ENABLE = self.is_cuda_enable()
[rank14]: ^^^^^^^^^^^^^^^^^^^^^
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 370, in is_cuda_enable
[rank14]: assert_no_cuda_mismatch(self.name)
[rank14]: File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/deepspeed-0.14.1+535a908f-py3.12.egg/deepspeed/ops/op_builder/builder.py", line 85, in assert_no_cuda_mismatch
[rank14]: torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
[rank14]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank14]: AttributeError: 'NoneType' object has no attribute 'split'
W0309 21:06:11.623000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624920 closing signal SIGTERM
W0309 21:06:11.624000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624921 closing signal SIGTERM
W0309 21:06:11.625000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624923 closing signal SIGTERM
W0309 21:06:11.631000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624925 closing signal SIGTERM
W0309 21:06:11.635000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624927 closing signal SIGTERM
W0309 21:06:11.637000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624929 closing signal SIGTERM
W0309 21:06:11.638000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624930 closing signal SIGTERM
W0309 21:06:11.639000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624931 closing signal SIGTERM
W0309 21:06:11.641000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624932 closing signal SIGTERM
W0309 21:06:11.662000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624933 closing signal SIGTERM
W0309 21:06:11.664000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624934 closing signal SIGTERM
W0309 21:06:11.666000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624936 closing signal SIGTERM
W0309 21:06:11.667000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624940 closing signal SIGTERM
W0309 21:06:11.668000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624942 closing signal SIGTERM
W0309 21:06:11.670000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624945 closing signal SIGTERM
W0309 21:06:11.672000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624946 closing signal SIGTERM
W0309 21:06:11.675000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624947 closing signal SIGTERM
W0309 21:06:11.677000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624950 closing signal SIGTERM
W0309 21:06:11.679000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:901] Sending process 624951 closing signal SIGTERM
E0309 21:06:13.554000 140121663395008 torch/distributed/elastic/multiprocessing/api.py:876] failed (exitcode: 1) local_rank: 2 (pid: 624922) of binary: /home/ehartford/miniconda3/envs/axolotl/bin/python
Traceback (most recent call last):
File "/home/ehartford/miniconda3/envs/axolotl/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1014, in launch_command
multi_gpu_launcher(args)
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
axolotl.cli.train FAILED
------------------------------------------------------------
OPUS:
The error message indicates that there is a mismatch between the CUDA version expected by DeepSpeed and the actual CUDA version available in the PyTorch installation.
The key error is:
```
AttributeError: 'NoneType' object has no attribute 'split'
```
This happens when trying to access `torch.version.cuda`, which seems to be `None`. This suggests that PyTorch is not built with CUDA support in your current environment.
To resolve this issue, you have a few options:
1. Install a PyTorch version that is built with CUDA support. Make sure the CUDA version matches the one expected by DeepSpeed. You can do this by running:
```
pip install torch --extra-index-url https://download.pytorch.org/whl/cu116
```
Replace `cu116` with the appropriate CUDA version for your system.
2. If you don't need CUDA or GPU acceleration, you can try running your script with CPU only. Modify your DeepSpeed configuration to use `cpu_offload` instead of GPU settings.
3. Ensure that your system has a compatible CUDA version installed and the necessary NVIDIA drivers are properly set up. You can verify this by running `nvcc --version` in the terminal.
4. Check if there are any conflicting CUDA versions in your environment. Make sure the CUDA paths are correctly set and pointing to the desired CUDA installation.
It's important to align the CUDA versions across PyTorch, DeepSpeed, and your system to avoid compatibility issues. Review your environment setup, PyTorch installation, and DeepSpeed configuration to ensure they are consistent and compatible with your hardware.
ME:
I installed with pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0
I have ROCm mi210
can you tell me the right way to install DeepSpeed from source to support my pytorch installation
OPUS:
Certainly! To install DeepSpeed from source with support for your PyTorch installation built with ROCm, follow these steps:
1. Clone the DeepSpeed repository:
```
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
```
2. Checkout the latest release branch (e.g., `v0.8.0` or `v0.9.0`):
```
git checkout v0.8.0
```
3. Install the necessary build dependencies:
```
pip install -r requirements/requirements.txt
```
4. Set the `DS_BUILD_CPU_ADAM` environment variable to `1` to enable building the CPU Adam optimizer:
```
export DS_BUILD_CPU_ADAM=1
```
5. Set the `TORCH_CUDA_ARCH_LIST` environment variable to match your GPU architecture. For MI210, you can use:
```
export TORCH_CUDA_ARCH_LIST="gfx908"
```
6. Build and install DeepSpeed:
```
python setup.py install
```
If you encounter any issues during the build process, you can try adding the `--rocm` flag:
```
python setup.py install --rocm
```
7. Verify the installation by running:
```
python -c "import deepspeed; print(deepspeed.__version__)"
```
If the installation is successful, it should print the DeepSpeed version.
Here are a few additional notes:
- Make sure you have the latest version of ROCm installed and the necessary ROCm libraries are available in your environment.
- If you encounter any issues related to `hipify-python`, you can try installing it separately:
```
pip install hipify-python
```
- If you face any compilation errors related to specific ROCm versions, you might need to manually edit the `setup.py` file in the DeepSpeed repository to match your ROCm version.
- In some cases, you may need to set the `HCC_AMDGPU_TARGET` environment variable to match your GPU architecture:
```
export HCC_AMDGPU_TARGET=gfx908
```
By following these steps and ensuring compatibility between PyTorch, ROCm, and DeepSpeed, you should be able to install DeepSpeed from source with support for your PyTorch installation built with ROCm.
ME:
gfx90a tho
OPUS:
Apologies for the confusion. If your GPU is MI210, the correct GPU architecture is indeed `gfx90a`. Here's the updated set of instructions:
1. Clone the DeepSpeed repository:
```
git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
```
2. Checkout the latest release branch (e.g., `v0.8.0` or `v0.9.0`):
```
git checkout v0.8.0
```
3. Install the necessary build dependencies:
```
pip install -r requirements/requirements.txt
```
4. Set the `DS_BUILD_CPU_ADAM` environment variable to `1` to enable building the CPU Adam optimizer:
```
export DS_BUILD_CPU_ADAM=1
```
5. Set the `TORCH_CUDA_ARCH_LIST` environment variable to match your GPU architecture. For MI210 (gfx90a), use:
```
export TORCH_CUDA_ARCH_LIST="gfx90a"
```
6. Build and install DeepSpeed:
```
python setup.py install --rocm
```
7. Verify the installation by running:
```
python -c "import deepspeed; print(deepspeed.__version__)"
```
If the installation is successful, it should print the DeepSpeed version.
Additional notes:
- Make sure you have the latest version of ROCm installed and the necessary ROCm libraries are available in your environment.
- If you encounter any issues related to `hipify-python`, you can try installing it separately:
```
pip install hipify-python
```
- If you face any compilation errors related to specific ROCm versions, you might need to manually edit the `setup.py` file in the DeepSpeed repository to match your ROCm version.
- You may also need to set the `HCC_AMDGPU_TARGET` environment variable to match your GPU architecture:
```
export HCC_AMDGPU_TARGET=gfx90a
```
By setting the correct GPU architecture (`gfx90a`) for your MI210 GPU and following the installation steps with the `--rocm` flag, you should be able to install DeepSpeed from source with support for your PyTorch installation built with ROCm.
ME:
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ export DS_BUILD_CPU_ADAM=1
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ export TORCH_CUDA_ARCH_LIST="gfx90a"
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ python setup.py install --rocm
[2024-03-09 21:17:47,067] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-09 21:17:47,491] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DS_BUILD_OPS=0
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
/scratch/axolotl/DeepSpeed/csrc/adam/fused_adam_frontend.cpp -> /scratch/axolotl/DeepSpeed/csrc/adam/fused_adam_frontend.cpp [skipped, no changes]
/scratch/axolotl/DeepSpeed/csrc/includes/compat.h -> /scratch/axolotl/DeepSpeed/csrc/includes/compat.h [skipped, no changes]
/scratch/axolotl/DeepSpeed/csrc/adam/multi_tensor_apply.cuh -> /scratch/axolotl/DeepSpeed/csrc/adam/multi_tensor_apply_hip.cuh [skipped, already hipified]
/scratch/axolotl/DeepSpeed/csrc/includes/type_shim.h -> /scratch/axolotl/DeepSpeed/csrc/includes/type_shim_hip.h [skipped, already hipified]
/scratch/axolotl/DeepSpeed/csrc/adam/multi_tensor_adam.cu -> /scratch/axolotl/DeepSpeed/csrc/adam/multi_tensor_adam.hip [skipped, already hipified]
Successfully preprocessed all matching files.
Total number of unsupported CUDA function calls: 0
Total number of replaced kernel launches: 1
Traceback (most recent call last):
File "/scratch/axolotl/DeepSpeed/setup.py", line 193, in <module>
ext_modules.append(builder.builder())
^^^^^^^^^^^^^^^^^
File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 644, in builder
{'cxx': self.strip_empty_entries(self.cxx_args()), \
^^^^^^^^^^^^^^^
File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 766, in cxx_args
CUDA_ENABLE = self.is_cuda_enable()
^^^^^^^^^^^^^^^^^^^^^
File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 370, in is_cuda_enable
assert_no_cuda_mismatch(self.name)
File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 85, in assert_no_cuda_mismatch
torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'
OPUS:
The error message suggests that PyTorch is not built with CUDA support, even though you have installed it with ROCm. This is causing a mismatch when building DeepSpeed.
To resolve this, you can try the following steps:
1. Uninstall the existing PyTorch installation:
```
pip uninstall torch torchvision torchaudio
```
2. Install PyTorch with ROCm support using the official PyTorch ROCm wheels:
```
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.4
```
Make sure to use the appropriate ROCm version that matches your system (e.g., `rocm5.4` or `rocm5.2`).
3. Verify that PyTorch is installed correctly and recognizes ROCm:
```
python -c "import torch; print(torch.version.cuda)"
```
It should print the ROCm version instead of `None`.
4. Set the `HIP_PLATFORM` environment variable to `amd`:
```
export HIP_PLATFORM=amd
```
5. Navigate to the DeepSpeed directory and build DeepSpeed with ROCm support:
```
cd DeepSpeed
DS_BUILD_CPU_ADAM=1 TORCH_CUDA_ARCH_LIST="gfx90a" python setup.py install --rocm
```
This command sets the necessary environment variables and builds DeepSpeed with ROCm support.
6. Verify the DeepSpeed installation:
```
python -c "import deepspeed; print(deepspeed.__version__)"
```
It should print the DeepSpeed version without any errors.
By following these steps and ensuring that PyTorch is installed with ROCm support, you should be able to build and install DeepSpeed with ROCm compatibility.
If you still encounter issues, you may need to investigate further and ensure that your ROCm installation is properly set up and compatible with your system and PyTorch version. Double-check the ROCm version and make sure all the necessary ROCm libraries and dependencies are installed correctly.
ME:
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ pip uninstall torch torchvision torchaudio
Found existing installation: torch 2.3.0.dev20240309+rocm6.0
Uninstalling torch-2.3.0.dev20240309+rocm6.0:
Would remove:
/scratch/miniconda3/envs/axolotl/bin/convert-caffe2-to-onnx
/scratch/miniconda3/envs/axolotl/bin/convert-onnx-to-caffe2
/scratch/miniconda3/envs/axolotl/bin/torchrun
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/functorch/*
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch-2.3.0.dev20240309+rocm6.0.dist-info/*
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/*
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchgen/*
Proceed (Y/n)? Y
Successfully uninstalled torch-2.3.0.dev20240309+rocm6.0
Found existing installation: torchvision 0.18.0.dev20240309+rocm6.0
Uninstalling torchvision-0.18.0.dev20240309+rocm6.0:
Would remove:
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision-0.18.0.dev20240309+rocm6.0.dist-info/*
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libamd_comgr.9b21a3b1.so.2
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libamdhip64.3696acf0.so.6
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libdrm.75edc4d3.so.2
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libdrm_amdgpu.fdca4e9c.so.1
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libelf.778b3e34.so.1
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libhsa-runtime64.2f88ea9b.so.1
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libjpeg.ceea7512.so.62
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libnuma.716911dd.so.1
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libpng16.7f72a3c5.so.16
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libtinfo.d434ca9c.so.5
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision.libs/libz.ed7ef817.so.1
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchvision/*
Proceed (Y/n)? Y
Successfully uninstalled torchvision-0.18.0.dev20240309+rocm6.0
Found existing installation: torchaudio 2.2.0.dev20240309+rocm6.0
Uninstalling torchaudio-2.2.0.dev20240309+rocm6.0:
Would remove:
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchaudio-2.2.0.dev20240309+rocm6.0.dist-info/*
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torchaudio/*
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/torio/*
Proceed (Y/n)? Y
Successfully uninstalled torchaudio-2.2.0.dev20240309+rocm6.0
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0
Looking in indexes: https://download.pytorch.org/whl/nightly/rocm6.0
Collecting torch
Using cached https://download.pytorch.org/whl/nightly/rocm6.0/torch-2.3.0.dev20240309%2Brocm6.0-cp312-cp312-linux_x86_64.whl (2134.1 MB)
Collecting torchvision
Using cached https://download.pytorch.org/whl/nightly/rocm6.0/torchvision-0.18.0.dev20240309%2Brocm6.0-cp312-cp312-linux_x86_64.whl (65.9 MB)
Collecting torchaudio
Using cached https://download.pytorch.org/whl/nightly/rocm6.0/torchaudio-2.2.0.dev20240309%2Brocm6.0-cp312-cp312-linux_x86_64.whl (1.7 MB)
Requirement already satisfied: filelock in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from torch) (3.13.1)
Requirement already satisfied: typing-extensions>=4.8.0 in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from torch) (4.10.0)
Requirement already satisfied: sympy in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from torch) (1.12)
Requirement already satisfied: networkx in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from torch) (3.2.1)
Requirement already satisfied: jinja2 in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from torch) (3.1.3)
Requirement already satisfied: fsspec in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from torch) (2024.2.0)
Requirement already satisfied: numpy in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from torchvision) (1.26.4)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from torchvision) (10.2.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from jinja2->torch) (2.1.5)
Requirement already satisfied: mpmath>=0.19 in /home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages (from sympy->torch) (1.3.0)
Installing collected packages: torch, torchvision, torchaudio
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
axolotl 0.4.0 requires torch==2.2.1, but you have torch 2.3.0.dev20240309+rocm6.0 which is incompatible.
Successfully installed torch-2.3.0.dev20240309+rocm6.0 torchaudio-2.2.0.dev20240309+rocm6.0 torchvision-0.18.0.dev20240309+rocm6.0
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ python -c "import torch; print(torch.version.cuda)"
None
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0
OPUS:
It seems like the PyTorch installation is still not detecting ROCm correctly. Let's try a few more steps:
1. Uninstall the existing PyTorch installation again:
```
pip uninstall torch torchvision torchaudio
```
2. Set the `HIP_PLATFORM` environment variable to `amd`:
```
export HIP_PLATFORM=amd
```
3. Install PyTorch with ROCm support using the official PyTorch ROCm wheels for your specific ROCm version. For ROCm 6.0, use the following command:
```
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm6.0
```
Note the use of `--extra-index-url` instead of `--index-url`.
4. Verify that PyTorch is installed correctly and recognizes ROCm:
```
python -c "import torch; print(torch.version.hip)"
```
It should print the ROCm version instead of `None`.
5. If the above steps still don't work, you can try installing PyTorch directly from the PyTorch repository:
```
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/rocm6.0
```
This command uses the nightly builds of PyTorch specifically for ROCm 6.0.
6. Once PyTorch is installed successfully with ROCm support, navigate to the DeepSpeed directory and build DeepSpeed with ROCm support:
```
cd DeepSpeed
DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install --rocm
```
Note the use of `TORCH_HIP_ARCH_LIST` instead of `TORCH_CUDA_ARCH_LIST`.
7. Verify the DeepSpeed installation:
```
python -c "import deepspeed; print(deepspeed.__version__)"
```
It should print the DeepSpeed version without any errors.
If you still encounter issues after following these steps, it's possible that there might be some compatibility issues between your specific ROCm version, PyTorch version, and DeepSpeed. In that case, you may need to investigate further, consult the PyTorch and DeepSpeed documentation for ROCm support, and possibly seek assistance from their respective communities or support channels.
ME:
Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.2.0 jinja2-3.1.3 mpmath-1.4.0a0 networkx-3.2.1 numpy-1.26.4 pillow-10.2.0 sympy-1.12 torch-2.3.0.dev20240309+rocm6.0 torchaudio-2.2.0.dev20240309+rocm6.0 torchvision-0.18.0.dev20240309+rocm6.0 typing-extensions-4.10.0
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ python -c "import torch; print(torch.version.hip)"
6.0.32830-d62f6a171
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ vim ~/.bashrc
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ export HIP_PLATFORM=amd
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ export DS_BUILD_CPU_ADAM=1
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ export TORCH_HIP_ARCH_LIST="gfx90a"
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install --rocm
Traceback (most recent call last):
File "/scratch/axolotl/DeepSpeed/setup.py", line 37, in <module>
from op_builder import get_default_compute_capabilities, OpBuilder
File "/scratch/axolotl/DeepSpeed/op_builder/__init__.py", line 18, in <module>
import deepspeed.ops.op_builder # noqa: F401 # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/axolotl/DeepSpeed/deepspeed/__init__.py", line 25, in <module>
from . import ops
File "/scratch/axolotl/DeepSpeed/deepspeed/ops/__init__.py", line 6, in <module>
from . import adam
File "/scratch/axolotl/DeepSpeed/deepspeed/ops/adam/__init__.py", line 6, in <module>
from .cpu_adam import DeepSpeedCPUAdam
File "/scratch/axolotl/DeepSpeed/deepspeed/ops/adam/cpu_adam.py", line 8, in <module>
from deepspeed.utils import logger
File "/scratch/axolotl/DeepSpeed/deepspeed/utils/__init__.py", line 10, in <module>
from .groups import *
File "/scratch/axolotl/DeepSpeed/deepspeed/utils/groups.py", line 28, in <module>
from deepspeed import comm as dist
File "/scratch/axolotl/DeepSpeed/deepspeed/comm/__init__.py", line 7, in <module>
from .comm import *
File "/scratch/axolotl/DeepSpeed/deepspeed/comm/comm.py", line 31, in <module>
from deepspeed.comm.ccl import CCLBackend
File "/scratch/axolotl/DeepSpeed/deepspeed/comm/ccl.py", line 12, in <module>
from .torch import TorchBackend
File "/scratch/axolotl/DeepSpeed/deepspeed/comm/torch.py", line 100, in <module>
class TorchBackend(Backend):
File "/scratch/axolotl/DeepSpeed/deepspeed/comm/torch.py", line 124, in TorchBackend
@compiler.disable
^^^^^^^^^^^^^^^^
File "/scratch/axolotl/DeepSpeed/deepspeed/runtime/compiler.py", line 21, in disable
return torch.compiler.disable(func)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/compiler/__init__.py", line 96, in disable
import torch._dynamo
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/_dynamo/__init__.py", line 2, in <module>
from . import convert_frame, eval_frame, resume_execution
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py", line 31, in <module>
from torch.fx.experimental.symbolic_shapes import (
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/fx/experimental/symbolic_shapes.py", line 63, in <module>
from torch.utils._sympy.functions import FloorDiv, Mod, IsNonOverlappingAndDenseIndicator
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/torch/utils/_sympy/functions.py", line 1, in <module>
import sympy
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/sympy/__init__.py", line 30, in <module>
from sympy.core.cache import lazy_function
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/sympy/core/__init__.py", line 9, in <module>
from .expr import Expr, AtomicExpr, UnevaluatedExpr
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/sympy/core/expr.py", line 4159, in <module>
from .mul import Mul
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/sympy/core/mul.py", line 2193, in <module>
from .numbers import Rational
File "/home/ehartford/miniconda3/envs/axolotl/lib/python3.12/site-packages/sympy/core/numbers.py", line 4567, in <module>
_sympy_converter[type(mpmath.rational.mpq(1, 2))] = sympify_mpmath_mpq
^^^^^^^^^^^^^^^
AttributeError: module 'mpmath' has no attribute 'rational'
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$
OPUS:
It seems like there is a compatibility issue between the installed version of `sympy` and `mpmath`. The error message indicates that the `mpmath` module does not have the `rational` attribute, which is expected by `sympy`.
To resolve this issue, you can try the following steps:
1. Uninstall the current versions of `sympy` and `mpmath`:
```
pip uninstall sympy mpmath
```
2. Reinstall `sympy` and `mpmath` with compatible versions:
```
pip install 'sympy<1.12' 'mpmath<1.3.0'
```
This command will install `sympy` with a version lower than 1.12 and `mpmath` with a version lower than 1.3.0 to ensure compatibility.
3. After installing the compatible versions, try building DeepSpeed with ROCm support again:
```
DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install --rocm
```
4. If the build is successful, verify the DeepSpeed installation:
```
python -c "import deepspeed; print(deepspeed.__version__)"
```
It should print the DeepSpeed version without any errors.
If you still encounter issues, you may need to investigate further and check the compatibility of other dependencies with the specific versions of PyTorch, ROCm, and DeepSpeed you are using.
Additionally, make sure that your ROCm installation is properly set up and the necessary ROCm libraries are available in your environment. Double-check the ROCm version and ensure that it matches the PyTorch ROCm version you are using.
If the issue persists, it might be helpful to consult the DeepSpeed documentation, issue tracker, or community forums for further guidance and support specific to ROCm integration.
ME:
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ pip uninstall sympy mpmath
Found existing installation: sympy 1.12
Uninstalling sympy-1.12:
Would remove:
/scratch/miniconda3/envs/axolotl/bin/isympy
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/isympy.py
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/sympy-1.12.dist-info/*
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/sympy/*
/scratch/miniconda3/envs/axolotl/share/man/man1/isympy.1
Proceed (Y/n)? Y
Successfully uninstalled sympy-1.12
Found existing installation: mpmath 1.4.0a0
Uninstalling mpmath-1.4.0a0:
Would remove:
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/mpmath-1.4.0a0.dist-info/*
/scratch/miniconda3/envs/axolotl/lib/python3.12/site-packages/mpmath/*
Proceed (Y/n)? Y
Successfully uninstalled mpmath-1.4.0a0
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ pip install 'sympy<1.12' 'mpmath<1.3.0'
Collecting sympy<1.12
Downloading sympy-1.11.1-py3-none-any.whl.metadata (12 kB)
Collecting mpmath<1.3.0
Downloading mpmath-1.2.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sympy-1.11.1-py3-none-any.whl (6.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 65.9 MB/s eta 0:00:00
Downloading mpmath-1.2.1-py3-none-any.whl (532 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 532.6/532.6 kB 110.0 MB/s eta 0:00:00
Installing collected packages: mpmath, sympy
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
axolotl 0.4.0 requires torch==2.2.1, but you have torch 2.3.0.dev20240309+rocm6.0 which is incompatible.
Successfully installed mpmath-1.2.1 sympy-1.11.1
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install --rocm
[2024-03-09 21:34:57,884] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-03-09 21:34:58,314] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
DS_BUILD_OPS=0
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
/scratch/axolotl/DeepSpeed/csrc/adam/fused_adam_frontend.cpp -> /scratch/axolotl/DeepSpeed/csrc/adam/fused_adam_frontend.cpp [skipped, no changes]
/scratch/axolotl/DeepSpeed/csrc/includes/compat.h -> /scratch/axolotl/DeepSpeed/csrc/includes/compat.h [skipped, no changes]
/scratch/axolotl/DeepSpeed/csrc/adam/multi_tensor_apply.cuh -> /scratch/axolotl/DeepSpeed/csrc/adam/multi_tensor_apply_hip.cuh [skipped, already hipified]
/scratch/axolotl/DeepSpeed/csrc/includes/type_shim.h -> /scratch/axolotl/DeepSpeed/csrc/includes/type_shim_hip.h [skipped, already hipified]
/scratch/axolotl/DeepSpeed/csrc/adam/multi_tensor_adam.cu -> /scratch/axolotl/DeepSpeed/csrc/adam/multi_tensor_adam.hip [skipped, already hipified]
Successfully preprocessed all matching files.
Total number of unsupported CUDA function calls: 0
Total number of replaced kernel launches: 1
Traceback (most recent call last):
File "/scratch/axolotl/DeepSpeed/setup.py", line 193, in <module>
ext_modules.append(builder.builder())
^^^^^^^^^^^^^^^^^
File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 644, in builder
{'cxx': self.strip_empty_entries(self.cxx_args()), \
^^^^^^^^^^^^^^^
File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 766, in cxx_args
CUDA_ENABLE = self.is_cuda_enable()
^^^^^^^^^^^^^^^^^^^^^
File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 370, in is_cuda_enable
assert_no_cuda_mismatch(self.name)
File "/scratch/axolotl/DeepSpeed/op_builder/builder.py", line 85, in assert_no_cuda_mismatch
torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'split'
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$ python -c "import torch; print(torch.version.hip)"
6.0.32830-d62f6a171
(axolotl) ehartford@supergenoa2:/local/axolotl/DeepSpeed$
OPUS:
It seems like the issue persists even after installing compatible versions of `sympy` and `mpmath`. The error message still indicates that `torch.version.cuda` is `None`, suggesting that PyTorch is not detecting the ROCm installation correctly.
At this point, I recommend the following steps:
1. Open the `setup.py` file in the DeepSpeed directory and locate the `assert_no_cuda_mismatch` function.
2. Modify the `assert_no_cuda_mismatch` function to handle the case when `torch.version.cuda` is `None`. You can replace the existing function with the following code:
```python
def assert_no_cuda_mismatch(name):
cuda_available = torch.cuda.is_available()
if not cuda_available and not torch.version.hip:
# Print a warning message indicating no CUDA or ROCm support
print(f"Warning: {name} requires CUDA or ROCm support, but neither is available.")
else:
# Check CUDA version if available
if cuda_available:
cuda_version = torch.version.cuda
if cuda_version is not None:
torch_cuda_version = ".".join(cuda_version.split('.')[:2])
installed_cuda_version = ".".join(get_cuda_version().split('.')[:2])
if torch_cuda_version != installed_cuda_version:
raise Exception(f'Torch CUDA version mismatch! Torch CUDA version {torch_cuda_version}, '
f'but installed CUDA version {installed_cuda_version}! '
f'Please check your PyTorch and CUDA installation.')
```
This modified function checks if CUDA or ROCm support is available. If neither is available, it prints a warning message. If CUDA is available, it proceeds with the CUDA version check as before.
3. Save the modified `setup.py` file.
4. Try building DeepSpeed with ROCm support again:
```
DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install --rocm
```
This should bypass the `AttributeError` related to `torch.version.cuda` being `None`.
5. If the build is successful, verify the DeepSpeed installation:
```
python -c "import deepspeed; print(deepspeed.__version__)"
```
It should print the DeepSpeed version without any errors.
If you still encounter issues after making these changes, it might be worth considering the following:
- Check if there are any known compatibility issues between the specific versions of PyTorch, ROCm, and DeepSpeed you are using. Consult the respective documentation and release notes for any reported issues or recommended version combinations.
- Seek further assistance from the DeepSpeed community or the PyTorch ROCm community. They may have more specific insights or workarounds for your particular setup.
- Consider using a different version of PyTorch or DeepSpeed that has better compatibility with your ROCm installation.
Remember to thoroughly test your setup and verify that DeepSpeed is functioning correctly with ROCm before proceeding with your training tasks.
ME:
there's no assert_no_cuda_mismatch
Here is the setup.py
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
"""
DeepSpeed library
To build wheel on Windows:
1. Install pytorch, such as pytorch 1.12 + cuda 11.6.
2. Install visual cpp build tool.
3. Include cuda toolkit.
4. Launch cmd console with Administrator privilege for creating required symlink folders.
Create a new wheel via the following command:
build_win.bat
The wheel will be located at: dist/*.whl
"""
import os
import sys
import subprocess
from setuptools import setup, find_packages
from setuptools.command import egg_info
import time
import typing
torch_available = True
try:
import torch
except ImportError:
torch_available = False
print('[WARNING] Unable to import torch, pre-compiling ops will be disabled. ' \
'Please visit https://pytorch.org/ to see how to properly install torch on your system.')
from op_builder import get_default_compute_capabilities, OpBuilder
from op_builder.all_ops import ALL_OPS
from op_builder.builder import installed_cuda_version
# Fetch rocm state.
is_rocm_pytorch = OpBuilder.is_rocm_pytorch()
rocm_version = OpBuilder.installed_rocm_version()
RED_START = '\033[31m'
RED_END = '\033[0m'
ERROR = f"{RED_START} [ERROR] {RED_END}"
def abort(msg):
print(f"{ERROR} {msg}")
assert False, msg
def fetch_requirements(path):
with open(path, 'r') as fd:
return [r.strip() for r in fd.readlines()]
def is_env_set(key):
"""
Checks if an environment variable is set and not "".
"""
return bool(os.environ.get(key, None))
def get_env_if_set(key, default: typing.Any = ""):
"""
Returns an environment variable if it is set and not "",
otherwise returns a default value. In contrast, the fallback
parameter of os.environ.get() is skipped if the variable is set to "".
"""
return os.environ.get(key, None) or default
install_requires = fetch_requirements('requirements/requirements.txt')
extras_require = {
'1bit': [], # add cupy based on cuda/rocm version
'1bit_mpi': fetch_requirements('requirements/requirements-1bit-mpi.txt'),
'readthedocs': fetch_requirements('requirements/requirements-readthedocs.txt'),
'dev': fetch_requirements('requirements/requirements-dev.txt'),
'autotuning': fetch_requirements('requirements/requirements-autotuning.txt'),
'autotuning_ml': fetch_requirements('requirements/requirements-autotuning-ml.txt'),
'sparse_attn': fetch_requirements('requirements/requirements-sparse_attn.txt'),
'sparse': fetch_requirements('requirements/requirements-sparse_pruning.txt'),
'inf': fetch_requirements('requirements/requirements-inf.txt'),
'sd': fetch_requirements('requirements/requirements-sd.txt'),
'triton': fetch_requirements('requirements/requirements-triton.txt'),
}
# Add specific cupy version to both onebit extension variants.
if torch_available and torch.cuda.is_available():
cupy = None
if is_rocm_pytorch:
rocm_major, rocm_minor = rocm_version
# XXX cupy support for rocm 5 is not available yet.
if rocm_major <= 4:
cupy = f"cupy-rocm-{rocm_major}-{rocm_minor}"
else:
cuda_major_ver, cuda_minor_ver = installed_cuda_version()
if (cuda_major_ver < 11) or ((cuda_major_ver == 11) and (cuda_minor_ver < 3)):
cupy = f"cupy-cuda{cuda_major_ver}{cuda_minor_ver}"
else:
cupy = f"cupy-cuda{cuda_major_ver}x"
if cupy:
extras_require['1bit'].append(cupy)
extras_require['1bit_mpi'].append(cupy)
# Make an [all] extra that installs all needed dependencies.
all_extras = set()
for extra in extras_require.items():
for req in extra[1]:
all_extras.add(req)
extras_require['all'] = list(all_extras)
cmdclass = {}
# For any pre-installed ops force disable ninja.
if torch_available:
from accelerator import get_accelerator
use_ninja = is_env_set("DS_ENABLE_NINJA")
cmdclass['build_ext'] = get_accelerator().build_extension().with_options(use_ninja=use_ninja)
if torch_available:
TORCH_MAJOR = torch.__version__.split('.')[0]
TORCH_MINOR = torch.__version__.split('.')[1]
else:
TORCH_MAJOR = "0"
TORCH_MINOR = "0"
if torch_available and not torch.cuda.is_available():
# Fix to allow docker builds, similar to https://github.com/NVIDIA/apex/issues/486.
print("[WARNING] Torch did not find cuda available, if cross-compiling or running with cpu only "
"you can ignore this message. Adding compute capability for Pascal, Volta, and Turing "
"(compute capabilities 6.0, 6.1, 6.2)")
if not is_env_set("TORCH_CUDA_ARCH_LIST"):
os.environ["TORCH_CUDA_ARCH_LIST"] = get_default_compute_capabilities()
ext_modules = []
# Default to pre-install kernels to false so we rely on JIT on Linux, opposite on Windows.
BUILD_OP_PLATFORM = 1 if sys.platform == "win32" else 0
BUILD_OP_DEFAULT = int(get_env_if_set('DS_BUILD_OPS', BUILD_OP_PLATFORM))
print(f"DS_BUILD_OPS={BUILD_OP_DEFAULT}")
if BUILD_OP_DEFAULT:
assert torch_available, "Unable to pre-compile ops without torch installed. Please install torch before attempting to pre-compile ops."
def command_exists(cmd):
if sys.platform == "win32":
result = subprocess.Popen(f'{cmd}', stdout=subprocess.PIPE, shell=True)
return result.wait() == 1
else:
result = subprocess.Popen(f'type {cmd}', stdout=subprocess.PIPE, shell=True)
return result.wait() == 0
def op_envvar(op_name):
assert hasattr(ALL_OPS[op_name], 'BUILD_VAR'), \
f"{op_name} is missing BUILD_VAR field"
return ALL_OPS[op_name].BUILD_VAR
def op_enabled(op_name):
env_var = op_envvar(op_name)
return int(get_env_if_set(env_var, BUILD_OP_DEFAULT))
compatible_ops = dict.fromkeys(ALL_OPS.keys(), False)
install_ops = dict.fromkeys(ALL_OPS.keys(), False)
for op_name, builder in ALL_OPS.items():
op_compatible = builder.is_compatible()
compatible_ops[op_name] = op_compatible
compatible_ops["deepspeed_not_implemented"] = False
# If op is requested but not available, throw an error.
if op_enabled(op_name) and not op_compatible:
env_var = op_envvar(op_name)
if not is_env_set(env_var):
builder.warning(f"One can disable {op_name} with {env_var}=0")
abort(f"Unable to pre-compile {op_name}")
# If op is compatible but install is not enabled (JIT mode).
if is_rocm_pytorch and op_compatible and not op_enabled(op_name):
builder.hipify_extension()
# If op install enabled, add builder to extensions.
if op_enabled(op_name) and op_compatible:
assert torch_available, f"Unable to pre-compile {op_name}, please first install torch"
install_ops[op_name] = op_enabled(op_name)
ext_modules.append(builder.builder())
print(f'Install Ops={install_ops}')
# Write out version/git info.
git_hash_cmd = "git rev-parse --short HEAD"
git_branch_cmd = "git rev-parse --abbrev-ref HEAD"
if command_exists('git') and not is_env_set('DS_BUILD_STRING'):
try:
result = subprocess.check_output(git_hash_cmd, shell=True)
git_hash = result.decode('utf-8').strip()
result = subprocess.check_output(git_branch_cmd, shell=True)
git_branch = result.decode('utf-8').strip()
except subprocess.CalledProcessError:
git_hash = "unknown"
git_branch = "unknown"
else:
git_hash = "unknown"
git_branch = "unknown"
def create_dir_symlink(src, dest):
if not os.path.islink(dest):
if os.path.exists(dest):
os.remove(dest)
assert not os.path.exists(dest)
os.symlink(src, dest)
if sys.platform == "win32":
# This creates a symbolic links on Windows.
# It needs Administrator privilege to create symlinks on Windows.
create_dir_symlink('..\\..\\csrc', '.\\deepspeed\\ops\\csrc')
create_dir_symlink('..\\..\\op_builder', '.\\deepspeed\\ops\\op_builder')
create_dir_symlink('..\\accelerator', '.\\deepspeed\\accelerator')
egg_info.manifest_maker.template = 'MANIFEST_win.in'
# Parse the DeepSpeed version string from version.txt.
version_str = open('version.txt', 'r').read().strip()
# Build specifiers like .devX can be added at install time. Otherwise, add the git hash.
# Example: DS_BUILD_STRING=".dev20201022" python setup.py sdist bdist_wheel.
# Building wheel for distribution, update version file.
if is_env_set('DS_BUILD_STRING'):
# Build string env specified, probably building for distribution.
with open('build.txt', 'w') as fd:
fd.write(os.environ['DS_BUILD_STRING'])
version_str += os.environ['DS_BUILD_STRING']
elif os.path.isfile('build.txt'):
# build.txt exists, probably installing from distribution.
with open('build.txt', 'r') as fd:
version_str += fd.read().strip()
else:
# None of the above, probably installing from source.
version_str += f'+{git_hash}'
torch_version = ".".join([TORCH_MAJOR, TORCH_MINOR])
bf16_support = False
# Set cuda_version to 0.0 if cpu-only.
cuda_version = "0.0"
nccl_version = "0.0"
# Set hip_version to 0.0 if cpu-only.
hip_version = "0.0"
if torch_available and torch.version.cuda is not None:
cuda_version = ".".join(torch.version.cuda.split('.')[:2])
if sys.platform != "win32":
if isinstance(torch.cuda.nccl.version(), int):
# This will break if minor version > 9.
nccl_version = ".".join(str(torch.cuda.nccl.version())[:2])
else:
nccl_version = ".".join(map(str, torch.cuda.nccl.version()[:2]))
if hasattr(torch.cuda, 'is_bf16_supported') and torch.cuda.is_available():
bf16_support = torch.cuda.is_bf16_supported()
if torch_available and hasattr(torch.version, 'hip') and torch.version.hip is not None:
hip_version = ".".join(torch.version.hip.split('.')[:2])
torch_info = {
"version": torch_version,
"bf16_support": bf16_support,
"cuda_version": cuda_version,
"nccl_version": nccl_version,
"hip_version": hip_version
}
print(f"version={version_str}, git_hash={git_hash}, git_branch={git_branch}")
with open('deepspeed/git_version_info_installed.py', 'w') as fd:
fd.write(f"version='{version_str}'\n")
fd.write(f"git_hash='{git_hash}'\n")
fd.write(f"git_branch='{git_branch}'\n")
fd.write(f"installed_ops={install_ops}\n")
fd.write(f"compatible_ops={compatible_ops}\n")
fd.write(f"torch_info={torch_info}\n")
print(f'install_requires={install_requires}')
print(f'compatible_ops={compatible_ops}')
print(f'ext_modules={ext_modules}')
# Parse README.md to make long_description for PyPI page.
thisdir = os.path.abspath(os.path.dirname(__file__))
with open(os.path.join(thisdir, 'README.md'), encoding='utf-8') as fin:
readme_text = fin.read()
start_time = time.time()
setup(name='deepspeed',
version=version_str,
description='DeepSpeed library',
long_description=readme_text,
long_description_content_type='text/markdown',
author='DeepSpeed Team',
author_email='deepspeed-info@microsoft.com',
url='http://deepspeed.ai',
project_urls={
'Documentation': 'https://deepspeed.readthedocs.io',
'Source': 'https://github.com/microsoft/DeepSpeed',
},
install_requires=install_requires,
extras_require=extras_require,
packages=find_packages(include=['deepspeed', 'deepspeed.*']),
include_package_data=True,
scripts=[
'bin/deepspeed', 'bin/deepspeed.pt', 'bin/ds', 'bin/ds_ssh', 'bin/ds_report', 'bin/ds_bench', 'bin/dsr',
'bin/ds_elastic'
],
classifiers=[
'Programming Language :: Python :: 3.6', 'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8', 'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10'
],
license='Apache Software License 2.0',
ext_modules=ext_modules,
cmdclass=cmdclass)
end_time = time.time()
print(f'deepspeed build time = {end_time - start_time} secs')
OPUS:
You're right, I apologize for the confusion. The `assert_no_cuda_mismatch` function is not present in the `setup.py` file you provided. It seems to be part of the `op_builder/builder.py` file.
In this case, you can modify the `op_builder/builder.py` file instead. Locate the `assert_no_cuda_mismatch` function in that file and replace it with the following code:
```python
def assert_no_cuda_mismatch(name):
cuda_available = torch.cuda.is_available()
if not cuda_available and not torch.version.hip:
# Print a warning message indicating no CUDA or ROCm support
print(f"Warning: {name} requires CUDA or ROCm support, but neither is available.")
else:
# Check CUDA version if available
if cuda_available:
cuda_version = torch.version.cuda
if cuda_version is not None:
torch_cuda_version = ".".join(cuda_version.split('.')[:2])
installed_cuda_version = ".".join(get_cuda_version().split('.')[:2])
if torch_cuda_version != installed_cuda_version:
raise Exception(f'Torch CUDA version mismatch! Torch CUDA version {torch_cuda_version}, '
f'but installed CUDA version {installed_cuda_version}! '
f'Please check your PyTorch and CUDA installation.')
```
After making this modification, save the `op_builder/builder.py` file and then try building DeepSpeed with ROCm support again:
```
DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install --rocm
```
This should bypass the `AttributeError` related to `torch.version.cuda` being `None`.
If the build is successful, verify the DeepSpeed installation:
```
python -c "import deepspeed; print(deepspeed.__version__)"
```
It should print the DeepSpeed version without any errors.
Let me know if you have any further questions or if you encounter any other issues.
ME:
The current function looks like this:
def assert_no_cuda_mismatch(name=""): cuda_major, cuda_minor = installed_cuda_version(name) sys_cuda_version = f'{cuda_major}.{cuda_minor}' torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2]) # This is a show-stopping error, should probably not proceed past this if sys_cuda_version != torch_cuda_version: if (cuda_major in cuda_minor_mismatch_ok and sys_cuda_version in cuda_minor_mismatch_ok[cuda_major] and torch_cuda_version in cuda_minor_mismatch_ok[cuda_major]): print(f"Installed CUDA version {sys_cuda_version} does not match the " f"version torch was compiled with {torch.version.cuda} " "but since the APIs are compatible, accepting this combination") return True elif os.getenv("DS_SKIP_CUDA_CHECK", "0") == "1": print( f"{WARNING} DeepSpeed Op Builder: Installed CUDA version {sys_cuda_version} does not match the " f"version torch was compiled with {torch.version.cuda}." "Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior." ) return True raise CUDAMismatchException( f">- DeepSpeed Op Builder: Installed CUDA version {sys_cuda_version} does not match the " f"version torch was compiled with {torch.version.cuda}, unable to compile " "cuda/cpp extensions without a matching cuda version.") return True
OPUS:
Thank you for providing the current `assert_no_cuda_mismatch` function. You can modify it to handle the case when `torch.version.cuda` is `None` and to check for ROCm support. Here's the updated version of the function:
```python
def assert_no_cuda_mismatch(name=""):
cuda_available = torch.cuda.is_available()
if not cuda_available and not torch.version.hip:
# Print a warning message indicating no CUDA or ROCm support
print(f"Warning: {name} requires CUDA or ROCm support, but neither is available.")
return False
else:
# Check CUDA version if available
if cuda_available:
cuda_major, cuda_minor = installed_cuda_version(name)
sys_cuda_version = f'{cuda_major}.{cuda_minor}'
torch_cuda_version = torch.version.cuda
if torch_cuda_version is not None:
torch_cuda_version = ".".join(torch_cuda_version.split('.')[:2])
if sys_cuda_version != torch_cuda_version:
if (cuda_major in cuda_minor_mismatch_ok and
sys_cuda_version in cuda_minor_mismatch_ok[cuda_major] and
torch_cuda_version in cuda_minor_mismatch_ok[cuda_major]):
print(f"Installed CUDA version {sys_cuda_version} does not match the "
f"version torch was compiled with {torch.version.cuda} "
"but since the APIs are compatible, accepting this combination")
return True
elif os.getenv("DS_SKIP_CUDA_CHECK", "0") == "1":
print(
f"{WARNING} DeepSpeed Op Builder: Installed CUDA version {sys_cuda_version} does not match the "
f"version torch was compiled with {torch.version.cuda}."
"Detected `DS_SKIP_CUDA_CHECK=1`: Allowing this combination of CUDA, but it may result in unexpected behavior."
)
return True
raise CUDAMismatchException(
f">- DeepSpeed Op Builder: Installed CUDA version {sys_cuda_version} does not match the "
f"version torch was compiled with {torch.version.cuda}, unable to compile "
"cuda/cpp extensions without a matching cuda version.")
else:
print(f"Warning: {name} requires CUDA support, but torch.version.cuda is None.")
return False
return True
```
In this updated version, the function first checks if CUDA or ROCm support is available using `torch.cuda.is_available()` and `torch.version.hip`. If neither is available, it prints a warning message and returns `False`.
If CUDA is available, it proceeds with the CUDA version check as before. However, it now also checks if `torch.version.cuda` is `None`. If it is `None`, it prints a warning message and returns `False`.
If ROCm support is available (i.e., `torch.version.hip` is not `None`), the function will return `True` and allow the build to proceed.
After making these modifications, save the `op_builder/builder.py` file and then try building DeepSpeed with ROCm support again:
```
DS_BUILD_CPU_ADAM=1 TORCH_HIP_ARCH_LIST="gfx90a" python setup.py install --rocm
```
This should handle the case when `torch.version.cuda` is `None` and allow the build to proceed if ROCm support is available.
Let me know if you have any further questions or if you encounter any other issues.
ME:
And this fixed it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment