Skip to content

Instantly share code, notes, and snippets.

View ashmalvayani's full-sized avatar

Ashmal Vayani ashmalvayani

View GitHub Profile
@ashmalvayani
ashmalvayani / deepspeed_cuda_version_mismatch.txt
Created May 20, 2024 11:27
Installed CUDA version 12.1 does not match the version torch was compiled with 11.8
Error: Installed CUDA version 12.1 does not match the version torch was compiled with 11.7
This is a Deepspeed error. My current installed cuda version was 11.7. My torch was complied with 12.1 CUDA.
## Solution:
1) The easiest solution is to re-compile/ re-install the torch version with the required CUDA version from here (https://pytorch.org/get-started/previous-versions/)
2) You can simply choose to ignore this error, by running this command: "export DS_SKIP_CUDA_CHECK=1"
However, this does say that there could be errors because of this.
@ashmalvayani
ashmalvayani / work around.txt
Created April 18, 2024 18:26
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
In my case, simply setting worked for me
export CUDA_HOME=/usr/local/cuda-11.7
@ashmalvayani
ashmalvayani / solve.py
Created April 18, 2024 17:27
"triu_tril_cuda_template" not implemented for 'BFloat16'
## One of the two worked for me, not sure so writing it both:
pip uninstall flash-attn
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn
## Upgrading the torch version to 2.1.0 with this command:
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
@ashmalvayani
ashmalvayani / multi_node.sh
Last active April 17, 2024 21:07
DDP multi-node training
This case is for Fastchat training:
In my case I have 2 nodes (675d-2, 675d-3 being the name of the nodes).
ssh the nodes and do 'ifconfig', copy the ipaddress of 'eth0: inet'
eg: 16.1.32.185 for node 1
16.1.32.186 for node 2
#On first node (.185 one) execute the following command
torchrun --nproc_per_node 8 --nnodes 2 --node_rank 0 --master_addr "16.1.32.185" --master_port 8456 --local_addr "16.1.32.185" fastchat/train/train_lora.py \
@ashmalvayani
ashmalvayani / fastchat_deploy.sh
Created March 22, 2024 11:16
Deploy FastChat on public using NGROK
#Install the following commands in different screens:
CUDA_VISIBLE_DEVICES=0 python -m fastchat.serve.controller --host 0.0.0.0 --port 10000
CUDA_VISIBLE_DEVICES=1 python3 -m fastchat.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path MBZUAI/MobiLlama-1B-Chat
CUDA_VISIBLE_DEVICES=0 python3 -m fastchat.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40002 --worker http://localhost:40002 --model-path MBZUAI/MobiLlama-05B-Chat
CUDA_VISIBLE_DEVICES=0 python -m fastchat.serve.gradio_web_server --controller http://localhost:10000 --model-list-mode reload
@ashmalvayani
ashmalvayani / readme.md
Created March 20, 2024 09:14
Pdf2Text - Sciparser: How to run gorbid

PDF Text Extractor:

I am using the sciparser repository: https://github.com/davendw49/sciparser clone both the pdf_parser and sciparser repository and install all the requirements from requirements.txt.

To run this, we need to connect to Gorbid Server. For this, install docker and pull the image "Ifoppiano/gorbid:latest-develop" and run it. Make sure it's active.

Then http://10.10.10.10:8074 shall work fine.

@ashmalvayani
ashmalvayani / gist:6af865617a278fe1cfd35da94784ab03
Last active March 25, 2024 00:17
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
PROBLEM:
import flash_attn_2_cuda as flash_attn_cuda
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
Currently flash_attn version: flash_attn-2.5.6
-------
Solution:
Downgrade the flash_attn version
@ashmalvayani
ashmalvayani / download_hf_file_using_wget.txt
Created March 1, 2024 11:27
Download HuggingFace file using wget
Suppoe this is the link you're downloading
https://huggingface.co/datasets/CohereForAI/aya_collection/tree/main/translated_soda/train-00022-of-00038.parquet
replace "tree" with "resolve
https://huggingface.co/datasets/CohereForAI/aya_collection/resolve/main/translated_soda/train-00022-of-00038.parquet
@ashmalvayani
ashmalvayani / error_solve.txt
Created February 6, 2024 12:53
'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
AttributeError: 'DeepSpeedCPUAdam' object has no attribute 'ds_opt_adam'
Exception ignored in: <function DeepSpeedCPUAdam.__del__ at 0x7fc47decc0d0>
Solution:
export CUDA_HOME=/usr/local/cuda-11.7
@ashmalvayani
ashmalvayani / flash_attn_2_error.txt
Last active July 15, 2024 13:10
flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
ImportError: /workspace/venv/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
solution:
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip3 install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.0/flash_attn-2.3.0+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl