Skip to content

Instantly share code, notes, and snippets.

@savanovich
Last active February 19, 2024 10:02
Show Gist options
  • Save savanovich/e4163c593ca38391bf8bba211a54ecf5 to your computer and use it in GitHub Desktop.
Save savanovich/e4163c593ca38391bf8bba211a54ecf5 to your computer and use it in GitHub Desktop.
GPU.md

Drain mode = no longer accept new incoming requests

lspci -k | grep -A 2 -E '(3D|VGA)'
# lspci -k | grep -A 2 -E '(3D|VGA)'
# 00:08.0 VGA compatible controller: NVIDIA Corporation GR666GL [GeForce GX 666] (rev a0)
#         Kernel driver in use: nvidia
#         Kernel modules: nvidiafb, nouveau, nvidia

# set persistence mode off
nvidia-smi --id 0000:xx:00.0 --persistence-mode 0

# set on/off drain mode
nvidia-smi drain --pciid 0000:xx:00.0 --modify 1
# nvidia-smi drain --pciid 0000:xx:00.0 --modify 1

# set persistence mode on
nvidia-smi --persistence-mode 1

Set target temperature and voltage

# display detailed info
nvidia-smi -q

# enable persistance mode = keep gpu driver loaded
nvidia-smi -pm 1 

# Limit power usage
sudo nvidia-smi -i 0 -pl 300  # set 300W limit on gpu 0

sudo nvidia-smi -i 1 -gtt 78  # set target temperature on gpu 1

# Set fun speed
nvidia-settings -a "[gpu:0]/GPUFanControlState=1"  # set control flag
nvidia-settings -a "[fan:0]/GPUTargetFanSpeed=100" # 100% fun speed
  • MIG - the GPU can be divided into smaller, isolated instances. Each GPU instance has its own dedicated GPU cores, memory, and cache.
  • ECC mode = Error Correcting Code mode. ECC is a mechanism that helps detect and correct memory errors that may occur during the operation of a GPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment