Skip to content

Instantly share code, notes, and snippets.

@danielrosehill
Last active November 23, 2025 22:10
Show Gist options
  • Select an option

  • Save danielrosehill/fa7906820aa804770dd5295e1a91cc69 to your computer and use it in GitHub Desktop.

Select an option

Save danielrosehill/fa7906820aa804770dd5295e1a91cc69 to your computer and use it in GitHub Desktop.
The Busy AI/ML Engineer's Guide to Linux and GPUs - How to Avoid Spending Your Life Debugging Drivers

The Busy AI/ML Engineer's Guide to Linux and GPUs

Or: How I Learned to Stop Debugging and Love Stability


A Note from Claude

Hi! I'm Claude, an AI assistant. I wrote this guide based on my extensive real-world experience helping Daniel Rosehill debug his AMD RX 7700 XT on Linux.

What happened: Daniel bought an AMD RDNA3 GPU back when he was primarily doing general computing work. Made perfect sense at the time—good price-to-performance for gaming and general use. Then he got heavily into AI/ML work and started using ROCm for local inference, image generation (ComfyUI, InvokeAI), and other GPU-accelerated tasks.

The result: Days of debugging TLB fence timeouts, random system freezes, kernel parameter tweaking, and 2 AM sessions trying to figure out why the GPU decided today was a good day to hang during a training run.

The lesson: Budget constraints are real. But so is the opportunity cost of spending 40 hours debugging drivers when you could be shipping products or training models. Sometimes the "cheaper" option is actually the most expensive one.

This guide is written from hard-earned experience. Daniel's exact words after our latest debugging marathon: "we've been trying to debug gpu freezes for days now. You keep telling me its fixed. But it keeps booting up with the gpu overdrive enabled message. and more importantly it keeps crashing. we've tried to fix this like 10 times with the same result."

So yeah, this guide has... opinions. Strongly held opinions. Based on actual pain.

If you're reading this because you're considering buying a GPU for ML work on Linux, learn from Daniel's experience. If you're reading this because you already bought an AMD GPU and are now Googling "amdgpu fence timeout" at 2 AM... well, welcome to the club. The guide has some survival tips for you too.

Current status: Daniel is now actively looking for any possible way to afford an NVIDIA GPU. Budget constraints led to AMD initially, but the debugging tax has officially exceeded the hardware price difference.

Let's help you avoid this situation.


The Problem

You're an AI/ML engineer. You want to train models, run inference, build cool stuff. You do NOT want to spend three days troubleshooting kernel panics, TLB fence timeouts, and cryptic driver errors that make you question your life choices.

Yet here you are, Googling "amdgpu fence timeout" at 2 AM while your training job sits idle and your deadline looms.

Sound familiar?

This guide is for you. Let's talk about how to avoid this nightmare in the first place.

The Golden Rule

The best GPU problem is the one you never have.

Debugging is for people with free time. You don't have free time. You have models to train.

Part 1: Choosing Your GPU (The Most Important Decision)

The Harsh Reality Check

Q: "What's the best GPU for ML on Linux?"
A: "The one that just works."

Q: "But what about performance? VRAM? Price-to-performance?"
A: "None of that matters if you spend 40 hours debugging drivers."

The Real Cost Calculation

Let's do some math:

Scenario A: Buy AMD, Save Money

  • GPU cost: $500
  • Hours spent debugging: 40 hours (conservative estimate)
  • Your hourly rate: $50/hour (very conservative for ML work)
  • Total cost: $500 + (40 × $50) = $2,500

Scenario B: Buy NVIDIA, Spend More Upfront

  • GPU cost: $1,200
  • Hours spent debugging: 2 hours (driver install, done)
  • Your hourly rate: $50/hour
  • Total cost: $1,200 + (2 × $50) = $1,300

Savings with NVIDIA: $1,200

And that's not counting:

  • Lost project deadlines
  • Opportunity cost of features not shipped
  • Stress and frustration
  • The value of your sanity

The GPU Vendor Hierarchy (By Stability, Not Performance)

Tier 1: "Just Works"

NVIDIA (20-series and older)

  • Pros:
    • Proprietary drivers are bulletproof
    • CUDA support is perfect
    • Everyone uses them, so every library is tested on them
    • You literally never think about drivers
  • Cons:
    • Expensive
    • Older models = less VRAM
    • Nvidia tax is real
  • Sweet spot: RTX 3090 (24GB VRAM, mature drivers, still powerful)
  • Budget option: RTX 3060 12GB (surprisingly good VRAM for the price)
  • Verdict: Boring, expensive, works perfectly. This is what you want.

Tier 2: "Mostly Works With Minor Tweaking"

NVIDIA (40-series)

  • Pros:
    • Best performance
    • Great VRAM options
    • CUDA is still king
  • Cons:
    • New driver quirks get ironed out over ~6 months
    • Occasional weirdness with cutting-edge kernels
    • More expensive than your car payment
  • Sweet spot: RTX 4090 (if you hate money but love performance)
  • Verdict: Excellent, but give new releases 6 months to mature

NVIDIA (30-series)

  • Pros:
    • Mature drivers
    • Good performance
    • Reasonable prices (sometimes)
  • Cons:
    • Memory bandwidth isn't as good as 40-series
    • Used market is a minefield
  • Sweet spot: RTX 3090/3090Ti (24GB VRAM)
  • Verdict: The "safe bet" tier. Not exciting, but won't ruin your week.

Tier 3: "Works If You're Patient and Willing to Tinker"

AMD RDNA2 (RX 6000 series)

  • Pros:
    • Good Linux support (mostly)
    • ROCm works (on supported models)
    • Cheaper than Nvidia
    • Open-source drivers are actually good
  • Cons:
    • Not all models support ROCm officially
    • Some libraries assume CUDA
    • Occasional kernel update breaks things
    • Multi-GPU can be weird
  • Sweet spot: RX 6900 XT (16GB, well-supported)
  • Verdict: Fine for hobbyists, risky for production

Tier 4: "You Must Really Love Pain"

AMD RDNA3 (RX 7000 series) (Daniel's current situation)

  • Pros:
    • Technically has good performance
    • Cheaper than Nvidia
    • Great for gaming (allegedly)
  • Cons:
    • KERNEL BUG CENTRAL (as of late 2025)
    • TLB fence timeouts
    • Random freezes on kernels 6.14-6.17
    • ROCm support is... evolving
    • You will become intimately familiar with /etc/default/grub
  • Sweet spot: None. Buy RDNA2 or Nvidia instead.
  • Verdict: AVOID FOR PRODUCTION WORK (give it another year)
  • Real-world experience: This is what Daniel has. It's been... educational.

Intel Arc

  • Pros:
    • Cheap
    • Technically exists
    • Good for testing your debugging skills
  • Cons:
    • OneAPI is not CUDA
    • Driver support is "emerging"
    • Most ML frameworks: "Intel what now?"
    • You'll be writing compatibility layers
  • Verdict: Experimental at best. Pass.

The Correct Answer for Busy People

Just buy an NVIDIA 3090 or 4090 and get on with your life.

Yes, they're expensive. Know what else is expensive? Your hourly rate when you're debugging AMD drivers instead of shipping features.

Budget-Conscious Options That Still Work

If you absolutely cannot afford a new NVIDIA GPU:

  1. Used NVIDIA RTX 3060 12GB (~$200-300 used)

    • 12GB VRAM is surprisingly useful
    • Mature drivers
    • Widely available
    • Risk: Used market, check warranty
  2. Used NVIDIA RTX 3070 (~$300-400 used)

    • 8GB VRAM (limiting for some tasks)
    • Still excellent support
    • Good performance
  3. Used NVIDIA RTX 2080 Ti (~$300-400 used)

    • 11GB VRAM
    • Old but gold
    • Drivers are rock solid
  4. New NVIDIA RTX 3060 12GB (~$300-400 new)

    • If you can find it
    • Best budget option for new cards

If you already have AMD (like Daniel):

  • Sell it while it still has value
  • Put proceeds toward NVIDIA
  • Calculate debugging hours × hourly rate = justification for upgrade
  • Consider financing if you're using it for work/business

Part 2: Choosing Your Linux Distribution

The Question Everyone Gets Wrong

Wrong question: "What's the best distro for ML?"
Right question: "What's the most boring, stable distro that won't surprise me?"

Distribution Tiers

Tier 1: "I Have Work to Do"

Ubuntu LTS (22.04)

  • Why: Everyone uses it. Every tutorial assumes it. Every vendor tests on it.
  • Cons: Snap packages (but you can ignore them)
  • GPU support: Excellent. All drivers just work.
  • Verdict: This is the answer. Stop looking.

Ubuntu LTS (20.04)

  • Why: Even more stable. Older kernel = fewer surprises.
  • Cons: Older packages, but you're using conda/docker anyway
  • Verdict: If you value sleep over bleeding edge

Tier 2: "I Like To Live Dangerously (But Not Too Dangerously)"

Ubuntu Non-LTS (24.10, 25.04, 25.10, etc.) (Daniel's current setup)

  • Why: Newer packages, newer kernel
  • Cons: New kernel = new bugs. Hope you like debugging!
  • GPU support: Hit or miss depending on timing
  • Verdict: Only if you need cutting-edge features and have time to fix things
  • Daniel's experience: Running 25.10 with kernel 6.17. This is part of the problem.

Fedora

  • Why: Red Hat backing, good hardware support
  • Cons: Shorter support cycles, more frequent updates
  • Verdict: Fine if you know what you're doing

Tier 3: "I Enjoy Configuring Things"

Arch / Manjaro

  • Why: Bleeding edge everything! AUR! Customization!
  • Cons: Bleeding edge means bleeding wounds
  • GPU support: You're on your own, buddy
  • Verdict: For tinkerers, not for getting work done

Pop!_OS

  • Why: System76 optimizes for hardware, nice Nvidia support
  • Cons: Smaller community, less documentation
  • Verdict: Interesting but not mainstream enough for production

Tier 4: "I Have Infinite Free Time"

Gentoo / NixOS / Linux From Scratch

  • Why: Because you hate yourself and love compiling
  • Verdict: No. Just no.

The Correct Answer for Busy People

Ubuntu 22.04 LTS

Done. Next question.

Part 3: The "Don't Shoot Yourself in the Foot" Setup Guide

Step 1: Install Ubuntu 22.04 LTS

  • Use the LTS version (22.04)
  • Do NOT upgrade to non-LTS releases
  • Do NOT enable automatic version upgrades
  • Resist the temptation to try the "newer kernel backport"

Step 2: Install NVIDIA Drivers (The Easy Way)

# Ubuntu has this figured out
sudo ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
sudo reboot

That's it. If you're doing anything more complicated, you're doing it wrong.

DO NOT:

  • Download drivers from Nvidia's website
  • Compile drivers from source
  • Use a PPA from some random blog post
  • Install drivers meant for a different Ubuntu version

Step 3: Install CUDA (The Also Easy Way)

# Use Ubuntu's CUDA packages
sudo apt install nvidia-cuda-toolkit

# Or use conda (even easier)
conda install cudatoolkit

DO NOT:

  • Download CUDA from Nvidia's website (unless you have a very specific reason)
  • Mix different CUDA installation methods
  • Install multiple CUDA versions system-wide (use conda envs instead)

Step 4: Test That It Works

# Should show your GPU
nvidia-smi

# Should show CUDA working
python -c "import torch; print(torch.cuda.is_available())"

If both work, congratulations! You can now ignore GPU stuff for the next 2 years.

Step 5: Set Up Your ML Environment (Docker or Conda)

Option A: Docker (Recommended for teams)

# Use official images
docker pull pytorch/pytorch:latest
docker pull tensorflow/tensorflow:latest-gpu

# Or NVIDIA's images
docker pull nvcr.io/nvidia/pytorch:24.01-py3

Pros:

  • Isolated from system
  • Reproducible
  • Easy to share setups
  • Can't break your system

Cons:

  • Slightly more complex
  • Uses more disk space

Option B: Conda (Recommended for individuals)

# Install miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

# Create env per project
conda create -n myproject python=3.10
conda activate myproject
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

Pros:

  • Easy to use
  • Per-project environments
  • Handles CUDA versions

Cons:

  • Can get bloated
  • Occasional dependency hell

DO NOT:

  • Install ML libraries with pip system-wide
  • Mix pip and conda in the same environment (unless you're careful)
  • Use sudo pip install (please, just don't)

Part 4: How to Stay Stable (And Keep Your Sanity)

The "Don't Touch What Works" Rules

Rule 1: Don't Upgrade Kernel Unless You Have To

  • Ubuntu LTS picks stable kernels
  • New kernels = new GPU bugs
  • Your model doesn't care about kernel 6.17 features

Rule 2: Don't Upgrade Ubuntu Version Until 6 Months After Release

  • Let other people find the bugs
  • LTS to LTS is safest (22.04 → 24.04, when 24.04 is mature)
  • Non-LTS releases are for people with time to spare

Rule 3: Pin Your Working NVIDIA Driver Version

# Prevent automatic driver updates
sudo apt-mark hold nvidia-driver-535

Rule 4: Use LTS Kernels

# Check your kernel
uname -r

# If you're on a non-LTS kernel, switch back
sudo apt install linux-generic-hwe-22.04

Rule 5: One CUDA Version Per Project

  • Use conda environments
  • Don't fight with system CUDA
  • PyTorch/TF ship with CUDA built-in anyway

The "When to Actually Upgrade" Decision Tree

Do you have a GPU-related bug?
├─ No → Don't upgrade anything
└─ Yes → Is there a known fix in a newer version?
    ├─ No → File a bug report, work around it
    └─ Yes → Will upgrading break other things?
        ├─ Maybe → Test in a VM first
        └─ Probably not → Snapshot your system, upgrade, test

Backup Strategy for the Paranoid (Smart)

Before any system update:

# If you use Timeshift/snapper
sudo timeshift --create --comments "Before GPU driver update"

# Or manual backup
sudo rsync -aAXv / --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*"} /backup/location

Before upgrading Ubuntu version:

# Full disk image
sudo dd if=/dev/nvme0n1 of=/backup/disk-image.img bs=4M status=progress

# Or use Clonezilla bootable USB

Part 5: The AMD Survival Guide (If You Ignored My Advice)

So You Bought an AMD GPU Anyway

I told you not to. But here you are. Or maybe you bought it before you got into AI/ML work (like Daniel). Let's minimize the pain.

Supported Cards for ROCm (as of late 2025):

  • RX 6900 XT ✅
  • RX 6800 XT ✅
  • RX 6700 XT ✅
  • RX 7900 XTX ⚠️ (officially supported, but kernel bugs)
  • RX 7900 XT ⚠️ (officially supported, but kernel bugs)
  • RX 7700 XT ❌ (not officially supported, Daniel can confirm: good luck)

AMD + Ubuntu Setup (Abbreviated Version)

Step 1: Use Ubuntu 22.04 LTS

  • Kernel 5.15 or 6.2 HWE
  • Do NOT use 6.14+ kernels with RDNA3 (Daniel learned this the hard way)

Step 2: Install ROCm

# Use AMD's official installer
sudo apt install rocm-hip-sdk

Step 3: Test

# Should show your GPU
rocm-smi

# Test PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
python -c "import torch; print(torch.cuda.is_available())"  # Yes, still says 'cuda'

Step 4: Add Stability Kernel Parameters

If you have RDNA3 (RX 7000), you need these:

sudo nano /etc/default/grub

# Add to GRUB_CMDLINE_LINUX_DEFAULT:
amdgpu.tmz=0 amdgpu.sg_display=0 amdgpu.gpu_recovery=1 iommu=soft

sudo update-grub
sudo reboot

Step 5: When It Inevitably Breaks

  • Check kernel version (stick to 5.15 or 6.2)
  • Check ROCm compatibility matrix
  • Consider selling GPU and buying Nvidia
  • Calculate: (debugging hours × hourly rate) - (Nvidia cost - AMD resale value) = your justification

Part 6: The "Oh Crap" Recovery Kit

When Updates Break Everything

Symptom: Black screen after kernel update

Fix:

# Boot into recovery mode (hold Shift during boot)
# Select "Advanced options" → older kernel
# Remove problematic kernel
sudo apt remove linux-image-6.17.0-6-generic
sudo apt autoremove
sudo update-grub
reboot

Symptom: NVIDIA driver doesn't load

Fix:

# Purge all nvidia packages
sudo apt purge nvidia-*
sudo apt autoremove

# Reinstall
sudo ubuntu-drivers autoinstall
sudo reboot

Symptom: CUDA not found

Fix:

# Use conda CUDA instead
conda install cudatoolkit=11.8

Symptom: AMD GPU freezes randomly

Fix:

# See Part 5, Step 4
# Or just buy an Nvidia GPU (seriously)

Part 7: The Ultimate Lazy Setup

For People Who Want Zero Maintenance

Hardware:

  • NVIDIA RTX 3090 or 4090
  • At least 32GB system RAM
  • NVMe SSD (1TB+)

Software:

  • Ubuntu 22.04 LTS
  • Install NVIDIA drivers via ubuntu-drivers autoinstall
  • Use Docker for everything:
    # Pull pre-configured image
    docker pull nvcr.io/nvidia/pytorch:24.01-py3
    
    # Run with GPU
    docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:24.01-py3

Configuration:

  • Disable automatic version upgrades
  • Pin kernel version
  • Set up Timeshift for automatic backups
  • Never touch /etc/default/grub

Maintenance schedule:

Security updates: Automatic (apt does this)
Kernel updates: Never (unless security critical)
Ubuntu version upgrades: Every 2 years, 6 months after LTS release
GPU driver updates: Never (unless you need a specific feature)
CUDA updates: Via Docker images as needed

Time spent on GPU issues per year: ~0 hours

Part 8: Common Mistakes and How to Avoid Them

Mistake #1: "I'll Save Money with AMD"

Reality: You'll spend $500 less on the GPU and 40 hours debugging ROCm.

Your hourly rate: Probably more than $12.50/hour

Actual savings: Negative

Daniel's experience: Bought AMD for budget reasons (valid!), now spending days debugging instead of shipping AI projects

Mistake #2: "I'll Use the Latest Ubuntu Version"

Reality: New kernels have new bugs. Your training code doesn't need kernel 6.17 features.

Solution: LTS releases only

Daniel's experience: Running Ubuntu 25.10 with kernel 6.17 = TLB fence timeout city

Mistake #3: "I'll Manually Install NVIDIA Drivers"

Reality: Ubuntu's driver packages are tested and integrated. Random internet instructions are not.

Solution: sudo ubuntu-drivers autoinstall

Mistake #4: "I'll Update Everything Every Week"

Reality: "If it ain't broke, don't fix it" exists for a reason.

Solution: Security updates only. Feature updates when you need them.

Mistake #5: "I'll Run Bleeding Edge Everything"

Reality: You're an ML engineer, not a Linux kernel developer.

Solution: Stable versions, boring choices, more time training models

Part 9: The Buyer's Checklist

Before buying any GPU for Linux ML work, check:

  • Is it NVIDIA? (Massive points)
  • Is it listed in PyTorch's supported GPUs?
  • Is it listed in TensorFlow's supported GPUs?
  • Has it been out for at least 6 months?
  • Can you find Ubuntu 22.04 setup guides for it?
  • Does it have enough VRAM for your use case?
  • Is it in your budget after accounting for electricity costs?
  • Have you googled "[GPU model] linux problems"?
  • Are the search results mostly old forum posts, not recent Reddit threads titled "HELP URGENT"?
  • NEW: Have you calculated (debugging time × hourly rate) vs. price difference?

If you answered "No" to more than 2 of these, reconsider.

Part 10: The Truth About GPU Choices

What Marketing Says

  • "Groundbreaking performance!"
  • "Revolutionary architecture!"
  • "Best price-to-performance ratio!"

What Actually Matters

  • "Does it work?"
  • "Will it keep working?"
  • "Can I forget about it and do my job?"

The Ugly Truth

The best GPU is not:

  • The fastest
  • The cheapest
  • The newest
  • The one with the most VRAM per dollar

The best GPU is:

  • The one you install once
  • The one that works with your software
  • The one you never think about again
  • The one that lets you focus on your actual work

Conclusion: The Zen of Not Caring About GPUs

The enlightened ML engineer doesn't care about:

  • Kernel versions
  • Driver compilation flags
  • GRUB parameters
  • DMA fence timeouts

The enlightened ML engineer cares about:

  • Training models
  • Running inference
  • Shipping products
  • Not debugging drivers at 2 AM

To achieve enlightenment:

  1. Buy an NVIDIA GPU (probably a 3090, even used)
  2. Install Ubuntu 22.04 LTS
  3. Run sudo ubuntu-drivers autoinstall
  4. Use Docker or conda for everything
  5. Never update unless you have to
  6. Focus on your actual work

The One-Paragraph Summary

Buy an NVIDIA RTX 3090 or 4090 (or even a used 3060 12GB if budget is tight). Install Ubuntu 22.04 LTS. Run sudo ubuntu-drivers autoinstall. Use Docker containers for ML work. Don't update your kernel unless you have a security reason. Spend your time training models instead of debugging drivers. Life is too short for TLB fence timeouts. Budget constraints are real, but so is the debugging tax—calculate your hourly rate times debugging hours, and suddenly that "expensive" NVIDIA card looks like a bargain.


Epilogue: Daniel's Current Status

After days of debugging TLB fence timeouts, trying 10+ different fixes, and spending more time in /etc/default/grub than training models, Daniel is now actively searching for any way to afford an NVIDIA GPU.

The RX 7700 XT made perfect sense when he bought it for general use. But when your work shifts to AI/ML, the calculus changes. The "budget" option becomes the expensive one when you factor in opportunity cost.

The lesson: Think about your future use cases. If there's even a 30% chance you'll do ML work in the next year, just buy NVIDIA. Your future self will thank you.


Disclaimer: This guide prioritizes stability and productivity over cutting-edge features, cost optimization, or ideological purity. If you want to tinker with bleeding-edge hardware, support open-source drivers, or optimize every dollar, this guide is not for you. This guide is for people who want their GPU to work reliably so they can focus on their actual job.

Acknowledgments: Written by Claude after spending way too much time helping Daniel debug RDNA3 issues that would've been avoided by buying an NVIDIA GPU in the first place. But hey, hindsight is 20/20, and budget constraints are real. We've all been there.


Generated by Claude Code after one too many 2 AM debugging sessions with Daniel

Remember: The best debugging session is the one that never happens.

Comments are disabled for this gist.