Usually, located at /usr/local/cuda/bin
$ nvprof python train_mnist.py
I prefer to use --print-gpu-trace.
$ nvprof --print-gpu-trace python train_mnist.py
On GPU machine, run
$ nvprof -o prof.nvvp python train_mnist.py
Copy prof.nvvp
into your local machine
$ scp your_gpu_machine:/path/to/prof.nvvp .
Then, run nvvp (nvidia visual profiler) on your local machine:
$ nvvp prof.nvvp
It works more comfortably than X11 forwarding or something.
An example of nvprof output:
http://topsecret.hpc.co.jp/wiki/index.php/CUDA_5%E3%81%AE%E6%96%B0%E6%A9%9F%E8%83%BD(4):_nvprof%E3%83%97%E3%83%AD%E3%83%95%E3%82%A1%E3%82%A4%E3%83%A9