Skip to content

Instantly share code, notes, and snippets.

@tonysy
Created April 10, 2018 14:05
Show Gist options
  • Save tonysy/f051e8924cf39202074fff357e7e5310 to your computer and use it in GitHub Desktop.
Save tonysy/f051e8924cf39202074fff357e7e5310 to your computer and use it in GitHub Desktop.
P40 Cluster Usage

10.AI-Cluster-Usage

P40 AI Cluster

For P40 AI cluster, the system is allowed to send works use PBS command only. It is similar to send a shell script to GPU server and you will get output log files.

Similar with old ai cluster, it consists of two kinds of nodes, admin node and GPU node.

P40 AI Cluster IP: 10.15.22.198 . The IP address is not static now!

Warm Up

  • You can login admin node use:
    ssh username@10.15.22.198
  • Change your password yppasswd username
  • Now you have logged into admin node of the GPU server. Internet is accessible only in admin node.

For fun

Create a folder to save script and log files mkdir pbs_tool && cd pbs_tool

Write PBS script

You can use GPU to train your model by sending a pdb script in admin node only.

  • PBS script example
#!/bin/bash
#PBS-N Example -q sist-hexm -l sist-gpu0x
echo "This is a test script"
pwd 
nvidia-smi

Save this file as example.pb

Send work and get your output

Use this command to send work qsub example.pb

Then some log files will be generated. You can check out the output in Example.o123. .o*** stands for output.

Other resources:

Ref: PBS Documents

Save your life

  • Run a script:

    #!/bin/bash
    #PBS-N YourLogFileName -q sist-hexm -l sist-gpu0x
    echo "This is a test script"
    cd YourProjectFolder
    command you want to excute
    • YourLogFileName is filename for output log and error log
    • sist-hexm is group name
    • sist-gpu0x is computer id
  • Watch Your output

    • You can use cat or vi to watch the output.
    • You can use watch -n -1 tail -n 30 YourLogFileNmae.oxxx to monitor your output.
      • -n -1 means system will refresh every 0.1s
      • -n 30 is used to assign how many lines to display.
  • Pay attention to YourLogFileNmae.exxx, error and warning info will be recored here.

You should use your local machine to debug your code, use this cluster to run and collect the results. It's difficult to debug with pbs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment