Skip to content

Instantly share code, notes, and snippets.

@marshki
Last active February 26, 2021 14:42
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save marshki/c2e69d011f97f2ca74ac83fb9096b4d0 to your computer and use it in GitHub Desktop.
Save marshki/c2e69d011f97f2ca74ac83fb9096b4d0 to your computer and use it in GitHub Desktop.
Transfer data from source to destination on NYU's high performance computing (HPC) cluster.

Transfer Data to NYU's HPC 🚀

Scope: Transfer data to NYU's high performance computing (HPC) cluster.

Summary of access nodes on "Greene":

Fully-qualified domain name (FQDN) Purpose
gdtn.hpc.nyu.edu Data transfer node (DTN)
greene.hpc.nyu.edu Login node

Summary of data storage on "Greene" (per netID):

Path Environmental Variable Purpose Flushed? Allocation
/archive/$USER $ARCHIVE Long-term storage No 2TB/20K files
/home/$USER $HOME Small files, code No 50GB/30K files
/scratch/$USER $SCRATCH File staging -freq. read/write Yes. Files unused for sixty (60) days are deleted 5TB/1M files

Preflight check ✔️✈️

You'll need these:

Transferring Data 🔄

No-frills file transfer with secure copy (scp)

Quickly transfer a small number of file(s) or directory from source to destination:

scp -rv file1 netID@gdtn.hpc.nyu.edu:/home/netID

(when prompted, provide your NetID credentials. A successful transfer will yield: Exit status 0)

Remotely synchronize the contents of two directories (rsync)

Sync the content of a directory(s) from source to destination (recommended for large # of files):

rsync --archive --compress --progress --exclude=".*" directoryname/ netID@gdtn.hpc.nyu.edu:/scratch/netID/directoryname

or more succinctly:

rysnc -azP --exclude=".*" directoryname/ netID@gdtn.hpc.nyu.edu:/scratch/netID/directoryname

(when prompted, provide your NetID credentials.)

rsync notes:

  • The trailing / on (directoryname/) in your source matters.
  • --exclude=".*" excludes dot . files. This is optional, but recommended.
  • Run the rsync command again when your source directory changes to have those changes reflected in your destination directory.
  • rsync can pick up where it left off with the: --append option.

Archive data with on-the-fly compression (tar, ssh)

Use tar to compress source directory and push it over SSH to destination (recommended for fast data transfer and/or archiving):

tar --create --gzip --verbose --file - directoryname |ssh netid@gdtn.hpc.nyu.edu "cat > /archive/netid/tarballname.tar.gz"

or more succinctly:

tar czvf - directoryname | ssh netid@gdtn.hpc.nyu.edu "cat > /archive/netid/tarballname.tgz"

To unpack the tar (on the destination side) do:

tar --extract --verbose --gunzip --file tarballname.tgz

or less verbosely:

tar -xvzf tarballname.tgz

Transferring data with no hangups (nohup) 🚫📞

To keep a process running even after exiting your shell or terminal, preface it with: nohup (recommended for time-intensive jobs):

nohup tar -czvf - directoryname | ssh netid@gdtn.hpc.nyu.edu "cat > /scratch/netid/tarballname.tgz"

then stop the command:

Ctrl + z

and send it to background:

bg

To monitor this process, take note of its PID:

ps aux |grep -i ssh

which will yield, e.g.: 67225

then take that PID, and feed it to top:

top -pid 67225 or: top -p 67225

(you can quit top with: q).

:neckbeard:❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment