Skip to content

Instantly share code, notes, and snippets.

@leeklee0427
Last active June 27, 2024 18:40
Show Gist options
  • Save leeklee0427/e6331b73d06c15d2b937d9f94c6b5b39 to your computer and use it in GitHub Desktop.
Save leeklee0427/e6331b73d06c15d2b937d9f94c6b5b39 to your computer and use it in GitHub Desktop.

95-869 Big Data and Large Scale Computing

PySpark Setup

Links

ACCESS Registration

  1. Create user account: https://identity.access-ci.org/new-user
  2. Select "Register without an existing identity"
  3. Cick "Begin"
  4. Enter user information
  5. Select organization - CMU
  6. Verify email
  7. Use ACCESS ID to login
  8. Select an Identity Provider - ACCESS CI (XSEDE)
  9. Set up DUO multi-factor authentification
  10. Upon receiving the confirmation email, reset PSC cluster account password at https://apr.psc.edu/autopwdreset/autopwdreset.html

ACCESS Account Information

  • ACCESS ID/username: dli8
  • PSC username: dlid

Running PySpark in iPython Notebook on PSC Cluster Jupyter Server

Note: These steps are performed each time for each homework.

1. Connect to the Bridges-2 server

$ ssh dlid@bridges2.psc.edu

2. Launch an interactive compute task for use

$ interact -N 1 -n 8 -t 1:00:00

Sample message: salloc: Nodes r001 are ready for job

3(a). Configure Python environment and start Jupyter notebook with the provided bash file

$ bash /ocean/projects/cis220071p/shared/config.sh 8888

Note: 8888 is the default port number, change to any other number to avoid conflicts with other students.

Sample message:

[dlid@r001 ~]$ bash /ocean/projects/cis220071p/shared/config.sh 6666
step1 (ignored)
step2 (load packages)
step3 (print remote forward command line)
ssh -L 8888:r001.ib.bridges2.psc.edu:6666 dlid@bridges2.psc.edu
step4 (open notebook)

3(b). Configure Python environment and start Jupyter notebook step by step

  1. Load the required PSC module for PySpark use

    $ module load anaconda3
  2. Get hostname:

    $ hostname

    Sample message: r001.ib.bridges2.psc.edu

  3. Run Jupyter notebook server:

    $ jupyter notebook --no-browser --ip 0.0.0.0 --port <port-number>
    $ jupyter notebook --no-browser --ip 0.0.0.0 --port 6666

    Sample message: http://0.0.0.0:<port-number>/?token=<longstring>

4. Open new Terminal and create SSH tunnel

Use message from 3(a):

$ ssh -L 8888:r001.ib.bridges2.psc.edu:6666 dlid@bridges2.psc.edu

Use message from 3(b):

$ ssh -L 8888:<hostname>:<port-number> dlid@bridges2.psc.edu

5. Visit Jupyter server in local browser

URL: http://127.0.0.1:8888/ If token is asked, use the <longstring> mentioned above.

6. Open a Python 3 Jupyter notebook in the browser by choosing New > Python 3 Notebook.

7. Test

Insert a new cell at the top of the notebook with the following code:

import sys
sys.path.append("/opt/packages/spark/latest/python/lib/py4j-0.10.9-src.zip")
sys.path.append("/opt/packages/spark/latest/python/")
sys.path.append("/opt/packages/spark/latest/python/pyspark")
from pyspark import SparkConf, SparkContext
sc = SparkContext()
sc

PySpark is ready to use.

print(sc.parallelize([1,2,3,4,5]).reduce(lambda x,y: x+y))

Answer (15) should be displayed after the execution.

sc.textFile(’file:///ocean/projects/cis220071p/shared/test.txt’).take(10)

10 lines from an EBook by William Shakespeare will be printed.

8. Save and Checkpoint


Resources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment