- ACCESS Operations: https://access-ci.org/
- Pittsburgh Supercomputing Center: https://www.psc.edu/
- Bridges 2 Ocean Storage: https://www.psc.edu/resources/bridges-2
- Bridges 2 Regular Memory: https://www.psc.edu/resources/bridges-2
- Create user account: https://identity.access-ci.org/new-user
- Select "Register without an existing identity"
- Cick "Begin"
- Enter user information
- Select organization - CMU
- Verify email
- Use ACCESS ID to login
- Select an Identity Provider - ACCESS CI (XSEDE)
- Set up DUO multi-factor authentification
- Upon receiving the confirmation email, reset PSC cluster account password at https://apr.psc.edu/autopwdreset/autopwdreset.html
- ACCESS ID/username:
dli8
- PSC username:
dlid
Note: These steps are performed each time for each homework.
$ ssh dlid@bridges2.psc.edu
$ interact -N 1 -n 8 -t 1:00:00
Sample message: salloc: Nodes r001 are ready for job
$ bash /ocean/projects/cis220071p/shared/config.sh 8888
Note: 8888 is the default port number, change to any other number to avoid conflicts with other students.
Sample message:
[dlid@r001 ~]$ bash /ocean/projects/cis220071p/shared/config.sh 6666
step1 (ignored)
step2 (load packages)
step3 (print remote forward command line)
ssh -L 8888:r001.ib.bridges2.psc.edu:6666 dlid@bridges2.psc.edu
step4 (open notebook)
-
Load the required PSC module for PySpark use
$ module load anaconda3
-
Get hostname:
$ hostname
Sample message:
r001.ib.bridges2.psc.edu
-
Run Jupyter notebook server:
$ jupyter notebook --no-browser --ip 0.0.0.0 --port <port-number> $ jupyter notebook --no-browser --ip 0.0.0.0 --port 6666
Sample message:
http://0.0.0.0:<port-number>/?token=<longstring>
Use message from 3(a):
$ ssh -L 8888:r001.ib.bridges2.psc.edu:6666 dlid@bridges2.psc.edu
Use message from 3(b):
$ ssh -L 8888:<hostname>:<port-number> dlid@bridges2.psc.edu
URL: http://127.0.0.1:8888/
If token is asked, use the <longstring>
mentioned above.
Insert a new cell at the top of the notebook with the following code:
import sys
sys.path.append("/opt/packages/spark/latest/python/lib/py4j-0.10.9-src.zip")
sys.path.append("/opt/packages/spark/latest/python/")
sys.path.append("/opt/packages/spark/latest/python/pyspark")
from pyspark import SparkConf, SparkContext
sc = SparkContext()
sc
PySpark is ready to use.
print(sc.parallelize([1,2,3,4,5]).reduce(lambda x,y: x+y))
Answer (15) should be displayed after the execution.
sc.textFile(’file:///ocean/projects/cis220071p/shared/test.txt’).take(10)
10 lines from an EBook by William Shakespeare will be printed.
- Spark documentation - https://spark.apache.org/documentation.html
- Spark Programming Guide - https://spark.apache.org/docs/latest/programming-guide.html
- pySpark API documentation - https://spark.apache.org/docs/latest/api/python/index.html
- Python documentation - https://docs.python.org/2/