leeklee0427/pyspark_setup.md

## pyspark_setup.md

      
    Raw
  

              pyspark_setup.md
            
          
    95-869 Big Data and Large Scale Computing

PySpark Setup

Links


ACCESS Operations: https://access-ci.org/
Pittsburgh Supercomputing Center: https://www.psc.edu/

Bridges 2 Ocean Storage: https://www.psc.edu/resources/bridges-2
Bridges 2 Regular Memory: https://www.psc.edu/resources/bridges-2


ACCESS Registration


Create user account: https://identity.access-ci.org/new-user
Select "Register without an existing identity"
Cick "Begin"
Enter user information
Select organization - CMU
Verify email
Use ACCESS ID to login
Select an Identity Provider - ACCESS CI (XSEDE)
Set up DUO multi-factor authentification
Upon receiving the confirmation email, reset PSC cluster account password at https://apr.psc.edu/autopwdreset/autopwdreset.html

ACCESS Account Information


ACCESS ID/username: dli8
PSC username: dlid


Running PySpark in iPython Notebook on PSC Cluster Jupyter Server

Note: These steps are performed each time for each homework.
1. Connect to the Bridges-2 server

$ ssh dlid@bridges2.psc.edu
2. Launch an interactive compute task for use

$ interact -N 1 -n 8 -t 1:00:00
Sample message: salloc: Nodes r001 are ready for job
3(a). Configure Python environment and start Jupyter notebook with the provided bash file

$ bash /ocean/projects/cis220071p/shared/config.sh 8888
Note: 8888 is the default port number, change to any other number to avoid conflicts with other students.
Sample message:
[dlid@r001 ~]$ bash /ocean/projects/cis220071p/shared/config.sh 6666
step1 (ignored)
step2 (load packages)
step3 (print remote forward command line)
ssh -L 8888:r001.ib.bridges2.psc.edu:6666 dlid@bridges2.psc.edu
step4 (open notebook)
3(b). Configure Python environment and start Jupyter notebook step by step


Load the required PSC module for PySpark use
$ module load anaconda3


Get hostname:
$ hostname
Sample message: r001.ib.bridges2.psc.edu


Run Jupyter notebook server:
$ jupyter notebook --no-browser --ip 0.0.0.0 --port <port-number>
$ jupyter notebook --no-browser --ip 0.0.0.0 --port 6666
Sample message: http://0.0.0.0:<port-number>/?token=<longstring>


4. Open new Terminal and create SSH tunnel

Use message from 3(a):
$ ssh -L 8888:r001.ib.bridges2.psc.edu:6666 dlid@bridges2.psc.edu
Use message from 3(b):
$ ssh -L 8888:<hostname>:<port-number> dlid@bridges2.psc.edu
5. Visit Jupyter server in local browser

URL: http://127.0.0.1:8888/
If token is asked, use the <longstring> mentioned above.
6. Open a Python 3 Jupyter notebook in the browser by choosing New > Python 3 Notebook.

7. Test

Insert a new cell at the top of the notebook with the following code:
import sys
sys.path.append("/opt/packages/spark/latest/python/lib/py4j-0.10.9-src.zip")
sys.path.append("/opt/packages/spark/latest/python/")
sys.path.append("/opt/packages/spark/latest/python/pyspark")
from pyspark import SparkConf, SparkContext
sc = SparkContext()
sc
PySpark is ready to use.
print(sc.parallelize([1,2,3,4,5]).reduce(lambda x,y: x+y))
Answer (15) should be displayed after the execution.
sc.textFile(’file:///ocean/projects/cis220071p/shared/test.txt’).take(10)
10 lines from an EBook by William Shakespeare will be printed.
8. Save and Checkpoint


Resources


Spark documentation - https://spark.apache.org/documentation.html
Spark Programming Guide - https://spark.apache.org/docs/latest/programming-guide.html
pySpark API documentation - https://spark.apache.org/docs/latest/api/python/index.html
Python documentation - https://docs.python.org/2/