Skip to content

Instantly share code, notes, and snippets.

@satyajeetmaharana
Last active February 13, 2020 01:57
Show Gist options
  • Save satyajeetmaharana/70aec8b6150d3c7ba89c7d9e3d32933a to your computer and use it in GitHub Desktop.
Save satyajeetmaharana/70aec8b6150d3c7ba89c7d9e3d32933a to your computer and use it in GitHub Desktop.
Accessing Spark on the NYU HPC Dumbo

How to run Spark on NYU Dumbo Cluster

NYU’s Hadoop Cluster, Dumbo

There is an NYU HPC Hadoop cluster (Dumbo) available for homework and projects - this is available to students registered for the course at no charge.

Support The NYU HPC IT team provides support for Dumbo - you can reach them at hpc@nyu.edu for assistance with the cluster; you can also use our class Forum on NYU Classes to get help.

Getting an account To get an account, follow these instructions (you can select your Course Professor for sponsor): https://wikis.nyu.edu/display/NYUHPC/Getting+or+renewing+an+HPC+account

Logging In Once you have an account, instructions for logging in are here: https://wikis.nyu.edu/display/NYUHPC/Clusters+-+Dumbo#Clusters-Dumbo-LOGGING_INLoggingIn

More Information You can read about Dumbo here: https://wikis.nyu.edu/display/NYUHPC/Clusters+-+Dumbo

Dumbo - Logging In, Testing HDFS

If you want to try Dumbo, here are steps I've used to log into Dumbo. Use the Forum if you encounter any difficulties.

  • Execute these two steps to log into Dumbo, remember to replace 'yourNetID' with your own net ID.

    1. ssh yourNetID@gw.hpc.nyu.edu (You can skip this step if you are logged into the VPN - vpn.nyu.edu)

    2. ssh -Y yourNetID@dumbo.es.its.nyu.edu

  • Use an editor, such as vi, to create a text file in the local (non HDFS) file system

    vi myTestData.txt
    
  • Next, put your data file into HDFS

    hdfs dfs -ls /
    hdfs dfs -ls /user
    hdfs dfs -ls /user/yourNetID
    hdfs dfs -mkdir /user/yourNetID/class1
    hdfs dfs -put myTestData.txt /user/yourNetID/class1 
    hdfs dfs -cat /user/yourNetID/class1/myTestData.txt
    

If the above steps worked, your Dumbo Hadoop account is ready to use.

Reference http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/FileSystemShell.html

Using Spark REPL

  • In the already open terminal window type the following command to start the Spark shell - you shouldn't see any errors (warnings can be ignored):

    $ spark-shell — Start the Scala version of the Spark REPL

    After some output from the shell, you should see a scala> prompt

  • Some Commands you can try:

    scala> :help    — In the Spark shell, try the help command
    scala> sc[TAB]	— View the commands available in the Spark Context (sc) 
    scala> sc.version  — View the version of Spark that is running in the shell
    scala> val myConstant: Int = 2016 scala> myConstant
    scala> my[TAB]
    scala> myConstant.[TAB]
    scala> myConstant.to[TAB]
    scala> myConstant.toFloat 
    scala> myConstant		— Note that myConstant has not changed; it’s still an Int
    scala> myConstant.toFloat.toInt
    scala> val myString = myConstant		— Note the type inferred for myString
    scala> :type val myString2 = myConstant		— Use the :type command to view the type that is inferred for myString2
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment