During our Big Data lab sessions, we use PuTTY and WinSCP to connect to the Hadoop cluster. This solution cannot be accomplished from students with a Linux/MacOS device, as these programs are not available outside Windows.
On live sessions, it is easy to do this using Guacamole; but if you want to use Spark outside of lab hours, different solutions are required.
As I'm following lessions with a MacOS device, I was searching a rapid solution to do this, without installing too much external software. I found a way to go using ssh
and scp
via terminal (tested with bash
or zsh
), that are pre-installed into the majority of MacOS and Linux systems.
You can directly connect to the cluster with this one-liner:
ssh -t 'riccardo.maldini2@studio.unibo.it'@isi-alfa.csr.unibo.it "ssh rmaldini@isi-vclust4.csr.unibo.it"
Use your Unibo and cluster credentials, end enter the respective passwords when asked. Here you can use foe example spark2
and spark2-submit
.
To use Hue Web Interface and Cloudera Manager into your browser, open another connection with this command, then configure proxy settings of your browser it to use a SOCKS proxy with host name 127.0.0.1 and port 8080:
$ ssh -D 8080 'riccardo.maldini2@studio.unibo.it'@isi-alfa.csr.unibo.it
A more detailed explanation of what is happening here is given below.
You can manage to do it in two steps. Fistly, you have to connect to Unibo servers with your credentials via ssh
. I use this command from my Terminal. You should change my address with your institutional address:
$ ssh 'riccardo.maldini2@studio.unibo.it'@isi-alfa.csr.unibo.it
Sometimes the server refuses the connection, so you may have to retry sending the command. If the connection is successful, the server asks you the password (insert your institutional credentials)... and you are in! You should see a Debian bash shell, pointing to your personal Unibo server space, with a prefix like STUDENTI\username:~$
.
Here you can connect to the cluster used for Hadoop operations, by opening another SSH connection from the inner bash terminal. Try this command to connect to the cluster, using your assigned node credentials:
STUDENTI\username:~$ ssh rmaldini@isi-vclust4.csr.unibo.it
Aaaand you're done again! If you see a bash prefix like [rmaldini@isi-vclust4 ~]$
, you're in. Here you can use spark2-shell
and spark2-submit
commands. Go in :paste
mode to use directly your code inside the Spark Shell.
The one-liner reported in the first section just executes the two steps together.
If you want to use spark2-submit
and load your jar from your laptop, the way is more tricker. At least, the solution I found is. I'm sure there are better and more direcy ways, so help me to modify the gist if you find something better.
In my solution, i use scp
(Secure Copy), from my laptop terminal:
$ scp ~/myfile.jar 'riccardo.maldini2@studio.unibo.it'@isi-alfa.csr.unibo.it:.
This loads the file in your personal Unibo server space. Now, if you want to use the file with Spark, you should load the file into the cluster node space. So, load again the file, this time from the remote server terminal (not the cluster one):
STUDENTI\user:~$ scp myfile.jar rmaldini@isi-vclust4.csr.unibo.it:.
Now your file should be ready to be used from the cluster node with spark2-submit
.
You have to login into the remote machine using:
$ ssh -D 8080 'riccardo.maldini2@studio.unibo.it'@isi-alfa.csr.unibo.it
Leave the terminal open; now, go to your browser's proxy settings, and configure it to use a SOCKS proxy with host name 127.0.0.1 and port 8080. All pages you load in your web browser will be tunnelled through the SSH connection, by now. You should now be able to access the private web page in the same way you would from the remote host.
This means you can access:
- Hue web Interface, at
http://isi-vclust0.csr.unibo.it:8889
; - Cloudera Manager at
http://137.204.72.233:7180
; - Spark Manager, at
http://isi-vclust0.csr.unibo.it:18089