Skip to content

Instantly share code, notes, and snippets.

@sambos
Last active May 17, 2018 23:06
Show Gist options
  • Save sambos/4685bf2b5b0144266c3d241bb371161f to your computer and use it in GitHub Desktop.
Save sambos/4685bf2b5b0144266c3d241bb371161f to your computer and use it in GitHub Desktop.
Jupyter pyspark setup on secured cluster
Following are steps for running jupyter on hadoop cluster and connecting to it from local browser
Assuming you have a secured spark cluster created on linux
Assuming you have anaconda installed
## Steps for setting up Python Virtual Environments
Add conda to path (add following to ~/.bashrc file)
PATH=$PATH:/opt/anaconda/latest/bin/
export PATH
1. Setup/Create Virtual Environment (see steps below)
Make sure to install ipykernel - that provides IPython Kernel for Jupyter
conda create -n pysam36 python=3.6.2 ipykernel
source activate pysam36
exit terminal
2. Create SSH Tunnel
Open Putty and use port 22 for your cluster host
Under Connection > SSH > Tunnels
- Add source 127.0.0.1:8888 and desitnation as 127.0.0.1:8888 (destination port should match port where Jupyter notebook is running)
Login with credentials and Init Keberos
kinit -k <appID> -t /opt/cst/spnego/<appID>.keytab
cd to /data/appId/.conda
Then export variables:
export PYSPARK_PYTHON=/opt/anaconda/latest/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/anaconda/latest/bin/python
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=127.0.0.1'
activate virtual environment already created
source activate pysam36
start pyspark
pyspark
If started correctly it shoul say :
"The Jupyter Notebook is running at: http://127.0.0.1:8888"
Open browser with http://127.0.0.1:8888 - you should see jupyter with your created name virtual environment
If all packages installed correctly, you should be able to type
sc or sqlContext and get result in notebook
---------- Connecting to Impala using impyla --------------------
pip install thrift_sasl
install impyla
pip install impyla (check the versions, some version have issues, i think 0.1.12 is ok ?)
from impala.dbapi import connect
import thrift_sasl
conn = connect(host='sfeidcimpala-dev.host', port=21050, timeout=300, use_ssl=True, ca_cert='/opt/cst/ssl/user/server-cacerts.pem', kerberos_service_name='impala',auth_mechanism='GSSAPI')
cursor = conn.cursor()
print(cursor)
cursor.execute('show databases')
cursor.execute('show tables')
- Enjoy
===================================================================
A. Create Virtual Environment
mkdir .conda
cd .conda
mkdir pkgs
mkdir envs
Create .condarc file with contents (provide conda package repo location_
envs_dirs:
- /data/appid/.conda/envs
pkgs_dirs:
- /data/appid/.conda/pkgs
channels:
- http://host.conda.repo:8080/conda/anaconda
Create pip config for installing additional packages
mkdir ~/.pip
touch ~/.pip/pip.conf
[global]
index=https://nexus.host.repo/repository/pypi.python.org/pypi
index-url=https://nexus.host.repo/repository/pypi.python.org/simple
trusted-host=nexus.host
B. Conda commands for creating VE - virutal environment
conda create -n VEName --copy -y -q python=X.X.X (package version) package-name
example - conda create -n pyve36 python=3.6.2 anaconda
conda install -n VEName --copy -y -q package-name
source activate VEName
you can also install packages using pip
pip install impyla==0.13.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment