Last active
May 17, 2018 23:06
-
-
Save sambos/4685bf2b5b0144266c3d241bb371161f to your computer and use it in GitHub Desktop.
Jupyter pyspark setup on secured cluster
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Following are steps for running jupyter on hadoop cluster and connecting to it from local browser | |
Assuming you have a secured spark cluster created on linux | |
Assuming you have anaconda installed | |
## Steps for setting up Python Virtual Environments | |
Add conda to path (add following to ~/.bashrc file) | |
PATH=$PATH:/opt/anaconda/latest/bin/ | |
export PATH | |
1. Setup/Create Virtual Environment (see steps below) | |
Make sure to install ipykernel - that provides IPython Kernel for Jupyter | |
conda create -n pysam36 python=3.6.2 ipykernel | |
source activate pysam36 | |
exit terminal | |
2. Create SSH Tunnel | |
Open Putty and use port 22 for your cluster host | |
Under Connection > SSH > Tunnels | |
- Add source 127.0.0.1:8888 and desitnation as 127.0.0.1:8888 (destination port should match port where Jupyter notebook is running) | |
Login with credentials and Init Keberos | |
kinit -k <appID> -t /opt/cst/spnego/<appID>.keytab | |
cd to /data/appId/.conda | |
Then export variables: | |
export PYSPARK_PYTHON=/opt/anaconda/latest/bin/python | |
export PYSPARK_DRIVER_PYTHON=/opt/anaconda/latest/bin/python | |
export PATH=$SPARK_HOME/bin:$PATH | |
export PYSPARK_DRIVER_PYTHON=jupyter | |
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=127.0.0.1' | |
activate virtual environment already created | |
source activate pysam36 | |
start pyspark | |
pyspark | |
If started correctly it shoul say : | |
"The Jupyter Notebook is running at: http://127.0.0.1:8888" | |
Open browser with http://127.0.0.1:8888 - you should see jupyter with your created name virtual environment | |
If all packages installed correctly, you should be able to type | |
sc or sqlContext and get result in notebook | |
---------- Connecting to Impala using impyla -------------------- | |
pip install thrift_sasl | |
install impyla | |
pip install impyla (check the versions, some version have issues, i think 0.1.12 is ok ?) | |
from impala.dbapi import connect | |
import thrift_sasl | |
conn = connect(host='sfeidcimpala-dev.host', port=21050, timeout=300, use_ssl=True, ca_cert='/opt/cst/ssl/user/server-cacerts.pem', kerberos_service_name='impala',auth_mechanism='GSSAPI') | |
cursor = conn.cursor() | |
print(cursor) | |
cursor.execute('show databases') | |
cursor.execute('show tables') | |
- Enjoy | |
=================================================================== | |
A. Create Virtual Environment | |
mkdir .conda | |
cd .conda | |
mkdir pkgs | |
mkdir envs | |
Create .condarc file with contents (provide conda package repo location_ | |
envs_dirs: | |
- /data/appid/.conda/envs | |
pkgs_dirs: | |
- /data/appid/.conda/pkgs | |
channels: | |
- http://host.conda.repo:8080/conda/anaconda | |
Create pip config for installing additional packages | |
mkdir ~/.pip | |
touch ~/.pip/pip.conf | |
[global] | |
index=https://nexus.host.repo/repository/pypi.python.org/pypi | |
index-url=https://nexus.host.repo/repository/pypi.python.org/simple | |
trusted-host=nexus.host | |
B. Conda commands for creating VE - virutal environment | |
conda create -n VEName --copy -y -q python=X.X.X (package version) package-name | |
example - conda create -n pyve36 python=3.6.2 anaconda | |
conda install -n VEName --copy -y -q package-name | |
source activate VEName | |
you can also install packages using pip | |
pip install impyla==0.13.1 | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment