sambos/jupyter_pyspark

## jupyter_pyspark
Following are steps for running jupyter on hadoop cluster and connecting to it from local browser

Assuming you have a secured spark cluster created on linux
Assuming you have anaconda installed

## Steps for setting up Python Virtual Environments
Add conda to path (add following to ~/.bashrc file)
  PATH=$PATH:/opt/anaconda/latest/bin/
  export PATH


1. Setup/Create Virtual Environment (see steps below)
Make sure to install ipykernel - that provides IPython Kernel for Jupyter

conda create -n pysam36 python=3.6.2 ipykernel
source activate pysam36
exit terminal

2. Create SSH Tunnel
Open Putty and use port 22 for your cluster host
Under Connection > SSH > Tunnels
 - Add source 127.0.0.1:8888 and desitnation as 127.0.0.1:8888 (destination port should match port where Jupyter notebook is running)
 Login with credentials and Init Keberos
 kinit -k <appID> -t /opt/cst/spnego/<appID>.keytab

cd to /data/appId/.conda

Then export variables:
 export PYSPARK_PYTHON=/opt/anaconda/latest/bin/python
 export PYSPARK_DRIVER_PYTHON=/opt/anaconda/latest/bin/python
 export PATH=$SPARK_HOME/bin:$PATH
 export PYSPARK_DRIVER_PYTHON=jupyter
 export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=127.0.0.1'

 activate virtual environment already created
 source activate pysam36

 start pyspark
 pyspark
If started correctly it shoul say :
"The Jupyter Notebook is running at: http://127.0.0.1:8888"

 Open browser with http://127.0.0.1:8888 - you should see jupyter with your created name virtual environment

 If all packages installed correctly, you should be able to type
 sc or sqlContext and get result in notebook

---------- Connecting to Impala using impyla --------------------

pip install thrift_sasl
install impyla
pip install impyla (check the versions, some version have issues, i think 0.1.12 is ok ?)


from impala.dbapi import connect
import thrift_sasl
conn = connect(host='sfeidcimpala-dev.host', port=21050, timeout=300, use_ssl=True, ca_cert='/opt/cst/ssl/user/server-cacerts.pem', kerberos_service_name='impala',auth_mechanism='GSSAPI')
cursor = conn.cursor()
print(cursor)
cursor.execute('show databases')
cursor.execute('show tables')


- Enjoy


===================================================================
A. Create Virtual Environment

 mkdir .conda
 cd .conda
 mkdir pkgs
 mkdir envs

Create .condarc file with contents (provide conda package repo location_
envs_dirs:
 - /data/appid/.conda/envs
 pkgs_dirs:
 - /data/appid/.conda/pkgs
 channels:
 - http://host.conda.repo:8080/conda/anaconda

Create pip config for installing additional packages
mkdir ~/.pip
touch ~/.pip/pip.conf

 [global]
 index=https://nexus.host.repo/repository/pypi.python.org/pypi
 index-url=https://nexus.host.repo/repository/pypi.python.org/simple
 trusted-host=nexus.host


B. Conda commands for creating VE - virutal environment

 conda create -n VEName --copy -y -q python=X.X.X (package version) package-name
 example -  conda create -n pyve36 python=3.6.2 anaconda

conda install -n VEName --copy -y -q package-name
source activate VEName

you can also install packages using pip
pip install impyla==0.13.1
	Following are steps for running jupyter on hadoop cluster and connecting to it from local browser

	Assuming you have a secured spark cluster created on linux
	Assuming you have anaconda installed

	## Steps for setting up Python Virtual Environments
	Add conda to path (add following to ~/.bashrc file)
	PATH=$PATH:/opt/anaconda/latest/bin/
	export PATH


	1. Setup/Create Virtual Environment (see steps below)
	Make sure to install ipykernel - that provides IPython Kernel for Jupyter

	conda create -n pysam36 python=3.6.2 ipykernel
	source activate pysam36
	exit terminal

	2. Create SSH Tunnel
	Open Putty and use port 22 for your cluster host
	Under Connection > SSH > Tunnels
	- Add source 127.0.0.1:8888 and desitnation as 127.0.0.1:8888 (destination port should match port where Jupyter notebook is running)
	Login with credentials and Init Keberos
	kinit -k <appID> -t /opt/cst/spnego/<appID>.keytab

	cd to /data/appId/.conda

	Then export variables:
	export PYSPARK_PYTHON=/opt/anaconda/latest/bin/python
	export PYSPARK_DRIVER_PYTHON=/opt/anaconda/latest/bin/python
	export PATH=$SPARK_HOME/bin:$PATH
	export PYSPARK_DRIVER_PYTHON=jupyter
	export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=127.0.0.1'

	activate virtual environment already created
	source activate pysam36

	start pyspark
	pyspark
	If started correctly it shoul say :
	"The Jupyter Notebook is running at: http://127.0.0.1:8888"

	Open browser with http://127.0.0.1:8888 - you should see jupyter with your created name virtual environment

	If all packages installed correctly, you should be able to type
	sc or sqlContext and get result in notebook

	---------- Connecting to Impala using impyla --------------------

	pip install thrift_sasl
	install impyla
	pip install impyla (check the versions, some version have issues, i think 0.1.12 is ok ?)


	from impala.dbapi import connect
	import thrift_sasl
	conn = connect(host='sfeidcimpala-dev.host', port=21050, timeout=300, use_ssl=True, ca_cert='/opt/cst/ssl/user/server-cacerts.pem', kerberos_service_name='impala',auth_mechanism='GSSAPI')
	cursor = conn.cursor()
	print(cursor)
	cursor.execute('show databases')
	cursor.execute('show tables')


	- Enjoy


	===================================================================
	A. Create Virtual Environment

	mkdir .conda
	cd .conda
	mkdir pkgs
	mkdir envs

	Create .condarc file with contents (provide conda package repo location_
	envs_dirs:
	- /data/appid/.conda/envs
	pkgs_dirs:
	- /data/appid/.conda/pkgs
	channels:
	- http://host.conda.repo:8080/conda/anaconda

	Create pip config for installing additional packages
	mkdir ~/.pip
	touch ~/.pip/pip.conf

	[global]
	index=https://nexus.host.repo/repository/pypi.python.org/pypi
	index-url=https://nexus.host.repo/repository/pypi.python.org/simple
	trusted-host=nexus.host


	B. Conda commands for creating VE - virutal environment

	conda create -n VEName --copy -y -q python=X.X.X (package version) package-name
	example - conda create -n pyve36 python=3.6.2 anaconda

	conda install -n VEName --copy -y -q package-name
	source activate VEName

	you can also install packages using pip
	pip install impyla==0.13.1