Created
November 6, 2019 07:56
-
-
Save drorata/f5967abc84f7252f2816b69d394a924b to your computer and use it in GitHub Desktop.
Very minimal example with Livy and sparkmagic
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Assume you have a Spark cluster running with [Apache Livy](https://livy.apache.org/) enabled.\n", | |
"Using the Python packages:\n", | |
"* [livy](https://pypi.org/project/livy/) and\n", | |
"* [sparkmagic](https://pypi.org/project/sparkmagic/)\n", | |
"you will be able to execute computation tasks on the remote cluster from your local notebook/machine.\n", | |
"\n", | |
"The greatest benefit is that you can keep track of your notebooks locally and enjoy, for example easy version controlling.\n", | |
"Sure, with the caveat that notebooks are [not ideal](https://nextjournal.com/schmudde/how-to-version-control-jupyter) for version controlling.\n", | |
"\n", | |
"## Using `Livy`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from livy import LivySession" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# 8998 is the default port of Livy (at least when enabled on an EMR cluster)\n", | |
"LIVY_URL = 'http://xxx.yyy.uuu.vvv:8998'" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"with LivySession(LIVY_URL) as session:\n", | |
" session.run('df = spark.read.json(\"s3://mybucket/some/prefix/foo.json.gz\")')\n", | |
" session.run('df.show()') " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is nice, but it can get nicer!\n", | |
"At least when working from a notebook.\n", | |
"\n", | |
"## Using `sparkmagic`\n", | |
"\n", | |
"First, load the magics!" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%load_ext sparkmagic.magics" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Now, you have to establish the connection and start a Spark session.\n", | |
"This is done using `%manage_spark`:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%manage_spark" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Once a session has started, you can use it in following cells.\n", | |
"Don't forget to start the cell with `%%spark`:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%spark\n", | |
"df = spark.read.json(\"s3://mybucket/some/prefix/foo.json.gz\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"%%spark\n", | |
"df.show(4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"%%spark\n", | |
"from pyspark.sql import functions as F\n", | |
"df.select(F.col(\"userId\")).distinct().cache().show(truncate=False)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python [conda env:test-sparkmagic]", | |
"language": "python", | |
"name": "conda-env-test-sparkmagic-py" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment