Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save JJRyan0/732bfadb6397c48b47e1a01652273e9b to your computer and use it in GitHub Desktop.
Save JJRyan0/732bfadb6397c48b47e1a01652273e9b to your computer and use it in GitHub Desktop.
Apache Spark (MLlib) Extracting Text Features for Machine Learning.ipynb
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "## Apache Spark (MLlib): Extracting Text Features for Machine Learning\n\n\n<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">**Note:** Citation: The Notebooks of Leonardo Da Vinci — Complete by da Vinci Leonardo for a tutorial example. Datasource@ http://www.gutenberg.org/ebooks/5000/</div>\n\n\n\nCreated by John Ryan 1st May 2017\n\n\n__1. Overview__\n\n**Question 1: How do we build a Resilient Distributed Data Set from text**\n\n**Question 2: How do we build a Term frequency-inverse document frequency (TF-IDF) from text**\n\nThe data set provides many categorical and continous variables that allow for the opportunity to implement Machine Learning models to accurately predict future purchasing habits of customers.\n\nPyspark ML : https://spark.apache.org/docs/2.1.0"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Apache Spark: Building Relisient Distributed Datasets (RDDs) with text with pyspark**\n"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Getting Started**\n\nThis tutorial brings you trough the most import fundamentals of running spark programmes.\nWhen begining every pyspark session the following process is required:\n\n- Import: and create RDDs from external data sources.\n\n- Transform: develop new RDDs using transformations such as filter().\n\n- persist: possible RDDs that will need further use later need to be persisted a.\n\n- Actions: Initiate parellel computation executed by Spark using take(), count() etc.."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**1. Import: and create RDDs**"
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "from pyspark.sql import SparkSession\n\n# @hidden_cell\n# This function is used to setup the access of Spark to your Object Storage. The definition contains your credentials.\n# You might want to remove those credentials before you share your notebook.\ndef set_hadoop_config_with_credentials_xxxxxxxxxxxxxxxx(name):\n \"\"\"This function sets the Hadoop configuration so it is possible to\n access data from Bluemix Object Storage using Spark\"\"\"\n\n prefix = 'fs.swift.service.' + name\n hconf = sc._jsc.hadoopConfiguration()\n hconf.set(prefix + '.auth.url', 'https://identity.open.softlayer.com'+'/xx/auth/tokens')\n hconf.set(prefix + '.auth.endpoint.prefix', 'endpoints')\n hconf.set(prefix + '.tenant', 'xxxxxxxxxxxxxxxxxxx')\n hconf.set(prefix + '.username', 'xxxxxxxxxxxxxxxxxxx')\n hconf.set(prefix + '.password', 'xxxxxxxxx')\n hconf.setInt(prefix + '.http.port', xxxx)\n hconf.set(prefix + '.region', 'xxxx')\n hconf.setBoolean(prefix + '.public', False)\n\n# you can choose any name\nname = 'keystone'\nset_hadoop_config_with_credentials_xxxxxxxxxxxxxxxxxxx(name)\n\nspark = SparkSession.builder.getOrCreate()\n\n# Please read the documentation of PySpark to learn more about the possibilities to load data files.\n# PySpark documentation: https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession\n# The SparkSession object is already initalized for you.\n# The following variable contains the path to your file on your Object Storage.\n\nrddData1 = sc.textFile(\"swift://newscala.\" + name + \"/DiVinciNotes.txt\")\nrddData1.take(10)\n",
"execution_count": 1,
"outputs": [
{
"execution_count": 1,
"output_type": "execute_result",
"data": {
"text/plain": "[u'The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete',\n u'by Leonardo Da Vinci',\n u'(#3 in our series by Leonardo Da Vinci)',\n u'',\n u'Copyright laws are changing all over the world. Be sure to check the',\n u'copyright laws for your country before downloading or redistributing',\n u'this or any other Project Gutenberg eBook.',\n u'',\n u'This header should be the first thing seen when viewing this Project',\n u'Gutenberg file. Please do not remove it. Do not change or edit the']"
},
"metadata": {}
}
]
},
{
"metadata": {
"collapsed": false
},
"cell_type": "markdown",
"source": "**Resilient Distributed Datasets (RDDs)**\n\nRDDs are a main compotent to spark these Resillient Distributed Datasets are fault-tolerant collections of elements that can handle commands in parrellel.Spark will distribute data stored in the RDD across the cluster and parallizes the operations carried out on the RDD. \n\n**Creating Resilient Distributed Datasets (RDDs)**\n\n- RDD's are split into a number of partitions, that are computed on different nodeds of a cluster.\n\n- Pyspark allows for RDDs to be created from an array of data storage types that are thus supported by bt Hadoop, for eaxmple HDFS, HBase, object container storage and URI. For this example we will use data imported from an object container storage."
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "#the first()action computes row of the RDD, spark only reads \n#the first line not the rest of the file!\nrddData1.first()",
"execution_count": 2,
"outputs": [
{
"execution_count": 2,
"output_type": "execute_result",
"data": {
"text/plain": "u'The Project Gutenberg EBook of The Notebooks of Leonardo Da Vinci, Complete'"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**2. Transformations: Develop new RDDs using transformations such as filter().**\n\nTransformations construct and change the previous RDD to a new one.Here we filtered the \nelements containing \"Leonardo\" from the document and created a new RDD called \"data\"."
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "#Transformations: construct and change the previous RDD to a new one.Here we filtered the \n#elements containing\nleoRDD = rddData1.filter(lambda line: \"Leonardo\" in line)\nleoRDD",
"execution_count": 3,
"outputs": [
{
"execution_count": 3,
"output_type": "execute_result",
"data": {
"text/plain": "PythonRDD[4] at RDD at PythonRDD.scala:48"
},
"metadata": {}
}
]
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "#example two same function with longer string\neffectsRDD = rddData1.filter(lambda line: \"The effects of\" in line)\neffectsRDD",
"execution_count": 4,
"outputs": [
{
"execution_count": 4,
"output_type": "execute_result",
"data": {
"text/plain": "PythonRDD[5] at RDD at PythonRDD.scala:48"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Persisting () or Caching()**\n\nThis envolves sending a subset of your data into memory inorder for recall to carry out repeated calcualtions. Caching is similier process.\n\n"
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "#we persist the new rdd that contains the Leonardo elements\nleoRDD.persist",
"execution_count": 5,
"outputs": [
{
"execution_count": 5,
"output_type": "execute_result",
"data": {
"text/plain": "<bound method PipelinedRDD.persist of PythonRDD[4] at RDD at PythonRDD.scala:48>"
},
"metadata": {}
}
]
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "#Example two cached\neffectsRDD.cache",
"execution_count": 6,
"outputs": [
{
"execution_count": 6,
"output_type": "execute_result",
"data": {
"text/plain": "<bound method PipelinedRDD.cache of PythonRDD[5] at RDD at PythonRDD.scala:48>"
},
"metadata": {}
}
]
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "leoRDD.count()",
"execution_count": 7,
"outputs": [
{
"execution_count": 7,
"output_type": "execute_result",
"data": {
"text/plain": "626"
},
"metadata": {}
}
]
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "effectsRDD.count()",
"execution_count": 8,
"outputs": [
{
"execution_count": 8,
"output_type": "execute_result",
"data": {
"text/plain": "4"
},
"metadata": {}
}
]
},
{
"metadata": {
"collapsed": true
},
"cell_type": "markdown",
"source": "**2.1 Union() Transformation**"
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "#prints out the number of lines containing bothe \"Leonardo\" and \"the effects of\"\nvinciRDD = effectsRDD.union(leoRDD)",
"execution_count": 9,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**2.3 map() and filter() Transformations**"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "- The map () transformation is an element wise transformation that uses a given function and applies a computation to elements within the RDD.\n- The filter() transformation as seen previously also uses a given function but only creates a new RDD that may exclude elements onced passing the filter() function for example."
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Term frequency-inverse document frequency (TF-IDF)**"
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Step 1: HashingTF** "
},
{
"metadata": {
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "from pyspark.mllib.feature import HashingTF, IDF\ndataRDD = rddData1.map(lambda line: line.split())\nhtf = HashingTF()\ntfVect = htf.transform(dataRDD)\ntfVect.cache()#cache for later reuse",
"execution_count": 10,
"outputs": [
{
"execution_count": 10,
"output_type": "execute_result",
"data": {
"text/plain": "PythonRDD[9] at RDD at PythonRDD.scala:48"
},
"metadata": {}
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "**Step 2: Inverse Document Frequency**"
},
{
"metadata": {
"scrolled": true,
"collapsed": false,
"trusted": false
},
"cell_type": "code",
"source": "#import library and fit the tf vector to IDF model and then transform to a sparse vector representive of words\nfrom pyspark.mllib.feature import IDF\nidf = IDF(minDocFreq=2).fit(tfVect)\ntfidf = idf.transform(tfVect)\ntfidf.take(10)",
"execution_count": 14,
"outputs": [
{
"execution_count": 14,
"output_type": "execute_result",
"data": {
"text/plain": "[SparseVector(1048576, {274140: 4.5222, 323305: 2.6779, 641967: 8.3176, 689675: 8.7876, 832474: 6.0919, 833202: 7.4014, 880387: 7.1782, 994896: 7.0298, 996619: 0.0, 1027493: 9.2985}),\n SparseVector(1048576, {274140: 4.5222, 335910: 6.8417, 641967: 8.3176, 956125: 2.8184}),\n SparseVector(1048576, {50583: 1.9295, 274140: 4.5222, 641967: 8.3176, 852617: 8.3176, 880390: 0.0, 948889: 5.4483, 956125: 2.8184, 997345: 0.0}),\n SparseVector(1048576, {}),\n SparseVector(1048576, {191961: 8.7876, 358972: 7.5067, 521365: 3.0466, 592895: 7.5639, 618498: 4.0445, 647690: 5.5688, 649117: 8.4512, 682947: 9.0108, 725041: 1.9718, 900641: 8.6053, 959994: 1.8289, 996102: 9.2985}),\n SparseVector(1048576, {66793: 7.4014, 308262: 8.1999, 323325: 3.781, 372509: 4.6445, 592895: 7.5639, 643889: 0.0, 672287: 5.6609, 919619: 0.0, 951974: 3.4637}),\n SparseVector(1048576, {196531: 4.2233, 230882: 0.0, 323325: 3.781, 629096: 3.0785, 715677: 4.5306, 833202: 7.4014, 994896: 7.0298}),\n SparseVector(1048576, {}),\n SparseVector(1048576, {72751: 4.7552, 84721: 5.5768, 111889: 9.0108, 213913: 4.4972, 497106: 0.0, 629096: 3.0785, 719560: 4.6605, 805631: 4.592, 944044: 4.1529, 956097: 2.8123, 959994: 0.9144, 994896: 7.0298}),\n SparseVector(1048576, {43230: 7.1782, 141912: 6.8008, 218222: 8.0945, 234048: 5.2209, 295692: 8.3176, 323325: 3.781, 641953: 8.4512, 697409: 4.9375, 751389: 9.2985, 817768: 0.0, 833202: 7.4014, 959994: 0.9144})]"
},
"metadata": {}
}
]
}
],
"metadata": {
"kernelspec": {
"name": "python2",
"display_name": "Python 2",
"language": "python"
},
"language_info": {
"mimetype": "text/x-python",
"nbconvert_exporter": "python",
"name": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11",
"file_extension": ".py",
"codemirror_mode": {
"version": 2,
"name": "ipython"
}
},
"gist": {
"id": "",
"data": {
"description": "Apache Spark (MLlib) Extracting Text Features for Machine Learning.ipynb",
"public": true
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment