Skip to content

Instantly share code, notes, and snippets.

@kremerben
Last active August 31, 2017 05:50
Show Gist options
  • Save kremerben/3a7de6c674bc88b3c5695ad4b2ea4360 to your computer and use it in GitHub Desktop.
Save kremerben/3a7de6c674bc88b3c5695ad4b2ea4360 to your computer and use it in GitHub Desktop.
Get Spark up and running in a vagrant virtual machine

Get spark up and running:

$ vagrant init ubuntu/xenial64
$ vagrant up
$ vagrant ssh 
vm$ sudo apt-get update 
vm$ sudo apt-get install default-jre
vm$ sudo apt install python-pip
vm$ pip install --upgrade pip
vm$ sudo pip --no-cache-dir install pyspark
vm$ pyspark
vm$ cd ~
vm$ mkdir dev && cd dev
vm$ touch text.txt
vm$ nano text.txt   #<--- paste some text in

vm$ pyspark
vm$ textFile = spark.read.text("text.txt")

count rows

vm$ textFile.count()

count words

def word_count(filename):
    lines = spark.read.text(filename)
    # lines.dropna()
    lines = lines.filter('value != ""')
    c = 0
    for r in lines.collect():
        c += len(r.value.split(' '))
    return c

word_count('text.txt')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment