Skip to content

Instantly share code, notes, and snippets.

@koverholt
Created April 10, 2015 20:59
Show Gist options
  • Save koverholt/a2cc2a0ab51acb13ae57 to your computer and use it in GitHub Desktop.
Save koverholt/a2cc2a0ab51acb13ae57 to your computer and use it in GitHub Desktop.
Simple Numpy example in Spark
import numpy as np
from pyspark import SparkContext
from pyspark import SparkConf
conf = SparkConf()
conf.setMaster("spark://<HOSTNAME>:7077")
conf.setAppName("NumpyMult")
sc = SparkContext(conf=conf)
def mult(x):
y = np.array([2])
return x*y
x = np.arange(10000)
distData = sc.parallelize(x)
results = distData.map(mult).collect()
print results
@praveen2916
Copy link

Hi,
Thanks for the code snippet. I was wondering if just like distData, we can have another distData2 and do operations on both of them together?
To be more precise:
x = np.arrange(10000)
distData = sc.parallelize(x)

y = np.arrange(10000)
distData2 = sc.parallelize(y)

Now do array operations on both disData and distData2. Is this possible?

Thanks
Venkata D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment