Skip to content

Instantly share code, notes, and snippets.

@padmick
Last active August 29, 2015 14:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save padmick/9ee5fe883c971fa4e14b to your computer and use it in GitHub Desktop.
Save padmick/9ee5fe883c971fa4e14b to your computer and use it in GitHub Desktop.
#to find the max of one row in the file
from pyspark import SparkContext
def mapMax(line):
line = line.split(',')
return line[0], max((float(x) for x in line[1:]))
textfile = sc.textFile("wasb://wgkspark@teststoragespark.blob.core.windows.net/dataA.csv").cache()
lines = textfile.takeSample(False, 50)
print mapMax(lines[-1])
#print lines[0] was just for debugging
#second program to find the max of every row in a csv file
from pyspark import SparkContext
def mapMax(line):
line = line.split(',')
return line[0], max((float(x) for x in line[1:]))
textfile = sc.textFile("wasb://wgkspark@teststoragespark.blob.core.windows.net/dataA.csv").cache()
somethingRDD = textfile.filter(lambda line: line.split(',')[0][0] != 'T').map(mapMax)
print somethingRDD.collect()
(B1-1 EW DI [mm])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment