Skip to content

Instantly share code, notes, and snippets.

@nithyadurai87
Created July 28, 2018 15:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nithyadurai87/1c3d78918762740140bcd4d07297cb10 to your computer and use it in GitHub Desktop.
Save nithyadurai87/1c3d78918762740140bcd4d07297cb10 to your computer and use it in GitHub Desktop.
counting.py - for spark tutorial
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("WordCount")
sc = SparkContext(conf = conf)
rdd1 = sc.textFile("file:///home/shrini/smp.csv")
def cols(data):
sno,fname,lname,age,desig,mob,location = data.split(",")
return sno,fname,lname,age,desig,mob,location
dict1 = rdd1.countByValue()
dict2 = rdd1.map(cols).filter(lambda line: int(line[3])>=30).countByValue()
managers=0
for i,j in dict1.items():
if "Manager" in i:
managers = managers+j
seniors=0
for j in dict2.values():
seniors = seniors+j
print("Total No. of records:",str(rdd1.count()))
print("Distinct records:",str(rdd1.distinct().count()))
print("Toal No.of Managers:",str(managers))
print("No. of Seniors (age>30):",str(seniors))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment