Skip to content

Instantly share code, notes, and snippets.

@lakshay-arora
Created October 15, 2019 12:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lakshay-arora/5eb852f95a0d1f8dc39f506166f37638 to your computer and use it in GitHub Desktop.
Save lakshay-arora/5eb852f95a0d1f8dc39f506166f37638 to your computer and use it in GitHub Desktop.
# create a RDD of the text file with Number of Partitions = 4
my_text_file = sc.textFile('tokens_spark.txt',minPartitions=4)
# RDD Object
print(my_text_file)
# convert to lower case
my_text_file = my_text_file.map(lambda x : x.lower())
# Updated RDD Object
print(my_text_file)
# Get the RDD Lineage
print(my_text_file.toDebugString())
@mikekenneth
Copy link

mikekenneth commented Feb 11, 2021

Really Nice. Please what is the content of your 'tokens_spark.txt' file?
Thanks in advance.

I used the below to create a file just for those who may need it:

with open('spark_tokens.txt', 'w') as outfile:
    words = ['Python', 'Java', 'Scala', 'Others', 'Spark', 'APACHE', 'Cool', 'udpate', 'Good_to_GO']
    for i in range(100000): outfile.write(words[randint(0, len(words)-1)] + '\n')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment