Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
# create a RDD of the text file with Number of Partitions = 4
my_text_file = sc.textFile('tokens_spark.txt',minPartitions=4)
# RDD Object
print(my_text_file)
# convert to lower case
my_text_file = my_text_file.map(lambda x : x.lower())
# Updated RDD Object
print(my_text_file)
# Get the RDD Lineage
print(my_text_file.toDebugString())
@mikekenneth

This comment has been minimized.

Copy link

@mikekenneth mikekenneth commented Feb 11, 2021

Really Nice. Please what is the content of your 'tokens_spark.txt' file?
Thanks in advance.

I used the below to create a file just for those who may need it:

with open('spark_tokens.txt', 'w') as outfile:
    words = ['Python', 'Java', 'Scala', 'Others', 'Spark', 'APACHE', 'Cool', 'udpate', 'Good_to_GO']
    for i in range(100000): outfile.write(words[randint(0, len(words)-1)] + '\n')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment