Skip to content

Instantly share code, notes, and snippets.

@yogenderPalChandra
Created July 3, 2022 09:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yogenderPalChandra/3c527e67c56c0b93e235aa177d7038c2 to your computer and use it in GitHub Desktop.
Save yogenderPalChandra/3c527e67c56c0b93e235aa177d7038c2 to your computer and use it in GitHub Desktop.
path = "./*html"
def rdd_l(path):
"""path to rdd builder
takes path as argument and returns rdd
"""
return sc.sparkContext.wholeTextFiles("./*.html")
def df(rdd_l):
""" rdd to df builder
takes list of rdd (rdd_l) as argument and returns rdd dataframe (df).
rdd stores values as a tuple of filnename and the actual value (HTML doc) in this case
"""
return rdd_l.toDF(schema=["filename", "text"]).select("text")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment