-
-
Save yogenderPalChandra/3c527e67c56c0b93e235aa177d7038c2 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
path = "./*html" | |
def rdd_l(path): | |
"""path to rdd builder | |
takes path as argument and returns rdd | |
""" | |
return sc.sparkContext.wholeTextFiles("./*.html") | |
def df(rdd_l): | |
""" rdd to df builder | |
takes list of rdd (rdd_l) as argument and returns rdd dataframe (df). | |
rdd stores values as a tuple of filnename and the actual value (HTML doc) in this case | |
""" | |
return rdd_l.toDF(schema=["filename", "text"]).select("text") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment