Skip to content

Instantly share code, notes, and snippets.

@caseyliqb
Last active February 27, 2020 12:13
Show Gist options
  • Save caseyliqb/a04b80f4dfda569a6c957ee8f29dc997 to your computer and use it in GitHub Desktop.
Save caseyliqb/a04b80f4dfda569a6c957ee8f29dc997 to your computer and use it in GitHub Desktop.

Why is this so hard to remember?

from pyspark.sql.types import StructType, StructField, StringType

rdd = sc.parallelize([("moo this has stopwords b", "bat this one does not"),
                      ("apple orange banana", "cookie jar bla la")])

schema = StructType([StructField('entity', StringType(), True),
                     StructField('cleaned_entity', StringType(), True),
                     ])

# create dataframe
df3 = sqlContext.createDataFrame(rdd, schema)
@caseyliqb
Copy link
Author

output:

+------------------------+---------------------+
|entity                  |cleaned_entity       |
+------------------------+---------------------+
|moo this has stopwords b|bat this one does not|
|apple orange banana     |cookie jar bla la    |
+------------------------+---------------------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment