Skip to content

Instantly share code, notes, and snippets.

@maximveksler
Last active October 21, 2020 20:42
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maximveksler/56ac22e85dacc14125fec7e8459e2fce to your computer and use it in GitHub Desktop.
Save maximveksler/56ac22e85dacc14125fec7e8459e2fce to your computer and use it in GitHub Desktop.
All the ways to create a Spark Dataframe (pyspark)
data = [
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
]
df = spark.createDataFrame(data, schema=["id", "name", "age"])
from pyspark.sql import Row
data = [
{"id": "a", "name": "Alice", "age": 34},
{"id": "b", "name": "Bob", "age": 36},
{"id": "c", "name": "Charlie", "age": 30},
]
df = spark.createDataFrame(Row(**x) for x in data)
import pandas as pd
data = [['a', 'Alice', 34], ['b', 'Bob', 36], ['c', 'Charlie', 30]]
df = spark.createDataFrame(pd.DataFrame(data, columns = ['id', 'name', 'age']))
from pyspark.sql import Row
l = [('a',"Alice", 33),("b", "Bob", 36),("c", "Charlie", 30)]
rdd = sc.parallelize(l)
rows = rdd.map(lambda x: Row(id=x[0], name=x[1], age=int(x[2])))
df = spark.createDataFrame(rows)
alice = '''{"id":"a", "name": "Alice", "age":34}'''
bob = '''{"id":"b", "name": "Bob", "age":36}'''
charlie = '''{"id":"c", "name": "Charlie", "age":30}'''
rdd = sc.parallelize([alice, bob, charlie])
df = spark.read.json(rdd)
df = spark.sql("""
select 'a' id, 'Alice' name, 34 age
union
select 'b', 'Bob', 36 age
union
select 'c', 'Charlie', 30 age
""")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment