Skip to content

Instantly share code, notes, and snippets.

@sachdevm
Last active July 26, 2017 07:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sachdevm/04c27ec91adbe2fdbe5969f4af723642 to your computer and use it in GitHub Desktop.
Save sachdevm/04c27ec91adbe2fdbe5969f4af723642 to your computer and use it in GitHub Desktop.
Creating dataframe from dictionary in pyspark
# Ran following examples in PySpark shell
from pyspark.sql import Row
nested_struct = {
"org_id": "abc", "emp_id": "def",
"projects": [{"name": "test_proj_1", "duration": 20, "code": "p30"},
{"name": "test_proj_2", "duration": 15, "code": "p30"}]
}
test_rdd = sc.parallelize([Row(**nested_struct)])
test_df = test_rdd.toDF()
print test_rdd.collect()[0].projects
# Output: [{'duration': 20, 'code': 'p30', 'name': 'test_proj_1'}, {'duration': 15, 'code': 'p30', 'name': 'test_proj_2'}]
print test_df.collect()[0].projects
# Output: [{u'duration': 20, u'code': None, u'name': None}, {u'duration': 15, u'code': None, u'name': None}]
# Ran following in PySpark shell
from pyspark.sql import Row
nested_struct = {
"org_id": "abc", "emp_id": "def",
"projects": [{"name": "test_proj_1", "duration": "20", "code": "p30"},
{"name": "test_proj_2", "duration": "15", "code": "p30"}]
}
test_rdd = sc.parallelize([Row(**nested_struct)])
test_df = test_rdd.toDF()
print test_rdd.collect()[0].projects
# Output: [{'duration': '20', 'code': 'p30', 'name': 'test_proj_1'}, {'duration': '15', 'code': 'p30', 'name': 'test_proj_2'}]
print test_df.collect()[0].projects
# Output: [{u'duration': u'20', u'code': u'p30', u'name': u'test_proj_1'}, {u'duration': u'15', u'code': u'p30', u'name': u'test_proj_2'}]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment