Skip to content

Instantly share code, notes, and snippets.

@oneryalcin
Created September 23, 2019 21:13
Show Gist options
  • Save oneryalcin/7aaa87f2e468a52ddc79aa0a7e2642f5 to your computer and use it in GitHub Desktop.
Save oneryalcin/7aaa87f2e468a52ddc79aa0a7e2642f5 to your computer and use it in GitHub Desktop.
sparkify_4_null_values_stats
# First let's have a look if we have any NAN values in our dataset
data.select([count(when(isnan(c), c)).alias(c) for c in data.columns]).head().asDict()
>> {'artist': 0,
'auth': 0,
'firstName': 0,
'gender': 0,
'itemInSession': 0,
'lastName': 0,
'length': 0,
'level': 0,
'location': 0,
'method': 0,
'page': 0,
'registration': 0,
'sessionId': 0,
'song': 0,
'status': 0,
'ts': 0,
'userAgent': 0,
'userId': 0}
# Looks like we do not have any NAN values, that's good. How about NULL values?
data.select([count(when(col(c).isNull(), c)).alias(c) for c in data.columns]).head().asDict()
>>{'artist': 58392,
'auth': 0,
'firstName': 8346,
'gender': 8346,
'itemInSession': 0,
'lastName': 8346,
'length': 58392,
'level': 0,
'location': 8346,
'method': 0,
'page': 0,
'registration': 8346,
'sessionId': 0,
'song': 58392,
'status': 0,
'ts': 0,
'userAgent': 8346,
'userId': 0}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment