Skip to content

Instantly share code, notes, and snippets.

@ijan10
Created February 27, 2019 15:37
Show Gist options
  • Save ijan10/67dc03d5cfa43b37828660955b910558 to your computer and use it in GitHub Desktop.
Save ijan10/67dc03d5cfa43b37828660955b910558 to your computer and use it in GitHub Desktop.
weights_query = '''SELECT %s ,count(1) as weight from left_table group by %s order by weight desc''' % (left_col_name, left_col_name)
df_join_key_weights = spark_session.sql(weights_query)
# list of dict
spark_session.sparkContext.setJobGroup(GROUP_ID, "collect rdd to python list (counting the number of repeated keys)")
list_join_key_weights = [{left_col_name: i[left_col_name], 'weight': i['weight']} for i in df_join_key_weights.select(left_col_name, 'weight').rdd.collect()]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment