Skip to content

Instantly share code, notes, and snippets.

@rainsunny
Forked from rjurney/apply.py
Last active June 7, 2018 02:52
Show Gist options
  • Save rainsunny/0e9c478ad2f1e5ee09611f2add753d55 to your computer and use it in GitHub Desktop.
Save rainsunny/0e9c478ad2f1e5ee09611f2add753d55 to your computer and use it in GitHub Desktop.
Plot a pyspark.RDD.histogram as a pyplot histogram (via bar)
%matplotlib inline
buckets = [-87.0, -15, 0, 30, 120]
rdd_histogram_data = ml_bucketized_features\
.select("ArrDelay")\
.rdd\
.flatMap(lambda x: x)\
.histogram(buckets)
create_hist(rdd_histogram_data)
def create_hist(rdd_histogram_data):
"""Given an RDD.histogram, plot a pyplot histogram"""
heights = np.array(rdd_histogram_data[1])
full_bins = rdd_histogram_data[0]
mid_point_bins = full_bins[:-1]
widths = [abs(i - j) for i, j in zip(full_bins[:-1], full_bins[1:])]
bar = plt.bar(mid_point_bins, heights, width=widths, color='b', align='edge')
return bar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment