Skip to content

Instantly share code, notes, and snippets.

@acmiyaguchi
Created January 6, 2017 00:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save acmiyaguchi/0b3772807f146575420a9e157b10fbb9 to your computer and use it in GitHub Desktop.
Save acmiyaguchi/0b3772807f146575420a9e157b10fbb9 to your computer and use it in GitHub Desktop.
MainSummary_SampleId
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# coding: utf-8
# # Main Summary - Sample Id
#
# The `sample_id` column in the main summary allows you to work with a smaller subset of the data. In this report we explore the main summary dataset during the month of December 2016, leading up to how you can efficiently use the sample id in your own reports.
#
# In[1]:
bucket = "telemetry-parquet"
prefix = "main_summary/v3"
get_ipython().magic(u'time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet")')
# The main_summary is partitioned in s3 by submission date to s3 and by a sample_id tag that is given to each ping. Filtering along these attributes is efficient, since spark doesn't have to load the whole dataset into memory. We can use the spark flavored sql operations (sparksql) to find all the data related to December 2016.
#
# You'll find that there are many different ways to do the same thing. Below are examples of using the `like` operator and a simple string comparison.
# In[22]:
from pyspark.sql.functions import col
like_function_df = dataset.where(col("submission_date_s3").like("201612%"))
string_comp_df = (
dataset
.where(col("submission_date_s3") >= "20161201")
.where(col("submission_date_s3") < "20170101")
)
print("Performance of using col.like")
get_ipython().magic(u'time like_function_df.count()')
print("Performance of using string comparisons")
get_ipython().magic(u'time month_count = string_comp_df.count()')
# In[5]:
# both of the above perform roughly the same
month_df = string_comp_df
# The main summary has a sample_id attribute, where each id is a 1% sample of the whole dataset. The ids are labels in the range [1, 100]. Here, we show that the distribution of sample_ids is uniform. The extra column are pings where `sample_id=null`.
# In[14]:
print("Filtering based on day")
get_ipython().magic(u'time day_df = month_df.where(col("submission_date_s3") == "20161201")')
print("Result: " + str(day_df.count()))
day_df.groupBy("sample_id").count().describe().show()
# Armed with this knowledge, we can sample a certain percentage of the data. We sample 1, 5, 10, and 25% of the data using a random set of ids.
# In[24]:
import random
# use the variables month_df, month_count from previous cells
def percent_error(expected, theoretical):
return abs(expected-theoretical)/float(theoretical) * 100.0
percentages = [1, 5, 10, 25]
labels = set(range(1, 101))
for percentage in percentages:
# percentage is an integer in [1, 100] corresponding to total percentage to sample
sampled = random.sample(labels, percentage)
# equivalent to `select count(sample_id) from month_df where month_df.sample_id in (1, 2, 3 ...)`
get_ipython().magic(u'time result = month_df.where(col("sample_id").isin(sampled)).count()')
expected = month_count * (percentage / 100.0)
percent_err = percent_error(result, expected)
print("Sampled {}% - Expected {}, Result {}, % Error {} \n"
.format(percentage, result, expected, percent_err))
# The set of ids are chosen randomly from all possible sample_ids. We take advantage of the `isin` operator ([sparksql](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.Column.isin)) to count all the rows where the sample_id is within the set of ids. Note that our percentage error from the total sample is low, which allows us to use this subset as a good approximation of the full set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment