acmiyaguchi/MainSummary_SampleId.ipynb Secret

## MainSummary_SampleId.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              MainSummary_SampleId.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## MainSummary_SampleId.py

# coding: utf-8

# # Main Summary - Sample Id
#
# The `sample_id` column in the main summary allows you to work with a smaller subset of the data. In this report we explore the main summary dataset during the month of December 2016, leading up to how you can efficiently use the sample id in your own reports.
#

# In[1]:

bucket = "telemetry-parquet"
prefix = "main_summary/v3"
get_ipython().magic(u'time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet")')


# The main_summary is partitioned in s3 by submission date to s3 and by a sample_id tag that is given to each ping. Filtering along these attributes is efficient, since spark doesn't have to load the whole dataset into memory. We can use the spark flavored sql operations (sparksql) to find all the data related to December 2016.
#
# You'll find that there are many different ways to do the same thing. Below are examples of using the `like` operator and a simple string comparison.

# In[22]:

from pyspark.sql.functions import col
like_function_df = dataset.where(col("submission_date_s3").like("201612%"))
string_comp_df = (
    dataset
        .where(col("submission_date_s3") >= "20161201")
        .where(col("submission_date_s3") < "20170101")
)

print("Performance of using col.like")
get_ipython().magic(u'time like_function_df.count()')

print("Performance of using string comparisons")
get_ipython().magic(u'time month_count = string_comp_df.count()')


# In[5]:

# both of the above perform roughly the same
month_df = string_comp_df


# The main summary has a sample_id attribute, where each id is a 1% sample of the whole dataset. The ids are labels in the range [1, 100]. Here, we show that the distribution of sample_ids is uniform. The extra column are pings where `sample_id=null`.

# In[14]:

print("Filtering based on day")
get_ipython().magic(u'time day_df = month_df.where(col("submission_date_s3") == "20161201")')
print("Result: " + str(day_df.count()))

day_df.groupBy("sample_id").count().describe().show()


# Armed with this knowledge, we can sample a certain percentage of the data. We sample 1, 5, 10, and 25% of the data using a random set of ids.

# In[24]:

import random

# use the variables month_df, month_count from previous cells

def percent_error(expected, theoretical):
    return abs(expected-theoretical)/float(theoretical) * 100.0

percentages = [1, 5, 10, 25]
labels = set(range(1, 101))

for percentage in percentages:
    # percentage is an integer in [1, 100] corresponding to total percentage to sample
    sampled = random.sample(labels, percentage)

    # equivalent to `select count(sample_id) from month_df where month_df.sample_id in (1, 2, 3 ...)`
    get_ipython().magic(u'time result = month_df.where(col("sample_id").isin(sampled)).count()')

    expected = month_count * (percentage / 100.0)
    percent_err = percent_error(result, expected)
    print("Sampled {}% - Expected {}, Result {}, % Error {} \n"
              .format(percentage, result, expected, percent_err))


# The set of ids are chosen randomly from all possible sample_ids. We take advantage of the `isin` operator ([sparksql](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.Column.isin)) to count all the rows where the sample_id is within the set of ids.  Note that our percentage error from the total sample is low, which allows us to use this subset as a good approximation of the full set.

	# coding: utf-8

	# # Main Summary - Sample Id
	#
	# The `sample_id` column in the main summary allows you to work with a smaller subset of the data. In this report we explore the main summary dataset during the month of December 2016, leading up to how you can efficiently use the sample id in your own reports.
	#

	# In[1]:

	bucket = "telemetry-parquet"
	prefix = "main_summary/v3"
	get_ipython().magic(u'time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet")')


	# The main_summary is partitioned in s3 by submission date to s3 and by a sample_id tag that is given to each ping. Filtering along these attributes is efficient, since spark doesn't have to load the whole dataset into memory. We can use the spark flavored sql operations (sparksql) to find all the data related to December 2016.
	#
	# You'll find that there are many different ways to do the same thing. Below are examples of using the `like` operator and a simple string comparison.

	# In[22]:

	from pyspark.sql.functions import col
	like_function_df = dataset.where(col("submission_date_s3").like("201612%"))
	string_comp_df = (
	dataset
	.where(col("submission_date_s3") >= "20161201")
	.where(col("submission_date_s3") < "20170101")
	)

	print("Performance of using col.like")
	get_ipython().magic(u'time like_function_df.count()')

	print("Performance of using string comparisons")
	get_ipython().magic(u'time month_count = string_comp_df.count()')


	# In[5]:

	# both of the above perform roughly the same
	month_df = string_comp_df


	# The main summary has a sample_id attribute, where each id is a 1% sample of the whole dataset. The ids are labels in the range [1, 100]. Here, we show that the distribution of sample_ids is uniform. The extra column are pings where `sample_id=null`.

	# In[14]:

	print("Filtering based on day")
	get_ipython().magic(u'time day_df = month_df.where(col("submission_date_s3") == "20161201")')
	print("Result: " + str(day_df.count()))

	day_df.groupBy("sample_id").count().describe().show()


	# Armed with this knowledge, we can sample a certain percentage of the data. We sample 1, 5, 10, and 25% of the data using a random set of ids.

	# In[24]:

	import random

	# use the variables month_df, month_count from previous cells

	def percent_error(expected, theoretical):
	return abs(expected-theoretical)/float(theoretical) * 100.0

	percentages = [1, 5, 10, 25]
	labels = set(range(1, 101))

	for percentage in percentages:
	# percentage is an integer in [1, 100] corresponding to total percentage to sample
	sampled = random.sample(labels, percentage)

	# equivalent to `select count(sample_id) from month_df where month_df.sample_id in (1, 2, 3 ...)`
	get_ipython().magic(u'time result = month_df.where(col("sample_id").isin(sampled)).count()')

	expected = month_count * (percentage / 100.0)
	percent_err = percent_error(result, expected)
	print("Sampled {}% - Expected {}, Result {}, % Error {} \n"
	.format(percentage, result, expected, percent_err))


	# The set of ids are chosen randomly from all possible sample_ids. We take advantage of the `isin` operator ([sparksql](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.Column.isin)) to count all the rows where the sample_id is within the set of ids. Note that our percentage error from the total sample is low, which allows us to use this subset as a good approximation of the full set.